登录查看更多内容

Introduction to Natural Language Processing with Python

Daniel Deutsch, MSc (JKU), LL.M. (WU)

Artificial Intelligence and Business Law

发布日期: 2017年10月27日

+ 关注

An article introducing natural language processing with Python - theory and a code example.

Open Source Code on Github

?? Table of contents

Intro
-Challenges in NLP
-Machine Learning Workflow
Theory of summarizing and classifying texts
-Method Abstract Extraction
-Connecting machine learning to the articles
-Classification
Code example
-1.Getting text from a website
-2.Summarize text
-3.Find themes in articles
Useful links & credits

"The limits of my language are the limits of my world." ? Ludwig Wittgenstein

Intro

Challenges in NLP

Tokenization (breaking text into smaller pieces)
Stopword Removal (Filtering non-important words)
N-Grams (Grouping related words)
Word Sense Disambiguation (Identifying the context in which the word occurs)
Parts-of-Speech (Identifying characteristics of a language)
Stemming (Removing endings of words)

Machine Learning Workflow

Pick a problem
Represent data using numeric attributes
Use an algorithm to find a model

Theory of summarizing and classifying texts

Method Abstract Extraction

find the most important words (important words are generally more often repeated)
calculate a score for sentences containing important words
select import sentences

Connecting machine learning to the articles

Divide the articles into groups based on some common attributes (clustering) -> we want to maximize intracluster similarity
Use tf-idf to find important words in an article and represent those documents
apply the K-Means Clustering technique, by setting centers of the cluster, assigning each point to the center and rearrange the center of the cluster over again until the means do not change anymore

Classification

A typical classification workflow consists of testing data using numerical attributes, training the model with data and at the end test the model with other test data.

To apply any machine learning we need data. The data will be a set of articles.

In those articles a certain theme will be identified. Those themes will be assigned to the new articles.

For a new article the model will be used and applies a corresponding theme.

Code example

In this example I am going to get the paragraphs of an article from a ruling of the European Court of Justice.

I will display the most important paragraphs with the abstract extraction method.

And classify the articles with the K-Means technique.

1.Getting text from a website

from urllib.request import urlopen
from bs4 import BeautifulSoup

articleURL = "https://curia.europa.eu/juris/document/document.jsf?text=&docid=139407&pageIndex=0&doclang=EN&mode=lst&dir=&occ=first&part=1&cid=52454"

def getText(url):
    page = urlopen(url).read().decode('utf8', 'ignore')
    soup = BeautifulSoup(page, 'lxml')
    text = ' '.join(map(lambda p: p.text, soup.find_all('p')))
    return text.encode('ascii', errors='replace').decode().replace("?","")

text = getText(articleURL)

2.Summarize text

import nltk
# nltk.download('punkt')
# nltk.download()
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from collections import defaultdict
from string import punctuation
from heapq import nlargest

def summarize(text, n):
    sents = sent_tokenize(text)
    
    assert n <= len(sents)
    wordSent = word_tokenize(text.lower())
    stopWords = set(stopwords.words('english')+list(punctuation))
    
    wordSent= [word for word in wordSent if word not in stopWords]
    freq = FreqDist(wordSent)

    ranking = defaultdict(int)
    
    for i, sent in enumerate(sents):
        for w in word_tokenize(sent.lower()):
            if w in freq:
                ranking[i] += freq[w]

    sentsIDX = nlargest(n, ranking, key=ranking.get)
    return [sents[j] for j in sorted(sentsIDX)]

summaryArr = summarize(text, 10)
# summaryArr

3.Find themes in articles

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import numpy as np

vectorizer = TfidfVectorizer(max_df=0.5,min_df=2,stop_words='english')
X = vectorizer.fit_transform(summaryArr)
km = KMeans(n_clusters = 3, init = 'k-means++', max_iter = 100, n_init = 1, verbose = True)
km.fit(X)
np.unique(km.labels_, return_counts=True)

text={}
for i,cluster in enumerate(km.labels_):
    oneDocument = summaryArr[i]
    if cluster not in text.keys():
        text[cluster] = oneDocument
    else:
        text[cluster] += oneDocument

stopWords = set(stopwords.words('english')+list(punctuation))
keywords = {}
counts={}

for cluster in range(3):
    word_sent = word_tokenize(text[cluster].lower())
    word_sent=[word for word in word_sent if word not in stopWords]
    freq = FreqDist(word_sent)
    keywords[cluster] = nlargest(100, freq, key=freq.get)
    counts[cluster]=freq

uniqueKeys={}
for cluster in range(3):   
    other_clusters=list(set(range(3))-set([cluster]))
    keys_other_clusters=set(keywords[other_clusters[0]]).union(set(keywords[other_clusters[1]]))
    unique=set(keywords[cluster])-keys_other_clusters
    uniqueKeys[cluster]=nlargest(10, unique, key=counts[cluster].get)

print(uniqueKeys)

Useful links & credits

A great course can be found on Pluralsight

Thanks for reading my article! Feel free to leave any feedback!

Introduction to Natural Language Processing with Python

Daniel Deutsch, MSc (JKU), LL.M. (WU)

Artificial Intelligence and Business Law

?? Table of contents

Intro

Challenges in NLP

Machine Learning Workflow

Theory of summarizing and classifying texts

Method Abstract Extraction

Connecting machine learning to the articles

Classification

Code example

1.Getting text from a website

2.Summarize text

3.Find themes in articles

Useful links & credits

更多精彩文章

社区洞察

其他会员也浏览了

Two Vital Python Programming Approaches For Your Website’s SEO

How is Python shaping the future of NLP and machine learning in big data analytics?

End to end LLMOps Pipeline - Part 2 - FastAPI

Best Python Sentiment Analysis Libraries: Unleashing the Power of Text Analysis

15 Machine Learning Libraries and Tools for Java

Building a neural network in python is quite simple

Develop AI Using Python: A Step-by-Step Guide

Introducing the Revolutionary Self-Modifying GPT Python Script!

How to Perform Zero-Shot Text Classification Using Hugging Face Transformers Library in Python

Implementing LSTM with TensorFlow and Python

?? Table of contents

Intro

Challenges in NLP

Machine Learning Workflow

Theory of summarizing and classifying texts

Method Abstract Extraction

Connecting machine learning to the articles

Classification

Code example

1.Getting text from a website

2.Summarize text

3.Find themes in articles

Useful links & credits

How I determined the perfect fit for the Business Analyst Position

2022年3月11日

Reflecting on 2019 by Daniel Deutsch

2019年12月28日

Reflecting 2018 by Daniel Deutsch

2018年12月29日

Favorite VS Code Extensions 2017

2017年12月16日

Chart the Stock Market with React, Redux, React-Vis and Socket.io

2017年10月13日

Introducing Relay (Classic)

2017年10月6日

Programming — the pragmatic approach

2017年9月29日

Introducing TypeScript (with a section on JSX)

2017年9月22日

Understanding Machine Learning

2017年9月15日

Learnings from my first full-stack app in JavaScript

2017年9月7日

社区洞察

其他会员也浏览了

Two Vital Python Programming Approaches For Your Website’s SEO

How is Python shaping the future of NLP and machine learning in big data analytics?

End to end LLMOps Pipeline - Part 2 - FastAPI

Best Python Sentiment Analysis Libraries: Unleashing the Power of Text Analysis

15 Machine Learning Libraries and Tools for Java

Building a neural network in python is quite simple

Develop AI Using Python: A Step-by-Step Guide

Introducing the Revolutionary Self-Modifying GPT Python Script!

How to Perform Zero-Shot Text Classification Using Hugging Face Transformers Library in Python

Implementing LSTM with TensorFlow and Python