登录查看更多内容

Natural Language (NLP) Processing with Python Use Case

Steven Murhula

ML Engineer l Data Engineer l Scala l Python l Data Analysis l Big Data Development l SQL I AWS l ETL I GCP I Azure I Microservices l Data Science I Data Engineer I AI Engineer I Architect I Databricks I Java I Sql

发布日期: 2018年4月2日

Introduction :

Today, with digitization of everything, 80% of the data being created is unstructured. Audio, video, our social footprints, the data generated from conversations between customer service reps, tons of legal documents, and texts processed in financial sectors are examples of unstructured data stored in Big Data. Organizations are turning to natural language processing (NLP) technology to derive understanding from the myriad unstructured data available online, in call logs, and in other sources.

NLP describes the ability of computers to understand human speech as it is spoken. NLP is a branch of artificial intelligence that has many important implications on the ways that computers and humans interact. Machine learning has helped computers parse the ambiguity of human language. Apache OpenNLP, Natural Language Toolkit (NLTK), and Stanford NLP are various open source NLP libraries used in real world applications.

Companies are collecting all these different kind of data for better customer targeting and meaningful insights. To process all these unstructured data source we need people who understand NLP.

Building these applications requires a very specific skill set with a great understanding of language and tools to process the language efficiently. So it's not just hype that makes NLP one of the most niche areas, but it's the kind of application that can be created using NLP that makes it one of the most unique skills to have.

LIST OF APPLICATION WE CAN DO WITH NATURAL LANGUAGES PROCESSING

MACHINE TRANSLATION :The easiest way to understand machine translation is to know how we translate from one language to other. Our mind parses the sentence structure and tries to understand the sentence. Once we understand the sentence, we will try to substitute the words from the original language with those from the target language. While substituting, we use the grammar rules of the target sentence and finally achieved.

If we start from the source language text, we have to tokenize the sentences that we will parse the tree (for syntactic structure in easy words) to make sure the sentences are correctly formulated. Semantic structure holds the meaning of the sentences, and at the next level, we reach the state of Interlingua, which is an abstract state that is independent from any language. There are multiple ways in which people have developed methods of translation. The more you go on towards the root of the pyramid, the more intense is the NLP processing required. So, based on these levels of transfer, there are a variety of methods that are available. I have listed two of them here:

Direct translation: This will be more of a dictionary-based machine translation while you have huge corpora of source and target language words. This kind of transfer is possible for applications where we have a large corpus of languages available. It's popular because of its simplicity.

Syntactic transfer: Here you will try to build a parser of the source language. There are varieties of ways in which people have approached the problem of parsing. There are deep parsers that actually take care of some parts of semantics too. Once you have a parser, target word substitution happens and the target parser can generate the final sentence in the target language.

Speech recognition

Speech recognition is a very old NLP problem. People have been trying to address this since the era of World War I, and it still is one of the hottest topics in the area of computing. The idea here is really intuitive. Given the speech uttered by a human can we convert it to text? The problem with speech is that we produce a sequence of sounds, called phonemes, that are hard to process, so speech segmentation itself is a big problem. Once the speech is processable, the next step is to go through some of the constraints (models) that are built using training data available. This involves heavy machine learning. If you see the figure representing the modeling as one box of applying constraints, it's actually one of the most complex components of the entire system. While acoustic modeling involves building modes based on phonemes, lexical models will try to address the modeling on smaller segments of sentences, associating a meaning to each segment. Separately language models are built on unigrams and bigrams of words. Once we build these models, an utterence of the sentences is passed through the process. Once processed for initial preprocessing, the sentence is passed through these acoustic, lexical, and language models for generating the token as output.

Text classification is a well-defined and somewhat solved problem, and it has been applied across many domains. Typically, any text classification is the process of classifying text documents using words and the combination of words. While it's a typical machine learning problem, many of the preprocessing steps used in text classification are from NLP.

An abstract diagram of text classification is shown here:

Where we can explain it in a pipe line level as :

It depends on the kind of text classification problem we are trying to solve. So in few cases, it's more a case of feature engineering while we drop some of the preprocessing steps. The final goal of feature engineering is to generate a Term doc matrix (TDM), which holds the vocabulary of the entire corpus: columns and rows are the documents, while the matrix represents a scoring mechanism to show the Bag of word (BOW) representation. The weighting scheme can be varied to TF, TF-IDF, Bernoulli, and other variations of term frequency. There are also ways to induce features such as the POS of a given feature, contextual POS, and others, to make our feature space more NLP intense. Once the TDM is generated, the text classification problem becomes a typical supervised unsupervised classification problem, where given a set of samples, we need to predict what sample belongs to what class. The next chapter is dedicated entirely to this topic. This is definitely a splendid application of NLP/ML and is used quite often for commercial purposes.

Information extraction:

Information extraction (IE) is a process of extracting meaningful information from unstructured text. IE is yet another widely popular and highly important application. In general, an information extraction engine harnesses huge numbers of unstructured documents and generates some sort of structured/semi-structured knowledge base (KB) that can be deployed to build an application around it. A simple example is that of generating a very good ontology using a huge set of unstructured text documents.

There are mainly two ways of extracting information:

Rule-based extraction: This method is where one uses a template filling mechanism. The idea is to look for some kind predefined use cases for expected outcomes and try to mine the unstructured text for that specific template. For example, building a knowledge base of football will involve getting information on all the players and their profiles, the statistics, some personal information, and so on. All that can be well defined and extracted using either pattern-based rules or POS tags, NERs and relation extraction.

Machine learning based: The other approach involves deeper NLP-based methods such as building a parser specific to the need of our knowledge base. Some of the KBs will require mining the entities that can't be extracted using a pre-trained NER, so we have to build a custom NER.

Question answering systems

Question answering (QA) systems are intelligent systems that can address any question based on their knowledge base. One of the major examples of this is IBM Watson, which took part in the TV show Jeopardy and won over human opponents. A QA system can be broken down to building components from speech recognition for querying the knowledge base while the knowledge base is generated using information retrieval and extraction. Once you have a question for the system, one big problem is to classify/categorize the question in different ways. The other aspect is to search the knowledge base effectively and retrieve the most precise document. Even after that, we have to generate the answer in a natural way using some of the other applications, such as summarization and parsing.

Dialog systems

Dialog systems are considered the dream application, where given a speech in source language, the system will perform speech recognition and transcribe it to text. This text will then go to a machine translation system that can translate the speech into the target language and then a text-to-speech system will convert it into speech in the target language. This is one of the most desirable applications of NLP, where we can talk to a computer in any language and the computer will reply in the same language. This kind of application can actually destroy the language barrier that exists in the world.

Apple Siri and Google Voice are examples of some of the commercial applications in the line of dialog systems intelligent enough to understand our information needs, try to address them in a set of actions or information, and respond in a human-like manner.

Building your first NLP application:

Let's start with one of the very complex NLP applications, which is summarization. The concept of summarization is quite simple. We are given an article/passage/story and you will have to generate a summary of the content automatically. Summarization actually requires deep knowledge of NLP because we need to understand not just the structure of the sentence but also the structure of the entire text. We also need to know about genre of the text and the theme of the content.

I have scraped an article from the New York Times in a text file nyt.txt, in the following example. The idea here is to summarize this news article for us. Let's build a version of Google News for our personal use.

To start off, we need to keep in mind that, typically, a sentence that has more entities and nouns has greater importance than other sentences. We will try to normalize the same logic while calculating an importance score, using the following code. To get the top-n sentence, we can choose a threshold for the importance score.

import nltk
results=[]
for sent_no,sentence in enumerate(nltk.sent_tokenize(news_content)):
    no_of_tokens=len(nltk.word_tokenize(sentence))
    # Let's do POS tagging
    tagged=nltk.pos_tag(nltk.word_tokenize(sentence))
    # Count the no of Nouns in the sentence
    no_of_nouns=len([word for word,pos in tagged if pos in ["NN","NNP"] ])
    #Use NER to tag the named entities.
    ners=nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sentence)), binary=False)
    no_of_ners= len([chunk for chunk in ners if hasattr(chunk, 'node')])
    score=(no_of_ners+no_of_nouns)/float(no_of_toekns)
    results.append((sent_no,no_of_tokens,no_of_ners,\
no_of_nouns,score,sentence))

for sent in sorted(results,key=lambda x: x[4],reverse=True):
    print sent[5]

Vusi Zulu

6 年

Hi Steven I hope you are doing fine. With all due respect, you titles (Data Scientist, Big Data Developer, Data Analyst) yhooo man m so motivated. wow

要查看或添加评论，请登录

Steven Murhula的更多文章

Automating AI in the Cloud: MLOps Best Practices for Azure, AWS, and GCP

2025年3月20日

Automating AI in the Cloud: MLOps Best Practices for Azure, AWS, and GCP

Introduction Artificial Intelligence (AI) is transforming industries at an unprecedented scale, but its true power is…
Solving the MLOps Puzzle: How to Optimize Model Deployment in Azure, AWS, and GCP

2025年3月20日

Solving the MLOps Puzzle: How to Optimize Model Deployment in Azure, AWS, and GCP

Introduction The world of AI is racing forward, but without a solid deployment strategy, even the most powerful machine…
Building Resilient Data Pipelines: Stop Firefighting, Start Delivering Value

2025年3月11日

Building Resilient Data Pipelines: Stop Firefighting, Start Delivering Value

"Our pipeline broke again. Dashboards are down.

1 条评论
From Chaos to Clarity: How Data Lakehouses Are Powering Real-Time Analytics

2025年3月6日

From Chaos to Clarity: How Data Lakehouses Are Powering Real-Time Analytics

A Deep Dive Into Kafka, Iceberg, Airflow, and the Future of Streaming Analytics in AWS & GCP ?? Introduction: The Data…
DAGs, Snowflake, and the Future of Cloud Data Engineering

2025年3月4日

DAGs, Snowflake, and the Future of Cloud Data Engineering

Introduction In today’s fast-paced digital world, businesses thrive on data-driven decisions. But how do companies…
Docker & Kafka on AWS: The Ultimate Guide for Data Engineers

2025年2月26日

Docker & Kafka on AWS: The Ultimate Guide for Data Engineers

Introduction Data engineers often face challenges in managing complex data workflows, ensuring environment consistency,…
Beyond Pipelines: Why Most ML Models Fail in Production (And How to Fix It)

2025年2月24日

Beyond Pipelines: Why Most ML Models Fail in Production (And How to Fix It)

?? You built an ML model. It works beautifully in your Jupyter notebook.
Your ML Model is Dying—And You Don’t Even Know It

2025年2月24日

Your ML Model is Dying—And You Don’t Even Know It

The Hidden MLOps Crisis That’s Costing Companies Millions You just built an amazing machine learning model. It crushed…
Why Your Data Models Are Failing: The Hidden Mistakes You’re Overlooking

2025年2月21日

Why Your Data Models Are Failing: The Hidden Mistakes You’re Overlooking

Have you ever spent weeks fine-tuning your data model only to watch it crash and burn in production? You’re not alone…
From Data Chaos to Cloud Automation: How Apache NiFi Powers Scalable Data Pipelines: A Hands-On Guide for Engineers & Architects

2025年2月19日

From Data Chaos to Cloud Automation: How Apache NiFi Powers Scalable Data Pipelines: A Hands-On Guide for Engineers & Architects

Introduction: The Data Movement Challenge in Cloud Environments As organizations increasingly shift to cloud-first…

See all articles

Natural Language (NLP) Processing with Python Use Case

Steven Murhula

ML Engineer l Data Engineer l Scala l Python l Data Analysis l Big Data Development l SQL I AWS l ETL I GCP I Azure I Microservices l Data Science I Data Engineer I AI Engineer I Architect I Databricks I Java I Sql

Introduction :

Speech recognition

Information extraction:

Dialog systems

Steven Murhula的更多文章

社区洞察

其他会员也浏览了

Mastering Natural Language Processing: Tips and Hacks for Success

From Rulesets to Transformers: A Journey Through the Evolution of SOTA in?NLP

?? From Words to Vectors: How Tokens & Embeddings Power Modern AI

TF-IDF (Term Frequency - Inverse Document Frequency) in NLP.

Financial Sentiment Analysis using FinBert

Huggingface

NATURAL LANGUAGE PROCESSING

NLP

Introduction to Statistical NLP : Remembering Old Sports : Part - 1

Build your SPAM filter with NLP & Machine Learning

Introduction :

Speech recognition

Information extraction:

Dialog systems

Steven Murhula的更多文章

Automating AI in the Cloud: MLOps Best Practices for Azure, AWS, and GCP

Solving the MLOps Puzzle: How to Optimize Model Deployment in Azure, AWS, and GCP

Building Resilient Data Pipelines: Stop Firefighting, Start Delivering Value

From Chaos to Clarity: How Data Lakehouses Are Powering Real-Time Analytics

DAGs, Snowflake, and the Future of Cloud Data Engineering

Docker & Kafka on AWS: The Ultimate Guide for Data Engineers

Beyond Pipelines: Why Most ML Models Fail in Production (And How to Fix It)

Your ML Model is Dying—And You Don’t Even Know It

Why Your Data Models Are Failing: The Hidden Mistakes You’re Overlooking

From Data Chaos to Cloud Automation: How Apache NiFi Powers Scalable Data Pipelines: A Hands-On Guide for Engineers & Architects

社区洞察

其他会员也浏览了

Mastering Natural Language Processing: Tips and Hacks for Success

From Rulesets to Transformers: A Journey Through the Evolution of SOTA in?NLP

?? From Words to Vectors: How Tokens & Embeddings Power Modern AI

TF-IDF (Term Frequency - Inverse Document Frequency) in NLP.

Financial Sentiment Analysis using FinBert

Huggingface

NATURAL LANGUAGE PROCESSING

NLP

Introduction to Statistical NLP : Remembering Old Sports : Part - 1

Build your SPAM filter with NLP & Machine Learning