Exploring NLP: Market Dynamics, Gartner's Insights, and Effective Text Analysis Techniques

It all began with Natural Language Processing, which paved the way for tokenization, word embeddings, attention mechanisms, and sequence modeling. How humans wrote sentences to make other human beings understand and convey the message was translated to machines, and this was all back in the 1940s, after World War II. At that time, people recognized the importance of translating from one language to another and hoped to create a machine that could translate automatically.

As I mentioned in the beginning about the beautiful movement we are in at the moment of the blooming Artificial Intelligence era, we have crossed the paths of Alan Turing's "Computing Machinery and Intelligence." In this blog, I will highlight my learning of NLP, from the basics to the project my team and I are building.

We will also review the market dynamics that Natural Language Processing holds under the hood for its efficient technique that never vanishes.

Here are a few snips from Gartner's Hype Cycle Trends from the past ten years:

NLP saw its peak of expectations in

The 'Peak of Inflated Expectations' is described by Gartner as "Early publicity produces several success stories — often accompanied by scores of failures." Interestingly, notice is not just the trending of NLP in 2013, but its close associates - Content Analytics and Speech-to-Speech Translation were trending too. Let us move to 2015, where NLP saw a dip in its slope of hype:

It started to be in a lower gradient of the peak slope in

Fast Forward to 2022, we see the position of NLP still holding its place, the same as in 2015 at the 'Trough of Disillusionment,' this is when "Interest wanes as experiments and implementations fail to deliver."

The plateau time is expected to be reached in about 5 to 10 years

Natural Language Processing has been pivotal in developing models like chatGPT and other foundation models. Here is how NLP has influenced them:

  1. Training Data: NLP models rely on large amounts of text data for training. This data teaches the model about language patterns, semantics, and syntax. The availability of large corpora of text data has enabled the training of increasingly sophisticated ChatGPT
  2. Feature Engineering: NLP provides a rich set of features that can be extracted from text data. These features are crucial for training machine learning models, including those used in CharGPT.
  3. Model Architecture: NLP has led to the development of architectures that are used for handling text data - Transformers, in particular, have revolutionized the field with their ability to capture long-range dependencies in text.


When trying to check the trend graph with that of research papers published on the topic of NLP, this is what I found.

The number of papers presented post-2015 was higher than the scope trend equivalent of Gartner.

This does not necessarily mean more research papers have been published with the decline in the hype cycle; it only supports the fact that the NLP is a base for all the other ground-breaking applications of the foundation model (This is just a hypothesis and could be proved if it holds good).

The global natural language processing market is valued at USD 27.73 billion in 2022 and is expected to expand at a compound annual growth rate (CAGR) of 40.4% from 2023 to 2030.

How I started my NLP journey

It is my final semester in the Information Systems major specializing in Business Analytics. For my capstone project, my team and I are building a "NER model for converting noisy unstructured newspaper data on road accidents to a structured database." I enrolled in LinkedIn Learnings' "NLP with Python for Machine Learning Essential Training" by Derek Jedamski to start my learning. I went ahead and completed the advanced chapters, too.

The steps that I am following for building my final model:

  • Text Pre-Processing: The first step in any NLP task is pre-processing the data. Pre-processing the input text means putting the data into a predictable and analyzable form. The pre-processing steps involved the following steps:? Tokenization? Stop-Word Removal? Stemming and Lemmatization

Vectorization: Vectorization is converting textual data, such as sentences or documents, into numerical vectors that can be used for data analysis, Machine Learning, and other computational tasks.Machine Learning models need to be more knowledgeable about human words and linguistic sentences, so to make the machines understand the language, we convert the words to their proportional word embeddings.

A few ways of Vectorization:

  • Word2Vec
  • Doc2Vec
  • TF- IDF (Term Frequency - Inverse Document Frequency)

Since my project is based on entity extraction, I delved deep into the method of extracting the entities using Named Entity Extraction (NER).

Named Entities refer to key subjects of a text, such as names, locations, companies, events, products, themes, topics, times, monetary values, and percentages. The three commonly used NER systems:

  1. Supervised ML-based systems
  2. Rule-based systems
  3. Dictionary-based systems

NER models which can be used for this project are:spaCy, BERT, BERT-based models fine-tuned for NER - 'BERT-NER,' 'RoBERTa-NER,'' DistilBERT-NER,' etc.

Currently, we use the spaCy and gensim libraries to use pre-processing, annotation, and word embeddings.


We will picture the results of our project in the next blog, and if you have any suggestions for building a pipeline, please reach out to me or comment on this post.



Meet Turakhia

Software Engineer @Saviynt | MS in CS @CSUF'24 | Web & App Developer | Machine Learning

7 个月

Good insights on NLP!, waiting on the next one to know the project details and results.

要查看或添加评论,请登录

Aakarsh Surendra的更多文章

  • Clustering Methodologies for Text Classification

    Clustering Methodologies for Text Classification

    Continuing from our previous blog on NLP and how our final year project is projected towards its goal of creating a…

    2 条评论

社区洞察

其他会员也浏览了