登录查看更多内容

Traditional Text Analysis Methods In The World of Deep?Learning

Shanif Dhanani

Founder of Nobi —?AI shopping assistant that boosts conversion rates by improving discovery and recommendations

发布日期: 2018年5月31日

Several years ago a new technique for processing text flipped the world of natural language processing (NLP) on its head. That technique, of course, was the process of converting words into vectors, known today as word embeddings.

Word embeddings allow data scientists to represent a word in multi-dimensional space, where these dimensions correspond to the statistical properties related to word co-occurrence. They allow us to concisely represent key properties of text using a small(ish) number of numerical weights.

This is a huge improvement over what had to be done prior to the creation of word embeddings. In those days, words had to be one-hot encoded, causing an explosion in input feature dimensionality. In addition, it was common to use n-grams, stemming, lemmatization, and other text pre-processing techniques to make it easier to encode words into numbers.

At Apteo, we’ve been using embeddings without any regard to any of the pre-processing methods mentioned above. However, we were interested in seeing if any of the methods mentioned above could improve the accuracy of the network in which these embeddings are used.

So we grabbed our document corpus, transformed all of the words using these traditional methods, then used embeddings for the transformed words and tossed the resulting vectors into our neural network to see if there would be any improvements in our accuracy.

There weren’t.

None of the pre-processing methods helped improve our cross-validated accuracy.

Of course, in retrospect, that’s not that surprising. The whole point of vectorizing a word is to transform it into high dimensional space that can represent the nuances of how that word is used in real life. Dropping suffixes or lemmatizing words won’t change the underlying context of that word all that much, so it’s unlikely that any changes in the underlying embeddings that were used had much of an impact.

However, despite the fact that traditional text pre-processing didn’t help us all that much, I still have hopes for other methods that could help us process our text in a more effective manner.

Topic extraction techniques like LDA and LSI, which are statistical methods for loosely labeling documents with topics, may still be able to provide deep networks with useful information about the high-level context of what’s being said in these documents.

We have yet to see if these methods can improve our performance, obviously we hope the do, but we won’t know until we try them.

It’s great to see how far we’ve come in the world of NLP in a few years. I have no doubt that as AI advances and more researchers start looking into text processing, we’ll get even better.

要查看或添加评论，请登录

Shanif Dhanani的更多文章

The Agents Newsletter #8: Guardrails

2025年3月14日

The Agents Newsletter #8: Guardrails

Howdy. Welcome to issue #8 of The Agents Newsletter.
The Agents Newsletter #7: Context Windows And “Memory”

2025年3月1日

The Agents Newsletter #7: Context Windows And “Memory”

Welcome to issue #7 of The Agents Newsletter. Today we’re going to get a bit into one of the more particular aspects of…

1 条评论
The Agents Newsletter #6: Reinforcement Learning Agents (The OGs)

2025年2月17日

The Agents Newsletter #6: Reinforcement Learning Agents (The OGs)

Today, we talk about the OG agents. The ones that us nerds first heard about close to a decade ago.

2 条评论
The Agents Newsletter #5: Agents Vs. Traditional Machine Learning (Where Agents Fall Short)

2025年2月4日

The Agents Newsletter #5: Agents Vs. Traditional Machine Learning (Where Agents Fall Short)

There’s a scene in The Big Short where Ryan Gosling’s douchey finance bro character is pitching a bunch of other…

4 条评论
The Agents Newsletter #4: Horizontal Vs. Vertical Agents

2025年1月22日

The Agents Newsletter #4: Horizontal Vs. Vertical Agents

Goood morning, and welcome to this latest issue of The Agents Newsletter. If you haven’t read any of the previous…

3 条评论
The Agents Newsletter #3: Agents vs. Workflow Builders

2025年1月9日

The Agents Newsletter #3: Agents vs. Workflow Builders

Gooood morning. Welcome to the 3rd edition of The Agents Newsletter.

3 条评论
#2: How Agents Work

2024年12月27日

#2: How Agents Work

About a year and a half ago, I saw the video that got me really excited about agents. I still remember the day.

6 条评论
#1: About Agents

2024年12月14日

#1: About Agents

I don’t know if you’ve ever seen Saved By The Bell. It was one of my favorite shows growing up.

4 条评论
Why We're Focusing On AI-Powered Customer Success For SaaS At?Locusive

2023年11月29日

Why We're Focusing On AI-Powered Customer Success For SaaS At?Locusive

Over the past year, I’ve been heavily focused on not making the same fatal startup mistake I’ve already made at least…
Our Playbook For Creating High-Quality SEO Content With?AI

2023年11月20日

Our Playbook For Creating High-Quality SEO Content With?AI

There’s no magic bullet when it comes to marketing, but when you can generate inbound leads into perpetuity, identify…

See all articles

Traditional Text Analysis Methods In The World of Deep?Learning

Shanif Dhanani

Founder of Nobi —?AI shopping assistant that boosts conversion rates by improving discovery and recommendations

Shanif Dhanani的更多文章

社区洞察

其他会员也浏览了

Understanding Transformations, Agents, and Deep Learning Frameworks: When and How to Use Them

WHAT IS HUGGING FACE ?

No code-no maths: Learn Gen AI

BERT: Revolutionizing Natural Language Processing

Working with Text to Image Gen AI Tools

Understanding the Evolution of Language Models: From Word2Vec to BERT and Transformers

BERT: Revolutionizing Natural Language Processing Through Bidirectional Learning

Fine-Tuning the GPT-2 Large Language Model: Unlocking its Full Potential

Image Captioning for digits sequence recognition [Code Included]

Shanif Dhanani的更多文章

The Agents Newsletter #8: Guardrails

The Agents Newsletter #7: Context Windows And “Memory”

The Agents Newsletter #6: Reinforcement Learning Agents (The OGs)

The Agents Newsletter #5: Agents Vs. Traditional Machine Learning (Where Agents Fall Short)

The Agents Newsletter #4: Horizontal Vs. Vertical Agents

The Agents Newsletter #3: Agents vs. Workflow Builders

#2: How Agents Work

#1: About Agents

Why We're Focusing On AI-Powered Customer Success For SaaS At?Locusive

Our Playbook For Creating High-Quality SEO Content With?AI

社区洞察

其他会员也浏览了

Understanding Transformations, Agents, and Deep Learning Frameworks: When and How to Use Them

WHAT IS HUGGING FACE ?

No code-no maths: Learn Gen AI

BERT: Revolutionizing Natural Language Processing

Working with Text to Image Gen AI Tools

Understanding the Evolution of Language Models: From Word2Vec to BERT and Transformers

BERT: Revolutionizing Natural Language Processing Through Bidirectional Learning

Fine-Tuning the GPT-2 Large Language Model: Unlocking its Full Potential

Image Captioning for digits sequence recognition [Code Included]