TF-IDF Technique Overview
Sagar Shroff
Sr Software Development Engineer In Test - Selenium | Cucumber | Karate | Cypress | Javascript | Java | AWS
I am currently learning about feature-engineering, and today I explored the TF-IDF technique. I decided to write it up because, as it is said, if you can't explain something in simple terms, you haven't fully understood it yourself. Here is my attempt to break down this simple yet foundational method for quantifying and working with text data.
Brief
TF-IDF is a statistical (I will explain the maths in a bit) measure which helps in deriving the importance of a particular term/word in a document against the entire corpus/collection of documents available.
This can be helpful to find which document is most relevant for a given term over a large collection of data. Mind you, this is not a simple text search that just matches keywords—TF-IDF quantifies the importance of terms by considering both their frequency within a document and their rarity across all documents.
So there are 2 key ideas it follows to achieve this:
Therefore, TF-IDF stands for - term frequency * inverse document frequency.
The result (the product of tf * idf) is the weightage assigned to each term/word in each document. Simple but beautiful.
You can review my below notebook which shows the same using a simple example.
数据科学爱好者||终身学习者||神经科学狂热者|| ADHD 和 C-PTSD 倡导者
2 天前love this