What is the difference between keyword search and text mining?

What is the difference between keyword search and text mining?

In my daily work I always get from my customers questions such: "Is it your text mining solution able to read specific words?", "How does it deal with sarcasm?", "How about if a word is misspelled or contains a typo?" and many others. Unfortunately most of people not familiar with text mining confuses keywords (or words) search with text mining. Those are two totally different words. 

When you use keyword (word) search - basically searching words in a corpus - you type some words in a search engine and the software brings back one or more documents that contains those words. Each hit correspond to one document and typically you need to read a document to decide if is relevant or not. So, if you have a 1'000 hits, you need to read a 1'000 documents.

At the other end, text mining software is able to "read" and "interpret" the meaning of data inside the document. It identifies concepts and relationship. It presents the results back to you in a structure form. And the result are fragments of text that correspond to facts, associations or relationships. You only need to read the document once you find the relevant hit.

I personally think this confusion has been generated by vendors - especially in CXM space - setting wrong expectations. Common wrong expectations are:

  • With a click of mouse you get all topics out of a big data set (corpus) of customer feedbacks.
  • It is magic, you don't need to do anything, the software itself will understand all topics, sentiment and correlations inside the text (corpus). Don't believe in this bullshit even if the guy is telling to you is called Watson ;-)

Unfortunately this is not the case even if - especially for the first point - we are not far from a solution. The hilarious fact is - as I said - many CXM vendors selling keyword search as text mining ...and I can tell you, they put a very high price tag for that gimmick!

Why keyword search is not the right way to identify topics in a customer feedback corpus? As I said before "...typically you need to read a document to decide if is relevant or not." It turns into a nightmare: the quality of your classification will be horrible, and maintaining the rules to identify topics by keywords will be an ever ending sad story.

With "real text mining", especially deep machine learning, you will be able to achieve near the "push the button" unsupervised process to identify topics. The process, at very high level, will be as "easy" as 3 steps:

  1. "Clean" your corpus, using specific linguistic algorithms such: tokenisation, part-of-speech, stop words, disambiguation, lemmatisation, etc.
  2. Turn words in numbers (vectors) using different techniques: for instance continuous "bag of words", "sparse matrix", "tf-idf matrix", etc. Reason to turn words into numbers is deep machine learning (neural networks) loves numbers.
  3. Apply specific deep machine learning techniques such: word2vec, Latent Dirichlet Assosciation (LDA), K-means clustering, etc. to automatically discover topics and sentiment. All those techniques are "code-able" in libraries such: Tensorflow, Gensim, Glove, Spy.cy, etc.

This approach allows you to answer to some of the specific requests mentioned before:

  1. The computer will not 'read' words, it will turn into vectors and understand them in a mathematical way (e.g. word2vec, lda2vec, etc.)
  2. The approach is language agnostic: it will analyse English the same way as Swiss-German, Arabic or Japanese ...and I can assure you, this is a big advantage.
  3. The solution will solve the problem of misspelled words: the vector of "my father is coming home" or "my father is coming ome" is so close that there will be no difference as input for a neural network.
  4. New words and concepts popping up will not be a problem: new vectors will appear in the vector space, you will be able to identify them and follow them in a trend.
  5. What we call "multitopics" in the same sentence such "Your product is great but your customer care sucks!" can be easily detected by LDA models reporting also the right sentiment.

Easy as 1, 2 and 3 ...well easy if you are working with the right partner ;-) 

Stat Reviewer

Statistical and Medical Informatics Reviewer

6 年

Good Article. The following article also will be helpful Tutorial on Mining of Biomedical Literature with the Help of R Package www.vinaitheerthan.com/research.php ?

  • 该图片无替代文字
回复
Andrew Tucker, Ph.D.

Co-Founder, CEO at Mettle Capital

7 年

I agree that the distinction between keyword search and text mining isn't clear in clients' minds. As a result, we haven't yet realised the full potential of the latter. As a specialist group, we need to be showing what is possible with text mining. For example, my team look at what drives trust in brands to predict market share growth. Perhaps your next post could cover standout examples of where businesses use text mining? Happy to contribute if that helps?

Walter Mitchell

Maritime Domain Expert | Analysis and Advisory

7 年

SparkCognition's DeepNLP would be very useful in this app

回复
Patrizia Alfiero

Head of Pricing Analytics - Poste Italiane

7 年

Grazie Federico! Un concentrato di tecnica e saggezza

Brigitte Kobi

Stilettissimo - Numbered Luxury - The Next Level of Footwear - Swiss Design - Italian Artistry - Uncompromising Excellence - The Right Shoe Is Glamour

7 年

Very clearly explained. Thank you, Federico Cesconi

要查看或添加评论,请登录

Federico Cesconi的更多文章

社区洞察

其他会员也浏览了