登录查看更多内容

Smart indexing your business

Miguel Gaspar

AI & GenAI Enabler / Advocate

发布日期: 2021年1月8日

Going behind full-text indexing for improved search results

It is common to perform indexing of documents and do full-text indexing. It gives final users the ability to search for documents based on words contained in the documents. That has great results however there are techniques that are more promising on improving the results, reducing the time to find relevant documents.

Most written text has a lot of functional words, like “a”, “the”, or “is” which are important to the person reading the content as they help it flow in a cohesive manner, but aren’t necessary as important to someone searching the content of your documents. Consider the word “the”, which in a standard email or attachment, could easily appear hundreds of times or more. When a user performs a search, part of the algorithm that calculates the relevancy of any document in the search index is to count the number of times a word appears in the text being searched. The more often it appears, the more relevant the document.

To avoid pointing out relevant documents based on words that add no meaning to the searches, best practices, point to some preprocessing/filtering to remove stop words, like the ones mentioned above, before indexing the documents based on a full-text field.

Hitachi Content Intelligence (HCI) gives you that, out-of-the box, as also the ability to adjust the stop words list so you can adjust it to specific needs.

How many times do you end up searching and wasting time on un-relevant results not being able to find what you were looking for?

Just by filtering stop-words filtering won’t give you the best results you can get if you have objects, like emails and attachments that have a lot of words, cause many of those words won’t and shouldn’t be considered stop words, however they will be indexed and not relevant.

If we take in consideration that the goal of indexing documents is to find the relevant documents when a search is executed, the best search results are the ones that provide the fewest results with the highest coverage of relevant documents.

Text indexing and retrieval techniques have their roots in the field of Information Retrieval where the task is to extract documents that best match a query.

There are articles such as Document Indexing Using Named Entities (2001), by Rada Mihalcea, Dan I. Moldovan in Studies in Informatics and Control, say that a way to improve the relevance of documents retrieved from a large collections, show that indexing documents by using named entities, reduces the number of retrieved documents by a factor of 2, while still retrieving the relevant document.

When indexing huge number of documents using full-text indexing, even when applying filtering of stop-words, the results of a search can return non-relevant documents. In these cases, stemming and entity named recognition can be used to reduce the retrieved documents, however keeping the relevant ones.

There is still consideration to have in mind, as results will depend on the named entities you are recognizing and using for indexing. Also for different industries and business, there will be entities that are more important than others, so having some flexibility, can really make the difference on the results you provide to the end users, finding much faster the documents they were looking for.

Improving results with stemming and named entity recognition

The best search results are the ones that provide the fewest results with the highest coverage of relevant documents”, meaning that a good search result is the one that gives you all/as many as possible results covering the search, however restricting the results to that minimum.

While Hitachi Content Platform (HCP) is great to store objects with relevant annotations and compliance, Hitachi Content Intelligence (HCI) is great for doing some pre-processing, enrichment and indexing of those documents in an easy and very performant way.

I am proposing that Stemming and Named Entity Recognition be performed on HCI, indexing documents in a way to retrieve less, however relevant, results.

HCI is flexible and allows us to create custom plugins, so that is why we have created a plugin with the goal of providing the fewest results with the highest coverage.

Together with HCI, and while processing, pre-processing, enriching documents metadata, the plugin can, out-of-the-box, perform the following tasks:

Named Entities Recognition (NER), in English language:

Stemming can be a bit trickier. There are two important phases when indexing documents: Index and Search. Stemming to be effective, should be applied on both indexing and search phases, and HCI allows to perform those configurations on internal indexes.

Keep in mind that real value will come from training the models to recognize named entities for you specifically for your business, while steaming can be applied out-of-the-box, however applying both is where you will make the most improvements.

Another advantage is that your indexes sizes will be reduced, giving you less costs over the required infrastructure and a better performance, both on indexing stages as well on retrieving search results.

Do not waste valuable time on looking at non relevant results, improve efficiency of enterprise search results and give your users the time to perform crucial tasks for your business.

Reach out to me and we will help you overachieving your business expectations.

Miguel Gaspar

AI & GenAI Enabler / Advocate

4 年

Anthony Marsh this is what we've discussed about some weeks ago :D

要查看或添加评论，请登录

Miguel Gaspar的更多文章

Agentic AI: A Game-Changer for Organizations, But Not a One-Size-Fits-All Solution

2025年2月18日

Agentic AI: A Game-Changer for Organizations, But Not a One-Size-Fits-All Solution

Agentic AI: A Game-Changer for Organizations, But Not a One-Size-Fits-All Solution In the ever-evolving landscape of…

1 条评论
The Evolution and Future Architecture of AI Agents: A Deep Dive

2024年12月30日

The Evolution and Future Architecture of AI Agents: A Deep Dive

Artificial Intelligence (AI) has come a long way from its early days of simple rule-based systems to the sophisticated…

1 条评论
The Game Changer Empowering Adaptive Agentic AI

2024年12月2日

The Game Changer Empowering Adaptive Agentic AI

In the world of artificial intelligence, we're moving from static chatbots to dynamic, adaptive agents that can engage…

2 条评论
Beyond Static Chatbots: Embracing Adaptive Agents

2024年10月29日

Beyond Static Chatbots: Embracing Adaptive Agents

In the rapidly evolving landscape of artificial intelligence, GenAI agents are emerging as powerful tools that can…
A New Frontier in Document Processing

2024年10月21日

A New Frontier in Document Processing

This article explores potential solutions for automating Intelligent Invoice Processing (IIP), a critical subset of…

2 条评论
Navigating the Risks of Many-Shot Jailbreaking in Generative AI

2024年4月8日

Navigating the Risks of Many-Shot Jailbreaking in Generative AI

#responsibleAI #secureaAI4all #ethicalAI Introduction Security remains a paramount concern in the rapidly evolving…

1 条评论
After Sora, what remains?

2024年3月5日

After Sora, what remains?

Ethical and legal issues relating to privacy and fairness are increasingly emerging, and the progress awaiting this…

2 条评论
Multi-agent: A GenAI secret weapon for enterprise success

2023年11月29日

Multi-agent: A GenAI secret weapon for enterprise success

GenAI is one of the most rapidly developing fields of artificial intelligence and continues to advance at breakneck…
Risks and security considerations for organizations starting on Generative AI

2023年8月24日

Risks and security considerations for organizations starting on Generative AI

Generative AI: “With great power comes great responsibility” In recent years, Generative Artificial Intelligence (AI)…
The benefits of the Data Lakehouse on Hybrid Cloud solutions

2023年8月18日

The benefits of the Data Lakehouse on Hybrid Cloud solutions

In this blog post we will focus on two subjects that at first sight could have nothing to do with each other, however…

See all articles

Smart indexing your business

Miguel Gaspar

AI & GenAI Enabler / Advocate

Going behind full-text indexing for improved search results

Improving results with stemming and named entity recognition

Miguel Gaspar的更多文章

社区洞察

其他会员也浏览了

Tuning Information Retrieval in Agent Builder Search applications with Google Search?Adaptor.

AI-Enhanced Web Scraping: The Future of Data Collection, Today

Comprehensive Business Plan for AI-Powered Semantic Search Engine

The Benefits and Usefulness of Implementing Enterprise Search Using LLM

How Search as a Service Harnesses the Power of Enterprise Knowledge

Essential Do's & Don'ts for Query Letters

TextSniper Lifetime Deal: The Easiest Way to Extract Text from Any Screen

How would you like to create your own algorithm?

Search Concepts Cheatsheet - Elastic Oriented

Web Scraping

Going behind full-text indexing for improved search results

Improving results with stemming and named entity recognition

Miguel Gaspar的更多文章

Agentic AI: A Game-Changer for Organizations, But Not a One-Size-Fits-All Solution

The Evolution and Future Architecture of AI Agents: A Deep Dive

The Game Changer Empowering Adaptive Agentic AI

Beyond Static Chatbots: Embracing Adaptive Agents

A New Frontier in Document Processing

Navigating the Risks of Many-Shot Jailbreaking in Generative AI

After Sora, what remains?

Multi-agent: A GenAI secret weapon for enterprise success

Risks and security considerations for organizations starting on Generative AI

The benefits of the Data Lakehouse on Hybrid Cloud solutions

社区洞察

其他会员也浏览了

Tuning Information Retrieval in Agent Builder Search applications with Google Search?Adaptor.

AI-Enhanced Web Scraping: The Future of Data Collection, Today

Comprehensive Business Plan for AI-Powered Semantic Search Engine

The Benefits and Usefulness of Implementing Enterprise Search Using LLM

How Search as a Service Harnesses the Power of Enterprise Knowledge

Essential Do's & Don'ts for Query Letters

TextSniper Lifetime Deal: The Easiest Way to Extract Text from Any Screen

How would you like to create your own algorithm?

Search Concepts Cheatsheet - Elastic Oriented

Web Scraping