What I learned while building a semantic search engine?

What I learned while building a semantic search engine?

In 2020, we started to build a semantic search engine with a team of four developers.

Over a few months, we analysed the market and met with potential clients until we decided not to pursue this idea any further.

Still, I believe there is an opportunity to build a successful business in this field.

For that reason, let me share the data I collected to help someone else make it happen.

Information management is a mess

For years, I have been organising files by coming up with universally understood names, forming elaborate folder structures, and distributing access to those files.

Often, I have even maintained reorganised copies of the whole company’s knowledge base to ensure that I could find everything when needed, potentially creating a security risk. And I know that other people have the same frustrations.

Eventually, I wrote down an idea for a product in the spring of 2020 that would help people like me find and organise their files better.

The overall premise at that point was that people are bad at systematically naming and grouping files and folders, thus making it hard for others (and themselves) to find necessary files when needed.

Based on that, I described a potential solution whereby we would first create an internal search system for cloud storage that finds files based on content. Even better, I hoped it would evolve into an AI tool that perhaps even renames and relocates them to create a better system.

However, I understood from the get-go that much of the created work is only saved in private communication, like messages and emails. Thus, it was clear that we would need to integrate all communication services and enable searching across them to provide a full-service solution.

Meanwhile, I recognised that teams would need to invest in separate processes to manage access rights and data security in larger organisations. Thus, a potential product also had to enable teams to make some of the files accessible only by acknowledging their existence and allowing people to request access to them.

But similar to many other ideas, I initially just wrote it down & forgot about it.

There is no universal search tool

A few months later, I found a team of engineers who had built a similar solution — an easy integration between Confluence and Microsoft Teams-based bot to find files more quickly.

After some initial discussions with them, we decided to collaborate and develop a semantic search engine for SMEs and enterprises. It would differ from other search engines as it searched what people were meaning instead of only following the specific keywords presented.

From there on, I started doing actual research.

I discovered that over 1 trillion work documents are produced each year, and they are primarily unstructured — thus causing people to generally not know where they are. More so, I wrote down the following learnings.

  • Information is fragmented across both cloud and on-premise storage systems.
  • Due to no universal search tool or no time to structure data, valuable information gets lost.
  • This results in considerable time waste, with knowledge workers spending up to 20% of their time searching for stuff they know exists but cannot quickly locate.
  • And as 90% of all data is created in the last two years, this problem worsens.

When digging deeper, I found that 1.8 hours is spent searching and gathering information by employees every day, according to McKinsey’s 2012 study.

Next, it was time to check whether there is a large enough market for such a solution.

Based on the Grand View Research report, it’s evident that the global enterprise search market size by the end-use was $5 billion as of 2020.

Simultaneously, the IDC report stated that the CAGR of data created by enterprise work was 44.6%, also by 2020.

And according to another IDC report, 85% of all data is believed to be unstructured data by 2025.

More so, if we were to believe estimated market size data, then Maximize Market Research expects the global enterprise search market to reach $9.5B by 2027. At the same time, Credence Research, Inc stated in its 2020 report that the global big data analytics market would reach $105B by 2027.

No alt text provided for this image

Give people access to their data

Based on the information gathered, we believed an existing problem in the SME and enterprise field would be big enough to dig deeper.

First, we decided to interview potential clients inside my network.

Our guesstimate was that people working with many different documents would most need such a tool. Thus we focused on meeting various consultants, finance people and lawyers.

After initial meetings, we recognised the clients wanted an easy to use tool that gives access to people’s data whilst being highly scalable and secure.

However, we challenged to not only focus on internal documentation search. Instead, we saw a need for a general semantic search tool that would support building search features for websites and other products.

So we started to build an AI-driven hybrid cloud enterprise search solution that would feature

  • Hybrid cloud — accessing data in the cloud and on-premise
  • Multi-lingual — working across 50 languages
  • Scalability — having a 75–100ms response time with 100GB of files
  • Security — with data being encrypted & not uploaded to our servers
  • Advanced AI — learning user behaviour on the go
  • Analytics — via built-in unstructured data analysis tools

But shortly after launching our MVP, the team realised that this might not have been the product we wanted to build.

On the one hand, some interviewees stated that they could use Google’s Programmable Search Engine, Algolia’s Search API or Elastic’s open-source stack to get adequate results at a much lower cost. At least compared to our suggested pricing plan.

On the other hand, new startups like Searchable.ai and Command E had recently launched, and we could not figure out how to differentiate our product enough from them to make it worth pursuing this idea further. Plus, while testing out their products, I didn’t find any of them sticky enough.

But most of all, we acknowledged that this particular problem had less to do with building an AI-driven product and was primarily a search optimisation problem. In contrast, the engineering team had envisioned that they wanted to work with a more machine learning-focused problem.

Therefore, we went back to the drawing board to brainstorm what other things can we build to help solve a similar problem.

That is when we grasped how such particular search tools, like most other products out there, needed some AI-related building blocks while not being AI-driven themselves.

With this in mind, I began new research towards what do companies do when it comes to implementing AI-based features into their products.

But if anyone is planning to build a semantic search engine or something similar, then feel free to use any statistics I’ve referred to here.

At the same time, I’m happy to share more of my thoughts or potentially test your product in the future.

Vassili Rusmanov

Senior DevOps Lead / Chief Security Officer / Quality Manager

3 年

Did you take a look to Data Lakes and platforms that do support them? They exists exactly for unstructured data and provides lots of things you mentioned: like some search and logic. And of case the idea of Data Lakes is to have some AI/ML/DL when/if you have such algorithms. It dramatically simplifies product. You don’t need to invest time for storing and aggregation of all and everything.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了