AI search tools for patents; How to test & compare them? Part I
Linus Wretblad
Innovation Advisor * Boosting IP decisions * QPIP Qualified Patent Information Professional * Founder & CEO
Part I (of III): The background to unveil the black box
Lately the AI buzz has flooded also the work of the information professional. This is indeed a new option to expand the toolbox to find prior art. It offers smarter automated information retrieval through text-based searches. However, the common approach from solution providers seems overall to be "Yes we got it too, AI is the future" rather than to be transparent on the actual performance. The evaluation process of new tools has become quite a challenge. On one hand, the AI black box is very difficult to understand with all the complex algorithms behind it. On the other hand, it is a time consuming process to validate the performance manually.
As a user you would like to know straight on if the quality of a tool is good enough. That the results contribute to a more efficient search and decision process. In short: does it help me or not? Other questions are: what is the actual performance within my specific technical domain and how does the performance of different providers compare? The focus of the evaluation should be on the actual performance delivered and not on technology used. Comparable to when buying a car with automatic gearing, it is not how the gearing box works but the performance of the car you are interested in as driver.
It would be great to have a universal model for evaluating such text-based search tools. To have a standard test to validate any system that from a text input, without human interaction, automatically retrieves supposedly relevant documents. Secondly, it should be even better to have an automated procedure to run a test instead of manually defining and reviewing search queries. This is the reason why I write the article to share our knowledge from research and experiences we have done to create transparency.
Let us first back up a little to establish a ground for such an evaluation. Information retrieval, which is the general term in academia for research on searching and finding of relevant data, has been around since the early 1960s. There are two basic metrics defined for measuring the performance, called Precision and Recall [1] (read also more online at Wikipedia). Precision defines, as it sounds, the quality ratio of the retrieved hits or documents. In short: how happy are we with what we’ve got? This score is the relation between the number of relevant documents in a result and the total amount of documents retrieved. In a perfect world you have zero noise and all hits are relevant to the subject of the input query text. Recall shows the proportion of correct answers found. In common words; have we found all the documents we were looking for? This score tells us how many of the relevant documents of a known set of right answers that were found by the search tool.
The relation and trade off between the two scores is a contradictive challenge. By looking at more documents in the result list to identify further correct answers and increase the Recall, we will also decrease the Precision accordingly. The relation is shown in the following graph.
The evaluation procedure of the two metrics demands a preparation of test queries and associated correct answers, a so called ground truth (sometimes also gold standard). Automating measurements of the Precision is very complex as you need to know the relevancy of all retrieved hits in relation to a query text. Algorithms are yet not accurate enough to assess if a document is explicitly relevant or not. Thus, it remains to do a manual review to obtain reliable score. That is, an IP specialist understanding the technical domain of the query must assess and categorize each retrieved document as relevant or noise. A very tedious process, and in practice limited to random sampling with quality check procedures on a few cases.
Verifying the Recall is more straight forward. Every case has to be defined with a query (text input) and a correct set of answers (documents to be found). You limit the validation procedure to identify how many of the known documents that were found. An IP specialist could manually assess this number and calculate a performance ratio. It still requires quite some manual work, both to prepare the queries and to assess the score. However, for this task automation is possible, as the analysis is limited to identify known answers.
For performing these kinds of evaluations, spinning a handful of queries is far from enough. You would need thousands of test queries to get statistically reliable quality measurements, which is not feasible to do manually. There is some research done on defining models for automatically evaluating Precision and Recall based on log data or search reports.
For an information professional, the performance question is quite simple. You like to see the most relevant document in the top of a hit list (high Recall) to save time during the reading process. At the same time you only want to see relevant documents in the result list (high Precision) to trust the results. Here we have the dilemma with Recall and Precision in its very essence, improving one will lower the other.
The balance between the two is the key question to create trust in black box solutions for automated searches. To have a result that shows both high Precision and Recall. Furthermore, how many documents is reasonable to review in a screening process, given a certain Recall and Precision score? From these thoughts, we should define an evaluation model.
The next step is to create such a test platform for automatic evaluation and creating a baseline structure. Our research and results are disclosed in my next post “Defining a Baseline and a Ground Truth”.
[1] Information Retrieval Evaluation: Harman, Donna; Morgan & Claypool Publishers, 2011
Linus Wretblad is the co-founder of Uppdragshuset and IPscreener. He has a Master of Science degree from the Technical Physics and Electrical Engineering, Link?ping University, Sweden and holds a French DEA degree in microelectronics. He studied MBA on Innovation and Entrepreneurship at the University of Stockholm. Linus has 20 years experience of innovation processes and IPR with a focus on prior art searches and analysis, starting as an examiner at the Swedish Patent office. Since 2008 he is on the steering committee of the Swedish IP Information Group (SIPIG) and was during 2012-2017 on the board and president of the Confederacy of European Patent Information User Groups (CEPIUG). Linus is one of the coordinators in the certification program for information professionals. He is recently involved in a EUROSTAR research project together with the Technical University of Vienna on automated text based and AI supported prior art screening.
This article and its content is copyright of Linus Wretblad - ? IPscreener 2019. All rights reserved. Any redistribution or reproduction of part or all of the contents in any form is prohibited other than the following:
- you may print or download to a local hard disk extracts for your personal and non-commercial use only
- you may copy the content to individual third parties for their personal use or disclose it in a presentation to an audience, only if you acknowledge this being source of the material
You may not, except with express written permission, commercially exploit the content or store it on any other website.
SVP IP I IAM-300 I Einride
5 年Very interesting Linus. Thanks for the posts!
Passionate about the impact of technology and digital on society | Science
6 年great Job, Looking for part II : Defining a Baseline and a Ground Truth and part III
Digital Enabler @ HUB System Integration | Act on your Data!
6 年Interesting, even fascinating topic. Your share made my thoughts spinning. Based on using these tehnologies, hands on, in IP- and Legal contexts among others. I’d day start exploring. Find some tech, not the perfected one. If the use reasonably contemporary texnologies, they outperform us (humans) anyhow. Team up with (new) Tech, and Tech providers. And refine, learn to know what questions to ask down the road. There are so many ;) Looking fwd to yr next share.
Senior Information Research Scientist at The Hershey Company
6 年Great topic Linus! Look forward to the next articles in the series.
Intellectual Property Strategist
6 年The frame is clearly set. Looking forward to read part II and III.