AI search tools for patents; How to test & compare them? Part II
Linus Wretblad
Innovation Advisor * Boosting IP decisions * QPIP Qualified Patent Information Professional * Founder & CEO
Part II (of III): Defining a Baseline and a Ground Truth
I described in Part I the pains of measuring the performance of a black box search solution within the industry and how precision & recall are key indicators to assure the quality of such performance. This post covers how a platform for automated measurements of the performance could look like.
For the task we did a research study together with the TU Vienna (special thanks to Mihai Lupu and Linda Andersson) to review and refine existing procedures. Much of the research was based on evaluation models presented in the TREC research initiative [1] (Read more online at NIST). This ended up in a platform for assessing the quality of text-based searches on a large scale. Our key questions were:
- What input delivers good results and where are potential pitfalls?
- How can we reliably measure the performance to guarantee continuous improvements?
The goal is to exemplify quality measurement in an information retrieval procedure and to track such scores across different industries. As finding good documents is a primary objective of a search, the first evaluation model traces recall scores and the associated ranking (at which position in the result list answers are found). The challenge is to define a methodology and standard set of queries together with the known correct answers (documents), i.e. a ground truth (also called gold standard).
Luckily, the patent domain has indeed such a "truth" if we presume that the examination reports from the Patent Offices are the correct answers to be found. Of course, it might be possible to find other prior art documents being as relevant (or even more relevant) than the documents cited in the examination report. However, this definition of ground truth presents a transparent evaluation and permits us to automate the procedure on large set of queries. That fact is essential for obtaining a reliable performance score in an efficient way.
Thus, the ground truth consists of queries (the patent applications) and the known answers against which results are compared (the relevant documents cited by the examiner). We measure the recall by calculating the ratio of how many of the relevant citations mentioned in the patent office examination report did the automated tool find. The approach makes it possible to create an automated evaluation procedure and is a first step to open up the black box to comparison.
From the information professional’s perspective, we should focus the evaluation model on recall scores within limited number of presented result hits. We want to understand the ranking of correct answers retrieved; how many of them are found and where in the result list? Tony Trippe also emphasized this after my first post and even suggested it to be a third metric. Optionally, you could consider measuring at what position in the result set did the tool find the first correct answer (e.g. the first knockout citation document was found as document number 187 in result set).
However, we chose to focus on the interval approach for clarity reasons and better overview. To achieve this, we could define ranges corresponding to intervals normally accepted or used for screening purposes. E.g. how many of the predefined documents that “should” be found in a search were retrieved among the first top 100, 50, 25 and 10 listed hits respectively .
The second challenge is to trace the performance across technologies. Again, the patent collections are very structured by the IPC/CPC classification system, categorized in main technical sections from A to H and an associated hierarchy (explained more here). This permits us to measure and compare the search performance for different technologies, as the query sets (documents) are possible to group according to the classification. It also creates transparency around the quality of automatically generated results.
For statistical relevance, we required to use at least 100 queries per defined technical area. To cover the overall populated technical domains, the test collection added up to a total of 175000 queries. They are composed of randomly distributed patent applications with examination reports, spread over classes and applicants from the latest 20 years. It constitutes then the baseline query set to be used for measurements.
Based on the assumption that the text input affects the results the most, we elaborated on different formats and associated output performance. The old rule of “Garbage in, Garbage out” applies even more for algorithms. Thus, we chose to have toughest input possible for a “worst case” evaluation: only using title and abstract as input, with grouping of test queries on Subclass level (e.g. G06F) and Group level (e.g. G06F-003). This represents a good tradeoff between granularity of query sets, harmonization of the length of the query and the technology focus. As a measurement reference, and to establish a graphic presentation of the results, we run the baseline queries through a text based prior art screening algorithm.
The resulting diagram below shows the recall scores as a percentage. It shows the average number of answers (cited documents) found within the first 100 hits shown in view of the total set of citations reported by the examiner. The scores in our example are grouped on a Subclass level (G06F).
The performance is showing a major variation depending on the technical classes in question. The fluctuation does not come as a surprise and applies typically to any automated tool out there on the market. However, even though this conclusion is quite obvious, it is normally not communicated to me as a user. Furthermore, as the input text has the highest impact on the quality of the output, it is important to understand when I need to retailor an input for better performance. Using a baseline is one way to unveil and identify such domains. More important, by including multiple result sets in one view you may also compare the performance for different providers of automated tools. It would support to understand where one is better than the other.
However, the diagram divided on individual classes still becomes complex. The end user would certainly wish for more transparency. It is not common knowledge what each class actually relates to, and even experts do not know them all by heart. It would be great with an analysis view in a more comprehensive presentation format.
I will elaborate on how to establish a more transparent presentation in my coming post “AI performance understandable by everyone”.
[1] TREC chemical information retrieval – An initial evaluation effort for chemical IR systems: Lupu, Huang, Zhu, Tait; World Patent Information, 2011
Linus Wretblad is the co-founder of Uppdragshuset and IPscreener. He has a Master of Science degree from the Technical Physics and Electrical Engineering, Link?ping University, Sweden and holds a French DEA degree in microelectronics. He studied MBA on Innovation and Entrepreneurship at the University of Stockholm. Linus has 20 years experience of innovation processes and IPR with a focus on prior art searches and analysis, starting as an examiner at the Swedish Patent office. Since 2008 he is on the steering committee of the Swedish IP Information Group (SIPIG) and was during 2012-2017 on the board and president of the Confederacy of European Patent Information User Groups (CEPIUG). Linus is one of the coordinators in the certification program for information professionals. He is recently involved in a EUROSTAR research project together with the Technical University of Vienna on automated text based and AI supported prior art screening.
This article and its content is copyright of Linus Wretblad - ? IPscreener 2019. All rights reserved. Any redistribution or reproduction of part or all of the contents in any form is prohibited other than the following:
- you may print or download to a local hard disk extracts for your personal and non-commercial use only
- you may copy the content to individual third parties for their personal use or disclose it in a presentation to an audience, only if you acknowledge this being source of the material
You may not, except with express written permission, commercially exploit the content or store it on any other website.
Retraité Office Européen des Brevets Munich
6 年truly interesting tool and I hope many will profit by using it!
Intellectual Property Competitive Intelligence Director at JTI (Japan Tobacco International)
6 年This is an excellent idea for "ground truthing" patent search results and provides interesting, if not necessarily surprising, conclusions. Looking forward to part III.
Indian Patent & Design Agent | Chemist | Scientist | Inventor | Advisor | Analyst | Author | Blogger | Mentor of Change | Educator | Journalist | Poet | Cheminformatician | Patinformatician | Senior Patent Associate |
6 年Thanks for sharing... Great to know the updates... Best wishes for your upcoming endeavours.
IP Strategy Manager at Owens Corning | Adjunct Professor - IP Management & Markets at IIT
6 年This is certainly a challenge, both in the evaluation, and in developing a system that produces reasonable results. You, and your team should be commended for sharing the details of your evaluation model, and your results.
Patent Research Expert, QPIP Qualified Patent Information Professional, Patent Landscape Professional, IPR Estoppel Expert Witness
6 年Fantastic approach, Linus Wretblad, I applaud your straightforward methodology and transparency!