登录查看更多内容

Deconstruction of fake #AI Benchmarks - Recommender Systems Case Study

Jaroslaw Krolewski

synerise.com | basemodel.ai | cleora.ai | wislakrakow.com | agh.edu.pl

发布日期: 2021年2月20日

+ 关注

We have recently spent a lot of time to create & deliver top-notch AI-driven solutions and products in Synerise. Only in last year we have published 7 papers, showing what attitude to scientific work our team has. Moreover we were awarded or won different prestigious competitions organized by such amazing companies like Booking.com, Rakuten, Stanford, FacebookAI or supported by specialists from Amazon, Apple, Nvidia, Alibaba, Adobe, Zalando, eBay and more - all of these events were very professional - both in terms of methodology and measurementes techniques.

We took part of dozen of conferences presenting our scientific papers, invest in AI school program for the youngest people - because we believe that we can support democratization of AI only by educating society. We also published to community our open-source framework cleora.AI general-purpose model for efficient, scalable learning of stable and inductive entity embeddings for heterogeneous relational data. From data processing perspective, we have built from the scratch our proprietary DB engine - Terrarium. Day by day, we are transparently sharing to clients and partners our ideas, ambition of projects we delivered and present vision - what we want to achieve in future.

Recently, the biggest enemy for AI propagation are: fake research, selling poor quality solutions by immature companies, understatements, intellectual bias and over-hyped results and overpromised ROI. Each month we observe new self-made “benchmarks” published by many companies - prepared to convince clients to choose specific solutions because it is “state of the art”. Reading such studies is really funny moment - number of ridiculous things You can find & read in such ebooks and offers is pretty high.

In that text, I want to explain what is really important when You try to audit and validate vendor You want to work with, by showing real example how to deconstruct false advertising claims in AI field of Recommender Systems.

One of #AI self named company has recently published a document purporting to benchmark their recommender systems. Due to numerous false claims, misrepresentations and erroneous methodology, we perform a fact-checking of the document to prevent spread of false and potentially harmful misinformation and make it as a guide for companies to validate AI companies and their results. Lets deconstruct falsehoods and convert it in nice guide.

PROBLEM 1:

Using work from 6 YEARS AGO as a baseline:

RED FLAG! MISLEADING ??

While comparing to prior methods, it is important to pick the most recent and most relevant prior publications. 6 years in Machine Learning is a very long time, and results from such a long time ago are often terrible compared to state-of-the-art. A single GPU today can do the work of a Google-sized cluster back then.

While the paper is sound and has been published at a prestigious conference, it is cited as:

Which is inconsistent with any accepted publication quoting standards and we have to assume that intentionally omits the 2015 date of publication. According to current standards, the approaches in the publication are extremely weak baselines.

PROBLEM 2:

Using the WRONG METRICS: ROOKIE MISTAKE ??

R-Precision as a metric has not been used to evaluate recommender systems since 2015!

The correct metric that replaced it is Precision@K – which considers that the user sees only the top K, most relevant recommendations. Most publications evaluate recommender systems reporting for a few values of K, e.g. Precision@10, Precision@20, Precision@100. There are other important metrics for recommender systems as well: Recall@K, Mean Reciprocal Rank and Normalized Discounted Cumulative Gain – every one of them is important to have a well-balanced recommender system. These are all missing from vendor presentation. But knowing these numbers would still be useless when comparing systems on different, undisclosed datasets.

PROBLEM 3:

Comparing to prior work based on UNDISCLOSED DATASETS: ABSOLUTE BULL "YOU-KNOW-WHAT"??

When comparing ML algorithms and models, it is absolutely necessary to hold the dataset constant. The same algorithm will always give different results on different datasets. If a publication uses undisclosed datasets, it is impossible to compare any method against it once it has been published!

The referenced publication does not identify the datasets, making any subsequent replications or comparisons impossible:

Neither does vendor identify their datasets, but one thing is certain: they are using different datasets than the ones used in the paper! (They must be, as the datasets used in the paper have not been disclosed.)

Comparing a method on some mysterious dataset of your own to different methods on some equally mysterious datasets from an ancient paper proves only one thing: the authors definitely have no idea what they’re doing. The methodology is simply incorrect.

A dataset can be prepared to give ANY results. When the dataset is secret, and your method is the only one tested - the results can be anything and have exactly zero meaning.

There are multiple established and widely accepted public datasets for benchmarking recommender systems, e.g. Netflix Prize Dataset, MovieLens datasets, Diginetica, Yoochoose.

PROBLEM 4:

Claim that “last visited product” recommendation is 2x better than the state-of-the-art: INSANE BULL "YOU-KNOW-WHAT"??

This is just insanely ridiculous. Think about it.

If showing the user his/her most recently seen products was more than 2x better than anything on the market, why would Amazon, Youtube, Alibaba, Netflix and many others invest Billions of dollars in recommender research? The big guys are fighting for 0.1% gains, 200% gains would surely blow their mind! This again proves that the authors not only don’t have any idea what they’re doing, but also may be harmful! But how can such an insane claim arise? See the next point:

PROBLEM 5:

领英推荐

AI Insights from OpenAI, Mistral, Microsoft, Cohere

Generative AI 7 个月前

TAI #137: DeepSeek r1 Ignites Debate: Efficiency vs…

Towards AI 1 个月前

This week's latest AI industry updates - November 26…

SymphonyAI 3 个月前

Comparing Top-K recommenders to Session-based recommenders: ABSOLUTE BULL-"YOU-KNOW-WHAT"??

The referenced whitepaper uses a setting called “top-k” recommendation – recommending items based on item sets, without any sequential or temporal information. It is an entirely different problem setting & research setting compared to “session-based” recommendation, which explicitly uses sequences and timestamps. "Last seen products” is a baseline approach in session-based recommendation, but cannot be applied to top-k. The authors confuse the 2 approaches and compare one to the other, resulting in the ridiculous claim that “last seen products” outperform state-of-the-art recommenders.

PROBLEM 6:

Suggestion that a 2015 paper evaluated somebody else’s 2021 method: MISLEADING??

The screenshot of a table from a 2015 whitepaper, combined with the phrasing:

“Fortunately, across the world there are many researches published by scientists that evaluate well known recommendation algorithms. Here’s what they found:”

seems to misleadingly suggest that some scientists looked at the claims of VENDOR and published them independently. In reality, the numbers have been generated by the company, have not been peer reviewed, and due to the aforementioned ridiculous claims (comparing apples to oranges, comparing the same numbers on different datasets, using the wrong metrics) would not pass careful peer-review.

PROBLEM 7:

Some methods presented have significantly limited applicability: MISLEADING??

Metods such as “Abandoned cart” can only be applied to a limited number of cases (only users who have some items in an abandoned shopping cart). Comparing performance of such methods working on a small subsample of the user population with methods able to operate on all users is highly misleading.

PROBLEM 8:

Best performing methods are not ML/AI powered & do not consider client history: FALSE CLAIM??

According to the presentation, “Frequently bought after visit” is the best performing method. This method does not take into consideration actual client’s interaction history during inference. As described on slide 19, “Frequently bought after visit” recommends products “bought by other users who displayed the product being viewed”. It is a product-to-product recommendation, not client-to-product recommendation, making comparison to other methods such as Collaborative Filtering very misleading.

The fact, that a simple baseline such as “frequently bought after visit” outperforms real ML-powered approaches, which consider the user’s interaction history is inconsistent with the state-of-the-art research & applied engineering practice across the world. It should be considered a proof of inadequate exploration of widely available ML-powered approaches.

Slides confirms the ineffectiveness of vendor AI methods clearly showing that simple heuristic methods outperform their AI-based methods!!!. Based on the wide body of published research & accepted industry practices, one could reasonably expect that AI-based methods outperform simple heuristics.

SOLUTION:

Proper recommender evaluations:

HOW IT SHOULD LOOK LIKE??

A proper evaluation of a recommender system should exhibit the following key characteristics:

1. Clear and detailed definitions of used datasets. If public datasets are available (such as Netflix Prize, MovieLens, Retail Rocket for recommendations), then they should be used.

2. If the paper is using unidentified, proprietary datasets, it is not enough to measure performance of a single model on this dataset. Other, prior work has to be replicated and compared on the introduced proprietary dataset.

3. The recommendation problem should be clearly stated (e.g. top-k recommendation or session-recommendation), and methods solving the problem should be compared to other methods solving the same problem. Comparing anything on different problems is meaningless.

4. When evaluating methods in a practical, applied setting, one cannot compare against synthetic published benchmarks. Many factors can impact the performance of live recommender systems – the design & layout of a website, the way in which items are recommended, placement of the recommendation section, the size of images etc. The only proper way to evaluate production systems is via an A/B/X testing approach.

5. References, benchmarks and comparisons should be done against recent publications. Compared solutions don’t have to be exclusively state-of-the-art, but should be up-to-date with the current state of research knowledge.

6. Proper evaluation metrics should be chosen according to the best practices, or a convincing argumentation of a different choice has to be made.

7. A detailed description of the evaluation protocol, including any data pre-processing, filtering & tuning must be described. The definition of metrics used must be also made clear.

8. Preferably a publication containing the above characteristics should be peer-reviewed by a respected body, such as a leading conference or a journal.

9. Preferably source-code allowing to replicate experiments and confirm the claims should be

10. Don't make your clients & partners idiots

Thanks for Your time. Sharing and optimizing knowledge is the best thing You can do to move AI disciplines forward. ?

Synerise TEAM

BTW. Link to our scientific papers below, You are welcome to criticize, improve, comment:)

Jaroslaw Krolewski Newslettter

712 位关注者

Jerzy Opar

12 Years of Sales Experience | Startup, Technology, Marketing | Webflow Websites @ mind&matter

4 年

Strong move Jaroslaw Krolewski ! This is clearly your territory and the amount of work your team has put into ML research is tremendous and impressive. It's undeniable that you guys rock in this area.

1 次回应

Barbara Rychalska, PhD

LLMs meet Graphs for solving RAG problems | Eridani.AI

4 年

This is such a big problem. It's natural that we are excited by our own research ideas but let's put them under scrutiny, always. Otherwise we're not making any progress but in fact walking backwards and amplifying noise in the field. Negative results (experiments which failed) are incredibly valuable too! Mental note to self: show more of these :)

5 次回应

查看更多评论

要查看或添加评论，请登录

Jaroslaw Krolewski的更多文章

Synerise Monad: Apply science to behavioral data. Automatically.

2022年7月6日

Synerise Monad: Apply science to behavioral data. Automatically.

Deploying AI effectively requires extensive data processing, maintaining separate batch and real-time data flows, and…

1 条评论
How Synerise AI Team challenge the Transformer.

2022年5月31日

How Synerise AI Team challenge the Transformer.

Originally published at sair.synerise.

3 条评论
Synerise Cleora sets new standards in identifying substitutes and complementary products.

2021年4月8日

Synerise Cleora sets new standards in identifying substitutes and complementary products.

Finding similar products or products that complement each other represents one of the most critical challenges in…

1 条评论
Cleora.ai - Swiss Army knife - essential element of systems operating on data in the form of a network of connected nodes.

2021年3月2日

Cleora.ai - Swiss Army knife - essential element of systems operating on data in the form of a network of connected nodes.

We created Cleora, one of the fastest graph-embedding algorithms in existence. How was it created and what is the…

1 条评论
AI for good: Cleora.AI created by Synerise in Biomedical Sciences.

2021年2月22日

AI for good: Cleora.AI created by Synerise in Biomedical Sciences.

Artificial Intelligence for drug development in medicine is very hot topic nowadays - which definitely will…

2 条评论
Synerise open-sourcing Cleora AI framework for ultra-fast embeddings in large graphs

2020年11月6日

Synerise open-sourcing Cleora AI framework for ultra-fast embeddings in large graphs

More than 197x faster than DeepWalk, ~4x-8x faster than Pytorch-BigGraph by Facebook. We are open sourcing Cleora AI…

8 条评论
Synerise Terrarium - a massive scale in-memory & disk storage built from scratch

2020年9月17日

Synerise Terrarium - a massive scale in-memory & disk storage built from scratch

Terrarium is a column and row store engine designed specifically for behavioral intelligence, real-time data…

4 条评论
Synerise business continuity during COVID-19: a message for our people, clients, partners and suppliers

2020年3月19日

Synerise business continuity during COVID-19: a message for our people, clients, partners and suppliers

Right from the start, we have built Synerise as an organization that can fully support our clients and partners…

3 条评论
From mass surveillance to fashion advice - can consumer AI benefit from surveillance research?

2020年3月11日

From mass surveillance to fashion advice - can consumer AI benefit from surveillance research?

We've recently published a preprint of our paper with answer on that question. In the last 3 years, there's been a…

1 条评论
How Synerise collaborates with Microsoft to stop the guessing game in retail

2019年11月5日

How Synerise collaborates with Microsoft to stop the guessing game in retail

Relentless competition that pushes prices and margins down, changing and unpredictable consumer tastes and the need to…

3 条评论

See all articles

Deconstruction of fake #AI Benchmarks - Recommender Systems Case Study

Jaroslaw Krolewski

synerise.com | basemodel.ai | cleora.ai | wislakrakow.com | agh.edu.pl

PROBLEM 1:

Using work from 6 YEARS AGO as a baseline:

RED FLAG! MISLEADING ??

PROBLEM 2:

Using the WRONG METRICS: ROOKIE MISTAKE ??

PROBLEM 3:

Comparing to prior work based on UNDISCLOSED DATASETS: ABSOLUTE BULL "YOU-KNOW-WHAT"??

PROBLEM 4:

Claim that “last visited product” recommendation is 2x better than the state-of-the-art: INSANE BULL "YOU-KNOW-WHAT"??

PROBLEM 5:

领英推荐

Comparing Top-K recommenders to Session-based recommenders: ABSOLUTE BULL-"YOU-KNOW-WHAT"??

PROBLEM 6:

Suggestion that a 2015 paper evaluated somebody else’s 2021 method: MISLEADING??

PROBLEM 7:

Some methods presented have significantly limited applicability: MISLEADING??

PROBLEM 8:

Best performing methods are not ML/AI powered & do not consider client history: FALSE CLAIM??

SOLUTION:

Proper recommender evaluations:

HOW IT SHOULD LOOK LIKE??

Thanks for Your time. Sharing and optimizing knowledge is the best thing You can do to move AI disciplines forward. ?

Jaroslaw Krolewski Newslettter

712 位关注者

Jaroslaw Krolewski的更多文章

社区洞察

其他会员也浏览了

Anthropic Can Now Control Your PC

May AI Have Your Attention?

Vector Search - The New Kid on the Azure AI Search Block

OpenAI is approaching a $90 Billion valuation

DeepSeek vs. OpenAI: Who’s Winning the AI Battle—and Where Are the Risks?

Transformative AI Advancements: August 2024

Revolutionary: Microsoft Lets You Build Your Own AI 'Copilots'

OpenAI Realtime API is a Game Changer, BUT

The lesson we all learned from the OpenAI outage

Glean: How to Outpace Competitors in Enterprise AI (and Frustrate OpenAI in the Process)

PROBLEM 1:

Using work from 6 YEARS AGO as a baseline:

RED FLAG! MISLEADING ??

PROBLEM 2:

Using the WRONG METRICS: ROOKIE MISTAKE ??

PROBLEM 3:

Comparing to prior work based on UNDISCLOSED DATASETS: ABSOLUTE BULL "YOU-KNOW-WHAT"??

PROBLEM 4:

Claim that “last visited product” recommendation is 2x better than the state-of-the-art: INSANE BULL "YOU-KNOW-WHAT"??

PROBLEM 5:

领英推荐

Comparing Top-K recommenders to Session-based recommenders: ABSOLUTE BULL-"YOU-KNOW-WHAT"??

PROBLEM 6:

Suggestion that a 2015 paper evaluated somebody else’s 2021 method: MISLEADING??

PROBLEM 7:

Some methods presented have significantly limited applicability: MISLEADING??

PROBLEM 8:

Best performing methods are not ML/AI powered & do not consider client history: FALSE CLAIM??

SOLUTION:

Proper recommender evaluations:

HOW IT SHOULD LOOK LIKE??

Thanks for Your time. Sharing and optimizing knowledge is the best thing You can do to move AI disciplines forward. ?

Jaroslaw Krolewski Newslettter

712 位关注者

Jaroslaw Krolewski的更多文章

Synerise Monad: Apply science to behavioral data. Automatically.

How Synerise AI Team challenge the Transformer.

Synerise Cleora sets new standards in identifying substitutes and complementary products.

Cleora.ai - Swiss Army knife - essential element of systems operating on data in the form of a network of connected nodes.

AI for good: Cleora.AI created by Synerise in Biomedical Sciences.

Synerise open-sourcing Cleora AI framework for ultra-fast embeddings in large graphs

Synerise Terrarium - a massive scale in-memory & disk storage built from scratch

Synerise business continuity during COVID-19: a message for our people, clients, partners and suppliers

From mass surveillance to fashion advice - can consumer AI benefit from surveillance research?

How Synerise collaborates with Microsoft to stop the guessing game in retail

社区洞察

其他会员也浏览了

Anthropic Can Now Control Your PC

May AI Have Your Attention?

Vector Search - The New Kid on the Azure AI Search Block

OpenAI is approaching a $90 Billion valuation

DeepSeek vs. OpenAI: Who’s Winning the AI Battle—and Where Are the Risks?

Transformative AI Advancements: August 2024

Revolutionary: Microsoft Lets You Build Your Own AI 'Copilots'

OpenAI Realtime API is a Game Changer, BUT

The lesson we all learned from the OpenAI outage

Glean: How to Outpace Competitors in Enterprise AI (and Frustrate OpenAI in the Process)