登录查看更多内容

Signals: Sustained CTR Growth 18 Months On

Times Internet

发布日期: 2024年9月3日

Co-authored By: Ashish Jaiswal and Manish Mishra

Quick Recap: In February 2022, we began work on Signals, a publication-agnostic personalization platform for news. Within six months, we had built a simple ranking model for TOI+, our subscription product, which resulted in a minimum 100% boost in CTR. Encouraged by this success, we formally launched work on a large-scale collaborative filtering model for the larger business. By March 2023, we had implemented it on our website in our recirculation widgets, achieving an 85% boost in CTR.

A lot has happened since then.

The engineering teams put in a tremendous amount of work to improve the reliability of our infrastructure: data ingestion, low-latency data pipelines, regular data audits, scaling of servers, and tweaking caching to serve millions of audiences. A deep dive into these topics will be covered in a separate blog post.

In this post, we’ll focus primarily on the evolution of our product and machine learning functions.

Outcomes

Compared to editorial distribution, Signals continues to sustain its CTR growth even after 1.5 years:

Web: We are seeing a 95% higher CTR on widgets and 2.5x higher CTR on push notifications.
App: We are seeing a 30–50% higher CTR on the homepage feed and push notifications.
The algorithm is successfully tapping into the archive and getting clicks on older stories. 50% of clicks from personalized push notifications are from stories older than 2 days.

The last point is critical because it demonstrates that tapping into evergreen content is a ‘distribution problem.’

Improvements to the Model

Self-Learning (and Data Drift):

One of the major reasons the CTR increase percentage has sustained (and grown in many situations) is because we’ve resisted the urge to add any business rules to the model. It is completely self-learning. Whenever we discover a new negative phenomenon, instead of adding business rules as a band-aid solution, our ML team tries to discover causality (why it is happening) in the data and solve it mathematically.

Here’s an example:

Phenomenon: Until just a couple of months ago, most of our analysis was at the day level. When we explored the data at the hour level, we realized that the model’s performance was extremely high during the morning but progressively worsened from 1 PM until 4 AM the next morning.

Caused by Data Drift: After weeks of exploratory data analysis and tests, the team established a direct causal relationship between the quality of personalization and the quantity and diversity of fresh content supply available. While our digital editorial team publishes a steady quantity of stories across multiple shifts, at 4 AM, our digital CMS (Denmark) pulls all stories from yesterday’s newspaper (from across all editions) and auto-publishes them.

Solution: The team started dynamically changing hyperparameters of the model to account for this phenomenon, which stopped the degradation of CTR in the evening.

Progressively Personalize:

All good search systems start wide and progressively narrow down to what the user sees. At each phase of narrowing, the number of data points considered increases.

At the training layer, conceptually it works like this:

100 million: Filter out records from the entire archive and clickstream history primarily using one data point - event date.
10 million: Use collaborative filtering but with only 3 data points.
1 million: Remove bias through normalization from the previous phase’s output.
100 thousand: Use content filtering with 10 data points to re-sort the previous phase’s output.
10 thousand: Use a model to remove stale stories from the previous phase’s output.

At the model serving API layer, it further optimizes while serving the API output:

1 thousand: Remove all stories read by this user and stories presented to them recently but not clicked on.
100: Run in-session adjustments to further refine what’s shown.

Remove Stale Stories:

Personalization teams in news companies have a major advantage over those at social media firms.?

The entire content supply at a news firm is vetted and verified by the editorial team, eliminating concerns about conspiracy theories, misinformation, etc.?
Moreover, there’s no risk of the model leading users down echo chambers, as our editorial team maintains a balanced content mix.

However, there’s one big challenge: Audiences do not expect to see stale content on their feed. Identifying and removing such stories is an extremely hard problem to solve.

领英推荐

From Insights to Innovation: The Importance of Data…

Deuglo Infosystem Private Limited 1 年前

New Year, New Learnings: Stay Ahead with Extract…

Zyte 1 年前

Data Science Mid-Year Trends for 2023: Insights and…

IntellectFaces, Inc 1 年前

Stale doesn’t necessarily mean old. Stories can be old, but their information value can remain relevant.?

Stories like ‘who won the toss before a cricket match begins’ become obsolete within as little as 1 hour.?
Daily stock market updates become stale in 12 hours.?
In some cases, the information isn’t stale, but the way the story is written makes it appear stale.?

Our first attempt at solving this problem was to train a binary classifier. However, the model occasionally produced false positives, leading to user complaints about stale stories.?

Next, we then started forecasting CTR from a story based on its prior day’s performance, which greatly reduced the number of stale stories being recommended.

Forgetting:

In civil law, there is a concept of the statute of limitations, a period within which legal proceedings may be initiated. Similarly, in personalization, it is important for the model to forget user preferences after a certain period. This keeps recommendations adaptive to recent trends and interests.

Change in Perspective

A year and a half ago, we thought of personalization as a layer sitting on top of everything else. Now we increasingly see personalization impacting every aspect of the business. Here are three examples:

Mitigate ‘Concept Drift’ with the Product Team:

Concept Drift occurs when the relationship between input and output starts changing. This can happen if the input data collected isn’t representative of reality (i.e., what users are experiencing) or if its definition has changed. Such drifts would typically bypass most data audit systems because the data’s valid values remain the same. Eventually, this makes that particular data point no longer reliable or fully accurate.

For example, let’s say there is a data point that is true if a user sees a paywall on a story. If, after a feature release, users still see the paywall but this data point is now false, it results in silent concept drift.

So far, there doesn’t seem to be an automated way to catch this. Hence, just as teams take SEO and performance approval before launching a feature on the browser, we will need to introduce a review process for data concept checks.

Mitigate ‘Bias’ with the Design Team:

There’s a reason why most algorithmically distributed internet products serve content in simple, straightforward feeds. This is true for Google, LinkedIn, X, Facebook, Instagram, Google News, News Inshorts, etc. This is because:

The user experience adds bias to clickstream data, which can negatively impact a model's performance. Hence, it is critical for models to normalize and de-bias the input data. However, this normalization step requires input data, which is only possible to collect reliably if the UX is a simple feed.

Additionally, all personalization is essentially a search and sort problem. Therefore, you won’t find ‘widgets’ or ‘sections’ in the middle of the feed, as widgets distort the sort order of the algorithm.

To ensure this, we need to work closely with the platform design teams to drive decisions.

Factor in ‘Cost of Revenue’ in Distribution With The Business Team:

The Times of India digital operates multiple revenue models — direct ads, indirect ads, sponsored content, subscription, affiliate, microtransactions, etc. — on the same real estate (website or app) targeting the same users. Previously, what was shown to each user was governed by business rules. For example, story numbers 6 and 10 in the feed would be affiliate stories. However, data shows that such rules drastically reduce engagement levels.

We’ve come to realize that our ‘recommender systems’ need not just to personalize but also to ‘maximize revenue’ while balancing the trade-offs of these competing revenue models.

The closest analogy we found is from the Mutual Fund industry. Depending on each user’s risk appetite, a mutual fund builds a portfolio of investments across various asset classes — gold, real estate, equity, etc. Similarly, our recommender models need to compute the risk/reward from each revenue model for a particular user to maximize revenue while maintaining engagement.

However, this again requires working closely with business teams to ensure their targets are met, regardless of how exposure is managed by Signals.

And before we sign off, the customary hiring plug.

There is a lot of hype about AI in the market. However, there are few jobs in the market where companies are investing in building in-house models that scale cost-effectively and handle subtle nuances. More importantly, you’ll get to work under the tutelage of an extremely senior data science and data pipeline team that has decades of putting such products into production.

We are looking to hire Java and Android developers.?

Additionally, we are considering hiring an AI Training Editor. Personalization is inherently an editorial product. It involves editorial judgment regarding what should and shouldn’t be shown to users. In this regard, there is much to learn from editorial leads about how they make editorial judgments. Deep conversations with them can lead to better understanding and formulation of nuances. Additionally, manually labeled datasets can help with rolling out algorithms. Yes, this is tedious work, but you’ll get to observe how AI works firsthand.

Join Us and #TakeUsToTheNextLevel: https://timesinternet.in/careers

Credits:

Algorithm: Dharmendra Mahajani, Honey Bansal, Karn Bhushan, Manish Mishra
Backend: Amit Chouhan, Kulbhushan Pandey, Prafulla Gupta
Design: Jay Kedia, Rahul Das
Frontend: Ankit Jindal, Arpit Toshniwal, Rahul Tokas, Subhashish Singh
Product: Ritvvij Parrikh

Tech Blogs at Times Internet

74,857 位关注者

Sanjeev Sharma (Lao)

5 个月

Thanks for including me

Sudesh Prasad

6 个月

I wanted to bring to your notice that each time I try and open the news link from Times of India via Google News or search, I get a pop-up asking me whether I want to open the news link on the Toi app or browser to avoid this irritant I downloaded ToI app (iOS). Still, I continue to get the pop-up. I would request you to look into it. I don’t have any issue opening links to other publications via Google News. Thank you. ??

1 次回应

查看更多评论

要查看或添加评论，请登录

Times Internet的更多文章

See all articles

Signals: Sustained CTR Growth 18 Months On

Times Internet

Outcomes

Improvements to the Model

Self-Learning (and Data Drift):

Progressively Personalize:

Remove Stale Stories:

领英推荐

Forgetting:

Change in Perspective

Mitigate ‘Concept Drift’ with the Product Team:

Mitigate ‘Bias’ with the Design Team:

Factor in ‘Cost of Revenue’ in Distribution With The Business Team:

Tech Blogs at Times Internet

74,857 位关注者

Times Internet的更多文章

社区洞察

其他会员也浏览了

Henrique Cruz on PLG: How Rows Simplified Data Management

Navigating the Data Ocean: Unleashing Value with OpenSearch Service

Data Cleaning - Sort Values

Datasets/ Data Sources and where to find them, ????.

Data Intellect eNews03

What Is The Semantic Layer And Why Does It Matter?

GovTech Division of Deep Knowledge Analytics Releases Definitive GovTech Industry Framework

60 seconds with… Senior Director of Data Science, Mike Shores, on how data is helping Vista become a simpler, smarter business.

Note 2 : The Architect’s Dilemma: Selecting the Right Data Technology

Why data products fail

Outcomes

Improvements to the Model

Self-Learning (and Data Drift):

Progressively Personalize:

Remove Stale Stories:

领英推荐

Forgetting:

Change in Perspective

Mitigate ‘Concept Drift’ with the Product Team:

Mitigate ‘Bias’ with the Design Team:

Factor in ‘Cost of Revenue’ in Distribution With The Business Team:

Tech Blogs at Times Internet

74,857 位关注者

Times Internet的更多文章

Enhancing Web Responsiveness: How We Optimized Interaction to Next Paint (INP)

Improving LCP on TIL News Sites: An In-depth Look at our Optimization Methods

Publish Story: A Powerful CMS- Enabling Publishers to Go Live in Hours

Boosting Times Internet’s News Sites Performance: A Deep Dive into Web Vitals Optimization

Road to Hyperscaling in India: Times Bridge report in association with BCG and TiE Delhi-NCR

ET Catalyse: Home to Contemporary Business Conversations

ET: The Daily Choice of First Class Business & Finance News Readers

Leverage the power & reach of India’s No. 1 News destination platform - TOI

社区洞察

其他会员也浏览了

Henrique Cruz on PLG: How Rows Simplified Data Management

Navigating the Data Ocean: Unleashing Value with OpenSearch Service

Data Cleaning - Sort Values

Datasets/ Data Sources and where to find them, ????.

Data Intellect eNews03

What Is The Semantic Layer And Why Does It Matter?

GovTech Division of Deep Knowledge Analytics Releases Definitive GovTech Industry Framework

60 seconds with… Senior Director of Data Science, Mike Shores, on how data is helping Vista become a simpler, smarter business.

Note 2 : The Architect’s Dilemma: Selecting the Right Data Technology

Why data products fail