登录查看更多内容

System Design Study: Twitter's Recommendation Algorithm

Vivek Bansal

Senior Software Engineer at Uber | Ex-Grab | Ex-Directi

发布日期: 2024年6月30日

If you like the free content I put out, consider subscribing to my newsletter on substack to get a well-researched article every week delivered straight to your inbox.

In 2023, Twitter published a blog post explaining how Twitter calculates a user’s timeline. This article will be a gist of that blog explaining all the necessary information. It will help you understand how social media timelines are designed. You can take this blog as a reference if you are working on a similar product or can use it for your system design interviews.

The recommendation pipeline consists of four main stages:

Candidate Sourcing: Find the best tweets for showing to end user
Ranking: Rank all the tweets in the best order personalized for end users.
Heuristics and Filtering: Apply different kinds of heuristics and filtering to prepare a well-balanced feed that is personalized for the end user.
Mixing and Serving: Mix the tweets with other non-tweet content and then display them to the end user

Let’s take a deep dive into each of these points to understand better.

1. Candidate Sourcing

Twitter filters out the best 1500 tweets from a pool of hundreds of millions of tweets for the user’s timeline. These tweets consist of a 50:50 ratio of In-Network Tweets and Out-of-Network tweets. In-network tweets consist of those tweets from people which you follow and Out-of-Network tweets consist of those tweets from people which you don’t follow. Let’s discuss one specific problem related to each of them.

In-network tweets: The main problem is how to rank all in-network tweets so that the timeline is most relevant to you.

Twitter ranks all the in-network tweets using a machine-learning model called Real-Graph which predicts the likelihood of engagement between two users. The higher the Real Graph score between you and the author of the Tweet, the more of their tweets Twitter will include.

Out-of-network tweets: The main problem is how to tell if a certain tweet is relevant to you if you don’t follow the author.

To be honest, this is a trickier problem and involves a lot of prediction-based algorithms. Twitter solves the problem by using two approaches: 1) Social Graph: In this approach, Twitter developed GraphJet which is a graph processing engine that maintains a real-time interaction graph between users and Tweets. Twitter’s algorithm traverses this graph of engagements to find out what tweets the people you follow recently engaged with. 2) Embedding spaces: This approach is a more generic one that calculates content similarity between other people's tweets and your interests.

Overall, a blend of In-Network tweets and out-of-network tweets is prepared and the best 1500 tweets are collected.

2. Ranking

At this point, the candidate sourcing is completed and now the 1500 tweets are treated equally. These tweets are now ranked using a ~48M parameter neural network that is continuously trained on real-time tweet interactions to optimize for positive engagement which means more likes, comments, and retweets.

After doing all the required processing, the ranking engine outputs ten labels for each tweet, where each label represents the probability of an engagement.

As per the source code from Twitter’s Github, the ten labels produced are:

scored_tweets_model_weight_fav: The probability the user will favorite the Tweet.
scored_tweets_model_weight_retweet: The probability the user will Retweet the Tweet.
scored_tweets_model_weight_reply: The probability the user replies to the Tweet.
scored_tweets_model_weight_good_profile_click: The probability the user opens the Tweet author profile and Likes or replies to a Tweet.
scored_tweets_model_weight_video_playback50: The probability (for a video Tweet) that the user will watch at least half of the video.
scored_tweets_model_weight_reply_engaged_by_author: The probability the user replies to the Tweet and this reply is engaged by the Tweet author.
scored_tweets_model_weight_good_click: The probability the user will click into the conversation of this Tweet and reply or Like a Tweet.
scored_tweets_model_weight_good_click_v2: The probability the user will click into the conversation of this Tweet and stay there for at least 2 minutes.
scored_tweets_model_weight_negative_feedback_v2: The probability the user will react negatively (requesting "show less often" on the Tweet or author, block or mute the Tweet author).
scored_tweets_model_weight_report: The probability the user will click Report Tweet.

And, there is a weight associated with each label:

领英推荐

When Making Predictions, Avoid Twitter

John Battelle 1 年前

When Will the Threads Honeymoon End?

John Battelle 1 年前

Microns: Theory and Economics

Rajesh Jain 3 年前

scored_tweets_model_weight_fav: 0.5
scored_tweets_model_weight_retweet: 1.0
scored_tweets_model_weight_reply: 13.5
scored_tweets_model_weight_good_profile_click: 12.0
scored_tweets_model_weight_video_playback50: 0.005
scored_tweets_model_weight_reply_engaged_by_author: 75.0
scored_tweets_model_weight_good_click: 11.0
scored_tweets_model_weight_good_click_v2: 10.0
scored_tweets_model_weight_negative_feedback_v2: -74.0
scored_tweets_model_weight_report: -369.0

Then, a final score is calculated using the following formula:

score = sum_i { (weight of engagement i)*(probability of engagement i) }

The higher the overall score, the higher the rank of the tweet in the overall timeline.

3. Heuristics and Filtering

This stage consists of applying different kinds of heuristics and filters such as:

Filter tweets based on your preferences and do not show any tweets from users that you have blocked or muted.
Avoid too many consecutive tweets from a single author
Ensure a fair balance of in-network and out-of-network tweets
Lower the score of a particular tweet if the user has provided negative feedback on the same tweet.
Install quality safe checks such as including only those out-of-network tweets where people you follow engaged with the tweet or people you follow have followed the tweet’s author.

These are some examples that make sure a well-balanced feed is delivered to the end user and make the user experience of scrolling through their timeline as relevant as possible.

4. Mixing and Serving

As the last step in the whole process, the Home Mixer service (which is responsible for constructing and serving the For You timeline) mixes the tweets with non-tweet content like Ads, Follow Recommendations, and Onboarding prompts.

As per Twitter’s blog, this pipeline runs approximately 5 billion times per day and completes in under 1.5 seconds on average. This shows the massive amount of data processing it takes to construct a single-user timeline.

That’s it, folks for this edition of the newsletter. In future editions, I’ll try to cover the algorithms of other popular social media platforms to get a holistic idea of building a social media user timeline.

Please consider liking and sharing with your friends as it motivates me to bring you good content for free. If you think I am doing a decent job, share this article in a nice summary with your network. Connect with me on Linkedin or Twitter for more technical posts in the future!

Book exclusive 1:1 with me here.

Thanks for reading Curious Engineer! Subscribe for free to receive new posts and support my work.

Resources

Twitter’s Recommendation Algorithm

Twitter’s Algorithm Source Code

Curious Engineer

19,937 位关注者

Sanial Das

Engineering @OneLot | Ex-Media.net | IIITG'23

8 个月

Insightful and easy to read ??

C S Karthik

Engineering@Fairmatic | Ex-Attentive | TIET'21 |

8 个月

Does the 1.5 seconds include even the ML pipeline predictions and ranking? Twitter engineering is really incredible and thanks for making it really readable!

Amit Kumar Upadhyay

Senior Project Engineer @ C-DAC | Full Stack Java Developer

8 个月

Very informative

1 次回应

Gaurav Sen

Founder, InterviewReady

8 个月

This is easy to read, thanks for sharing Vivek Bansal!

2 次回应

查看更多评论

要查看或添加评论，请登录

Vivek Bansal的更多文章

How to implement a Circuit Breaker

2025年1月12日

How to implement a Circuit Breaker

If you like the free content I put out, consider subscribing to my newsletter on substack to get a well-researched…

7 条评论
How to implement Consistent Hashing

2024年12月29日

How to implement Consistent Hashing

If you like the free content I put out, consider subscribing to my newsletter on substack to get a well-researched…

3 条评论
Optimistic Locking Implementation

2024年12月1日

Optimistic Locking Implementation

If you like the free content I put out, consider subscribing to my newsletter on substack to get a well-researched…
1 year to Curious Engineer ??

2024年11月18日

1 year to Curious Engineer ??

If you like the free content I put out, consider subscribing to my newsletter on substack to get a well-researched…
Message Queues vs Message Brokers

2024年11月9日

Message Queues vs Message Brokers

If you like the free content I put out, consider subscribing to my newsletter on substack to get a well-researched…

4 条评论
Introduction to gRPC

2024年11月2日

Introduction to gRPC

If you like the free content I put out, consider subscribing to my newsletter on substack to get a well-researched…

7 条评论
Non-Functional Requirements

2024年10月26日

Non-Functional Requirements

Brief Introduction Let’s say you are building a website that allows users to book flight tickets. The requirements for…

4 条评论
QuadTrees

2024年10月19日

QuadTrees

If you like the free content I put out, consider subscribing to my newsletter on substack to get a well-researched…

2 条评论
Text Based Search: ElasticSearch

2024年10月12日

Text Based Search: ElasticSearch

If you like the free content I put out, consider subscribing to my newsletter on substack to get a well-researched…

3 条评论
Sharding vs Partitioning

2024年10月5日

Sharding vs Partitioning

If you like the free content I put out, consider subscribing to my newsletter on substack to get a well-researched…

5 条评论

See all articles

System Design Study: Twitter's Recommendation Algorithm

Vivek Bansal

Senior Software Engineer at Uber | Ex-Grab | Ex-Directi

1. Candidate Sourcing

2. Ranking

领英推荐

3. Heuristics and Filtering

4. Mixing and Serving

Resources

Curious Engineer

19,937 位关注者

Vivek Bansal的更多文章

社区洞察

其他会员也浏览了

Microns: Theory and Economics

The Trust Crisis: Why Consumers Don’t Believe You (And What Brands Can Do About It)

How to Use Storify

Threads Phenomenon: A Detailed Exploration of Its Ascent and Descent

Crazy ideas platform - part 2 - Unbiased news platform

Things You Didn't Know Twitter Does with AI

Digital Digest #5 - Twitter Turmoil, Insta Algos, TikTok Tako and more!

War of the Search Engines, Upcoming Twitter Features, Mitigate Busy Work, Why to Facebook in 2023, and Salary Transparency Woes from The SUM

Threads vs Twitter: Examining Data Collection Practices and Key Differences

1. Candidate Sourcing

2. Ranking

领英推荐

3. Heuristics and Filtering

4. Mixing and Serving

Resources

Curious Engineer

19,937 位关注者

Vivek Bansal的更多文章

How to implement a Circuit Breaker

How to implement Consistent Hashing

Optimistic Locking Implementation

1 year to Curious Engineer ??

Message Queues vs Message Brokers

Introduction to gRPC

Non-Functional Requirements

QuadTrees

Text Based Search: ElasticSearch

Sharding vs Partitioning

社区洞察

其他会员也浏览了

Microns: Theory and Economics

The Trust Crisis: Why Consumers Don’t Believe You (And What Brands Can Do About It)

How to Use Storify

Threads Phenomenon: A Detailed Exploration of Its Ascent and Descent

Crazy ideas platform - part 2 - Unbiased news platform

Things You Didn't Know Twitter Does with AI

Digital Digest #5 - Twitter Turmoil, Insta Algos, TikTok Tako and more!

War of the Search Engines, Upcoming Twitter Features, Mitigate Busy Work, Why to Facebook in 2023, and Salary Transparency Woes from The SUM

Threads vs Twitter: Examining Data Collection Practices and Key Differences