登录查看更多内容

Search Rankings & Recommendations

Manisha Arora

Data Science, Google Ads | Data Science Coach | Helping Data Scientists Level Up in their Careers

发布日期: 2022年10月3日

Welcome to the?Data Science Growth Series?hosted by?PrepVector! ?? In this series, we help you up-level in your career by bringing the latest insights on data science by industry experts through events, articles, and webinars. We recently invited Sivakumar Palanivel to speak with us about large-scale machine learning systems, with a focus on Search Rankings & Recommendations.

About the Host:

Manisha Arora?is a Data Scientist at Google with 10 years' experience in driving business impact through data-driven decision making. She is currently leading ad measurement and experimentation for Ads across Search, YouTube, Shopping, and Display. She works with top Google advertisers to support their marketing and business objectives through data insights, machine learning, and experimentation solutions.

About the Speaker:

Sivakumar P. is a Data Scientist at Microsoft with 12 years' experience in large-scale business applications, statistical modeling, and machine learning. He has architected and developed back-office systems to automate and analyze data. Currently, Sivakumar is leading NLP based Information extraction models and search recommendations at Microsoft.

What are large-scale ML Systems?

Large scale ML systems are Machine Learning models with millions to billions of parameters, that promise adequate capacity to digest massive datasets and offer powerful predictive analytics. Large scale machine learning systems are more than just accuracy in predictions. It involves multiple perspectives since it deals with huge datasets and training the ML models. Often heavy lifted or complex models will result in computational errors, which can be avoided by beginning with a basic ML model and scaling it beyond. Large scale machine learning systems produce large data sets in a very short time. In order for the data to be processed effectively, dealing with such complex data sets requires a multi-dimensional perspective. If it's a global ML system, training the models for multiple languages is another perspective to consider.?

What are the principles behind designing a large scale ML system?

Understanding the different requirements of batch processing and real time processing is an important aspect of building large scale ML systems. Batch processing and real time processing depend on the use case. This requires two different data streams for handling the different processing methods. Another significant consideration is thinking about the cache and how you can train the model with the new data. For some systems, you have to train the model every 30 minutes, so you need to deploy the model with new data sets often.

Let's dive deeper into Search Rankings and Personalization:

Search rankings and personalization have common characteristics. We can assume personalization as the top layer for search rankings. If we are building a system that only shows search results based on a search query, then it may not require personalization and the focus can solely be on ranking the content. On the other hand, if it is an e-commerce website, the importance is channeled more towards personalization, focusing on user behavior.

There are various approaches for designing search rankings and personalization models. Let us take a look at the traditional model.?

Content Based filtering:

Content based filtering is a process of identifying similar items in the data repository. For example, when a user opens the application YouTube without logging in, the application shows the user random videos without any filters. But when the user clicks on a particular video, the YouTube algorithm filters the video content based on what the user watched earlier. This is an example of user-centric content based filtering. The system also uses item-centric content-based filtering, where the content that is shown to the user is contrasted with the level of popularity of other pieces of content, such as popular videos, etc.

Another example of content based filtering is Stack Overflow. If we read through any page, there are suggestions shown on the right of the page highlighting other similar questions. This is filtered based on similar content. But these suggestions are not personalized to the user.

Collaborative Filtering:

This approach is based on comparing other users and their behavior towards content searches and presenting similar content. For instance, if you have a profile setup with minimal, basic information (language, location, gender etc) but no activity, the system maps your profile with other users with similar profiles and studies their behavior to extract information and suggest relevant content based on their interests.??

Although the content based and collaborative filtering’s impact relies on the use cases, the latter has more advantages over the other. Most companies prefer a hybrid approach for search rankings and recommendations. For example, Amazon uses a hybrid approach for product recommendations. But without a hybrid approach, it might lead to similar types of recommendations to the users, which can lead to a poor user experience. A hybrid approach allows them to introduce diversity in the recommendations. Hence, in this use-case, product recommendations can be extracted with content based filtering but on top of those recommendations, other recommendations based on user interest need to be added to increase diversity.

Mapping items together is easy compared to mapping users because it is not easy to segment users into certain categories, whereas segmenting items into certain categories is a fairly easy process.?

Components of Search Design:

While filtering content from a repository that contains a large amount of data, we should first perform candidate generation. Candidate generation is the process of identifying potential candidates, which reduces or eliminates unwanted data and leaves us with relevant data to do the ranking and filter more specific content. Data follows through a funnel - it is filtered from 100 millions data -> 100 hundreds of data -> a dozen of data, which is then finally shown to the user. This funnel based approach is used largely for search and recommendations.

领英推荐

Should I Choose Machine Learning or Big Data?

Bernard Marr 3 年前

Understanding Vector Databases: A Strategic Guide for…

Don Hilborn 4 个月前

How to Stay Relevant in Data Analytics: 7 Learning Tips

Quantum Analytics NG 2 个月前

Two tower neural network is a widely-used approach for candidate identification. This approach is about creating an embedding layer for the users and the content. The user's interests and the item descriptions are matched and filtered with which the potential candidate is identified by the system. The goal here is to train two neural networks, a user encoder and an item encoder such that items with embeddings very similar to the user are a great recommendation for the user. The two towers represent ‘item encoder’ and ‘user encoder’. For a pair of user and item, we can pass them through these towers to get a fixed size user embedding and an item embedding of the same size. Then a dot-product of these two vectors is computed. If this dot-product is high, then the item is considered a good match for the user.

End-to-end Design of ML Systems:

The end-to-end approach for large scale ranking systems consists of 4 steps:

Retrieval: Every content will be converted into vectors / numbers and clusters the data. The clusters are then grouped to match the input query. This reduces the data from millions to thousands.?
Filtering: This step removes invalid data based on certain business logics.?
Scoring: This helps adding additional features or descriptions to the content (count of likes, comments).?
Ordering: Analyzes which one has to come on top by applying more business logic.

Product Quantizer:

With the vectors created from the content, the computational time can still be high while finding the potential candidate. To reduce the time, the vectors are sliced, clustered with similar data, and pre-calculate the distances of the vector ID’s and stored separately. With this practice, it is easier to filter through large data sets and find the elements, reducing the computational time.

Scaling up large scale ML systems:

While scaling the ML systems, the inference server extracts the potential candidates using a candidate generation approach. Cache and evaluation store are important aspects of the model when the ML model works with new data sets frequently. Model exploration, used for implementation, is connected to a data lake. This is the overall scaling process.

Metrics:

If you are dealing with ranking models with indexes, NDCG is a recommended metric. You should also measure from a business perspective. A/B testing can be used along with business metrics with random but power users.

It has been an absolute pleasure to have Sivakumar Palanivel with us for this Fireside Chat. If you are exploring Data Science roles and need more information, reach out to me.

If you are looking to hone up your skills in Data Science,?PrepVector?offers a comprehensive course led by experienced professionals. You will gain skills in Product Sense, AB Testing, Machine Learning and more through a series of live coaching sessions, industry mentors, and personalized career coaching sessions. In addition, you will also compound your skills by learning with like-minded professionals and sharing your learnings with the larger community along the way.

The next cohort will kick off Oct 17, 2022. Book a free consultation to know more! ??

Kudos to?Sujithra Gunasekar?for helping draft this article. ?? Subscribe to this newsletter to stay tuned about more such events!

Data Science Growth Series

5,004 位关注者

Bhimsain Arora

HR Manager at GHAI HOSPITAL Faridabad Haryana

2 年

upto the mark excellent. thanks manisha

2 次回应

Hamza Farooq

Founder | VC | Professor

2 年

This is so great!

1 次回应

Tushar Tyagi

2 年

This will be amazing! Thanks Manisha Arora

1 次回应

查看更多评论

要查看或添加评论，请登录

Manisha Arora的更多文章

Tech Leveling & Compensation Structures

2024年10月29日

Tech Leveling & Compensation Structures

Welcome to this edition of our newsletter, where we demystify tech leveling and compensation structures, a breakdown…

4 条评论
Data Analyst to Data Scientist Transition Plan

2024年10月23日

Data Analyst to Data Scientist Transition Plan

Welcome to this edition of our newsletter, where we dive into the exciting journey from Data Analyst to Data Scientist!…

3 条评论
Data Dialogues: Navigating the Data Science Landscape [Part 2 of 2]

2024年5月9日

Data Dialogues: Navigating the Data Science Landscape [Part 2 of 2]

Link to Part 1 of this Newsletter Welcome to the Data Science Growth Series hosted by PrepVector! ?? In this series, we…

1 条评论
Data Dialogues: Navigating the Data Science Landscape [Part 1 of 2]

2024年5月1日

Data Dialogues: Navigating the Data Science Landscape [Part 1 of 2]

Welcome to the Data Science Growth Series hosted by PrepVector! ?? In this series, we help you up-level in your career…
Applied Machine Learning Projects: Course Launch

2024年3月1日

Applied Machine Learning Projects: Course Launch

Get Updates Here I'm thrilled to announce the upcoming launch of the Applied Machine Learning Projects course by…

5 条评论
Experimentation-driven Product Development

2023年10月30日

Experimentation-driven Product Development

Welcome to the Data Science Growth Series hosted by PrepVector! ?? In this series, we help you up-level in your career…
Building a Q&A on custom docs using LangChain

2023年5月9日

Building a Q&A on custom docs using LangChain

Building Custom Q&A Model with LangChain Welcome to this edition of our newsletter, where we'll explore the world of…

5 条评论
Evolution of Language Models and Their Impact on Search

2023年4月25日

Evolution of Language Models and Their Impact on Search

Welcome to this edition of our newsletter, where we'll explore the world of large language models (LLMs) and their…
Causal Inference Fundamentals

2023年3月23日

Causal Inference Fundamentals

Welcome to the Data Science Growth Series hosted by PrepVector! ?? In this series, we help you up-level in your career…

3 条评论
Trends and Career Paths in Data Science

2023年2月20日

Trends and Career Paths in Data Science

Welcome to the Data Science Growth Series hosted by PrepVector! ?? In this series, we help you up-level in your career…

4 条评论

See all articles

Search Rankings & Recommendations

Manisha Arora

Data Science, Google Ads | Data Science Coach | Helping Data Scientists Level Up in their Careers

领英推荐

Data Science Growth Series

5,004 位关注者

Manisha Arora的更多文章

社区洞察

其他会员也浏览了

Data Phoenix Digest - ISSUE 2.2024

Unraveling Minds: Decoding Texts for Hidden Insights and Emotions

Exploring Snowflake Cortex: A New Era for Business Through AI

Using Databases and Data Warehouses as Vector Databases for AI Agents

Newsletter: May Edition

Top Machine Learning Algorithms in Data Science Explained: 7+ Algorithms

AI-Driven Data Exploration: How Generative AI in Microsoft Fabric Transforms Business Decision-Making

Partner1 Bits and Bytes Series Featuring Irina Petrakova-Otto Microsoft Data & AI CTO of Global Partner Solutions

Unleashing the Magic of AI/BI Genie Spaces

Do you still need RAG (Retrieval Augmentation Generation) now that we have Microsoft Copilot Pro?

领英推荐

Data Science Growth Series

5,004 位关注者

Manisha Arora的更多文章

Tech Leveling & Compensation Structures

Data Analyst to Data Scientist Transition Plan

Data Dialogues: Navigating the Data Science Landscape [Part 2 of 2]

Data Dialogues: Navigating the Data Science Landscape [Part 1 of 2]

Applied Machine Learning Projects: Course Launch

Experimentation-driven Product Development

Building a Q&A on custom docs using LangChain

Evolution of Language Models and Their Impact on Search

Causal Inference Fundamentals

Trends and Career Paths in Data Science

社区洞察

其他会员也浏览了

Data Phoenix Digest - ISSUE 2.2024

Unraveling Minds: Decoding Texts for Hidden Insights and Emotions

Exploring Snowflake Cortex: A New Era for Business Through AI

Using Databases and Data Warehouses as Vector Databases for AI Agents

Newsletter: May Edition

Top Machine Learning Algorithms in Data Science Explained: 7+ Algorithms

AI-Driven Data Exploration: How Generative AI in Microsoft Fabric Transforms Business Decision-Making

Partner1 Bits and Bytes Series Featuring Irina Petrakova-Otto Microsoft Data & AI CTO of Global Partner Solutions

Unleashing the Magic of AI/BI Genie Spaces

Do you still need RAG (Retrieval Augmentation Generation) now that we have Microsoft Copilot Pro?