Essentials of vector embeddings for: Recommendation Engines

Essentials of vector embeddings for: Recommendation Engines

Have you ever wondered how amazon recommends shirts or trousers based on your search history? Or perhaps you might have listened to recommended songs on your favorite music app which keeps track of what artists and albums you follow the most ? Most SaaS (Software as a Service) applications today have such built in engines to better understand user interactions , improving customer services , enriching personalization etc.

The secret sauce behind them , which you might have guessed it from the title are something called as vector embeddings. In my opinion they truly have revolutionized the way we discover, explore, and connect with the world around us. These ingenious mathematical representations unlock the hidden patterns within vast datasets, transforming mere words, images, or items into dense, meaningful vectors that can magically capture the essence of our preferences and desires.

But, what does that even mean? Why was there any need to create them in the first place?

This article captures the essential characteristics of vector-space embeddings, their mathematical approaches and their working using Google Cloud's Vertex AI Matching Engine.


Embeddings? , what's that?

Embeddings are a way of representing data–almost any kind of data, like text, images, videos, users, music, whatever–as points in space where the locations of those points in space are semantically meaningful.

One of the best ways to understand this concept is through an example , in particular , a 谷歌 developed technique whereby words are mapped into vector space to create what is called as, word embeddings more technically they are referred as Word2Vec (Word to Vector) .

They look something like this:

No alt text provided for this image

Where V1,V2 & V3 are the mathematical equivalent of words in the text space.

Here's more popular and perhaps most intuitive visual for word2vec:

No alt text provided for this image

With Word2Vec, similar words cluster together in space–so the vector/point representing “king” and “queen” , "Man" and "woman" will all cluster nearby.?

So now you can probably appreciate how this is useful , that is; embeddings allow us to cluster similar data points together in vector space. Hence , we could build a function which takes let's say "apple" as an input and returns closest & similar data items like "orange" or "mango" , which are clustered together in n dimensional space.

Let's suppose we embedded lyrics of a song , then we could build a method which takes in semantics of a song as an input and gives us some n similar songs as an output. That would be really cool right?

Guess what, we shall be doing exactly that, but before jumping into code for the same, let's analyze how they work exactly.


But, what exactly do we mean by "Clustering similar data" ?

In simple terms , it means to compute a similarity score between embedded data points. That is , "how similar is this shirt to another?". One way to compute this score is by calculating the distance between two embedded points in space which is the Euclidean distance, and say that:

"the more closer they are, the more similar they are"

This can also be done using the cosine distance, using the vector dot product , i.e,

Cos(angular-distance) = (V1.V2)/(|V1|*|V2|)        

Where |Vi| is the magnitude of a given vector Vi.

No alt text provided for this image


Furthermore, Similarity scores are useful for applications like duplicate detection and facial recognition.

To implement facial recognition, for example, you might embed pictures of people’s faces, then determine that if two pictures have a high enough similarity score, they’re of the same person. Or, if you were to embed all the pictures on your cell phone camera and found photos that were very nearby in embedding space, you could conclude those points were likely near-duplicate photos.


That is to say, In reality the mathematics behind these engines is a little more complicated than just simply calculating the distance between two vector points.

In 2013, Tomas Mikolov et al. proposed two seminal papers and two different model architectures for word embedding:

  • the?continuous bag-of -words?(CBOW) and
  • the?skip-gram?model.

In this article, we shall focus on the famous skip-gram architecture of word embeddings used in most of the query-based engines.

Let's take an array containing a sequence of words:

Sequence = { W0 , W1 , W2 , .....,Wj-1, Wj }        

For a word 'Wk' the context of which can be obtained from it's left and right neighborhood ;

That is;

left_context = { Wk-m-1, Wk-m }
right_context = {Wk+m , Wk+m+1 }        

That is, if the following represents the sequence;

No alt text provided for this image

then;

No alt text provided for this image
credits: https://towardsdatascience.com/

To each word say 'w' it is assigned a vector representation 'v', and the probability that Wo is in the context of Wi is defined as:

No alt text provided for this image
credits: https://towardsdatascience.com/

The primary objective of skip-gram model is to predict the context of central words , hence it's training involves computing a set of vectors V which maximizes the objective function.

However, as the size of the sequence array increases say a thousand words , we can easily concur that it becomes computationally inefficient to find out probabilities of each word. This makes optimization very much necessary for such a model to work.

Therefore , it is better to calculate probabilities of each word inversely proportional to its frequency in a text sequence. This discards the common words thus reducing the size of data to be processed. This design technique is called as Word-sub sampling. The details and concepts of such a technique is beyond the scope of this article, for that you can visit here.


Up-to this point , we understood the crux of what is the purpose of building scalable embeddings and how they work under the hood, but we still haven't seen its implementation.

After much exploration, I found out that there are tons of pre-trained text embedding models for free , out of all of which one of the most popular models is the Universal Sentence Encoder. It can be downloaded here from the TensorFlow model hub repository.

The code for using text embedding model from the repository is pretty straightforward , taken from their website:

import tensorflow_hub as hu
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
embeddings = embed([
? ? "The quick brown fox jumps over the lazy dog.",
? ? "I am a sentence for which I would like to get its embedding"])
print(embeddings)        

OpenAI's CLIP model can also be used which takes text and image as an input and maps both data-types into the same embedding space.

Additionally, we can also train our own model apart from pre-trained ones. ?Tensorflow Recommenders allows you to do exactly that.


Notice that so far in our discussion, we are comparing vectors one by one from a large pool of vectors. In recent times, you could use something called as the ANN , or Approximate Nearest Neighbor approach. Simply put , constructing an algorithm which doesn't guarantee to return actual nearest neighbor in every case instead finding the "approximate" ones significantly improves query-speed and vector space complexity , which is exactly what ANN does.

They use something called as Vector Quantization (VQ).

Vector..what?


Let's say each picture is represented by a set of numbers, like a list of features. These features could be things like the color of the animal, its size, and other characteristics. So for each picture, you have a vector of numbers that describes it.

Something like this:

No alt text provided for this image

Now, vector quantization is a method that takes these vectors and tries to find a small number of representative vectors called centroids. Think of these centroids as special pictures that represent a group of similar pictures.

To group the pictures, we look at each picture's vector and find the centroid that is closest to it. We consider the centroid to be the best representative for that picture. So instead of comparing a picture to all other pictures, we only compare it to a few centroids.

This process helps us organize the pictures into different groups based on their similarities. Pictures with similar features will have vectors that are closer to the same centroid. It's like putting pictures of cats in one group, pictures of dogs in another group, and so on.

This VQ technique dramatically enhances query speeds and is the essential part of many ANN algorithms, just like indexing is the essential part of relational databases and full-text search engines.

To better understand ANN , think of each bunch of vectors organized into groups.

No alt text provided for this image
image taken from wikipedia

From above, codewords represents the centroid of group. From above it is clear that , as the number of groups in the space increases the speed of the search decreases and the accuracy increases.?

Managing this trade-off — getting higher accuracy at shorter latency — has been a key challenge with ANN algorithms.

To tackle such challenges , 谷歌 introduced ScaNN which uses a relatively new VQ algorithm called as Anistropic Vector Quantization.


What is Anisotropy now?

Anisotropy refers to the property of data where the measurements or characteristics vary in different directions. In other words, the data is not equally distributed or uniform in all directions. For example, in an image, certain features or patterns may be more pronounced or have different variations along specific directions.

ScaNN algorithm takes into account the anisotropy of the data by using a different distance metric that considers the varying characteristics in different directions. It allows for more accurate representation and grouping of data that exhibits anisotropic properties.

The specific implementation of anisotropic vector quantization depends on the data and the application. It may involve using different distance measures, adjusting the weighting or scaling of different dimensions, or applying dimension-specific transformations to align the data.

This is the magic ingredient in the user experience you feel when you are using Google Image Search, YouTube, Google Play, and many other services that rely on recommendations and search. In short, Google's ANN technology enables users to find valuable information in milliseconds, in the vast sea of web content.

So, why are we even bothered to know about ScaNN?


Well, you see the well established 谷歌 's Vertex AI Matching Engine is powered by ScaNN backend which is responsible for most of the their services for fast and scalable vector search, and recently it became GA and ready for production use.

To truly understand the idea, I created a small project which demonstrates the use case for the matching engine.

  1. The idea is to fetch the lyrics of songs from a particular album using Spotify API and Genius API. Thereafter , storing them in a cloud based storage , here I have considered using Firestore as a document NoSQL database.
  2. After that, taking voice inputs using 谷歌 Chirp , which is a foundational model trained for Automatic Speech Recognition tasks, which shall convert the audio input to text format.
  3. Lastly, creating vector embeddings between the lyrics text and input text , such that the closer the input text words are to a particular lyrics of a song, the more similar they are , hence the more probability that input audio is from that matching song.

Diagrammatically;

No alt text provided for this image
Project Architecture

To fetch the lyrics , first we shall need the access token credentials from Spotify , which can be obtained here.

Thereafter, we shall also need access token from genius , for authorization of API calls made to it. The official docs for the same can be obtained from here.

Here's the code I wrote to fetch the lyrics.


import pandas as pd
import spotipy

from spotipy.oauth2 import SpotifyClientCredentials

from bs4 import BeautifulSoup

import requests

cid = 'SPOTIFY_CLIENT_ID'
secret = 'SPOTIFY_CLIENT_SECRET'

client_credentials_manager = SpotifyClientCredentials(client_id=cid, client_secret=secret)


sp = spotipy.Spotify(client_credentials_manager = client_credentials_manager)
genius_client_id = 'YOUR_GENIUS_CLIENT_ID'
genius_client_secret = 'YOUR_GENIUS_CLIENT_SECRET'
genius = lyricsgenius.Genius(client_id=genius_client_id, client_secret=genius_client_secret)


album_name = 'ALBUM_NAME'
artist_name = 'ARTIST_NAME'

results = spotify.search(q=f'album:{album_name} artist:{artist_name}', type='album', limit=1)

def get_album_tracks(uri_info):
? ? uri = []
? ? track = []
? ? duration = []
? ? explicit = []
? ? track_number = []
? ? one = sp.album_tracks(uri_info, limit=50, offset=0, market='US')
? ? df1 = pd.DataFrame(one)
? ??
? ? for i, x in df1['items'].items():
? ? ? ? uri.append(x['uri'])
? ? ? ? track.append(x['name'])
? ? ? ? duration.append(x['duration_ms'])
? ? ? ? explicit.append(x['explicit'])
? ? ? ? track_number.append(x['track_number'])
? ??
? ? df2 = pd.DataFrame({
? ? 'uri':uri,
? ? 'track':track,
? ? 'duration_ms':duration,
? ? 'explicit':explicit,
? ? 'track_number':track_number})
? ??
? ? return df2

album_track = get_album_tracks("spotify:album:3M2LeGpESC5bWeYklMGtgk")


def get_track_info(df):
? ? danceability = []
? ? energy = []
? ? key = []
? ? loudness = []
? ? speechiness = []
? ? acousticness = []
? ? instrumentalness = []
? ? liveness = []
? ? valence = []
? ? tempo = []

? ? for i in df['uri']:
? ? ? ? for x in sp.audio_features(tracks=[i]):
? ? ? ? ? ? danceability.append(x['danceability'])
? ? ? ? ? ? energy.append(x['energy'])
? ? ? ? ? ? key.append(x['key'])
? ? ? ? ? ? loudness.append(x['loudness'])
? ? ? ? ? ? speechiness.append(x['speechiness'])
? ? ? ? ? ? acousticness.append(x['acousticness'])
? ? ? ? ? ? instrumentalness.append(x['instrumentalness'])
? ? ? ? ? ? liveness.append(x['liveness'])
? ? ? ? ? ? valence.append(x['valence'])
? ? ? ? ? ? tempo.append(x['tempo'])
? ? ? ? ? ??
? ? df2 = pd.DataFrame({
? ? 'danceability':danceability,
? ? 'energy':energy,
? ? 'key':key,
? ? 'loudness':loudness,
? ? 'speechiness':speechiness,
? ? 'acousticness':acousticness,
? ? 'instrumentalness':instrumentalness,
? ? 'liveness':liveness,
? ? 'valence':valence,
? ? 'tempo':tempo})
? ??
? ? return df2

track_info = get_track_info(album_track)


def merge_frames(df1, df2):

? ? df3 = df1.merge(df2, left_index= True, right_index= True)
? ? return df3

merge_frames(album_track, track_info)

import requests


def request_song_info(song_title, artist_name):
? ? base_url = 'https://api.genius.com'
? ? headers = {'Authorization': 'Bearer ' + '8fwnmXP2qE2IgkuRwboItNKhrHV6Ik4zvmyA5FpUSXZCmTlt39-X91JrqGTxoa1x'}
? ? search_url = base_url + '/search'
? ? data = {'q': song_title + ' ' + artist_name}
? ? response = requests.get(search_url, data=data, headers=headers)


? ? return response

response = request_song_info("Happier", "Marshmello")


for hit in json['response']['hits']:
? ? if artist_name.lower() in hit['result']['primary_artist']['name'].lower():
? ? ? ? remote_song_info = hit
? ? ? ? break

if
 remote_song_info:
? ? song_url = remote_song_info['result']['url']

from bs4 import BeautifulSoup

def scrap_song_url(url):
? ? page = requests.get(url)
? ? html = BeautifulSoup(page.text, 'html.parser')
? ? lyrics = html.find('div', class_='lyrics').get_text()

? ? return lyrics

res = scrap_song_url(search_url)


print(res)        

It stores the lyrics as dataframe using pandas. After this, we shall upload the dataframe into firestore. You can use the cloud storage API or upload them through the google cloud console.

After logging into the console it will look something like this;

No alt text provided for this image
Firestore

Next step is using the speech to text transcription , for that we can either upload the MP3 audio files to cloud storage and automate the task of creating a new text document for each file , else we manually upload the input audio which we shall follow here.

No alt text provided for this image

Note that Chirp is a recently launched foundational model. It is limited to regions such as us-central , asia-pacific etc. In this scenario, I have chosen the us-central region to test the model.

No alt text provided for this image

Note that you have given the required permission in the iam console. This seriously caused me lot of headaches in the later stages of this project , so beware of such small things :).

After a few mins of coffee brake , it'll look something like this:

No alt text provided for this image

The above transcription shall be saved in the specified storage bucket , which in this case is "input-audio-bucket0" , and the name of the file it stored is:

Marshmello ft. Bastille - Happier (Official Lyric Video)_transcript_6477bd54-0000-2959-9b9e-582429c357ac.json        

Lastly, to create embeddings and perform semantic search using ScaNN , we use a dataset consisting of lyrics of songs , and import it:

!gsutil cp gs://your_data_set.json        

The dataset initially stored in firestore could be fetched back using the gsutil command. Note that a vector database like Pinecone on google cloud offers much better stability and features like data wrangling, cleaning ,fast query search results etc. It accepts only json files as it represents set of vectors as an array of json-objects.

No alt text provided for this image
Pinecone console

Let's store the records in a python list and append each line of the json text, as shown below:

records = [
with open("your_dataset.jsonl") as f:
    for line in f:
        record = json.loads(line)
        records.append(record)
     
]        

Then, to get the embeddings we can define a function, which takes text as an input and returns the embedded values, this can be achieved as:

def getting_embeddings(text):
    get_embedding.counter += 1
    try:
        if get_embedding.counter % 100 == 0:
            time.sleep(3)
        return model.get_embeddings([text])[0].values
    except:
        return []


get_embedding.counter = 0
        

Furthermore, we would require to create an index. They typically refer to the numerical identifiers or positions assigned to individual vectors within a collection or dataset. These indexes are used to uniquely identify and access specific vectors for retrieval or manipulation.

In ScaNN algorithm, it can be created using the scann.scann_ops.build() function as shown:

record_count = len(records
dataset = np.empty((record_count, 768))
for i in range(record_count):
    dataset[i] = df.embedding[i]

normalform_dataset = dataset / np.linalg.norm(dataset, axis=1)[:, np.newaxis]


searcher = (
    scann.scann_ops_pybind.builder(normalform_dataset, 10, "dot_product")
    .tree(
        num_leaves=record_count,
        num_leaves_to_search=record_count,
        training_sample_size=record_count,
    )
    .score_ah(2, anisotropic_quantization_threshold=0.2)
    .reorder(100)
    .build()
)
)        

The code for building the searcher object can be found in this repo , which explores the function calls in details.

Finally , to provide input to our model , we can construct a query function, which accepts a query as an input.

In our case it is the json file from Chirp model, which first shall need to be converted to text format if the function accepts a string as an input.

This can be done using:

df = pd.read_json ('Marshmello ft. Bastille - Happier (Official Lyric Video)_transcript_6477bd54-0000-2959-9b9e-582429c357ac.json')
df.to_csv (r'New_File.txt', index = False)        

Note that the above transcript is what we obtained earlier from Chirp.

And our query function can be defined as:

def search(query)
    start = time.time()
    query = model.get_embeddings([query])[0].values
    neighbors, distances = searcher.search(query, final_num_neighbors=3)
    end = time.time()

    for id, dist in zip(neighbors, distances):
        print(f"[docid:{id}] [{dist}] -- {df.textContent[int(id)][:125]}...")
    print("Latency (ms):", 1000 * (end - start)):        

We can call the function , by providing the json-to-text converted file earlier a parameter:

for line in New_File:
   search(line)        

Jatin Parashar

Upcoming GRC Associate @RSM | Cyber Security Professional | Web3 Explorer | Content Creator

1 年

Great post Saahil Rathore ???? Thanks for sharing

要查看或添加评论,请登录

Saahil Rathore的更多文章

  • Nvidia Nemo & Voice Swap Analysis

    Nvidia Nemo & Voice Swap Analysis

    This year's Microsoft Build witnessed many innovative AI based solutions, especially in the realm of enterprise AI…

  • Razorpay's defensive architecture

    Razorpay's defensive architecture

    Hey folks , today I stumbled upon a very interesting Amazon Web Services (AWS) blog, which outlines a multi-layered…

社区洞察

其他会员也浏览了