Essentials of vector embeddings for: Recommendation Engines
Saahil Rathore
Site Reliability Engineer intern @CRED | Pupil @Codeforces | 3? Codechef | 6 Kyu AtCoder | SIH '23 Finalist | AWS ML Scholar ‘23 | GenAI Buildathon Finalist | Ex Python Developer @Bytive
Have you ever wondered how amazon recommends shirts or trousers based on your search history? Or perhaps you might have listened to recommended songs on your favorite music app which keeps track of what artists and albums you follow the most ? Most SaaS (Software as a Service) applications today have such built in engines to better understand user interactions , improving customer services , enriching personalization etc.
The secret sauce behind them , which you might have guessed it from the title are something called as vector embeddings. In my opinion they truly have revolutionized the way we discover, explore, and connect with the world around us. These ingenious mathematical representations unlock the hidden patterns within vast datasets, transforming mere words, images, or items into dense, meaningful vectors that can magically capture the essence of our preferences and desires.
But, what does that even mean? Why was there any need to create them in the first place?
This article captures the essential characteristics of vector-space embeddings, their mathematical approaches and their working using Google Cloud's Vertex AI Matching Engine.
Embeddings? , what's that?
Embeddings are a way of representing data–almost any kind of data, like text, images, videos, users, music, whatever–as points in space where the locations of those points in space are semantically meaningful.
One of the best ways to understand this concept is through an example , in particular , a 谷歌 developed technique whereby words are mapped into vector space to create what is called as, word embeddings more technically they are referred as Word2Vec (Word to Vector) .
They look something like this:
Where V1,V2 & V3 are the mathematical equivalent of words in the text space.
Here's more popular and perhaps most intuitive visual for word2vec:
With Word2Vec, similar words cluster together in space–so the vector/point representing “king” and “queen” , "Man" and "woman" will all cluster nearby.?
So now you can probably appreciate how this is useful , that is; embeddings allow us to cluster similar data points together in vector space. Hence , we could build a function which takes let's say "apple" as an input and returns closest & similar data items like "orange" or "mango" , which are clustered together in n dimensional space.
Let's suppose we embedded lyrics of a song , then we could build a method which takes in semantics of a song as an input and gives us some n similar songs as an output. That would be really cool right?
Guess what, we shall be doing exactly that, but before jumping into code for the same, let's analyze how they work exactly.
But, what exactly do we mean by "Clustering similar data" ?
In simple terms , it means to compute a similarity score between embedded data points. That is , "how similar is this shirt to another?". One way to compute this score is by calculating the distance between two embedded points in space which is the Euclidean distance, and say that:
"the more closer they are, the more similar they are"
This can also be done using the cosine distance, using the vector dot product , i.e,
Cos(angular-distance) = (V1.V2)/(|V1|*|V2|)
Where |Vi| is the magnitude of a given vector Vi.
Furthermore, Similarity scores are useful for applications like duplicate detection and facial recognition.
To implement facial recognition, for example, you might embed pictures of people’s faces, then determine that if two pictures have a high enough similarity score, they’re of the same person. Or, if you were to embed all the pictures on your cell phone camera and found photos that were very nearby in embedding space, you could conclude those points were likely near-duplicate photos.
That is to say, In reality the mathematics behind these engines is a little more complicated than just simply calculating the distance between two vector points.
In 2013, Tomas Mikolov et al. proposed two seminal papers and two different model architectures for word embedding:
In this article, we shall focus on the famous skip-gram architecture of word embeddings used in most of the query-based engines.
Let's take an array containing a sequence of words:
Sequence = { W0 , W1 , W2 , .....,Wj-1, Wj }
For a word 'Wk' the context of which can be obtained from it's left and right neighborhood ;
That is;
left_context = { Wk-m-1, Wk-m }
right_context = {Wk+m , Wk+m+1 }
That is, if the following represents the sequence;
then;
To each word say 'w' it is assigned a vector representation 'v', and the probability that Wo is in the context of Wi is defined as:
The primary objective of skip-gram model is to predict the context of central words , hence it's training involves computing a set of vectors V which maximizes the objective function.
However, as the size of the sequence array increases say a thousand words , we can easily concur that it becomes computationally inefficient to find out probabilities of each word. This makes optimization very much necessary for such a model to work.
Therefore , it is better to calculate probabilities of each word inversely proportional to its frequency in a text sequence. This discards the common words thus reducing the size of data to be processed. This design technique is called as Word-sub sampling. The details and concepts of such a technique is beyond the scope of this article, for that you can visit here.
Up-to this point , we understood the crux of what is the purpose of building scalable embeddings and how they work under the hood, but we still haven't seen its implementation.
After much exploration, I found out that there are tons of pre-trained text embedding models for free , out of all of which one of the most popular models is the Universal Sentence Encoder. It can be downloaded here from the TensorFlow model hub repository.
The code for using text embedding model from the repository is pretty straightforward , taken from their website:
import tensorflow_hub as hu
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
embeddings = embed([
? ? "The quick brown fox jumps over the lazy dog.",
? ? "I am a sentence for which I would like to get its embedding"])
print(embeddings)
OpenAI's CLIP model can also be used which takes text and image as an input and maps both data-types into the same embedding space.
Additionally, we can also train our own model apart from pre-trained ones. ?Tensorflow Recommenders allows you to do exactly that.
Notice that so far in our discussion, we are comparing vectors one by one from a large pool of vectors. In recent times, you could use something called as the ANN , or Approximate Nearest Neighbor approach. Simply put , constructing an algorithm which doesn't guarantee to return actual nearest neighbor in every case instead finding the "approximate" ones significantly improves query-speed and vector space complexity , which is exactly what ANN does.
They use something called as Vector Quantization (VQ).
Vector..what?
Let's say each picture is represented by a set of numbers, like a list of features. These features could be things like the color of the animal, its size, and other characteristics. So for each picture, you have a vector of numbers that describes it.
Something like this:
领英推荐
Now, vector quantization is a method that takes these vectors and tries to find a small number of representative vectors called centroids. Think of these centroids as special pictures that represent a group of similar pictures.
To group the pictures, we look at each picture's vector and find the centroid that is closest to it. We consider the centroid to be the best representative for that picture. So instead of comparing a picture to all other pictures, we only compare it to a few centroids.
This process helps us organize the pictures into different groups based on their similarities. Pictures with similar features will have vectors that are closer to the same centroid. It's like putting pictures of cats in one group, pictures of dogs in another group, and so on.
This VQ technique dramatically enhances query speeds and is the essential part of many ANN algorithms, just like indexing is the essential part of relational databases and full-text search engines.
To better understand ANN , think of each bunch of vectors organized into groups.
From above, codewords represents the centroid of group. From above it is clear that , as the number of groups in the space increases the speed of the search decreases and the accuracy increases.?
Managing this trade-off — getting higher accuracy at shorter latency — has been a key challenge with ANN algorithms.
To tackle such challenges , 谷歌 introduced ScaNN which uses a relatively new VQ algorithm called as Anistropic Vector Quantization.
What is Anisotropy now?
Anisotropy refers to the property of data where the measurements or characteristics vary in different directions. In other words, the data is not equally distributed or uniform in all directions. For example, in an image, certain features or patterns may be more pronounced or have different variations along specific directions.
ScaNN algorithm takes into account the anisotropy of the data by using a different distance metric that considers the varying characteristics in different directions. It allows for more accurate representation and grouping of data that exhibits anisotropic properties.
The specific implementation of anisotropic vector quantization depends on the data and the application. It may involve using different distance measures, adjusting the weighting or scaling of different dimensions, or applying dimension-specific transformations to align the data.
This is the magic ingredient in the user experience you feel when you are using Google Image Search, YouTube, Google Play, and many other services that rely on recommendations and search. In short, Google's ANN technology enables users to find valuable information in milliseconds, in the vast sea of web content.
So, why are we even bothered to know about ScaNN?
Well, you see the well established 谷歌 's Vertex AI Matching Engine is powered by ScaNN backend which is responsible for most of the their services for fast and scalable vector search, and recently it became GA and ready for production use.
To truly understand the idea, I created a small project which demonstrates the use case for the matching engine.
Diagrammatically;
To fetch the lyrics , first we shall need the access token credentials from Spotify , which can be obtained here.
Thereafter, we shall also need access token from genius , for authorization of API calls made to it. The official docs for the same can be obtained from here.
Here's the code I wrote to fetch the lyrics.
import pandas as pd
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
from bs4 import BeautifulSoup
import requests
cid = 'SPOTIFY_CLIENT_ID'
secret = 'SPOTIFY_CLIENT_SECRET'
client_credentials_manager = SpotifyClientCredentials(client_id=cid, client_secret=secret)
sp = spotipy.Spotify(client_credentials_manager = client_credentials_manager)
genius_client_id = 'YOUR_GENIUS_CLIENT_ID'
genius_client_secret = 'YOUR_GENIUS_CLIENT_SECRET'
genius = lyricsgenius.Genius(client_id=genius_client_id, client_secret=genius_client_secret)
album_name = 'ALBUM_NAME'
artist_name = 'ARTIST_NAME'
results = spotify.search(q=f'album:{album_name} artist:{artist_name}', type='album', limit=1)
def get_album_tracks(uri_info):
? ? uri = []
? ? track = []
? ? duration = []
? ? explicit = []
? ? track_number = []
? ? one = sp.album_tracks(uri_info, limit=50, offset=0, market='US')
? ? df1 = pd.DataFrame(one)
? ??
? ? for i, x in df1['items'].items():
? ? ? ? uri.append(x['uri'])
? ? ? ? track.append(x['name'])
? ? ? ? duration.append(x['duration_ms'])
? ? ? ? explicit.append(x['explicit'])
? ? ? ? track_number.append(x['track_number'])
? ??
? ? df2 = pd.DataFrame({
? ? 'uri':uri,
? ? 'track':track,
? ? 'duration_ms':duration,
? ? 'explicit':explicit,
? ? 'track_number':track_number})
? ??
? ? return df2
album_track = get_album_tracks("spotify:album:3M2LeGpESC5bWeYklMGtgk")
def get_track_info(df):
? ? danceability = []
? ? energy = []
? ? key = []
? ? loudness = []
? ? speechiness = []
? ? acousticness = []
? ? instrumentalness = []
? ? liveness = []
? ? valence = []
? ? tempo = []
? ? for i in df['uri']:
? ? ? ? for x in sp.audio_features(tracks=[i]):
? ? ? ? ? ? danceability.append(x['danceability'])
? ? ? ? ? ? energy.append(x['energy'])
? ? ? ? ? ? key.append(x['key'])
? ? ? ? ? ? loudness.append(x['loudness'])
? ? ? ? ? ? speechiness.append(x['speechiness'])
? ? ? ? ? ? acousticness.append(x['acousticness'])
? ? ? ? ? ? instrumentalness.append(x['instrumentalness'])
? ? ? ? ? ? liveness.append(x['liveness'])
? ? ? ? ? ? valence.append(x['valence'])
? ? ? ? ? ? tempo.append(x['tempo'])
? ? ? ? ? ??
? ? df2 = pd.DataFrame({
? ? 'danceability':danceability,
? ? 'energy':energy,
? ? 'key':key,
? ? 'loudness':loudness,
? ? 'speechiness':speechiness,
? ? 'acousticness':acousticness,
? ? 'instrumentalness':instrumentalness,
? ? 'liveness':liveness,
? ? 'valence':valence,
? ? 'tempo':tempo})
? ??
? ? return df2
track_info = get_track_info(album_track)
def merge_frames(df1, df2):
? ? df3 = df1.merge(df2, left_index= True, right_index= True)
? ? return df3
merge_frames(album_track, track_info)
import requests
def request_song_info(song_title, artist_name):
? ? base_url = 'https://api.genius.com'
? ? headers = {'Authorization': 'Bearer ' + '8fwnmXP2qE2IgkuRwboItNKhrHV6Ik4zvmyA5FpUSXZCmTlt39-X91JrqGTxoa1x'}
? ? search_url = base_url + '/search'
? ? data = {'q': song_title + ' ' + artist_name}
? ? response = requests.get(search_url, data=data, headers=headers)
? ? return response
response = request_song_info("Happier", "Marshmello")
for hit in json['response']['hits']:
? ? if artist_name.lower() in hit['result']['primary_artist']['name'].lower():
? ? ? ? remote_song_info = hit
? ? ? ? break
if
remote_song_info:
? ? song_url = remote_song_info['result']['url']
from bs4 import BeautifulSoup
def scrap_song_url(url):
? ? page = requests.get(url)
? ? html = BeautifulSoup(page.text, 'html.parser')
? ? lyrics = html.find('div', class_='lyrics').get_text()
? ? return lyrics
res = scrap_song_url(search_url)
print(res)
It stores the lyrics as dataframe using pandas. After this, we shall upload the dataframe into firestore. You can use the cloud storage API or upload them through the google cloud console.
After logging into the console it will look something like this;
Next step is using the speech to text transcription , for that we can either upload the MP3 audio files to cloud storage and automate the task of creating a new text document for each file , else we manually upload the input audio which we shall follow here.
Note that Chirp is a recently launched foundational model. It is limited to regions such as us-central , asia-pacific etc. In this scenario, I have chosen the us-central region to test the model.
Note that you have given the required permission in the iam console. This seriously caused me lot of headaches in the later stages of this project , so beware of such small things :).
After a few mins of coffee brake , it'll look something like this:
The above transcription shall be saved in the specified storage bucket , which in this case is "input-audio-bucket0" , and the name of the file it stored is:
Marshmello ft. Bastille - Happier (Official Lyric Video)_transcript_6477bd54-0000-2959-9b9e-582429c357ac.json
Lastly, to create embeddings and perform semantic search using ScaNN , we use a dataset consisting of lyrics of songs , and import it:
!gsutil cp gs://your_data_set.json
The dataset initially stored in firestore could be fetched back using the gsutil command. Note that a vector database like Pinecone on google cloud offers much better stability and features like data wrangling, cleaning ,fast query search results etc. It accepts only json files as it represents set of vectors as an array of json-objects.
Let's store the records in a python list and append each line of the json text, as shown below:
records = [
with open("your_dataset.jsonl") as f:
for line in f:
record = json.loads(line)
records.append(record)
]
Then, to get the embeddings we can define a function, which takes text as an input and returns the embedded values, this can be achieved as:
def getting_embeddings(text):
get_embedding.counter += 1
try:
if get_embedding.counter % 100 == 0:
time.sleep(3)
return model.get_embeddings([text])[0].values
except:
return []
get_embedding.counter = 0
Furthermore, we would require to create an index. They typically refer to the numerical identifiers or positions assigned to individual vectors within a collection or dataset. These indexes are used to uniquely identify and access specific vectors for retrieval or manipulation.
In ScaNN algorithm, it can be created using the scann.scann_ops.build() function as shown:
record_count = len(records
dataset = np.empty((record_count, 768))
for i in range(record_count):
dataset[i] = df.embedding[i]
normalform_dataset = dataset / np.linalg.norm(dataset, axis=1)[:, np.newaxis]
searcher = (
scann.scann_ops_pybind.builder(normalform_dataset, 10, "dot_product")
.tree(
num_leaves=record_count,
num_leaves_to_search=record_count,
training_sample_size=record_count,
)
.score_ah(2, anisotropic_quantization_threshold=0.2)
.reorder(100)
.build()
)
)
The code for building the searcher object can be found in this repo , which explores the function calls in details.
Finally , to provide input to our model , we can construct a query function, which accepts a query as an input.
In our case it is the json file from Chirp model, which first shall need to be converted to text format if the function accepts a string as an input.
This can be done using:
df = pd.read_json ('Marshmello ft. Bastille - Happier (Official Lyric Video)_transcript_6477bd54-0000-2959-9b9e-582429c357ac.json')
df.to_csv (r'New_File.txt', index = False)
Note that the above transcript is what we obtained earlier from Chirp.
And our query function can be defined as:
def search(query)
start = time.time()
query = model.get_embeddings([query])[0].values
neighbors, distances = searcher.search(query, final_num_neighbors=3)
end = time.time()
for id, dist in zip(neighbors, distances):
print(f"[docid:{id}] [{dist}] -- {df.textContent[int(id)][:125]}...")
print("Latency (ms):", 1000 * (end - start)):
We can call the function , by providing the json-to-text converted file earlier a parameter:
for line in New_File:
search(line)
Upcoming GRC Associate @RSM | Cyber Security Professional | Web3 Explorer | Content Creator
1 年Great post Saahil Rathore ???? Thanks for sharing