Unlocking the Power of Semantic Search: Methods, Algorithms, and Code Walkthrough with Arabic Text Handling

Prasanth V

Generative AI & Machine Learning Engineer ,Professional Prompt Engineer ,Data Scientist |Driving Innovation in AI Solutions | Expert in LLMs, LangChain, and Cloud Technologies,Good at AI apps Angular JS ,Streamlit

发布日期: 2024年10月27日

+ 关注

Do You Know You Can Also Search Related Arabic Text in English?

For example, if you search in Arabic for:

"???? ????? ?????? ??? ??????" (Translation: "best way to write Python code"),

Your semantic search system will understand the meaning behind the words and return documents or tutorials related to Python programming, even if the exact terms don’t match in the content. This showcases the power of semantic search for handling non-English languages, such as Arabic.

Introduction

As search engines evolve, the demand for multilingual search capabilities, including support for languages like Arabic, grows. Traditional keyword-based search methods often fail to grasp the nuances of different languages, especially non-English ones. However, semantic search goes beyond simple keyword matching by understanding the meaning behind words, enabling it to provide relevant results for complex languages like Arabic.

In this article, we’ll walk through the journey of building a sophisticated semantic search engine, from traditional BM25 keyword-based methods to dense retrieval with embeddings and even generative search. We will also highlight how these methods work effectively for languages such as Arabic.

Section 1: What is Semantic Search?

Semantic search focuses on understanding the intent and meaning behind a user’s query, rather than just looking for keyword matches. This is especially critical for languages like Arabic, where words can have different meanings and forms depending on context.

Example: A query in Arabic like "???? ????? ?????? ??? ??????" (best way to write Python code) will return results about Python programming, even if the exact keywords aren’t present in the documents.

In contrast, a traditional keyword search might only return results containing the specific words "????" (best), "?????" (way), and "??????" (Python) without considering the relationships between these words.

Real-world Applications:

Google: Uses semantic search to understand user intent and deliver relevant search results.
E-commerce: Platforms like Amazon rely on semantic search to recommend products based on user behavior and contextual relevance.

Section 2: Algorithms and Techniques Used

Let’s dive into the methods and algorithms you’ve implemented, from the initial BM25 model to advanced dense retrieval using transformer-based models, and explore how these methods handle Arabic and other languages.

1. Initial Method: BM25 in Keyword Search (L1-Keyword_Search.py)

Your search engine started with BM25 (Best Matching 25), a popular ranking function used in information retrieval. BM25 scores documents based on keyword matches and their importance in the document (using TF-IDF: Term Frequency - Inverse Document Frequency).

For a query like "???? ????? ?????? ??? ??????" in Arabic, BM25 will retrieve documents that contain the exact terms "????" (best), "?????" (way), and "??????" (Python), but it won’t understand the deeper meaning behind the query.

Key Code from L1-Keyword_Search.py:

#!/usr/bin/env python
# coding: utf-8

# # Keyword Search

# ## Setup
# 
# Load needed API keys and relevant Python libaries.

# In[12]:


# !pip install cohere
# !pip install weaviate-client


# In[13]:


import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file


# Let's start by imporing Weaviate to access the Wikipedia database.

# In[14]:


import weaviate
auth_config = weaviate.auth.AuthApiKey(
    api_key=os.environ['WEAVIATE_API_KEY'])



# In[15]:


client = weaviate.Client(
    url=os.environ['WEAVIATE_API_URL'],
    auth_client_secret=auth_config,
    additional_headers={
        "X-Cohere-Api-Key": os.environ['COHERE_API_KEY'],
    }
)


# In[16]:


client.is_ready() 


# # Keyword Search

# In[17]:


def keyword_search(query,
                   results_lang='en',
                   properties = ["title","url","text"],
                   num_results=3):

    where_filter = {
    "path": ["lang"],
    "operator": "Equal",
    "valueString": results_lang
    }
    
    response = (
        client.query.get("Articles", properties)
        .with_bm25(
            query=query
        )
        .with_where(where_filter)
        .with_limit(num_results)
        .do()
        )

    result = response['data']['Get']['Articles']
    return result


# In[18]:


query = "What is the most viewed televised event?"
keyword_search_results = keyword_search(query)
print(keyword_search_results)


# ### Try modifying the search options
# - Other languages to try: `en, de, fr, es, it, ja, ar, zh, ko, hi`

# In[19]:


properties = ["text", "title", "url", 
             "views", "lang"]


# In[20]:


def print_result(result):
    """ Print results with colorful formatting """
    for i,item in enumerate(result):
        print(f'item {i}')
        for key in item.keys():
            print(f"{key}:{item.get(key)}")
            print()
        print()


# In[21]:


print_result(keyword_search_results)


# In[22]:


query = "What is the most viewed televised event?"
keyword_search_results = keyword_search(query, results_lang='de')
print_result(keyword_search_results)

Algorithm: BM25
Technique: Keyword matching with term frequency and inverse document frequency (TF-IDF).
Tool: rank_bm25 library.

This method forms the backbone of keyword-based search but falls short when handling queries in languages like Arabic, which often involve more complex word forms and meanings.

2. Introducing Word Embeddings (L2-Embeddings.py)

Next, you transitioned to word embeddings, such as GloVe and Word2Vec. Word embeddings represent words as vectors in a high-dimensional space, capturing their semantic relationships. This method improves search relevance by considering the meaning of words rather than just their literal appearance.

For example, for the query "????? ???? ????? ??????" (good way to learn Python), the system understands that "????" (good) and "????" (best) are semantically related, even if the document doesn’t contain exact keyword matches.

Key Code from L2-Embeddings.py:

python
#!/usr/bin/env python
# coding: utf-8

# # Lesson 2: Embeddings

# Note: The numeric values of embeddings you see in your notebook may vary slightly from those filmed.

# ### Setup
# Load needed API keys and relevant Python libaries.

# In[1]:


# !pip install cohere umap-learn altair datasets


# In[2]:


import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file


# In[3]:


import cohere
co = cohere.Client(os.environ['COHERE_API_KEY'])


# In[4]:


import pandas as pd


# ## Word Embeddings
# 
# Consider a very small dataset of three words.

# In[5]:


three_words = pd.DataFrame({'text':
  [
      'joy',
      'happiness',
      'potato'
  ]})

three_words


# Let's create the embeddings for the three words:
# You may see an 'unknown field' warning which can be ignored.

# In[6]:


three_words_emb = co.embed(texts=list(three_words['text']),
                           model='embed-english-v2.0').embeddings


# In[7]:


word_1 = three_words_emb[0]
word_2 = three_words_emb[1]
word_3 = three_words_emb[2]


# In[8]:


word_1[:10]


# ## Sentence Embeddings

# Consider a very small dataset of three sentences.

# In[9]:


sentences = pd.DataFrame({'text':
  [
   'Where is the world cup?',
   'The world cup is in Qatar',
   'What color is the sky?',
   'The sky is blue',
   'Where does the bear live?',
   'The bear lives in the the woods',
   'What is an apple?',
   'An apple is a fruit',
  ]})

sentences


# Let's create the embeddings for the three sentences:

# In[10]:


emb = co.embed(texts=list(sentences['text']),
               model='embed-english-v2.0').embeddings

# Explore the 10 first entries of the embeddings of the 3 sentences:
for e in emb:
    print(e[:3])


# In[11]:


len(emb[0])


# In[12]:


#import umap
#import altair as alt


# The next code cell is for hiding some warnings that appear when importing the `umap_plot` library.

# In[13]:


# hide the warnings that would appear when importing the UMAP library
from numba.core.errors import NumbaDeprecationWarning, NumbaPendingDeprecationWarning
import warnings
warnings.simplefilter('ignore', category=NumbaDeprecationWarning)
warnings.simplefilter('ignore', category=NumbaPendingDeprecationWarning)


# In[14]:


from utils import umap_plot


# In[15]:


chart = umap_plot(sentences, emb)


# In[16]:


chart.interactive()


# ## Articles Embeddings

# In[17]:


import pandas as pd
wiki_articles = pd.read_pickle('wikipedia.pkl')
wiki_articles


# In[18]:


import numpy as np
from utils import umap_plot_big


# In[19]:


articles = wiki_articles[['title', 'text']]
embeds = np.array([d for d in wiki_articles['emb']])

chart = umap_plot_big(articles, embeds)
chart.interactive()

Algorithm: GloVe embeddings.
Technique: Vectorization of words and computation of cosine similarity to measure closeness between query and document embeddings.
Tool: gensim for pre-trained embeddings, scikit-learn for cosine similarity.

领英推荐

Spring AI: Why Every Java Developer Should Embrace AI…

VARAISYS PVT. LTD. 7 个月前

Developers’ Tutorial: Using Claude’s Tool (Function…

Kanaka Software 3 个月前

Llama 2, ChatGPT for Web Scraping, & Latest Python News

Oxylabs.cn 1 年前

With embeddings, the system becomes capable of recognizing semantic similarity between queries and documents, which is crucial for Arabic text with rich morphology and complex grammar.

3. Dense Retrieval with Transformer Models (L3-Dense_Retrieval.py)

Dense retrieval takes semantic search to the next level by using transformer-based models such as BERT and Sentence-BERT. These models embed both queries and documents into a dense vector space, allowing for deep semantic understanding. This method works exceptionally well for complex languages like Arabic because it captures the full meaning of phrases in context.

Key Code from L3-Dense_Retrieval.py:

python
#!/usr/bin/env python
# coding: utf-8

# # Dense Retrieval

# ## Setup
# 
# Load needed API keys and relevant Python libaries.

# In[1]:


# !pip install cohere 
# !pip install weaviate-client Annoy


# In[2]:


import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file


# In[3]:


import cohere
co = cohere.Client(os.environ['COHERE_API_KEY'])


# In[4]:


import weaviate
auth_config = weaviate.auth.AuthApiKey(
    api_key=os.environ['WEAVIATE_API_KEY'])


# In[5]:


client = weaviate.Client(
    url=os.environ['WEAVIATE_API_URL'],
    auth_client_secret=auth_config,
    additional_headers={
        "X-Cohere-Api-Key": os.environ['COHERE_API_KEY'],
    }
)
client.is_ready() #check if True


# ## Part 1: Vector Database for semantic Search

# In[6]:


def dense_retrieval(query, 
                    results_lang='en', 
                    properties = ["text", "title", "url", "views", "lang", "_additional {distance}"],
                    num_results=5):

    nearText = {"concepts": [query]}
    
    # To filter by language
    where_filter = {
    "path": ["lang"],
    "operator": "Equal",
    "valueString": results_lang
    }
    response = (
        client.query
        .get("Articles", properties)
        .with_near_text(nearText)
        .with_where(where_filter)
        .with_limit(num_results)
        .do()
    )

    result = response['data']['Get']['Articles']

    return result


# In[7]:


from utils import print_result


# ### Bacic Query

# In[8]:


query = "Who wrote Hamlet?"
dense_retrieval_results = dense_retrieval(query)
print_result(dense_retrieval_results)


# ### Medium Query

# In[9]:


query = "What is the capital of Canada?"
dense_retrieval_results = dense_retrieval(query)
print_result(dense_retrieval_results)


# In[10]:


from utils import keyword_search

query = "What is the capital of Canada?"
keyword_search_results = keyword_search(query, client)
print_result(keyword_search_results)


# ### Complicated Query

# In[11]:


from utils import keyword_search

query = "Tallest person in history?"
keyword_search_results = keyword_search(query, client)
print_result(keyword_search_results)


# In[12]:


query = "Tallest person in history"
dense_retrieval_results = dense_retrieval(query)
print_result(dense_retrieval_results)


# In[14]:


query = "???? ??? ?? ???????"
dense_retrieval_results = dense_retrieval(query)
print_result(dense_retrieval_results)


# In[15]:


query = "film about a time travel paradox"
dense_retrieval_results = dense_retrieval(query)
print_result(dense_retrieval_results)


# ## Part 2: Building Semantic Search from Scratch
# 
# ### Get the text archive:

# In[16]:


from annoy import AnnoyIndex
import numpy as np
import pandas as pd
import re


# In[17]:


text = """
Interstellar is a 2014 epic science fiction film co-written, directed, and produced by Christopher Nolan.
It stars Matthew McConaughey, Anne Hathaway, Jessica Chastain, Bill Irwin, Ellen Burstyn, Matt Damon, and Michael Caine.
Set in a dystopian future where humanity is struggling to survive, the film follows a group of astronauts who travel through a wormhole near Saturn in search of a new home for mankind.

Brothers Christopher and Jonathan Nolan wrote the screenplay, which had its origins in a script Jonathan developed in 2007.
Caltech theoretical physicist and 2017 Nobel laureate in Physics[4] Kip Thorne was an executive producer, acted as a scientific consultant, and wrote a tie-in book, The Science of Interstellar.
Cinematographer Hoyte van Hoytema shot it on 35 mm movie film in the Panavision anamorphic format and IMAX 70 mm.
Principal photography began in late 2013 and took place in Alberta, Iceland, and Los Angeles.
Interstellar uses extensive practical and miniature effects and the company Double Negative created additional digital effects.

Interstellar premiered on October 26, 2014, in Los Angeles.
In the United States, it was first released on film stock, expanding to venues using digital projectors.
The film had a worldwide gross over $677 million (and $773 million with subsequent re-releases), making it the tenth-highest grossing film of 2014.
It received acclaim for its performances, direction, screenplay, musical score, visual effects, ambition, themes, and emotional weight.
It has also received praise from many astronomers for its scientific accuracy and portrayal of theoretical astrophysics. Since its premiere, Interstellar gained a cult following,[5] and now is regarded by many sci-fi experts as one of the best science-fiction films of all time.
Interstellar was nominated for five awards at the 87th Academy Awards, winning Best Visual Effects, and received numerous other accolades"""


# ### Chunking: 

# In[18]:


# Split into a list of sentences
texts = text.split('.')

# Clean up to remove empty spaces and new lines
texts = np.array([t.strip(' \n') for t in texts])


# In[19]:


texts


# In[20]:


# Split into a list of paragraphs
texts = text.split('\n\n')

# Clean up to remove empty spaces and new lines
texts = np.array([t.strip(' \n') for t in texts])


# In[21]:


texts


# In[22]:


# Split into a list of sentences
texts = text.split('.')

# Clean up to remove empty spaces and new lines
texts = np.array([t.strip(' \n') for t in texts])


# In[23]:


title = 'Interstellar (film)'

texts = np.array([f"{title} {t}" for t in texts])


# In[24]:


texts


# ### Get the embeddings:

# In[25]:


response = co.embed(
    texts=texts.tolist()
).embeddings


# In[26]:


embeds = np.array(response)
embeds.shape


# ### Create the search index:

# In[27]:


search_index = AnnoyIndex(embeds.shape[1], 'angular')
# Add all the vectors to the search index
for i in range(len(embeds)):
    search_index.add_item(i, embeds[i])

search_index.build(10) # 10 trees
search_index.save('test.ann')


# In[28]:


pd.set_option('display.max_colwidth', None)

def search(query):

  # Get the query's embedding
  query_embed = co.embed(texts=[query]).embeddings

  # Retrieve the nearest neighbors
  similar_item_ids = search_index.get_nns_by_vector(query_embed[0],
                                                    3,
                                                  include_distances=True)
  # Format the results
  results = pd.DataFrame(data={'texts': texts[similar_item_ids[0]],
                              'distance': similar_item_ids[1]})

  print(texts[similar_item_ids[0]])
    
  return results


# In[29]:


query = "How much did the film make?"
search(query)

Algorithm: Sentence-BERT (transformer-based model).

Technique: Dense retrieval using embeddings and cosine similarity.
Tool: SentenceTransformers library.

Dense retrieval ensures that even complex queries in Arabic, like "??? ???? ?????? ??????" (how deep learning works), retrieve semantically relevant documents.

4. Re-ranking Results for Precision (L4-Rerank.py)

After performing the initial retrieval, you apply re-ranking to refine the results and improve precision. By combining BM25 with dense retrieval, you ensure that documents retrieved through keyword-based methods are further evaluated for their semantic relevance.

Key Code from L4-Rerank.py:

python
#!/usr/bin/env python
# coding: utf-8

# # ReRank

# ## Setup
# 
# Load needed API keys and relevant Python libaries.

# In[25]:


# !pip install cohere 
# !pip install weaviate-client


# In[26]:


import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file


# In[27]:


import cohere
co = cohere.Client(os.environ['COHERE_API_KEY'])


# In[28]:


import weaviate
auth_config = weaviate.auth.AuthApiKey(
    api_key=os.environ['WEAVIATE_API_KEY'])


# In[29]:


client = weaviate.Client(
    url=os.environ['WEAVIATE_API_URL'],
    auth_client_secret=auth_config,
    additional_headers={
        "X-Cohere-Api-Key": os.environ['COHERE_API_KEY'],
    }
)


# ## Dense Retrieval

# In[30]:


from utils import dense_retrieval


# In[31]:


query = "What is the capital of Canada?"


# In[32]:


dense_retrieval_results = dense_retrieval(query, client)


# In[33]:


from utils import print_result


# In[34]:


print_result(dense_retrieval_results)


# ## Improving Keyword Search with ReRank

# In[35]:


from utils import keyword_search


# In[36]:


query_1 = "What is the capital of Canada?"


# In[37]:


query_1 = "What is the capital of Canada?"
results = keyword_search(query_1,
                         client,
                         properties=["text", "title", "url", "views", "lang", "_additional {distance}"],
                         num_results=3
                        )

for i, result in enumerate(results):
    print(f"i:{i}")
    print(result.get('title'))
    print(result.get('text'))


# In[ ]:


query_1 = "What is the capital of Canada?"
results = keyword_search(query_1,
                         client,
                         properties=["text", "title", "url", "views", "lang", "_additional {distance}"],
                         num_results=500
                        )

for i, result in enumerate(results):
    print(f"i:{i}")
    print(result.get('title'))
    #print(result.get('text'))


# In[ ]:


def rerank_responses(query, responses, num_responses=10):
    reranked_responses = co.rerank(
        model = 'rerank-english-v2.0',
        query = query,
        documents = responses,
        top_n = num_responses,
        )
    return reranked_responses


# In[ ]:


texts = [result.get('text') for result in results]
reranked_text = rerank_responses(query_1, texts)


# In[ ]:


for i, rerank_result in enumerate(reranked_text):
    print(f"i:{i}")
    print(f"{rerank_result}")
    print()


# ## Improving Dense Retrieval with ReRank

# In[ ]:


from utils import dense_retrieval


# In[ ]:


query_2 = "Who is the tallest person in history?"


# In[ ]:


results = dense_retrieval(query_2,client)


# In[ ]:


for i, result in enumerate(results):
    print(f"i:{i}")
    print(result.get('title'))
    print(result.get('text'))
    print()


# In[ ]:


texts = [result.get('text') for result in results]
reranked_text = rerank_responses(query_2, texts)


# In[ ]:


for i, rerank_result in enumerate(reranked_text):
    print(f"i:{i}")
    print(f"{rerank_result}")
    print()

Algorithm: Combines BM25 and dense retrieval for re-ranking.
Technique: Two-step ranking where the results from BM25 are re-evaluated based on semantic similarity.
Tools: BM25 and Sentence-BERT.

This step ensures that queries in Arabic, like "??? ???? ?? ???? ?????" (how to start learning machine learning), are ranked more accurately based on meaning.

5. Generative Search (L5-Generative_Search.py)

The final step in your system introduces generative search, where models like GPT-4 or Cohere’s generative models generate responses based on the query. Instead of just retrieving documents, the system generates answers to complex queries.

Key Code from L5-Generative_Search.py:

python
#!/usr/bin/env python
# coding: utf-8

# # Generating Answers

# In[47]:


question = "Are side projects important when you are starting to learn about AI?"


# In[48]:


text = """
The rapid rise of AI has led to a rapid rise in AI jobs, and many people are building exciting careers in this field. A career is a decades-long journey, and the path is not always straightforward. Over many years, I’ve been privileged to see thousands of students as well as engineers in companies large and small navigate careers in AI. In this and the next few letters, I’d like to share a few thoughts that might be useful in charting your own course.

Three key steps of career growth are learning (to gain technical and other skills), working on projects (to deepen skills, build a portfolio, and create impact) and searching for a job. These steps stack on top of each other:

Initially, you focus on gaining foundational technical skills.
After having gained foundational skills, you lean into project work. During this period, you’ll probably keep learning.
Later, you might occasionally carry out a job search. Throughout this process, you’ll probably continue to learn and work on meaningful projects.
These phases apply in a wide range of professions, but AI involves unique elements. For example:

AI is nascent, and many technologies are still evolving. While the foundations of machine learning and deep learning are maturing — and coursework is an efficient way to master them — beyond these foundations, keeping up-to-date with changing technology is more important in AI than fields that are more mature.
Project work often means working with stakeholders who lack expertise in AI. This can make it challenging to find a suitable project, estimate the project’s timeline and return on investment, and set expectations. In addition, the highly iterative nature of AI projects leads to special challenges in project management: How can you come up with a plan for building a system when you don’t know in advance how long it will take to achieve the target accuracy? Even after the system has hit the target, further iteration may be necessary to address post-deployment drift.
While searching for a job in AI can be similar to searching for a job in other sectors, there are some differences. Many companies are still trying to figure out which AI skills they need and how to hire people who have them. Things you’ve worked on may be significantly different than anything your interviewer has seen, and you’re more likely to have to educate potential employers about some elements of your work.
Throughout these steps, a supportive community is a big help. Having a group of friends and allies who can help you — and whom you strive to help — makes the path easier. This is true whether you’re taking your first steps or you’ve been on the journey for years.

I’m excited to work with all of you to grow the global AI community, and that includes helping everyone in our community develop their careers. I’ll dive more deeply into these topics in the next few weeks.

Last week, I wrote about key steps for building a career in AI: learning technical skills, doing project work, and searching for a job, all of which is supported by being part of a community. In this letter, I’d like to dive more deeply into the first step.

More papers have been published on AI than any person can read in a lifetime. So, in your efforts to learn, it’s critical to prioritize topic selection. I believe the most important topics for a technical career in machine learning are:

Foundational machine learning skills. For example, it’s important to understand models such as linear regression, logistic regression, neural networks, decision trees, clustering, and anomaly detection. Beyond specific models, it’s even more important to understand the core concepts behind how and why machine learning works, such as bias/variance, cost functions, regularization, optimization algorithms, and error analysis.
Deep learning. This has become such a large fraction of machine learning that it’s hard to excel in the field without some understanding of it! It’s valuable to know the basics of neural networks, practical skills for making them work (such as hyperparameter tuning), convolutional networks, sequence models, and transformers.
Math relevant to machine learning. Key areas include linear algebra (vectors, matrices, and various manipulations of them) as well as probability and statistics (including discrete and continuous probability, standard probability distributions, basic rules such as independence and Bayes rule, and hypothesis testing). In addition, exploratory data analysis (EDA) — using visualizations and other methods to systematically explore a dataset — is an underrated skill. I’ve found EDA particularly useful in data-centric AI development, where analyzing errors and gaining insights can really help drive progress! Finally, a basic intuitive understanding of calculus will also help. In a previous letter, I described how the math needed to do machine learning well has been changing. For instance, although some tasks require calculus, improved automatic differentiation software makes it possible to invent and implement new neural network architectures without doing any calculus. This was almost impossible a decade ago.
Software development. While you can get a job and make huge contributions with only machine learning modeling skills, your job opportunities will increase if you can also write good software to implement complex AI systems. These skills include programming fundamentals, data structures (especially those that relate to machine learning, such as data frames), algorithms (including those related to databases and data manipulation), software design, familiarity with Python, and familiarity with key libraries such as TensorFlow or PyTorch, and scikit-learn.
This is a lot to learn! Even after you master everything in this list, I hope you’ll keep learning and continue to deepen your technical knowledge. I’ve known many machine learning engineers who benefitted from deeper skills in an application area such as natural language processing or computer vision, or in a technology area such as probabilistic graphical models or building scalable software systems.

How do you gain these skills? There’s a lot of good content on the internet, and in theory reading dozens of web pages could work. But when the goal is deep understanding, reading disjointed web pages is inefficient because they tend to repeat each other, use inconsistent terminology (which slows you down), vary in quality, and leave gaps. That’s why a good course — in which a body of material has been organized into a coherent and logical form — is often the most time-efficient way to master a meaningful body of knowledge. When you’ve absorbed the knowledge available in courses, you can switch over to research papers and other resources.

Finally, keep in mind that no one can cram everything they need to know over a weekend or even a month. Everyone I know who’s great at machine learning is a lifelong learner. In fact, given how quickly our field is changing, there’s little choice but to keep learning if you want to keep up. How can you maintain a steady pace of learning for years? I’ve written about the value of habits. If you cultivate the habit of learning a little bit every week, you can make significant progress with what feels like less effort.

In the last two letters, I wrote about developing a career in AI and shared tips for gaining technical skills. This time, I’d like to discuss an important step in building a career: project work.

It goes without saying that we should only work on projects that are responsible and ethical, and that benefit people. But those limits leave a large variety to choose from. I wrote previously about how to identify and scope AI projects. This and next week’s letter have a different emphasis: picking and executing projects with an eye toward career development.

A fruitful career will include many projects, hopefully growing in scope, complexity, and impact over time. Thus, it is fine to start small. Use early projects to learn and gradually step up to bigger projects as your skills grow.

When you’re starting out, don’t expect others to hand great ideas or resources to you on a platter. Many people start by working on small projects in their spare time. With initial successes — even small ones — under your belt, your growing skills increase your ability to come up with better ideas, and it becomes easier to persuade others to help you step up to bigger projects.

What if you don’t have any project ideas? Here are a few ways to generate them:

Join existing projects. If you find someone else with an idea, ask to join their project.
Keep reading and talking to people. I come up with new ideas whenever I spend a lot of time reading, taking courses, or talking with domain experts. I’m confident that you will, too.
Focus on an application area. Many researchers are trying to advance basic AI technology — say, by inventing the next generation of transformers or further scaling up language models — so, while this is an exciting direction, it is hard. But the variety of applications to which machine learning has not yet been applied is vast! I’m fortunate to have been able to apply neural networks to everything from autonomous helicopter flight to online advertising, partly because I jumped in when relatively few people were working on those applications. If your company or school cares about a particular application, explore the possibilities for machine learning. That can give you a first look at a potentially creative application — one where you can do unique work — that no one else has done yet.
Develop a side hustle. Even if you have a full-time job, a fun project that may or may not develop into something bigger can stir the creative juices and strengthen bonds with collaborators. When I was a full-time professor, working on online education wasn’t part of my “job” (which was doing research and teaching classes). It was a fun hobby that I often worked on out of passion for education. My early experiences recording videos at home helped me later in working on online education in a more substantive way. Silicon Valley abounds with stories of startups that started as side projects. So long as it doesn’t create a conflict with your employer, these projects can be a stepping stone to something significant.
Given a few project ideas, which one should you jump into? Here’s a quick checklist of factors to consider:

Will the project help you grow technically? Ideally, it should be challenging enough to stretch your skills but not so hard that you have little chance of success. This will put you on a path toward mastering ever-greater technical complexity.
Do you have good teammates to work with? If not, are there people you can discuss things with? We learn a lot from the people around us, and good collaborators will have a huge impact on your growth.
Can it be a stepping stone? If the project is successful, will its technical complexity and/or business impact make it a meaningful stepping stone to larger projects? (If the project is bigger than those you’ve worked on before, there’s a good chance it could be such a stepping stone.)
Finally, avoid analysis paralysis. It doesn’t make sense to spend a month deciding whether to work on a project that would take a week to complete. You'll work on multiple projects over the course of your career, so you’ll have ample opportunity to refine your thinking on what’s worthwhile. Given the huge number of possible AI projects, rather than the conventional “ready, aim, fire” approach, you can accelerate your progress with “ready, fire, aim.”

"""


# ## Setup
# 
# Load needed API keys and relevant Python libaries.

# In[49]:


import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file


# In[50]:


import cohere

import numpy as np
import warnings
warnings.filterwarnings('ignore')


# ## Chunking

# In[51]:


# Split into a list of paragraphs
texts = text.split('\n\n')

# Clean up to remove empty spaces and new lines
texts = np.array([t.strip(' \n') for t in texts if t])


# In[52]:


texts[:3]


# ## Embeddings

# In[53]:


co = cohere.Client(os.environ['COHERE_API_KEY'])

# Get the embeddings
response = co.embed(
    texts=texts.tolist(),
).embeddings


# ## Build a search index

# In[54]:


from annoy import AnnoyIndex
import numpy as np
import pandas as pd


# In[55]:


# Check the dimensions of the embeddings
embeds = np.array(response)

# Create the search index, pass the size of embedding
search_index = AnnoyIndex(embeds.shape[1], 'angular')
# Add all the vectors to the search index
for i in range(len(embeds)):
    search_index.add_item(i, embeds[i])

search_index.build(10) # 10 trees
search_index.save('test.ann')


# ## Searching Articles

# In[56]:


def search_andrews_article(query):
    # Get the query's embedding
    query_embed = co.embed(texts=[query]).embeddings
    
    # Retrieve the nearest neighbors
    similar_item_ids = search_index.get_nns_by_vector(query_embed[0],
                                                    10,
                                                  include_distances=True)

    search_results = texts[similar_item_ids[0]]
    
    return search_results


# In[57]:


results = search_andrews_article(
    "Are side projects a good idea when trying to build a career in AI?"
)

print(results[0])


# ## Generating Answers

# In[58]:


def ask_andrews_article(question, num_generations=1):
    
    # Search the text archive
    results = search_andrews_article(question)

    # Get the top result
    context = results[0]

    # Prepare the prompt
    prompt = f"""
    Excerpt from the article titled "How to Build a Career in AI" 
    by Andrew Ng: 
    {context}
    Question: {question}
    
    Extract the answer of the question from the text provided. 
    If the text doesn't contain the answer, 
    reply that the answer is not available."""

    prediction = co.generate(
        prompt=prompt,
        max_tokens=70,
        model="command-nightly",
        temperature=0.5,
        num_generations=num_generations
    )

    return prediction.generations


# In[59]:


results = ask_andrews_article(
    "Are side projects a good idea when trying to build a career in AI?",

)

print(results[0])


# In[60]:


results = ask_andrews_article(
    "Are side projects a good idea when trying to build a career in AI?",
    num_generations=3
)

for gen in results:
    print(gen)
    print('--')


# In[61]:


results = ask_andrews_article(
    "What is the most viewed televised event?",
    num_generations=5
)


# In[62]:


for gen in results:
    print(gen)
    print('--')

Algorithm: GPT-4/Cohere’s generative models.

Technique: Text generation based on the query, synthesizing answers from multiple documents.
Tool: Cohere API.

For Arabic queries like "?? ?? ?????? ?????" (what is machine learning?), the system can generate an answer based on multiple relevant sources, providing a concise, synthesized response.

Section 4: Benefits of Semantic Search

More Relevant Results: Using dense retrieval ensures that search results are semantically relevant, even for complex queries in Arabic and English.
Improved User Satisfaction: Re-ranking and generative search provide meaningful, precise responses that align better with user expectations.
Contextual Understanding: Semantic search understands the full meaning of a query, not just individual keywords, which is crucial for languages with complex morphology like Arabic.

Section 5: Challenges and Future Directions

Scalability: Dense retrieval and generative models like GPT-4 are computationally expensive, making them challenging to scale for massive datasets.
Computational Complexity: Embedding-based models require more computational power than traditional BM25 keyword searches.
Future of Search: With advancements in hardware (GPUs/TPUs) and indexing techniques (e.g., FAISS for vector search), semantic search will become even more scalable and powerful, making multilingual search engines more accurate and efficient.

Conclusion

By transitioning from BM25 to dense retrieval and finally to generative search, you’ve built a robust and powerful semantic search system that handles both English and Arabic text. Each stage enhances the ability to understand user intent, whether the query is in a complex language like Arabic or a more straightforward one like English. With continued advancements in transformer models and embedding techniques, semantic search will only continue to evolve, delivering more precise and contextually relevant results for multilingual queries.

要查看或添加评论，请登录

Prasanth V的更多文章

Resume

2024年10月19日

Resume

Prasanth Vemula Dallas, TX(Relocate) ? [email protected] ? 9408433389 ? https://linkedin.
Did you know that detecting people in complete darkness is no longer science fiction? ?? Imagine combining thermal vision, state-of-the-art AI, and so

2024年10月19日

Did you know that detecting people in complete darkness is no longer science fiction? ?? Imagine combining thermal vision, state-of-the-art AI, and so

Did you know that detecting people in complete darkness is no longer science fiction? ?? Imagine combining thermal…
Unleashing the Power of AI: Building an Intelligent Agent with LangGraph

2024年6月10日

Unleashing the Power of AI: Building an Intelligent Agent with LangGraph

Introduction In a world where artificial intelligence is reshaping boundaries, I embarked on a fascinating journey to…

1 条评论

Unlocking the Power of Semantic Search: Methods, Algorithms, and Code Walkthrough with Arabic Text Handling

Prasanth V

Generative AI & Machine Learning Engineer ,Professional Prompt Engineer ,Data Scientist |Driving Innovation in AI Solutions | Expert in LLMs, LangChain, and Cloud Technologies,Good at AI apps Angular JS ,Streamlit

Do You Know You Can Also Search Related Arabic Text in English?

Introduction

Section 1: What is Semantic Search?

Section 2: Algorithms and Techniques Used

1. Initial Method: BM25 in Keyword Search (L1-Keyword_Search.py)

2. Introducing Word Embeddings (L2-Embeddings.py)

领英推荐

3. Dense Retrieval with Transformer Models (L3-Dense_Retrieval.py)

4. Re-ranking Results for Precision (L4-Rerank.py)

5. Generative Search (L5-Generative_Search.py)

Section 4: Benefits of Semantic Search

Section 5: Challenges and Future Directions

Conclusion

Prasanth V的更多文章

社区洞察

其他会员也浏览了

The Power of Griptape Task Memory and Off-Prompt?

The magic of batch changes in termbases

How to Create An AI-Powered Python Web App With Flask And GPT-4 API

Navigating the Future: Full Stack Development in the AI Era

Mastering SEO Theories with Python: A Comprehensive Guide

Elevate Your SEO Game: Python and ChatGPT Automation Guide

How To Make Your Chatbot Tweet News For You

Handling Long Context RAG for LLMs with Contextual Summarization

Optimal Techniques for Crafting Effective LLM Prompts

Does DSPy support multilingual tasks and how effective is it?

Do You Know You Can Also Search Related Arabic Text in English?

Introduction

Section 1: What is Semantic Search?

Section 2: Algorithms and Techniques Used

1. Initial Method: BM25 in Keyword Search (L1-Keyword_Search.py)

2. Introducing Word Embeddings (L2-Embeddings.py)

领英推荐

3. Dense Retrieval with Transformer Models (L3-Dense_Retrieval.py)

4. Re-ranking Results for Precision (L4-Rerank.py)

5. Generative Search (L5-Generative_Search.py)

Section 4: Benefits of Semantic Search

Section 5: Challenges and Future Directions

Conclusion

Prasanth V的更多文章

Resume

Did you know that detecting people in complete darkness is no longer science fiction? ?? Imagine combining thermal vision, state-of-the-art AI, and so

Unleashing the Power of AI: Building an Intelligent Agent with LangGraph

社区洞察

其他会员也浏览了

The Power of Griptape Task Memory and Off-Prompt?

The magic of batch changes in termbases

How to Create An AI-Powered Python Web App With Flask And GPT-4 API

Navigating the Future: Full Stack Development in the AI Era

Mastering SEO Theories with Python: A Comprehensive Guide

Elevate Your SEO Game: Python and ChatGPT Automation Guide

How To Make Your Chatbot Tweet News For You

Handling Long Context RAG for LLMs with Contextual Summarization

Optimal Techniques for Crafting Effective LLM Prompts

Does DSPy support multilingual tasks and how effective is it?