Embedded my data to GPT in (5) Simple Steps using LLAMA Index

Amr Salem

Distinguished Engineer @ Verizon | GenAI Solutions, Solutions Architecture

发布日期: 2023年12月24日

0) Pre-requisites

Basic Python understanding, LlamaIndex lib optionally dotenv lib for setting env. variables from .env file

import os
import openai

from dotenv import load_dotenv
load_dotenv() # It must be before llama_index import

from llama_index import GPTVectorStoreIndex, SimpleDirectoryReader, StorageContext, load_index_from_storage
from llama_index.embeddings import OpenAIEmbedding
from llama_index.llms import OpenAI


from llama_index import ( ServiceContext, StorageContext, load_index_from_storage, )

As we're using GPT 3.5 Turbo from OpenAI for this exercise we would need an API key set as an env. environment variable

openai.api_key = os.environ.get("OPENAI_API_KEY")

1) Build Storage

Building storage is an easy step, in llama index it can be built in many ways for simplicity we're building storage using the GPTVectorStoreIndex class from a set of documents setting on our file system in data_dir which we will initialize later at the end we return the storage index

def build_storage(data_dir, persist_dir):
    documents = SimpleDirectoryReader(data_dir).load_data()
    index = GPTVectorStoreIndex.from_documents(documents)
    index.storage_context.persist(persist_dir)
    return index

2) Read From Storage

Reading from storage is yet another easy task using StorageContext and load_index_from_storage from llama index lib we can achieve that

def read_from_storage(persist_dir):
    storage_context = StorageContext.from_defaults(persist_dir=persist_dir)
    return load_index_from_storage(storage_context)

3) Get Storage Index

In this function we decide if the storage was initialized or we need to so using the prior (2) functions we just defined, then we return the actual index based on that.

if you realize we also get to set the storage path and the data path here!

Data can be text , pdf or another readable file format, you might need to load more libs to handle extensions other than text.

Quantum Analytics NG 5 个月前

GenAI Weekly — Edition 23

Shuveb Hussain 1 个月前

20 Must know Python Libraries for Data Science

keySkillset 1 年前

def get_storage_index():
    persist_dir = os.environ.get("STORAGE_PATH")
    data_dir =  os.environ.get("DATA_PATH")
    index = None
    
    if os.path.exists(persist_dir):
        index = read_from_storage(persist_dir)
    else:
        index = build_storage(data_dir, persist_dir)
    
    return index

4) Get Query Engine

This is where the magic happens, you get to define your GPT model here as well as other parameters like embed model which the model that will be used to handle your documents/indexes as embeddings to GPT model.

If you're still wondering what is an embbeding it's nothing but a vector representation of your text usually a bunch of float numbers that the LLM can easily understand.

In this function, we're also creating an instance of the Query Engine object that we would use to send queries with embeddings to LLM next

def get_query_engine():

    llm = OpenAI(model="gpt-3.5-turbo-16k", temperature=0.2, max_tokens=8000, streaming=True)
    embed_model = OpenAIEmbedding()
    service_context = ServiceContext.from_defaults(llm=llm,embed_model=embed_model, chunk_size=512)
    #set_global_service_context(service_context)

    index = get_storage_index()
    query_engine = index.as_query_engine(service_context=service_context)
    return query_engine

5) Query

Voiala just pass your query to Query LLM function and enjoy response relative to your data files context

def query_llm(q):

    query_engine = get_query_engine()
   
    response = query_engine.query(q)
    
    print(response)
    return(response)

Additionally

You could expose this as a rest service using python flask

@endpoint.route(f'{PREFIX}/query',  methods=['POST'])
def query():
    content = request.get_json(silent=True)
    q= content['query']
    
    try:
         res = gptindex.query_llm(q)
         
         return json.dumps({"response": str(res)})
    except Exception as err:
        print(err)
        abort(404)

About the Author

Amr Salem was born in Cairo, Egypt. He is a technology geek. Currently holds the position of Distingushed Engineer- Cloud| SRE | Fullstack Development| System Architecture at Verizon, Temple Terrace, FL, USA. Amr plays a pivotal role in providing innovative solutions across Verizon’s Network Systems. Before joining Verizon, Amr was with the IBM Clients Innovation Center, where he honed his skills and expertise in the technology field. His diverse talents and dedication make him a valuable asset in the technology industry and a source of inspiration for aspiring writers.

Embedded my data to GPT in (5) Simple Steps using LLAMA Index

Amr Salem

Distinguished Engineer @ Verizon | GenAI Solutions, Solutions Architecture

0) Pre-requisites

1) Build Storage

2) Read From Storage

3) Get Storage Index

领英推荐

4) Get Query Engine

5) Query

Additionally

About the Author

更多精彩文章

社区洞察

其他会员也浏览了

Tools for Data Collection and Processing: Integrating Python, AI, and Machine Learning

KX's developed innovation of AI (Artificial Intelligence)

Is there any library on C++ like Sklearn, NumPy, SciPy, pandas for machine learning?

Document Splitting

Accelerating Data-on-Demand Services, C++, & Podcast Recommendation

Train and Evaluate Regression Models with Scikit-Learn to Forecast Numerical Quantities

DATA Pill #092 - MLFlow iceberg, Meta ?? Python

Leveraging People and Python in AI for Optimal Data Utilization

Python NumPy: Efficient Numerical Computing

My Review on Feature Engineering Book "Python Feature Engineering Cookbook"

0) Pre-requisites

1) Build Storage

2) Read From Storage

3) Get Storage Index

领英推荐

4) Get Query Engine

5) Query

Additionally

About the Author

GenAI Integration Building Blocks - How the Puzzle Works ?

2024年9月17日

Unlock GenAI: Your 2-Step Guide to Getting Started

2024年9月3日

Paper: Evaluating Solutions for Achieving High Availability or Near Zero Downtime

2024年6月28日

The Blurred Lines: Is Science Fiction Still Fiction in the Age of AI?

2024年5月30日

When Competition Becomes Conflict: The Pitfalls of Team Rivalries

2024年4月22日

This is an issue, That's a Cause!

2023年12月13日

Setting up Your AI/ML Home Lab: Newbies Challenges and Solutions

2023年8月24日

BAU versus non BAU

2017年4月19日

Simple IT Architecture

2016年3月1日

社区洞察

其他会员也浏览了

Tools for Data Collection and Processing: Integrating Python, AI, and Machine Learning

KX's developed innovation of AI (Artificial Intelligence)

Is there any library on C++ like Sklearn, NumPy, SciPy, pandas for machine learning?

Document Splitting

Accelerating Data-on-Demand Services, C++, & Podcast Recommendation

Train and Evaluate Regression Models with Scikit-Learn to Forecast Numerical Quantities

DATA Pill #092 - MLFlow iceberg, Meta ?? Python

Leveraging People and Python in AI for Optimal Data Utilization

Python NumPy: Efficient Numerical Computing

My Review on Feature Engineering Book "Python Feature Engineering Cookbook"