Embedded my data to GPT in (5) Simple Steps using LLAMA Index
0) Pre-requisites
Basic Python understanding, LlamaIndex lib optionally dotenv lib for setting env. variables from .env file
import os
import openai
from dotenv import load_dotenv
load_dotenv() # It must be before llama_index import
from llama_index import GPTVectorStoreIndex, SimpleDirectoryReader, StorageContext, load_index_from_storage
from llama_index.embeddings import OpenAIEmbedding
from llama_index.llms import OpenAI
from llama_index import ( ServiceContext, StorageContext, load_index_from_storage, )
As we're using GPT 3.5 Turbo from OpenAI for this exercise we would need an API key set as an env. environment variable
openai.api_key = os.environ.get("OPENAI_API_KEY")
1) Build Storage
Building storage is an easy step, in llama index it can be built in many ways for simplicity we're building storage using the GPTVectorStoreIndex class from a set of documents setting on our file system in data_dir which we will initialize later at the end we return the storage index
def build_storage(data_dir, persist_dir):
documents = SimpleDirectoryReader(data_dir).load_data()
index = GPTVectorStoreIndex.from_documents(documents)
index.storage_context.persist(persist_dir)
return index
2) Read From Storage
Reading from storage is yet another easy task using StorageContext and load_index_from_storage from llama index lib we can achieve that
def read_from_storage(persist_dir):
storage_context = StorageContext.from_defaults(persist_dir=persist_dir)
return load_index_from_storage(storage_context)
3) Get Storage Index
In this function we decide if the storage was initialized or we need to so using the prior (2) functions we just defined, then we return the actual index based on that.
if you realize we also get to set the storage path and the data path here!
Data can be text , pdf or another readable file format, you might need to load more libs to handle extensions other than text.
领英推荐
def get_storage_index():
persist_dir = os.environ.get("STORAGE_PATH")
data_dir = os.environ.get("DATA_PATH")
index = None
if os.path.exists(persist_dir):
index = read_from_storage(persist_dir)
else:
index = build_storage(data_dir, persist_dir)
return index
4) Get Query Engine
This is where the magic happens, you get to define your GPT model here as well as other parameters like embed model which the model that will be used to handle your documents/indexes as embeddings to GPT model.
If you're still wondering what is an embbeding it's nothing but a vector representation of your text usually a bunch of float numbers that the LLM can easily understand.
In this function, we're also creating an instance of the Query Engine object that we would use to send queries with embeddings to LLM next
def get_query_engine():
llm = OpenAI(model="gpt-3.5-turbo-16k", temperature=0.2, max_tokens=8000, streaming=True)
embed_model = OpenAIEmbedding()
service_context = ServiceContext.from_defaults(llm=llm,embed_model=embed_model, chunk_size=512)
#set_global_service_context(service_context)
index = get_storage_index()
query_engine = index.as_query_engine(service_context=service_context)
return query_engine
5) Query
Voiala just pass your query to Query LLM function and enjoy response relative to your data files context
def query_llm(q):
query_engine = get_query_engine()
response = query_engine.query(q)
print(response)
return(response)
Additionally
You could expose this as a rest service using python flask
@endpoint.route(f'{PREFIX}/query', methods=['POST'])
def query():
content = request.get_json(silent=True)
q= content['query']
try:
res = gptindex.query_llm(q)
return json.dumps({"response": str(res)})
except Exception as err:
print(err)
abort(404)
About the Author
Amr Salem was born in Cairo, Egypt. He is a technology geek. Currently holds the position of Distingushed Engineer- Cloud| SRE | Fullstack Development| System Architecture at Verizon, Temple Terrace, FL, USA. Amr plays a pivotal role in providing innovative solutions across Verizon’s Network Systems. Before joining Verizon, Amr was with the IBM Clients Innovation Center, where he honed his skills and expertise in the technology field. His diverse talents and dedication make him a valuable asset in the technology industry and a source of inspiration for aspiring writers.