Serve LLama 2 with Fast API using collab GPU and Ngrok for free
I have always had the idea of using open source products which can be useful for many companies due to budget or some time to secure their sensible data,
In our case, we will test serving LLAMA 2 which is an open-source LLM by Meta using FastAPI on Collab, Ngrok is used for exposing the local web server of Collab on a public URL
we used Collab because we can use the GPU allocation.
First of all, we start by changing the runtime type to T4 GPU in :
Then we start by installing the necessary packages :
llama-cpp-python, fastapi[all], uvicorn, python-multipart, transformers, pydantic, tensorflow
Next, we install Ngrok in our collab session:
Now we need to create a free account in Ngrok, next, we set our Authentication token as in the screenshot which will be saved in the Ngrok configuration.
After installing the packages and setting the authentication token we create our FastAPI App in the app.py file, you can find the full version in Collab link
so in our FastAPI, we created a post route '/generate'
The route uses LLama2_model to generate a response as simple as that.
Finally, we run this code to serve the localhost of Collab on our Ngrok account:
Documentation of our RestApi:
We use this route by sending a prompt text with temperature ( more creative response is, lower more precise ) and max_tokens( the max amount of (simplified) "words" allowed to be generated ) params:
{
"inputs": "Who is elon musk ?",
"parameters": {"temperature":0.1, "max_tokens":400}
}
Prompt result:
Url to collab file :