Running compact AI model locally: How I Deployed the Phi3 on a Raspberry Pi 5 Using FastAPI
Harri J Salomaa
Executive Advisor | Entrepreneur | Applied AI & ML Specialist | Non-Executive Director
As a coding hobbyist and an executive with a deep interest in AI, I often experiment with technology that bridges the gap between professional development and hands-on fun. When Amazon delivered my new 8GB Raspberry Pi 5 this Sunday at 7am, I couldn't wait to see how far I could push this tiny powerhouse.
In just a few hours, I had the Phi3 model running locally and served via a FastAPI server. Quite fast, right? This post is a recount of how I unboxed the Raspberry Pi, tackled memory constraints, and turned it into a functional AI demo—running on hardware that most wouldn’t expect for such a task.
It wasn’t totally smooth sailing, but by the end, I had a fully functional AI model running on hardware most people wouldn’t expect. Here’s how I made it work.
Taking the Raspberry Pi 5 Out of the Box
Upon receiving my brand new CanaKit Raspberry Pi 5 Starter Kit PRO (Turbine Black, 128GB Edition with 8GB RAM), I quickly realized that a bit of assembly was required before getting the Pi up and running. The kit included not only the Raspberry Pi 5 board but also a heat sink, cooling fan, case, power supply, and a microSD card preloaded with the operating system. The setup was straightforward, but it did involve carefully assembling the heat sink and fan to ensure proper cooling—a crucial step considering the Pi would be running AI models that demand efficient thermal management.
Once the hardware was fully assembled and everything connected, including the HDMI and USB cables, I booted up the Raspberry Pi with the standard Raspbian OS. It was a pleasant surprise to discover that Visual Studio Code is now available for Raspberry Pi OS. As someone who regularly uses VS Code for development, having access to my go-to IDE on the Pi made the entire setup process even more seamless and famili
Installing Ollama to Run the Phi3 Model
Phi-3 is a family of lightweight 3B (Mini) and 14B (Medium) state-of-the-art open models by Microsoft.
Next, I needed a tool to actually run the Phi3 model. For this, I turned to Ollama, a great tool that simplifies local inference for AI models.
Once Ollama was installed, it was time to see if the Phi3 model could run smoothly on the Raspberry Pi 5. I knew that 8GB of RAM wasn’t much for an AI model, so I wasn’t expecting blazing-fast performance. But hey, this was about pushing boundaries, right?
It's good to note that when you run the command ollama run phi3 for the first time, it actually downloads the Phi3 model onto the device. This step took a bit of time.
Sure enough, the model started up! However, I quickly realized that I’d need a larger swap file to make up for the memory limitations.
Making the Model Accessible via FastAPI
Next, I wanted to make the model accessible via an API, so I turned to FastAPI, which is both easy to use and highly performant. Installing FastAPI on the Raspberry Pi was straightforward.
领英推荐
Running the FastAPI App with Uvicorn
After setting up the FastAPI framework, I needed a way to run the app. This is where Uvicorn comes into play. Uvicorn is a lightweight, high-performance Asynchronous Server Gateway Interface. In simple terms, it's the engine that takes FastAPI app and turns it into a running web service, allowing your API to handle requests and return responses.
Once FastAPI and Uvicorn were installed, I wrote a simple Python server to handle requests and pass prompts to the Phi3 model.
The Moment of Truth: Testing the API
To make the API accessible, I started the FastAPI server using Uvicorn. This made the server live on my local network, and I could access it from any device connected to the network.
The real test came when I sent a prompt to the model via curl:
curl -X 'POST' \
'https://<raspberry-pi-ip>:8000/chat/phi3' \
-H 'Content-Type: application/json' \
-d '{"prompt": "How fast is a cheetah?"}'
And it worked! While the response was slow, the Phi3 model eventually provided a concise and accurate answer:
“The cheetah can run at speeds up to 70-75 miles per hour.”
Success!
Is It Worth It?
Running an AI model like Phi3 on a Raspberry Pi 5 was a fun experiment, but it wasn’t without challenges. The swap file allowed me to overcome memory limitations, but the speed of inference reminded me that this hardware, while impressive, still has its limits when it comes to high-end AI models.
That said, it’s fascinating to see how much you can accomplish with a small, affordable device like the Raspberry Pi. This experiment showed me that you don’t need a massive cloud infrastructure to explore AI models—sometimes, all you need is a Pi and a bit of persistence.
I hope this post inspires others to experiment with AI models on resource-constrained devices. Feel free to reach out if you have any questions or connect with me on LinkedIn to discuss further!
Best regards
Harri, AI enthusiast.
Head of India Delivery Center -Telecom
6 个月This is very motivating Harri
Head of Solutions and Business Development
6 个月Once those devices can get off the CPU cores and onto the GPU/NPU (that is not hidden behind unsupported drivers) there will be more than enough processing power for most interference. Once a chip-vendor realize they will not dominate the market of of €5K Microsoft AI-laptops or €2000-€15 000 datacenter chips or realize that you can put 32GB or 64GB (or 72GB) of semi-fast HBM memory with additional slow memory around a 10 core device .....then the tooling and opensource ecosystem is ready for some really interesting deployments. Those Core Ultra 200V or Strix Halo chips with 50-200 FLOPS combined with some "faster than 2 channel"-memory should be plenty fast.... but I guess memory-availability and TCSM capacity is holding back the volumes for now....