Solving object detection at minimal cost with Phi-2 and Florence
A few months ago, a customer reached out with an AI problem.
They are a transportation company. Drivers are paid extra for deliveries involving stairs. They ask their drivers to take a picture of the stairs at the time of delivery and have head office review the pictures and adjust the driver’s paycheck.
It represents about 10,000 pictures per day. There is room to chase value.
Despite using a large amount of training pictures, they had failed to train a model to detect staircases. They were getting lots of false positives. The model was confusing treadmills and pallets to staircases.
We told them that at Microsoft we recently released an image-to-text LLM. It is called project Florence and you can consume it as a service through the denseCaptions feature of the Azure Vision API.
After providing an image to Florence, the model returns up to ten text captions of the image.
We just need to pass those ten captions in a GPT-3.5 text-to-text model and craft a good prompt to ask if the object is in the picture or not.
This customer sent us 80 delivery pictures, focusing on the hard ones with partial staircases in the picture and lots of false positive of treadmills and pallets. We added 20 random pictures to the set. The prototype we built for them detected all staircases correctly but two pictures (because of Florence). That is 98% out of the box on a hard dataset. They were impressed, happy and are taking the developments in house.
The prototype detected, out of the box and from a difficult dataset, 98% of staircases correctly
What was interesting in building this prototype is we now have an ML pipeline that detects about any common objects without additional training required. When quoting the prototype, from a cost perspective, GPT-3 consumed about 85% of the total LLM compute while Florence represented only 15% of the total LLM cost compute.
GPT-3 consumed about 85% of the total LLM compute
The next logical step is to try to go down in inference cost for the GPT-3.5 part of the pipeline.
Back in December, Microsoft made Phi-2 commercially available through MIT license. Phi-2 is Microsoft small language model with 2.7 billions parameters (vs 175 billions parameters for GPT-3). It is cheaper to run. It can run offline or on mobile and ultimately, it consumes less power, it is better for the planet. Phi-2 is cool tech, it is super value prop.
In this prototype we combine Florence and Phi-2 to solve object detection at a minimal cost for everyone
The Detect table, our benchmark, output, and log table in dataverse
The first thing we do is create our Detect table to host our images. We use Power Platform dataverse for that. From there, dataverse allows us to create in no time a working app on top of this table : a simple view to view the records and a form to edit a record.
领英推荐
This Detect data source is like our benchmark, output, and log table at the same time. It contains all our pictures, and for each which object to detect, and if the object is actually in the picture or not (from human input).
Last, we have a status for each picture, when it is "To Be Processed" a flow kick off, upload the picture, extract the captions with Florence and process them through GPT-3.5 AI Builder and Phi-2, and update the record from there with the output and the logs.
The machine learning pipeline in Power Automate
To chain our models (the ML pipeline) we use Power Automate. In a business application context, Power Automate is the equivalent of a LangChain and PromptFlow together. It is like Power Automate is the Zappier of business applications. It connects safely, out of the box, and in no time to any kind of legacy business applications stack.
The flow wakes up when a record is flagged "To Be Processed", update the record to "Processing" and start extracting the data
Florence API is called via HTTP API. Note, in the process the image needs to be saved to a blob storage prior to call Florence.
From there, in series, we call GPT-3.5 through the AI Builder Prompt action and Phi-2 through HTTP pinging our own Azure AI Studio Phi-2 deployment. Once data is extracted, we update the record with the output, the log and the status of the record to "Processed".
Deploying Phi-2 in Azure AI Studio
Azure AI Studio makes it simple to choose from thousands of models in the Azure model catalog then deploy them as an API endpoint. You may need to request additional Azure quotas. In the meantime, Azure AI Studio offers you to run it for free for 7 days. I don't think we can do more simple than that.
Prompting Phi-2 and results
Phi-2 likes to keep writing. At times, it will get you the right answer but still keep writing. Playing with the model, you can tell the reasoning is there. Like GPT-3.5 likes to chat, it can get tricky to instruct. One useful trick with GPT-3.5 is to ask to output in a JSON format. The model understand JSON well. From there, whatever you ask in your JSON can be more instructed in nature. This trick seems to work for Phi-2 as well. That is what we used to prompt Phi-2.
With this Phi-2 prompt we got to 98% of the pictures being detected, on par with our GPT-3!
This prompt is simple. This prompt is also crafted and is sensitive (try it for yourself!). I am sure it is not the best prompt and there are ways to do better but for our 100 pictures benchmark, it got us on par with GPT-3. It is success. We can stop there.
This prototype was built on Power Platform in one day. If you have not, check out Power Platform. It is a terrific way to release enterprise grade AI at a fast pace and low-risk for your users and reduce technical debt in your legacy IT stack.
Copilot Studio & Power Platform @ Microsoft
1 年Wednesday Feb 14, I'll be discussing Microsoft Power Platform for data science teams. The demo part of it will be prototyping Florence + GPT3/Phi2 to do object detection. Feel free to register and it will be recorded. We'll touch on some cool vision use cases too :-) https://msit.events.teams.microsoft.com/event/84f6f62d-63d6-4b29-8eac-cc7ba3159ec2@72f988bf-86f1-41af-91ab-2d7cd011db47
Software Engineering Manager at Microsoft
1 年Thanks for sharing!
Software Architect (Consulting)
1 年Can u do for PID images??
This is a great example of rapid enterprise innovation using Power Platform, Nico! Well done