Solving object detection at minimal cost with Phi-2 and Florence

Solving object detection at minimal cost with Phi-2 and Florence

A few months ago, a customer reached out with an AI problem.

They are a transportation company. Drivers are paid extra for deliveries involving stairs. They ask their drivers to take a picture of the stairs at the time of delivery and have head office review the pictures and adjust the driver’s paycheck.

It represents about 10,000 pictures per day. There is room to chase value.

Despite using a large amount of training pictures, they had failed to train a model to detect staircases. They were getting lots of false positives. The model was confusing treadmills and pallets to staircases.

We told them that at Microsoft we recently released an image-to-text LLM. It is called project Florence and you can consume it as a service through the denseCaptions feature of the Azure Vision API.

After providing an image to Florence, the model returns up to ten text captions of the image.

denseCaption (project Florence) available as a service on Azure AI


We just need to pass those ten captions in a GPT-3.5 text-to-text model and craft a good prompt to ask if the object is in the picture or not.

Prompting GPT-3.5 from Power Platform AI Builder


This customer sent us 80 delivery pictures, focusing on the hard ones with partial staircases in the picture and lots of false positive of treadmills and pallets. We added 20 random pictures to the set. The prototype we built for them detected all staircases correctly but two pictures (because of Florence). That is 98% out of the box on a hard dataset. They were impressed, happy and are taking the developments in house.

The prototype detected, out of the box and from a difficult dataset, 98% of staircases correctly

What was interesting in building this prototype is we now have an ML pipeline that detects about any common objects without additional training required. When quoting the prototype, from a cost perspective, GPT-3 consumed about 85% of the total LLM compute while Florence represented only 15% of the total LLM cost compute.

GPT-3 consumed about 85% of the total LLM compute

The next logical step is to try to go down in inference cost for the GPT-3.5 part of the pipeline.

Back in December, Microsoft made Phi-2 commercially available through MIT license. Phi-2 is Microsoft small language model with 2.7 billions parameters (vs 175 billions parameters for GPT-3). It is cheaper to run. It can run offline or on mobile and ultimately, it consumes less power, it is better for the planet. Phi-2 is cool tech, it is super value prop.

In this prototype we combine Florence and Phi-2 to solve object detection at a minimal cost for everyone

The Detect table, our benchmark, output, and log table in dataverse

The first thing we do is create our Detect table to host our images. We use Power Platform dataverse for that. From there, dataverse allows us to create in no time a working app on top of this table : a simple view to view the records and a form to edit a record.

the Detect table on dataverse


view of images with the object to detect, is the object actually in the image (human) and Phi-2 and GPT3 outputs

This Detect data source is like our benchmark, output, and log table at the same time. It contains all our pictures, and for each which object to detect, and if the object is actually in the picture or not (from human input).

Last, we have a status for each picture, when it is "To Be Processed" a flow kick off, upload the picture, extract the captions with Florence and process them through GPT-3.5 AI Builder and Phi-2, and update the record from there with the output and the logs.

The machine learning pipeline in Power Automate

To chain our models (the ML pipeline) we use Power Automate. In a business application context, Power Automate is the equivalent of a LangChain and PromptFlow together. It is like Power Automate is the Zappier of business applications. It connects safely, out of the box, and in no time to any kind of legacy business applications stack.

The flow wakes up when a record is flagged "To Be Processed", update the record to "Processing" and start extracting the data.

Florence API is called via HTTP API. Note, in the process the image needs to be saved to a blob storage prior to call Florence.

From there, in series, we call GPT-3.5 through the AI Builder Prompt action and Phi-2 through HTTP pinging our own Azure AI Studio Phi-2 deployment. Once data is extracted, we update the record with the output, the log and the status of the record to "Processed".

Object Detection ML Pipeline : when a record is "To Be Processed", Florence then GPT3 and Phi2 are called and the record is updated in dataverse

Deploying Phi-2 in Azure AI Studio

Azure AI Studio makes it simple to choose from thousands of models in the Azure model catalog then deploy them as an API endpoint. You may need to request additional Azure quotas. In the meantime, Azure AI Studio offers you to run it for free for 7 days. I don't think we can do more simple than that.

The Azure Model Catalog
Deploying Phi-2 as an endpoint with Azure AI Studio

Prompting Phi-2 and results

Phi-2 likes to keep writing. At times, it will get you the right answer but still keep writing. Playing with the model, you can tell the reasoning is there. Like GPT-3.5 likes to chat, it can get tricky to instruct. One useful trick with GPT-3.5 is to ask to output in a JSON format. The model understand JSON well. From there, whatever you ask in your JSON can be more instructed in nature. This trick seems to work for Phi-2 as well. That is what we used to prompt Phi-2.


Prompting Phi-2 from Power Automate


Phi-2 and AI Builder GPT-3 outputs and logs once processed


With this Phi-2 prompt we got to 98% of the pictures being detected, on par with our GPT-3!

This prompt is simple. This prompt is also crafted and is sensitive (try it for yourself!). I am sure it is not the best prompt and there are ways to do better but for our 100 pictures benchmark, it got us on par with GPT-3. It is success. We can stop there.

This prototype was built on Power Platform in one day. If you have not, check out Power Platform. It is a terrific way to release enterprise grade AI at a fast pace and low-risk for your users and reduce technical debt in your legacy IT stack.

Nico Sprotti

Copilot Studio & Power Platform @ Microsoft

1 年

Wednesday Feb 14, I'll be discussing Microsoft Power Platform for data science teams. The demo part of it will be prototyping Florence + GPT3/Phi2 to do object detection. Feel free to register and it will be recorded. We'll touch on some cool vision use cases too :-) https://msit.events.teams.microsoft.com/event/84f6f62d-63d6-4b29-8eac-cc7ba3159ec2@72f988bf-86f1-41af-91ab-2d7cd011db47

回复
Hema Alaganandam

Software Engineering Manager at Microsoft

1 年

Thanks for sharing!

Parthiban Rajendran

Software Architect (Consulting)

1 年

Can u do for PID images??

回复

This is a great example of rapid enterprise innovation using Power Platform, Nico! Well done

要查看或添加评论,请登录

Nico Sprotti的更多文章

  • Build your own private GPT agents in Teams

    Build your own private GPT agents in Teams

    With a good prompt, top-tier models (GPT 4, Mistral large…) can beat the best doctors and lawyers out there. I often…

    10 条评论

社区洞察

其他会员也浏览了