登录查看更多内容

Solving object detection at minimal cost with Phi-2 and Florence

Nico Sprotti

Copilot Studio & Power Platform @ Microsoft

发布日期: 2024年1月28日

A few months ago, a customer reached out with an AI problem.

They are a transportation company. Drivers are paid extra for deliveries involving stairs. They ask their drivers to take a picture of the stairs at the time of delivery and have head office review the pictures and adjust the driver’s paycheck.

It represents about 10,000 pictures per day. There is room to chase value.

Despite using a large amount of training pictures, they had failed to train a model to detect staircases. They were getting lots of false positives. The model was confusing treadmills and pallets to staircases.

We told them that at Microsoft we recently released an image-to-text LLM. It is called project Florence and you can consume it as a service through the denseCaptions feature of the Azure Vision API.

After providing an image to Florence, the model returns up to ten text captions of the image.

denseCaption (project Florence) available as a service on Azure AI

We just need to pass those ten captions in a GPT-3.5 text-to-text model and craft a good prompt to ask if the object is in the picture or not.

Prompting GPT-3.5 from Power Platform AI Builder

This customer sent us 80 delivery pictures, focusing on the hard ones with partial staircases in the picture and lots of false positive of treadmills and pallets. We added 20 random pictures to the set. The prototype we built for them detected all staircases correctly but two pictures (because of Florence). That is 98% out of the box on a hard dataset. They were impressed, happy and are taking the developments in house.

The prototype detected, out of the box and from a difficult dataset, 98% of staircases correctly

What was interesting in building this prototype is we now have an ML pipeline that detects about any common objects without additional training required. When quoting the prototype, from a cost perspective, GPT-3 consumed about 85% of the total LLM compute while Florence represented only 15% of the total LLM cost compute.

GPT-3 consumed about 85% of the total LLM compute

The next logical step is to try to go down in inference cost for the GPT-3.5 part of the pipeline.

Back in December, Microsoft made Phi-2 commercially available through MIT license. Phi-2 is Microsoft small language model with 2.7 billions parameters (vs 175 billions parameters for GPT-3). It is cheaper to run. It can run offline or on mobile and ultimately, it consumes less power, it is better for the planet. Phi-2 is cool tech, it is super value prop.

In this prototype we combine Florence and Phi-2 to solve object detection at a minimal cost for everyone

The Detect table, our benchmark, output, and log table in dataverse

The first thing we do is create our Detect table to host our images. We use Power Platform dataverse for that. From there, dataverse allows us to create in no time a working app on top of this table : a simple view to view the records and a form to edit a record.

领英推荐

The Weekend @ ...

Generative AI 1 年前

TAI #117:Do OpenAI’s o1 Models Unlock a Full “Moore’s…

Towards AI 6 个月前

TAI #136: DeepSeek-R1 Challenges OpenAI-o1 With ~30x…

Towards AI 1 个月前

view of images with the object to detect, is the object actually in the image (human) and Phi-2 and GPT3 outputs

This Detect data source is like our benchmark, output, and log table at the same time. It contains all our pictures, and for each which object to detect, and if the object is actually in the picture or not (from human input).

Last, we have a status for each picture, when it is "To Be Processed" a flow kick off, upload the picture, extract the captions with Florence and process them through GPT-3.5 AI Builder and Phi-2, and update the record from there with the output and the logs.

The machine learning pipeline in Power Automate

To chain our models (the ML pipeline) we use Power Automate. In a business application context, Power Automate is the equivalent of a LangChain and PromptFlow together. It is like Power Automate is the Zappier of business applications. It connects safely, out of the box, and in no time to any kind of legacy business applications stack.

The flow wakes up when a record is flagged "To Be Processed", update the record to "Processing" and start extracting the data.

Florence API is called via HTTP API. Note, in the process the image needs to be saved to a blob storage prior to call Florence.

From there, in series, we call GPT-3.5 through the AI Builder Prompt action and Phi-2 through HTTP pinging our own Azure AI Studio Phi-2 deployment. Once data is extracted, we update the record with the output, the log and the status of the record to "Processed".

Object Detection ML Pipeline : when a record is "To Be Processed", Florence then GPT3 and Phi2 are called and the record is updated in dataverse

Deploying Phi-2 in Azure AI Studio

Azure AI Studio makes it simple to choose from thousands of models in the Azure model catalog then deploy them as an API endpoint. You may need to request additional Azure quotas. In the meantime, Azure AI Studio offers you to run it for free for 7 days. I don't think we can do more simple than that.

Deploying Phi-2 as an endpoint with Azure AI Studio

Prompting Phi-2 and results

Phi-2 likes to keep writing. At times, it will get you the right answer but still keep writing. Playing with the model, you can tell the reasoning is there. Like GPT-3.5 likes to chat, it can get tricky to instruct. One useful trick with GPT-3.5 is to ask to output in a JSON format. The model understand JSON well. From there, whatever you ask in your JSON can be more instructed in nature. This trick seems to work for Phi-2 as well. That is what we used to prompt Phi-2.

Phi-2 and AI Builder GPT-3 outputs and logs once processed

With this Phi-2 prompt we got to 98% of the pictures being detected, on par with our GPT-3!

This prompt is simple. This prompt is also crafted and is sensitive (try it for yourself!). I am sure it is not the best prompt and there are ways to do better but for our 100 pictures benchmark, it got us on par with GPT-3. It is success. We can stop there.

This prototype was built on Power Platform in one day. If you have not, check out Power Platform. It is a terrific way to release enterprise grade AI at a fast pace and low-risk for your users and reduce technical debt in your legacy IT stack.

带有此图标的链接由领英创建，不带此图标的链接由作者添加。

Nico Sprotti

Copilot Studio & Power Platform @ Microsoft

1 年

Wednesday Feb 14, I'll be discussing Microsoft Power Platform for data science teams. The demo part of it will be prototyping Florence + GPT3/Phi2 to do object detection. Feel free to register and it will be recorded. We'll touch on some cool vision use cases too :-) https://msit.events.teams.microsoft.com/event/84f6f62d-63d6-4b29-8eac-cc7ba3159ec2@72f988bf-86f1-41af-91ab-2d7cd011db47

Hema Alaganandam

Software Engineering Manager at Microsoft

1 年

Thanks for sharing!

1 次回应

Parthiban Rajendran

Software Architect (Consulting)

1 年

Can u do for PID images??

Tony Clayton

1 年

This is a great example of rapid enterprise innovation using Power Platform, Nico! Well done

1 次回应

查看更多评论

要查看或添加评论，请登录

Nico Sprotti的更多文章

Build your own private GPT agents in Teams

2024年3月26日

Build your own private GPT agents in Teams

With a good prompt, top-tier models (GPT 4, Mistral large…) can beat the best doctors and lawyers out there. I often…

10 条评论

Solving object detection at minimal cost with Phi-2 and Florence

Nico Sprotti

Copilot Studio & Power Platform @ Microsoft

The Detect table, our benchmark, output, and log table in dataverse

领英推荐

The machine learning pipeline in Power Automate

Deploying Phi-2 in Azure AI Studio

Prompting Phi-2 and results

Nico Sprotti的更多文章

社区洞察

其他会员也浏览了

TAI #137: DeepSeek r1 Ignites Debate: Efficiency vs. Scale and China vs. US in the AI Race

This AI newsletter is all you need #95

How Knowledge Graphs Enhance LLM Application Performance - A Guide

DeepSeek: The AI revolution you didn’t see coming

Why Llama 3.1's Release is an Important Step in the LLM Arena

How Good Are Multimodal AI Models Like GPT-4? Explore Unmatched Greatness

This AI newsletter is all you need #5

LLM Watch#74: DeepSeek-R1 Was Only The Beginning

World Models, GenSQL for Database Analysis, Synthetic Data for Training ML Models, and GenAI for Quantum Computing

?? The Future of Designing AI Agents

The Detect table, our benchmark, output, and log table in dataverse

领英推荐

The machine learning pipeline in Power Automate

Deploying Phi-2 in Azure AI Studio

Prompting Phi-2 and results

Nico Sprotti的更多文章

Build your own private GPT agents in Teams

社区洞察

其他会员也浏览了

TAI #137: DeepSeek r1 Ignites Debate: Efficiency vs. Scale and China vs. US in the AI Race

This AI newsletter is all you need #95

How Knowledge Graphs Enhance LLM Application Performance - A Guide

DeepSeek: The AI revolution you didn’t see coming

Why Llama 3.1's Release is an Important Step in the LLM Arena

How Good Are Multimodal AI Models Like GPT-4? Explore Unmatched Greatness

This AI newsletter is all you need #5

LLM Watch#74: DeepSeek-R1 Was Only The Beginning

World Models, GenSQL for Database Analysis, Synthetic Data for Training ML Models, and GenAI for Quantum Computing

?? The Future of Designing AI Agents