Introducing InkyMM: The First Commercial Open Source Multimodal Model

Introducing InkyMM: The First Commercial Open Source Multimodal Model

Today, OctoML is announcing ?? InkyMM, the first open-source, fully commercializable Image + Text LLM, built upon the great work of researchers at King Adbullah University and the MPT-7B Instruct published by MosaicML.

We've captured the highlights in this article, but if you want the deep dive, read the full post here ??

Computer Vision has Been Hard Work

Deep learning has made computer vision much more powerful, but still not?easy: getting good performance still usually depends upon gathering large amounts of training data, selecting the right model architecture, and training that model on your data.

For example, when Octonaut Ben?created a cat door that locks his cat out when he is trying to bring in a “present”, he had to gather and hand-label more than?22,000 images:

No alt text provided for this image

The same thing goes for any computer vision use case. Recent computer vision competitions have featured detecting player collisions in the NFL, finding ancient Roman ink in ash fragments from Vesuvius, or predicting whether a piece of clothing will look good on a customer.

No alt text provided for this image
New applications for computer vision continue to emerge

But all of these require extensive training sets. There have been many advancements in trying to make this pain easier—better labeling tools, self- and semi-supervised learning techniques. But getting good performance almost always means painstaking data gathering.

Furthermore, a model trained on such datasets will usually not go beyond them—it won’t deal well with novel types of images, and it certainly can’t have a chat with you about whether that blazer you’re wearing is in fashion at the moment.

Multimodal learning in machine learning is?a type of learning where the model is trained to understand and work with multiple forms of input data, such as text, images, and audio.?

The Holy Grail: Zero-Shot Image Models

There have been many recent attempts to create image labeling models that can work as “zero-shot” detectors — meaning they can label any image without getting training data. Popular image captioning models that seek to do so include OpenAI’s CLIP, Salesforce’s BLIP, and ViLT.

All of these are impressive in their own right, but their ability to reason about images is very limited, and sometimes they really miss the point. Below is an example of how BLIP handles a question that could be useful for any e-commerce company:

No alt text provided for this image
The description of the package condition could use some work.

Since the release of BLIP, many others have released models with zero-shot, multimodal capabilities:

March 14th, 2023: OpenAI announced that GPT-4 can respond to both images and text. OpenAI is currently testing this capability with a single customer, Be My Eyes, and has not released it to the general market.

April 20, 2023: Researchers at King Abdullah University released MiniGPT-4, a multimodal LLM built on top of BLIP and Llama. MiniGPT-4 shows impressive capability improvement by combining an LLM with an image captioning model, though, it has some flaws, including being slow and hallucinating in its responses.

May 10, 2023: Google announces that Bard will soon have multimodal capabilities.

The only open source model from the list above is MiniGPT-4. However, this open source contribution has a fatal flaw: It cannot be commercialized. MiniGPT-4 is based upon Vicuna, which was trained on prompts and responses scraped from ChatGPT. As such, it cannot be used for commercial purposes.

Enter OctoML with InkyMM

Since we’re all about making models ready for commercial enterprise, we decided to try an experiment: what if we could replicate MiniGPT-4’s training process, but attach it to a fully commercializable language model instead of Vicuna? To do so, we anchored on Mosaic ML’s MPT-7B Instruct model, a commercially available LLM that is trained upon instruction-following datasets. After a surprisingly short amount of effort, we were able to succeed in creating ??InkyMM, a multi-modal LLM with no legal encumbrances, ready for commercial use.

Early access users of the OctoML compute service can also access ??InkyMM endpoints for application development.?

Join the early access program here ??????

We should note a few things:

  • Since MPT-7B Instruct is not as strong an LLM as Vicuna, ??InkyMM has more of a “hallucination” issue than MiniGPT-4. We’re working on it!
  • The web version of ??InkyMM is still slow—we have yet to deploy techniques to accelerate this model pipeline but will deploy an accelerated version into our compute service once it’s running super fast.

We are excited to release an API endpoint version for application developers in June!

I have a white beard! I didn't know that! :)

  • 该图片无替代文字
回复

hey I remember seeing this demo'd at your Tuesday event, looks super cool and congrats to the team

Harry Kim

ML inference product

1 年

I just tried it and it's much faster than MiniGPT. Congrats OctoML team :) What HW are you running the InkyMM for inference?

Harry Kim

ML inference product

1 年

I tried out MiniGPT before so I couldn't be more excited about this particular release :) Thank you OctoML team!

要查看或添加评论,请登录

OctoAI (Acquired by NVIDIA)的更多文章

社区洞察

其他会员也浏览了