Metas new model Chameleon blew my mind ??
It's multi-modal, so text and images, but it does so using the same encoder! Unlike image gen models, it uses tokens not diffusion!
To do this, it has to convert the text and images into a single sequence of tokens when processing. Then, a single transformer processes these mixed-modal tokens, eliminating the need for a different encoder per modality.
If this works well, it might make for more scalable and efficient multi-modal models!
I'll put a link to run it in the comments ?? Note: they made it "safety-aligned" and ripped out the ability to generate images, however, that code is still in there. If you are able to unlock it, I'll give you $1000