Voxel51 Filtered Views Newsletter - March 29, 2024
Author: Harpreet Sahota (Hacker in Residence at Voxel51)
Welcome to Voxel51's bi-weekly digest of the latest trending AI, machine learning and computer vision news, events and resources! Subscribe to the email version.
?? The Industry Pulse
Ubisoft's new gaming NPC
??Ubisoft has introduced "neo NPCs" like Bloom, a new breed of gaming NPC designed to have meaningful conversations with players, thanks to the power of generative AI. This development represents a significant step forward in making game worlds more interactive and engaging. The Game Developers Conference highlighted the growing interest in AI across the gaming industry. Ubisoft's approach to designing NPCs like Bloom encourages players to use their social instincts within the game, making the gaming experience more personal and relatable. The possibilities are endless; what kind of character would you want to have a conversation within your favorite game?
Apple's move into GenAI
?? Apple is expected to address its approach to generative AI at the upcoming Worldwide Developers Conference (WWDC), which has led to widespread speculation and anticipation. The company has reportedly invested heavily in training its AI models and is exploring various avenues, including potential content partnerships and collaborations with leading AI entities. This move marks a significant shift in Apple's technological strategy, which has historically focused on on-device machine learning. We’ll soon see how Apple’s strategic moves in AI will reshape the competitive landscape and influence the future of consumer technology.
Suno's AI blues
?? The song "Soul Of The Machine" by an AI music generation startup called Suno attempts to mimic the soulful essence of Mississippi Delta Blues with an AI singing about its sadness. However, the reception among musicians reveals AI's nuanced challenges in replicating the human touch in music. While showcasing AI's ability to generate music that fits within a specific genre, the song lacks the emotional depth and nuanced understanding of rhythm and tension that human musicians bring to their performances. The article also highlights the irreplaceable value of live music, emphasizing the dynamic interaction between performers and their audience that is currently beyond AI's reach.
?????? GitHub Gems
New LLaVA release
LLaVA-NeXT (aka LLaVA-1.6) is here!?The improved version of the LLaVA (Large Language and Vision Assistant)? model, which is an open-source multimodal AI assistant capable of processing both text and images, boasts enhanced reasoning, optical character recognition (OCR), and world knowledge capabilities.?
Here are the key improvements in LLaVA-NeXT compared to the previous LLaVA-1.5 version:
LLaVA-NeXT is a massive advancement in open-source multimodal AI making powerful visual-language capabilities more widely accessible to researchers and developers
Get hands-on with the model and see it in action yourself with this Colab notebook we created for you!
?? Good Reads
Multimodality and Large Multimodal Models (LMMs)
Think about how you experience the world: you see, hear, touch, and talk.?
As a human, you have the uncanny ability to process and interact with the world using multiple modes of data simultaneously. You can output data in various ways, whether speaking, writing, typing, drawing, singing, or more. Developing AI systems that can operate in the "real world" means building models that understand the world as you do.? It requires models that can take in multiple input types, reason over that input, and generate output across different modalities.
CLIP's architecture. Both encoders and projection matrices are jointly trained together from scratch. The training goal is to maximize the similarity scores of the right (image, text) pairings while minimizing the similarity scores of the wrong pairings (contrastive learning).?
Chip's blog post discusses the latest advancements in training multimodal models. It’s split into three parts:
It's a long read, but it's time well spent. I highly recommend checking it out. My main takeaways are summarized below.?
The Essence of Multimodality
Chip outlines that multimodality involves interactions between different data types, including text, images, audio, etc. She mentions that it can mean one or more of the following:
Data Modalities
Chip Huyen's exploration into multimodality and LMMs offers an excellent overview of the topic, a thorough explanation of groundbreaking models like CLIP and Flamingo, plus a compelling glimpse into the future of AI, where the integration of diverse data types promises to unlock new levels of intelligence and utility.
领英推荐
??? Good Listens
Kate Park has made massive contributions to the development of data engines.
Her journey began at Tesla, where she played a crucial role in enhancing their Autopilot system. She then continued her work at Scale AI, focusing on natural language processing (NLP) systems.? Park's pioneering work at Tesla involved leveraging data engines to enhance Tesla Autopilot's capabilities. She emphasized that the key to achieving advanced levels of autonomy lies in data, not just in quantity but quality.
In a conversation on The Gradient Podcast, she highlights the balance between data quantity, quality, and the efficient allocation of resources to maximize model performance improvements. This podcast episode is a masterclass on how data engines create a systematic approach to improving machine learning models through collecting and processing high-quality data.
Here are my main takeaways:
Data Engines and GenAI
Beyond the key takeaways highlighted, Kate shares insight into the practical application and impact of data engines from her experience within the industry. She offers invaluable insights from her time at Tesla and Scale AI, detailing the data engines' role in autonomous driving and natural language processing domains.
?????? Good Research
How to Read Conference Papers
Prof Jason Corso published a blog post about how to read research papers a while back. His PACES (problem, approach, claim, evaluation, substantiation) method has been my go-to for understanding papers.
The Role of Data Curation in Image Captioning
This week, I’ll apply his methodology to "The Role of Data Curation in Image Captioning" by Wenyan Li, Jonas F. Lotz, Chen Qiu, and Desmond Elliott.
Problem
Image captioning models are typically trained by treating all samples equally, ignoring variations in captions or the presence of mismatched or hard-to-caption data points. This negatively impacts a model's ability to generate captions accurately because it can "confuse" the model during training. This paper investigates whether actively curating difficult samples within datasets can enhance model performance without increasing the total number of samples.???
Approach
The paper introduces three data curation methods to improve image captioning models by actively curating difficult samples within the dataset.? These methods are:
Claim
The authors claim that actively curating difficult samples in datasets, without increasing the total number of samples, enhances image captioning performance through three data curation methods: complete removal of a sample, caption replacement, or image replacement via a text-to-image generation model.
Their experiments show that the best strategy varies between datasets but is generalizable across different vision-language models.
Evaluation
The methods were evaluated with two state-of-the-art pretrained vision-language models (BLIP and BEiT-3) on widely used datasets (MS COCO and Flickr30K), focusing on how these curation methods impact the performance of image captioning models. They used metrics like CIDEr and BLEU scores to measure improvements.
The study finds that Flickr30K benefits more from removing high-loss training samples, suggesting it may be noisier than MS COCO.
Substantiation
The study discovered that:
???. Upcoming Events
Check out these upcoming AI, machine learning and computer vision events! View the full calendar and register for an event.
Top comments (0)
Loved reading your insights! ?? Aristotle once hinted at the idea that excellence is not an act but a habit. As we navigate the complexities of our industry, it’s these daily habits that pave our path to success. Here’s to creating a future that's as rich and diverse as the initiatives we champion. ?? Keep inspiring!
Strategic Account Executive | looking for my next opportunity in AI or Robotics | Technical SaaS and AI background
11 个月Great digest Harpreet Sahota ??