Voxel51 Filtered Views Newsletter - September 20, 2024
Welcome to Voxel51's weekly digest of the latest trending AI, machine learning and computer vision news, events and resources! Subscribe to the email version.
?? The Industry Pulse
ViewCrafter Synthesizes High-Fidelity Novel Views Like Magic!
ViewCrafter synthesizes high-fidelity novel views of generic scenes using single or sparse images, video diffusion models, and point-based 3D representations. This approach seems better than prevuous neural 3D reconstruction methods that rely on dense multi-view captures.
But how does it do what it does?
Robo-researchers are infiltrating academia
Ah, the wonders of modern academia! Who needs rigorous peer review when you've got ChatGPT churning out "scientific" papers faster than you can say "publish or perish"?
A new study has uncovered 139 questionable papers on Google Scholar, with suspected deceptive use of LLM tools. It seems no platform is safe from the invasion of robo-researchers, as most papers were in non-indexed journals or working papers, ranging from ResearchGate, ORCiD, and even X (because who needs academic standards when you've got hypebeasts on social media).?
Still, some appeared in established journals and conferences.
It doesn’t seem like this trend is endemic to any one one field in particular, as the study found these papers in fields like environmental studies, health studies, and computing breakthroughs—it seems like LLMs have it all covered.?
Why bother with years of painstaking research when you can have a transformer spit out a paper in seconds?
Read more here.
OpenAI is considering restructuring its corporate structure in the coming year.
The company is currently in talks to raise $6.5 billion at a $150 billion pre-money valuation, but this deal is contingent on removing the profit cap for investors. CEO Sam Altman has reportedly informed employees that OpenAI's structure will likely change in 2025, moving closer to a traditional for-profit model. The current structure, where a nonprofit controls the for-profit arm, appears to be a point of contention for potential investors.
Despite these potential changes, OpenAI has stated that the nonprofit aspect will remain central to its mission. The company emphasizes its commitment to developing AI that benefits everyone while positioning itself for success in achieving its goals.
If they make this change, I think "ForProfitAI" would make for a suitable new name.
Whatever the case, I firmly believe it’s Open Source for the win.?
Read more here.
?? GitHub Gems
Reader-LM: Get this, it’s an LLM that converts HTML to Markdown. This is included here ironically because it’s a thing that no one asked for and will likely never use. Why write regex to clean your dataset when you can use a 500m (1GB size) LLM and unnecessary GPU power to clean your dataset!
Hi3D: Hi3D, a novel video diffusion-based approach for generating high-resolution, multi-view consistent images with detailed textures from a single input image. Hi3D leverages the temporal consistency knowledge in video diffusion models to achieve geometry consistency across multiple views, and employs a 3D-aware video-to-video refiner to scale up the multi-view images while preserving high-resolution texture details.?
Prompt2Fashion: This repo introduces a dataset of automatically generated fashion images created using the methodology presented in the "AutoFashion" paper. The dataset focuses on personalization, incorporating a variety of requirements like gender, body type, occasions, and styles, and their combinations, to generate fashion images without human intervention in designing the final outfit or the conditioning prompt for the Diffusion Model.
Lexicon3D: This framework extracts features from various foundation models, constructs 3D feature embeddings as scene embeddings, and evaluates them on multiple downstream tasks. The paper presents a novel approach to representing complex indoor scenes using a combination of 2D and 3D modalities, such as posed images, videos, and 3D point clouds. The extracted feature embeddings from image- and video-based models are projected into 3D space using a multi-view 3D projection module for subsequent 3D scene evaluation tasks.
?? Good Read: Founder Mode
It came out of nowhere. All of a sudden, my entire X and LinkedIn feeds were filled with posts (and plenty of memes) using the words “Founder Mode.”
Eventually, I learned the origin of the term: a Paul Graham essay titled Founder Mode..?
After hearing Airbnb's Brian Chesky speak at a Y Combinator event, Paul Graham, tech guru and startup whisperer, had an epiphany: it turns out, the conventional wisdom of "hire good people and let them do their thing" is about as effective as trying to put out a fire with gasoline.?
Inspired by this revelation, Graham furiously penned an essay and hit "publish." The impact was immediate and powerful. Since then, the world of startups hasn't been the same.?
As founders read his words, a collective "Aha!" moment swept through the startup world. Suddenly, tech founders everywhere were nodding in recognition, realizing they'd been fighting the same uphill battle in their own companies. Graham's essay struck a chord that reverberated through the tech industry, uniting founders in their shared struggles and sparking a new conversation about what it really takes to build a successful startup.
A tale of two modes
Founder Mode
Graham argues that Founder Mode is more complex but ultimately more effective than manager mode, and is characterized by:
Basically, Founder Mode is like Manager Mode, but with 100% more founder intuition and 50% less delegation. Results may vary, batteries not included.
Graham admits we know about as much about Founder Mode as we do about dark matter. But fear not! He predicts that once we figure it out, founders will achieve even greater heights – like building rockets to Mars or creating social networks that definitely won't cause any problems whatsoever.?
He suggests that:
??? Good Listens : A full breakdown of the Reflection-70B fiasco
?? You might wanna grab some popcorn for this one!
Earlier this month (September 2024), Reflection 70B was making waves on X as THE new open-source LLM. Released by HyperWrite, a New York startup, it claimed to be the world's top open-source model. Yet soon after its release, Reflection 70B's performance was questioned and accused of potential fraud.
Initial Announcement and Claims
Thursday, September 5, 2024:
- Matt Shumer, co-founder and CEO of OthersideAI (HyperWrite), releases Reflection 70B on Hugging Face.
- Shumer claims it's "the world's top open-source model" and posts impressive benchmark results.
- The model is said to be a variant of Meta's Llama 3.1, trained using Glaive AI's synthetic data generation platform.
- Shumer attributes the performance to "Reflection Tuning," allowing the model to self-assess and refine responses.
Skepticism and Investigations
Friday, September 6 - Monday, September 9, 2024:
- Independent evaluators and the open-source AI community begin questioning the model's performance.
领英推荐
- Attempts to replicate the impressive results fail.
- Some responses indicate a possible connection to Anthropic's Claude 3.5 Sonnet model.
- Artificial Analysis posts on X that its tests yield significantly lower scores than initially claimed.
- It's revealed that Shumer is invested in Glaive AI, which he didn't disclose when releasing Reflection 70B.
- Shumer attributes discrepancies to issues during the model's upload to Hugging Face and promises to correct the weights.
On September 8, X user Shin Megami Boson openly accused Shumer of "fraud in the AI research community."
Silence and Response
Sunday, September 8 - Monday, September 9, 2024:
- Shumer goes silent on Sunday evening.
Tuesday, September 10, 2024:
Shumer breaks his silence, apologizing and claiming he "Got ahead of himself." He states a team is working to understand what happened and promises transparency. Sahil Chaudhary, founder of Glaive AI, posts that the benchmark scores shared with Shumer haven't been reproducible.
Yuchen Jin, CTO of Hyperbolic Labs, details his efforts to host Reflection 70B and expresses disappointment in Shumer's lack of communication.
Ongoing Skepticism
Post-September 10, 2024:
The AI community remains skeptical of Shumer's and Chaudhary's explanations. Many are calling for more detailed explanations of the discrepancies and the true nature of Reflection 70B. The situation continues to evolve, with the AI community awaiting further clarification and evidence from Shumer and his team regarding the true capabilities and origins of Reflection 70B.
Here’s a good recap of the entire fiasco on YouTube, and you can also read more about this here.
Good Research: ?? A survey on comic understanding
Comics present unique challenges for models due to their combination of visual and textual narratives, creative variations in style, non-linear storytelling, and distinctive compositional elements.?
While vision-language models have advanced significantly, their application to Comics Understanding is still developing and faces several challenges. However, there’s a lack of a comprehensive framework for categorizing and understanding the various tasks involved in Comics Understanding. This survey introduces a novel framework called the Layer of Comics Understanding (LoCU), categorizing tasks based on input/output modalities and spatio-temporal dimensions.?
The LoCU framework aims to guide researchers through the intricacies of Comics Understanding, from basic recognition to advanced synthesis tasks.
Holy frameworks, Batman! With LoCU, we're ready to take on the comic book universe!
This framework provides a structured approach to understanding the various tasks involved in Comics Understanding, from simple classification to complex generation and synthesis.? Each layer builds upon the previous ones, increasing in complexity and abstraction:
Here’s what the layers do in more detail:
Layer 1: Tagging and Augmentation
1. Tagging:
2. Augmentation:
Layer 2: Grounding, Analysis, and Segmentation
1. Grounding:
2. Analysis:
3. Segmentation:
Layer 3: Retrieval and Modification
1. Retrieval:
2. Modification:
Layer 4: Understanding
Layer 5: Generation and Synthesis
1. Generation:
2. Synthesis:
Thanks to Vivoli et al., we have the Layer of Comics Understanding (LoCU) framework to guide us through this maze of panels, bubbles, and superhero capes.?
From basic tagging to full-blown narrative synthesis, it's like having a superhero team of AI models ready to tackle every comic challenge. Time to turn those pixels into epic stories and make our favorite comics come alive in ways we never thought possible. Let's get our capes on and save the day, one panel at a time!
???. Upcoming Events
Check out these upcoming AI, machine learning and computer vision events! View the full calendar and register for an event.