The Google-Reddit deal will accelerate AI training, but there's no Reddit for AgTech AI (and that's OK)
(I posted a list of top 10 things in AgTech that matter the most for 2024 - this was item 4. I will put a cross-link on the main post after posting this article.)
One of the often over-looked nuances of quality artificial intelligence (AI) products is the need for high-quality training data to help with development of AI models. This is true of all AI products in all categories, including AgTech. Think about automation - the reason that weeding robots of all types (laser weeders, mechanical weeders, spray weeders) can make quicker development cycles for each successive product or weed identification is that AI data from the image library they develop helps them get smarter with each new data set and each new crop type and each new weed type. Once you know how to identify romaine lettuce and weed around it, it's a lot easier to identify other types of lettuce and weed around them. Plus, once you can identify certain types of weeds in one crop, you can identify them for all crops. Both of these factors - better plant identification and better weed identification - have allowed weeding robots to go from 1 type of crop for weeding to 40+ and in some cases they are approaching 100 different crop types they can identify and weed.
Weeding is a current use case because weeding robots are in market with some traction. Let's think about future use cases. For biologicals, the ability to treat each pest and each soil pathogen will be developed in a similar manner to crops and weeds for weeding robots. Each pest and each pathogen will require images and training data, including pictures of them at various growth stages and various conditions (partially hidden by plants or soil, different lighting conditions, different disease impacts). For food safety, the methodology will be similar - large amounts of training data (think lots of images, lots and lots of images) for different food products in different conditions.
While the three use cases above are different, the key data required for all 3 are similar. The common thread is pictures of crops in various growth stages and weather conditions help all 3 use cases improve faster, and more pictures improve the model iteration times. The not so common thread is that weeds, pests, soil pathogens, and food safety pathogens are all different use cases with different image requirements. The speed with which each of these three problems can be effectively addressed by AI models is tied pretty tightly to the ability of startups in each space to get large amounts of training data for models.
This is where things get interesting (I know, as if the above content weren't interesting enough). Let's take a look at what just happened for horizontal AI models. Reddit signed a deal with Google for $60 million a year for 3 years for all of Reddit's data, which Google can now use for AI training models. In theory, this will give Google a running start on their next set of models because Reddit provides a large amount of useful training data for whatever AI products Google wants to build. In theory, this data advantage provides Google a competitive advantage on it's large language models (LLMs) over other tech competitors that do not have as much data to use in their training. This will give Google an edge over companies that aren't going to like that fact much (think Amazon, Microsoft, and Apple, for starters) and are known as being a wee bit competitive.
So what's the likely outcome of the Reddit deal for AI? My prediction is that the Reddit deal is going to accelerate AI efforts for Google, which will in turn push the others on the list above to try and accelerate their AI efforts because none of them want any one of them to have an advantage position in AI during the early development because all 4 of those companies are very familiar with first mover advantage and network efforts. That one deal with Reddit and Google will pick up the investment and R&D pace for the entire AI world - all that for a relatively minor sticker price of $60M/year.
At the same time, I have a second prediction. Because there is no Reddit available in the case of AgTech data, there will be no corresponding opportunity for anyone in AgTech to license data for their model. Think about it - Reddit is large amounts of user generated content with a lot of different users with different viewpoints and content characteristics, so the AI model automatically has a lot of use case data and edge case data built into it because Reddit threads can go deep (really deep - you can argue the quality of the depth and some of the notorious Reddit rabbit holes, but you cannot argue they go verrrrry deep) on a variety of topics with posts by a wide variety of users. Without something like Reddit in play, there's no deal any of the companies can make to create an advantage position like Google just did with Reddit.
领英推荐
All that being said, I think people in AgTech will see the Google-Reddit deal as an opportunity to try and create a competitive advantage through a similar licensing deal or create a revenue stream for training data if you're on the Reddit side of the deal. The difference is there is no user-generated content (UGC) site like Reddit for AgTech where the user wants to share the information on a platform and the value they receive is the mere publication of the content on Reddit. Even better for Reddit, they've been doing this for a while so their data has been accumulated over years, which is often the best UGC from a value perspective.
The closest thing that AgTech has is the equipment captured on tractors and implements by the manufacturers, but that content is the furthest thing from UGC. Farmers understand the value of that content and are usually very reluctant to even talk about sharing it, increasingly only being willing to do so if there is a revenue sharing discussion attached to the conversation (which is fair, if you believe that your farming operations are best in class, why share data that helps people reverse engineer them for their own operational advantage?)
I believe that licensing deals for content that will be uses as AI training data will pick up – for a while. I think early movers are likely to have the advantage position and should monetize what they can reasonably quickly for one reason. The longer AI’s out there and data is being collected, the less valuable the data is to others, particularly if it’s relatively commoditized and easily replaced by cheaper sources that provide it to all via open source platforms with data sharing. In short, the larger the pool of available free data, the harder it will be to find premium pricing partnership opportunities. Well done Reddit – this one looks good. It, and other deals like it, will make sure that training data is happening at scale for a lot of key AI platforms. This will accelerate everything about general horizontal AI.
What about vertical AI, where we are focused on a specific segment? For ag purposes, it's not clear there is a Reddit or UGC equivalent, so there won't be any shortcuts for this process. AI companies in AgTech will have to follow Ben Hogan's golf swing advice and dig it out of the dirt (well, ok, unless you're flying drones or satellites ...) one set of data at a time. Two things can be true at the same time:
1) The Reddit-Google deal will accelerate AI development overall by accelerating training data sets at scale.
2) The Reddit-Google deal has no equivalent in AgTech so AgTech training data sets will take longer to acquire and leverage. AgTech will benefit from the overall $50B investment in AI in 2023 and improvements from that set of startups and large tech innovators, but will not benefit from a UGC deal like Reddit.
HR Operations | Implementation of HRIS systems & Employee Onboarding | HR Policies | Exit Interviews
6 个月Great article. ModelOps, the next phase after DataOps, aims to develop and maintain highly accurate Machine Learning models for production use. The ModelOps pipeline encompasses six key components: (a) Feature Engineering, (b) Model Training and Hyperparameter Tuning, (c) Model Validation and Testing, (d) Model Packaging and Versioning, (e) Model Serving and Predicting, and (f) Model Performance Monitoring and Logging. Feature Engineering involves categorizing and transforming features. Model Training optimizes algorithms using the training dataset and adjusts hyperparameters like training epochs. Model Validation and Testing assess the trained model's accuracy against a separate dataset, potentially requiring iterative refinement. Packaging is done in formats like PMML and Pickle for operationalization. Serving and predicting, facilitated by containerization (e.g., Docker, Kubernetes), enable flexible scaling of infrastructure. Model Performance Monitoring and Logging address potential data or concept drift, thereby ensuring ongoing model accuracy. Logging predictions aids statistical analysis, guiding adjustments to maintain model efficacy and prevent degradation. More about this topic: https://lnkd.in/gPjFMgy7
Founder & CEO, Group 8 Security Solutions Inc. DBA Machine Learning Intelligence
8 个月Appreciate your post!
Manager at Aactive Engineering
8 个月" 10- Do the Math !!"- Opinion + Facts+ Math>Opinion Napkin arithmetic, works?
Co-Founder and CEO at TheoryMesh | Sustainability and food safety through transparency and traceability | ex-Microsoft | Keynote speaker
8 个月Really? Has anyone been on Reddit lately? Reddit is one of the worst possible training sources for AI. Hugely skewed demographics, limited content review, etc. Search it if you don't believe me. Why for Reddit? Well, Sam Altman's take from the Reddit IPO is approx $600M. So the agreements between Reddit and AI companies will only accelerate the transfer from consumer contributors to Private Equity. Reddit is trading at an industry average multiple for revenue, so nothing much to see.
Founder & CEO at Viable | Scaling Startups into Global Ventures | Venture Builder & Investor | Forbes 30 Under 30
8 个月Exciting times ahead for AI development with this Reddit-Google deal! The possibilities are endless. ??