On model distillation, intellectual property, QARTA and orange juice
Credit to Microsoft Designer tool @2025

On model distillation, intellectual property, QARTA and orange juice

... this is like buying oranges from a shop, then find out that you're not allowed to make orange juice out of the fruit you just bought, ...or that you're not allowed to grow the seeds!

You might have heard the latest allegations from openai about deepseek distilling knowledge from chatgpt to train their model [1]. I guess we'd know soon how true this is, especially if someone manages to reproduce a model with comparable performances to deepseek by following the design steps they shared publicly, especially, the heavy use of RL to generate high quality training samples.

Back in 2018, I was working with Rade Stanojevic on traffic modelling for road networks. As we were thinking about productionizing QARTA[2], our local route engine, we figured that we needed to incorporate traffic awareness into our open source based system to be on par with Google Maps offerings.

One way to do this is to collect trajectory data from a fleet of vehicles and train a model that predicts the weight of various edges (road segments) at different times of the day and week [3].

For Qatar, we were fortunate to have a strong partnership with the local Taxi company which shared that data with us. A real-life experiment of 20k trips in which the company recorded the travel times predicted by GMaps and QARTA then compared them with the actual driving time recorded by the vehicle's computer revealed that QARTA's rmse was 20% better GMaps.

Following the local success, we wanted to go regional and export our solution to more neighboring countries. However, we quickly realised the difficulty of exporting QARTA as for it to work we’d need to partner with large enough fleets in each country we’d like to conquer. These fleets need to collect enough data in locations that are relevant to the business model(e.g., large cities.)

After few days of deep thinking, or maybe weeks, I don't really remember, we came up with three ways in which we could tackle the data challenge:

1. Buy trajectory data from legitimate “legacy” operators such as Tomtom. Unfortunately, Tomtom's customer service was stuck in the pre-internet era.

2. Buy dodgy user location data from dodgy data brokers who collect data from those popular free mobile apps and games you and I use. Yes, there's a market for that data and while it's not a perfect fit for our purpose, we thought we'd be able to clean it a bit to isolate vehicle-based trajectories. But who wants to do this!

3. Distill GMaps. Oh yeah, just use GMaps as a training data generator to bootstrap our system.

Understandably, the latter option was the most appealing for us and it would have worked as follows:

? Take any city in the world.

? Use heuristics (e.g., based on Foursquare POIs, OSM dwellings, and network betweeness/centrality to identify relevant location pairs for realistic trajectories)

? Run few thousand queries in GMaps API (this would cost few quid per city.)

? Use GMaps results to train our own models.

Bingo! Just like this. I hope you get it know, right!

This is what Sam is accusing deepseek of doing.


Well, because we're good people, we genuinely felt uneasy with this solution. It kind of felt like cheating! So we went on to read GMaps ToS just to be on the safe side, and surprise...! It was there waiting for us, written black on white, something along the lines of: you are not allowed to use our service to create a similar service! Seriously, Google!


I still remember my friend's immediate and priceless reaction:

...but this is insane! it's as if you went to the market to buy oranges, but then the sellers tells you that you’re not allowed to make orange juice out of those oranges...

I must confess that my friend had a strong point here.

Well, we've obviously had long, very long, conversations about whether oranges and orange juice was the right example and got into very interesting philosophical and practical aspects including whether GMaps has the right to include such a term given that it most likely uses users' private location data to build its services, or maybe that there was an antitrust aspect to this term given the monopoly GMaps had in the specific traffic-aware routing API business, we even went down the rabbit hole of Robin Hood's principles...


At the end, we resorted to create an API to train traffic-aware routing systems for any city that would ingest a collection of quadruplets <time, start_coordinates, end_coordinates, duration> and output the weights of the network. And we thought that it should be the responsibility of businesses interested in using our system to provide data in that format if they wanted to run a locally hosted traffic aware routing engine! How they get that data is their business, not ours :)




Syed Shadab Mustafa

Software Engineering Manager | Azure | SQL | MongoDB | .Net | REST APIs | IoT | CANBus | Payment Gateways | Data Analytics | DevOps | GitHub

4 周

Great article, while I understand what OpenAI is saying or what was the case with GMaps from commercial perspective, how about we use their services to validate our model or in the case of DeepSeek against OpenAI, that how does it score against their outcome and make adjustments to our models for better outcomes. Will that be a fine line to avoid breaching the ToS?

回复
Rade Stanojevic

Software Engineer at Uber

1 个月

Thinking about the army of SWE I work with at Uber to build a QARTA-like system makes me incredibly proud of what we did with a few machines and a lot of grit.

Belkacem Mouffok, MS EE, PMP

Sr. Project & Operation Manager

1 个月

The least we can say is: Their arrogance knows no bounds. Permissible for them , forbidden for us.

Yazan Boshmaf

AI | Cybersecurity | Web3

1 个月

It’s not really buying oranges to make orange juice. It’s more like buying orange juice to make orange juice. Nice article!

Anastasios F.

Technical Lead, Engineering & Design, Research Computing Computing Infrastructure, QCRI, GPCS

1 个月

Good point, Sofiane.

回复

要查看或添加评论,请登录

Sofiane Abbar, Ph.D的更多文章

社区洞察

其他会员也浏览了