The Ethics Of ML Data Sourcing For ADAS/AD: An Uncomfortable Conversation
Dear Reader,
This edition of Automotive Industry News is a bit different. You may find this piece opinionated, perhaps even preachy: I would not blame you for not liking the read. In fact, I would have preferred not having had to write it.
This month, I am focusing on unfortunate news and an uncomfortable topic: ethics in data sourcing for ML-based ADAS/AD. I do this not because I think it makes for good and pleasant reading, but because I think we have to write, read and talk about these things - even if they make us uncomfortable. Let me get right to it ...
Data-Related Labor Needs In ADAS/AD
It’s a topic that is not native to automotive, but an essential building block of next-gen ADAS and AD systems: Artificial Intelligence, or Machine Learning, whichever you prefer.
The complexity of advanced ODDs/DDTs makes the use of rule-based systems at scale impossible, so OEMs and ADAS suppliers are developing ML-based systems. One major difference is that these systems’ performance is optimized by training them on large amounts of data, rather than having their algorithms fine-tuned by experienced signal processing engineers.
Most training in this domain is ‘supervised learning’, which puts annotated datasets before ML models: scenes or scenarios with machine-readable information about the objects/events they contain - e.g. a camera sequence with bounding boxes around other traffic participants to train a perception system. Contrary to the high-level automation the end result is meant to achieve, these data labels are to a large degree placed, corrected and/or reviewed manually by humans.
Basically, if you want to achieve high-performance automation with ML, you need to invest in a considerable amount of manual labor to make that happen. And because human work is relatively expensive, producing these datasets costs a lot of money - which is the core of the problem: a classic conflict between cost/margin goals and working conditions/wages.
Now, we are no strangers to handling this conflict in first-world societies. Unions are powerful in automotive; and over the decades there’s been a balance achieved between companies’ economic growth and workers earning a proper livelihood. Sure, someone’s always unhappy about something; but overall, things have been working out for us. When it comes to data annotation, however, the people performing that work are usually not part of the ‘us’ that is protected by this balance …
The Unfortunate Reality Of Labor Exploitation In Data Sourcing
Data annotation is typically outsourced to low-cost/best-cost countries. That is not problematic in and of itself. But unfortunately, the work is often also crowdsourced to individuals entirely outside social security systems - which means no workers rights, no guarantee of equipment, no income stability, no health care or anything else we consider basic employment hygiene for ourselves.
One year ago, the MIT Technology Review published an article with the title “How the AI industry profits from catastrophe”. It was the result of in-depth research into data annotation workers' experience and shone an uncomfortably bright light on things:
“As the demand for data labeling exploded, an economic catastrophe turned Venezuela into ground zero for a new model of labor exploitation.”
This was the first time I was really exposed to these problems, which are by no means limited to Venezuela. Since then, I have read quite a few articles illuminating them - most recently, a Forbes piece looking at working conditions for data annotators in Southeast Asia and Africa. To quote from that article:
领英推荐
"In a 2022 study into working conditions on 15 digital labor platforms, University of Oxford researchers concluded [one major company]* met the 'minimum standards of fair work' in just two of 10 criteria, flunking equitable pay—which early employees say is pennies per hour on average—and fair representation.
[...] Lead researcher Kelle Howson compared data labelers on digital labor services like [said company]* to garment factory workers in many of the same countries. 'There is pretty much zero accountability for those working conditions,' she added."
*I've redacted the company name from the quote above. However, both mentioned articles and others report specifically on companies that are active suppliers to the automotive industry: This is not some Silicon Valley issue we can discard for our little niche; this is an ethical problem right in the foundation of data-defined ADAS/AD. Ignoring this problem means actively contributing to it, so how do we deal with the fact that our ML successes are built on the shoulders of exploited labor?
Facing Systemic Ethics Problems As An Individual
One possible view is that you should ‘let the consumer vote with their wallet’, driving low-cost suppliers to change: Shady business practices should theoretically not be sustainable; damages to a company’s public image as well as resulting revenue loss should eventually force them to course-correct - the market regulating itself. However, this argument places responsibility for large-scale, systemic injustice on individual buyers:
When I choose a chocolate brand in the supermarket, I don’t want the option of getting the one with child labor in its supply chain - that choice should not exist in our rich, advanced society. It should not be on consumers to ‘vote with their wallet’ whether human rights apply to people far away, and consumers shouldn't have to choose between that and paying their own bills. Similarly, I think companies that produce ethical alternatives should not have to compete with the low costs that come from child labor or violation of animal rights. And it seems pretty clear that no consumer has the ability to do a full supply chain audit of their supermarket’s chocolate aisle before picking a brand, anyway: The argument is buck-passing, and it serves to uphold the status quo.
Therefore, I think it is unfair to expect automotive companies/purchasers to ‘do the right thing’ on their own when sourcing data: This is a highly cost-driven industry, where single-digit margin improvements are huge successes. How are buyers inside such a system supposed to solve ethical sourcing on an individual basis; and how much auditing can they really be expected to perform?
What we need is a level playing field that excludes unethical offerings altogether, and for the remaining players to compete in the arena of quality, time and cost as we always do. Robbing thousands and thousands of workers of the fruits of their labor to undercut the competition’s prices is not a clever business strategy: It’s profiting from precarious working conditions while perpetuating them - and it should not even be on the shelf as an option. Perhaps better industry-wide supply chain (self-)regulation can help? I hope to hear more about this, and soon.
The fact that this currently is a non-resolved systemic issue does not absolve individual companies and buyers from unethical complacency, however: It’s scandalous that the supermarket would sell products of child labor in the first place; and the blame for that lies with the system and powerful players that enable it - but once you know that a specific chocolate bar comes from child labor, you should still stop buying it. The same is true for those of us leveraging data to bring ML-based products to market: Once we’ve become aware of the problem, we share responsibility. Our sourcing choices decide whether our company values are really values - or just buzzwords which can be disregarded in favor of better margins.
That shared responsibility is also the reason why I decided to write this. I will not be able to create any change through a LinkedIn newsletter, let’s be real … But I want to use what little platform I have to at least spread awareness and try to lend momentum: to a conversation that is uncomfortable, that is awkward, that I would rather avoid - but that needs to be had if our values are supposed to mean something.
---
That's it for today. I am aware that this piece does not offer much in terms of suggestions to improve the situation, but that is something I would love to talk more about: If you have any suggestions or related experiences to share, I absolutely encourage you to do so!
All the best
Tom Dahlstr?m
Well said
Expert/Investor
1 年Good thing companies like IBEO and MICROVISION have auto annotation software that is available to everyone
Science and technology policy and strategy expert, mentor, advisor, and author
1 年Brittany Eastman
Embedded System Software & Safety, Self-Driving Vehicles, Consulting.
1 年Thanks for bringing up this topic Tom Dahlstr?m. Obtaining massive amounts of training data is straining multiple aspects of our society, touching on possibilities of not only exploitation of cheap labor pools, but also copyright abuse, leakage of trade secrets, and privacy concerns. For this particular topic readers might not appreciate the scale of the issue. It is common for an autonomous vehicle company to have a data labeling effort with thousands of people. (It varies, but mid-thousands is an expected answer when you ask that question to someone involved.) The scope is significant.
Experienced Engineering Manager | Agile Leadership, Autonomous Driving, Validation Testing
1 年Certainly a difficult topic, but good to keep talking / reading about. Nicely written!