登录查看更多内容

Stack Overflow users revolt: How should model developers react?

Sean Easter

Machine learning executive | Inclusive servant leader

发布日期: 2024年5月13日

Last week, Tom's Hardware reported that many Stack Overflow (SO) contributors are outraged to learn SO will partner with OpenAI, and some users have sought to remove their own content, sometimes citing deletion protections of GDPR. "However," Tom's Hardware reports, "Stack Overflow's Terms of Service contains a clause carving out Stack Overflow's irrevocable ownership of all content subscribers provide to the site."

The full article includes a link in that sentence, follow it and you'll learn (or be reminded) that Stack's ability to do this relies on its contributors agreeing to license their content via a Creative Commons Attribution-ShareAlike, CC BY-SA 4.0, license. As one such contributor, and a machine learning pro working on and with LLMs, a few reflections on my motivation for authoring content on Stack Exchange and licensing it—"carving out Stack Overflow's irrevocable ownership" is too far on the article's part, I think, but deadlines are tight—and a few questions for all of us looking to build responsible AI to weigh in on.

Though I've contributed to SO, my primary home on Stack Exchange is Cross Validated (CV), the site for stats and machine learning. My motivations for contributing were to develop and demonstrate quantitative competency, and to help others wherever I could. CV has a special place in my heart as the home to my only known top Google result: Kick open an incognito tab and search for "mcmc vs variational inference" and the first link you'll see is a question where my answer is the top-voted response. That and an answer on PAC theory tend to get clusters of seasonal upvotes around Jan-Feb and Sep-Oct of each year (I say based on memory), my reminder that semesters are starting and someone somewhere is taking a Bayesian methods or introductory machine learning course and benefitting from a few answers I wrote years ago.

This experience sums up the main reasons many of us contribute to the Stack Exchange sites focused on technical or professional disciplines. A few free internet points ("reputation") that remind us we've helped someone, and the opportunity to develop a public body of work showing off what we know. I don't actively contribute to CV these days, but when I did I thought these rewards a completely fair exchange for the non-exclusive CC-BY SA 4.0 license. Why?

CC BY-SA 4.0 gives an irrevocable license for all uses, even commercial, provided that the licensor gives proper attribution: "You must give appropriate credit, provide a link to the license, and indicate if changes were made." The licensor must likewise share their transformations under the same license: "If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original." Together, they're a reasonable assurance that redistribution would support my goals of helping my professional communities by sharing things I know, and furthering my reputation by authorship demonstrating I know them. Someone could even edit my answers to create new ones, provided they noted source and changes, e.g., "This answer changed that of Sean Easter to account for developments since its original creation." What does CC BY-SA 4.0 mean for LLMs, and would I have licensed if LLMs existed then?

On the topic of how CC BY-SA 4.0 affects LLMs, I have precisely zero legal knowledge to offer. But as we wait for my attorney pals to find my snitch tags and weigh-in (Hi Pedro Pavón ?? ! Hi Lisa A. Hayes ! Hi Alisa B. Hall ! Hi Jeff Dietz !) as a Stack author and deep learning geek, here are my perspectives on attribution and share alike:

If any LLM responses that use my content attribute me, as an author I'll consider this requirement satisfied.
Share alike's requirement to distribute remixes or transformations is where things get interesting. What's a generative model, if not a transformation of its training data designed to produce remixes?

If someone asks OverflowAI, "Should I use MCMC or variational inference for this model," I'd take no issue with answers that summarize, paraphrase or quote my work as long as a link were included. (My understanding is that Overflow AI will be programming focused, making this example is a little tortured. Bear with me.) This kind of use is just an ML-powered version of someone asking the same question in a Slack and having a helpful human being share a link to my work. It makes for a a cool RAG case: Attribute any documents retrieved from the index to the author and source, and I as one author would consider your attribution obligation to me to be met.

领英推荐

Build Full-Stack RAG-based Apps with NextJS, Vercel…

Vincent Granville 10 个月前

Welcome to the TensorFlow.js monthly newsletter!

Jason Mayes 3 年前

Want To Experience The Future Of Search?

You.com 2 年前

In this same example, I (am not a lawyer and nonetheless) think that quotes with attribution fall under fair use: Quoting me is not remixing or transforming my work. It would not, then, require CC BY-SA 4.0 distribution. If a corporate model developer cited my work in their recommendation to use MCMC, I wouldn't expect their employer to publicly distribute the presentation under CC BY-SA 4.0. I chuckle at the prospect.

But if you're scraping my work, transforming it to numerical representations to train a base LLM, to which you send queries and random seeds to generate novel remixes, how should I expect you to meet your redistribution requirement of those responses? How am I to know to what extent any LLM response contrasting MCMC and variational inference has remixed my work? (There are many, better references; my own post is a glorified link to one.)

Put another way, I as an author who licensed content under CC BY-SA 4.0 might argue:

Any model transforms its training data.
Any generative model response constitutes a remix of the same training data.
Any LLM trained on CC BY-SA 4.0 content is required to distribute any of its responses.

This argument is obviously pretty scary to anyone trying to create a moat around their LLMs or create private services around them. It would preclude guarantees of confidentiality to customers paying for responses from models trained on CC BY-SA 4.0 content. (C.f. those that retrieve that content without using it in training data.) It also relies on my legal-layperson interpretations of "remix" and "transform," which no one should trust, except inasmuch as it demonstrates the thinking of one author who (1) licensed their content according to a license published in November 2013 (2) could not possibly have reasoned through how these terms would intersect with the development and use of LLMs years later. How could I have licensed a freedom of use I could not have even conceived? If I could have, would I have made a different decision and contributed less content? Who knows, but many outraged Stack contributors are doing exactly that, and I was much more prone to anger when I was young.

Concluding questions for a productive conversation:

To my lawyer friends, how should we model developers think about licensee obligations when training models? How have courts and legal minds interpreted these topics?
To my deep learning pals, how else might we implement guarantees of licensing requirements like attribution and share-alike in generative applications?

Stack Overflow users revolt: How should model developers react?

Sean Easter

Machine learning executive | Inclusive servant leader

领英推荐

社区洞察

其他会员也浏览了

Building LangChain ReAct Agents with create_json_chat_agent

2024 wrapped: a year full of AI

Open Manus Setup- The Complete Step-By-Step Guide For Beginners

A search engine that helps you code...

Open source LLMs vs. closed source LLMs - advantages, disadvantages and future trends

Bringing Vision to the Blind: Crafting an Image Narration App with Open Source Magic"

Web Scraping for $2/Day: Build a Cheap, Powerful Bot with DeepSeek V3 + Python?????

Open-Webui Mixture of Agents part 2

Re-Thinking Crawler