Hacker News Clone

RFC: Banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo

by Lockal on 4/15/2024, 9:30 AM with 155 comments

by JonChesterfield on 4/15/2024, 10:48 AM
Reading through the linked thread at https://github.com/pkgxdev/pantry/issues/5358 I'm in total agreement with Gentoo.
by nl on 4/15/2024, 12:23 PM
I can't say how strongly I disagree with the ethical/copyright concerns raised here.
The idea that intelligences - whether they be human, artificial or alien - should be forbidden from learning from code freely shared on the internet goes against everything I like about open source.
I think it's fair that no one should be able to use reproduced copyright code verbatim, whether that by by a human memorizing something or a computer copying it.
But I take the complete opposite view on the ethics of letting machine learn from work. I think this should be encouraged.
by GuB-42 on 4/15/2024, 11:50 AM
It looks like a knee-jerk reaction by some "AI" hater rather than a well thought out request.
- 7 instances of the word "shit". I don't mind swearing, but it is indicative of the author being maybe a bit too emotional for a technical proposal.
- It is unnecessarily broad. Not using AI to create bug reports? What if you use AI tech to find a bug? Are you not allowed to report it? The stated issue here seems to be mostly about code completion, but it is stretched to everything AI-related, everywhere.
The point raised are copyright, quality and ethics, which are valid points, but not specific to AI.
Copyright: You have the same problem when copy-pasting code, and people do that, you can't really single out AI. Instead of banning AI, a more sensible guideline would be to just be aware of copyright when importing code from elsewhere, including AI generated code, but also copy/pasting from online sources (ex: StackOverflow) and using external libraries. There are tools to check for copyright compliance.
Quality: AI-generated code is often lower quality, but so is code written by bad coders, judge by quality of contribution, not by how it is done. As for the "we can't really rely on all our contributors being aware of the risks", maybe start by picking contributors you can rely on. And if you think they may not be aware of the risks, tell them about the risks rather than saying "you can't do that".
Ethics: I don't know what Gentoo stands for, but I'm guessing it is mostly about making a good source-based Linux distribution. Don't hijack the project for your own goals. Now, I have no problem with a Linux distribution that has "no AI" as one of its core values, but it doesn't have to be Gentoo.
by apienx on 4/15/2024, 10:32 AM
No, thank you.
"The goal of Gentoo is to design tools and systems that allow a user to do that work as pleasantly and efficiently as possible, as they see fit." https://www.gentoo.org/get-started/philosophy/
And this is what Torvalds had to say about LLM-enhanced submissions to the kernel. https://www.youtube.com/watch?v=w7-gJicosyA
by constantcrying on 4/15/2024, 10:41 AM
This seems insane and totally unenforceable.
Even when I wrote very single line of code myself, I use AI to ask it about questions regarding the programming language or the library that I use. Banning that is just handicapping yourself.
I do like the sentiment. You absolutely do not want people to commit code they don't understand themselves, but the solution isn't to outright ban AI. The solution is to have trusted, knowledgeable developers who are aware of the limits of AI and use it appropriately.
by nicetryguy on 4/15/2024, 10:48 AM
Those concerns seem legit? Surprised at the negativity here.
by Draiken on 4/15/2024, 11:17 AM
Seems impossible to enforce, but I applaud the spirit of it.
Pretty soon anyone looking to add "open source contributor" to their GH profile can take a Gentoo issue and ask an AI to cook up a solution, put that on a PR, and send it in.
This will be a nightmare for maintainers. I'm not sure if there is a solution, since AI usage will spread regardless of how good/accurate it is and there's no way for us to differentiate between plausible bullshit and actual contributions, without reading it carefully. Reputation of contributors is probably the best proxy for genuine contributions, but that's a catch 22, so it can't be the only way.
by pantalaimon on 4/15/2024, 11:50 AM
> The AI bubble is causing huge energy waste.
Pretty ironic coming from a distribution that requires every user to compile everything from source.
by agentultra on 4/15/2024, 12:06 PM
Seems like a reasonable move to me.
I’ve seen folks use ChatGPT to generate code and review it for security flaws. It does often solve the tasks. And it leaves behind many kinds of vulnerabilities: injections, overruns, etc.
Based on the little empirical evidence we have about informal code review [0], it seems that we ought to limit or outright ban generated code. A Human can only read so much code before their impact on catching errors significantly drops. OSS project maintainers have enough on their plate and we don’t need to exhaust them with trying to maintain AI generated code.
[0] https://sail.cs.queensu.ca/data/pdfs/EMSE_AnEmpiricalStudyOf...
Update spelling
by rubymamis on 4/15/2024, 10:42 AM
Why don't evaluate contributions based on how well the code/documentation is written? What does it matter who wrote it, if it's good? Assuming no spamming by bots.
by ekidd on 4/15/2024, 11:31 AM
I do write some niche open source projects, ones which:
- Have been written with CoPilot enabled in my editor, and
- which optionally use GPT 3.5 as a translation API, and
- Which use OpenAI's text-to-speech model to generate spoken dialog files for testing.
I suppose I can try to mark my projects in a such a way as to inform Gentoo that it's against their policy to package them.
Overall, I would guess that my CoPilot-assisted code is slightly worse than code I hand-craft. The biggest difference seems to be that with CoPilot I write fewer tiny functions, and I tend to keep more related code in one place. On the other hand, CoPilot makes writing test code extremely quick. And I'm not talking about generic boilerplate here: CoPilot can write non-trivial parser or type inference code that relies heavily on internal project APIs that do not exist outside my project.
Overall, I'd guess that CoPilot allows me to produce twice as much code at 90-95% of the quality. Which since we're talking about open source projects that I maintain in my spare time (and that were painfully over-engineered to begin with), is probably a decent tradeoff.
by vasco on 4/15/2024, 11:02 AM
> In other words, explicitly forbid people from using ChatGPT, Bard, GitHub Copilot, and so on, to create ebuilds, code, documentation, messages, bug reports and so on for use in Gentoo.
Maybe a naive question but, how will they know?
by malet on 4/15/2024, 10:22 AM
All very well, the question is, how will you know? And if you can’t reliably differentiate between ai or human contributions how could this be enforced?
by t-sauer on 4/15/2024, 10:34 AM
But DeepL is fine apparently (as it should be in my opinion) so I guess this is going to be a random dice roll what maintainers will allow or not.
I can understand the sentiment behind this proposal, but it is way too nuanced and complicated to just solve it with a few basic rules.
by oefrha on 4/15/2024, 12:00 PM
> In other words, explicitly forbid people from using ChatGPT, Bard, GitHub Copilot, and so on, to create ebuilds, code, documentation, messages, bug reports and so on for use in Gentoo.
On the one hand, I’m on the fence about this heavy-handed approach. Tons of people, myself included, use AI assistants to create high quality work in less time. Of course I’m also aware of tons of low quality garbage.
On the other hand, I’m all for banning automated submissions which have been on the rise for the past couple of years, which are often thinly veiled (if at all) ads for AI startups. GitHub in particular should allow owners to ban all unsanctioned bots, and report unlabeled bots.
by tgsovlerkhgsel on 4/15/2024, 11:40 AM
The quality concerns are absolutely justified. The complaints about energy use sound like unfounded, extremely far-fetched arguments just used by people who don't like LLMs for other reasons.
The inference energy cost is likely on the same order of magnitude as the computer + screen used to read the answer (higher wattage, but much shorter time to generate the response than to formulate the request and read it).
The training energy cost is significant only if we ignore that it is used by many people. For GPT-3, I've seen plausible estimates of ~1 GWh, which would equal to about 400 tons of CO2, about as much as a single long-distance plane (total, not per passenger, fuel consumption only) round trip. Estimates for newer models usually ignore the existence and likely use of more efficient accelerators.
by nonrandomstring on 4/15/2024, 10:51 AM
It's interesting because you can see it as both a very conservative approach and a high risk stance which don't seem like common bedfellows.
If you acknowledge up front that AI is unfit for purpose and is very likely to introduce some serious security problems then it seems wise. When it turns out the LLM models have all been compromised to insert backdoors into the compiler toolchain, you win by being the last distro left standing. You could look at it as a very high risk strategy for the same reason, if you think you'll be "left behind". Either way who dares wins (or dies). Dare to go against the mob, or dare to bet the farm on a principle.
AFAIK Gentoo is one of the more conservative communities. But I'd also expect to see this policy being considered in BSD circles too.
by matheusmoreira on 4/15/2024, 10:47 AM
Define "use". What kind of uses? Verbatim code sourced from AI or using the technology in general?
I've used AI to learn about massive codebases. It's a bit stupid but still extremely helpful. The free ChatGPT was capable of explaining the concepts in the code and the file system structure of the project, allowing me to get started much faster. It sure as hell beats being a help vampire on some IRC channel or mailing list.
This technology is literally too good to be banned. We should be working on taking it as far as humanly possible by getting it running locally and completely uncensored.
by BaculumMeumEst on 4/15/2024, 10:25 AM
RFC: Banning contributions written on systems with proprietary software
by drrlvn on 4/15/2024, 10:23 AM
This is from February, is there any update or progress?
by AtNightWeCode on 4/15/2024, 12:04 PM
I think it is a more general problem. One can't see in GIT what tools have been used to create and validate the code.
It is impossible to use most modern devtools completely without AI. I think it is better to regulate the usage and enforce transparency.
by speed_spread on 4/15/2024, 10:32 AM
I haven't used Gentoo in a while but if I went back to it and had to manipulate ebuilds, LLMs would absolutely be involved. Even if it was enforceable, a no-LLM policy would deprive infrequent contributors from using one of the most powerful coding tools we've invented. It would be as inane as banning man pages or syntax highlighting. While I understand the goal of high quality input, the net result would just be gatekeep the process to experienced contributors and be detrimental to the project in the long run. LLMs are here to stay and are way too useful to be ruled out.
by BlueTemplar on 4/15/2024, 10:28 AM
In completely unrelated news (/s), supposedly Microsoft is silently (?) installing Copilot on Windows Server 2022 systems ?
by cqqxo4zV46cp on 4/15/2024, 10:24 AM
Not to be too snarky, but this is exactly the proposal and associated conversation that I’d expect to see around Gentoo. It is as impractical as it is unnecessarily standoffish.
There are multiple repliers that very clearly don’t understand how LLMs work. “It’s computers, so I can intuit it!” is typical techie hubris.
by jbandela1 on 4/15/2024, 11:13 AM
I am not sure what “AI” means but: Yes, yes. It is past time we do it.
People are producing plausible sounding bullshit because of ease of just cranking out and iterating code quickly. And we have made it way too easy to incorporate potentially copyrighted other people’s code.
As Donald Knuth would know, back when you had to generate punch cards and stand in line to load them on the mainframe, you spent a lot more time carefully designing and logically working through your code, instead of producing massive amounts of plausible bullshit. So yes, I agree.
You are talking about banning interactive editors with copy/paste and interactive compilers and debuggers right?
by connorgutman on 4/15/2024, 10:44 AM
As a Gentoo cultist… no thanks!
by f321x_ on 4/15/2024, 11:24 AM
Based
by gjs278 on 4/15/2024, 10:26 AM
[dead]
by on 4/15/2024, 12:26 PM
undefined