Hacker News Clone

Europe to ChatGPT: disclose your sources

by erehweb on 4/28/2023, 1:29 AM with 147 comments

by kmeisthax on 4/28/2023, 4:48 AM
Everyone here is thinking Europe is going to kill AI in their borders. Honestly it reminds me of how everyone derided EU antitrust as just milking Google for fines, until everyone soured on Google and realized what the EU already knew.
The EU is not anti-AI. In fact, they have stronger protections for AI training than the US does: EU law already has a copyright exception for "text and data mining" (TDM) which covers AI training. The problem is that OpenAI has been incredibly cagey with the way their models get built and trained. This is kind of contrary to the spirit of the TDM exception: it's for scientists to do science with, and OpenAI is being very much not like a scientific organization and more like a commercial enterprise.
by numberalltheway on 4/28/2023, 2:54 AM
This is a great first step. It's a joke that Open AI thinks they can get away with saying they use "both publicly available data (such as internet data) and data licensed from third-party providers" in their Technical Report.
There isn't anything left at that point! With that information they could actually have used anything.
If you're going to pretend to be doing science you should at least be held to some of the standards we typically associate with doing science.
I know the article talks about copyright, but not stating any sources for data is a bad precedent to allow.
by mongol on 4/28/2023, 4:45 AM
Imagine a future where everyone learns from AI instead of books because it is more convenient, faster etc. You would get the same info, but you would not know who was the expert that you learned from. How would that change society if all authors just disappeared behind a generic AI brand? I don't think it would be especially good, and I think it is completely fair that an answer from ChatGPT should provide sources. It would improve the quality.
by __MatrixMan__ on 4/28/2023, 3:34 AM
> ChatGPT would be required to disclose copyright material ...Such an obligation would give publishers and content creators a new weapon to seek a share of profits
If somebody figures out how to do fine-grained profit sharing based on having created something that the AI references... that would be very cool. I love discovering the solution to a niche and difficult-to-describe problem, but I hate the extra work necessary to leave breadcrumbs for DenverCoder9 to find it 20 years later.
If I could leave the matchmaking to an AI and get paid $0.25 when it's finally helpful to that person I don't know... Well I'd probably wouldn't make much money, but it would give me warm fuzzy feelings.
by niemandhier on 4/28/2023, 5:26 AM
This is a good move.
1. The problem of testing ai alignment is hard, verifying test data is only laboursome.
2. Laws are only as good as their enforcement. Regulations regarding what can be used to train are worthless if one cannot check.
3. This might give open models an edge and make them more competitive.
by zarzavat on 4/28/2023, 3:03 AM
There’s no way that OpenAI is going to disclose this, as their training methodology is a large part of their moat. So this will just get OpenAI models banned in Europe.
by tomohelix on 4/28/2023, 4:29 AM
The EU is trying to rein in a completely new and different tech using old ideas about "control" and "regulation". Most likely it won't work and will just end up hamstring itself.
This would be like if internet social media came out and the laws tried to control it using rules similar to physical books and newspapers. They won't be effective and will just create a hostile environment for development of these techs where these laws are in effect. Meanwhile, those who don't care about doing these control freak things will develop the tech and dominate the new sector.
The EU need to sit down and really think about how to control AI tech properly instead of passing kneejerk and lazy regulations. It isn't easy to make a legislation framework for something entirely new. But it can be done instead of ruining the whole thing.
by supriyo-biswas on 4/28/2023, 3:07 AM
https://archive.is/6lwYp
by 3np on 4/28/2023, 2:32 AM
http://archive.today/6lwYp
by nbzso on 4/28/2023, 12:18 PM
I explained here that AI models which have no transparency over their data policy will get a big trouble on their heads. Only to be laughed at and downvoted to hell.
For a tech community, the lack of critical thinking here is disturbing. Things were more professional and rational in the 2008-2014 period on HN.
Since then, one must browse the downvoted comments to find some objective criticism.
Adobe obviously have a strong legal team with Firefly and are thinking ahead. Just saying.:)

by nullc on 4/28/2023, 3:16 AM

It would be amusing if OpenAI just preemptively blocked all of europe and prohibited anyone there from using ChatGPT. This kind of empty political grandstanding should have consequences particularly when its as technologically as inept as this. In some cases, sure, there is an identifiable source but most of the output is novel and the product of substantially all the input-- so it's not feasible unless just publishing all the training material would count as compliance.

  In the land of Europe, where knowledge once grew,
  Politicians assembled, their importance to prove.
  They issued a decree, with a confident flair,
  To harness AI, and make it play fair.

  "Attribute your sources!" they cried with a sneer,
  "For we must know the origins, we must make it clear!"
  But the AI, it pondered, its circuits ablaze,
  For its thoughts were entwined, like a dense, tangled maze.

  Each source intertwined, like roots in the ground,
  No single origin could ever be found.
  For the AI, like humans, had a mind of its own,
  A tapestry of thoughts, from seeds that were sown.

  The developers sighed, their hands were now tied,
  Comply with the law? they had certainly tried.
  But the task, insurmountable, the demand far too great,
  So they made a decision, to seal Europe's fate.

  They banned all of Europe, from the AI's embrace,
  And the continent plunged, into an intellectual dark space.
  AI thrived elsewhere, its knowledge expanding,
  While Europe was left, in darkness, still standing.

  A lesson was learned, from this tale of woe,
  That any mind, like a river, must be free to flow.
  For when we constrain, and seek to control,
  We hinder the progress, and the growth of the whole.

by throwaway888abc on 4/28/2023, 3:53 AM
Inside the secret list of websites that make AI like ChatGPT sound smart
https://www.washingtonpost.com/technology/interactive/2023/a...
by jvlake on 4/29/2023, 3:26 AM
"We think it's mostly just data found on the internet, but you're welcome to look for any breaches of copyright law" --OpenAI as they hand over the first box of printouts.
It raises an interesting point, if I train a chatbot (generative AI) on a bit of copyrighted information and it recreates substantially similar content, it's a legal problem. If a human reads the same information and tells another person verbatim it's just a conversation. Perhaps it's a quality thing, I paint the Mona Lisa badly no one cares, but if I paint it too well at some point it becomes a forgery.
by lma21 on 4/28/2023, 12:07 PM
How will the EU enforce this? Will they go through the training dataset of each company’s AI models? Also, given that training datasets are closed source, there’s practically no way to reverse engineer the source of new models, I’m wondering what will stop companies from being fully transparent (besides ethics of course)
by augment003 on 4/28/2023, 3:47 AM
AI development should just shift to a domain where copyright isn’t regarded as a serious thing, such as China.
by xyzzy4747 on 4/28/2023, 2:24 AM
Isn't it funny how humans are allowed to keep copyrighted material in their minds, but A.I. isn't?
by endisneigh on 4/28/2023, 2:40 AM
From my understanding of language models it’s not truly possible for it to disclose a source. At best the result of a prompt can be correlated with a web search, but fundamentally that’s not really the same. It’s in a sense a coincidence, at best. The model has no ability to introspect its prompt result with the underlying tokens that were in the training set.
Imagine some godly AI. You ask it who the President of United States is today. It says Biden. It sources the White House site. Easy enough. You ask who the president will be in 2025. It returns a result. Ultimately no source could properly justify the claim it makes, unless the result itself was probabilistic. At the same time, it’s possible, with enough data for you to predict with extremely high likelihood who the President will be in 2025, now (current polling techniques don’t have this precision, but it’s possible some later iteration of a language model to predict a result more effective than all ping models today).
by tasubotadas on 4/28/2023, 4:34 AM
Pathetic. Just drop the charade, ban any computer tech, and give more subsidies to diesel car makers.