OpenAI's bot crushed this seven-person company's web site 'like a DDoS attack'

  • Recent and related:

    AI companies cause most of traffic on forums - https://news.ycombinator.com/item?id=42549624 - Dec 2024 (438 comments)

  • This keeps happening -- we wrote about multiple AI bots that were hammering us over at Read the Docs for >10TB of traffic: https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse...

    They really are trying to burn all their goodwill to the ground with this stuff.

  • > “OpenAI used 600 IPs to scrape data, and we are still analyzing logs from last week, perhaps it’s way more,” he said of the IP addresses the bot used to attempt to consume his site.

    The IP addresses in the screenshot are all owned by Cloudflare, meaning that their server logs are only recording the IPs of Cloudflare's reverse proxy, not the real client IPs.

    Also, the logs don't show any timestamps and there doesn't seem to be any mention of the request rate in the whole article.

    I'm not trying to defend OpenAI but as someone who scrapes data I think it's unfair to throw around terms "like DDOS attack" without providing basic request rate metrics. This seems to be purely based on the use of multiple IPs, which was actually caused by their own server configuration and has nothing to do with OpenAI.

  • I’ve been a web developer for decades as well as doing scraping, indexing, and analyzing million of sites.

    Just follow the golden rule: don’t ever load any site more aggressively than you would want yours to be.

    This isn’t hard stuff, and these AI companies have grossly inefficient and obnoxious scrapers.

    As a site owner those pisses me off as a matter of decency on the web, but as an engineer doing distributed data collection I’m offended by how shitty and inefficient their crawlers are.

  • It's "robots.txt", not "robot.txt". I'm not just nitpicking -- it's a clear signal the journalist has no idea what they're talking about.

    That and the fact that they're using a log file with the timestamps omitted as evidence of "how ruthelessly an OpenAI bot was accessing the site" makes the claims in the article a bit suspect.

    OpenAI isn't necessarily in the clear here, but this is a low-quality article that doesn't provide much signal either way.

  • It's funny how history repeats. The web originally grew because it was a way to get "an API" into a company. You could get information, without a phone call. Then, with forms and credit cards and eventually with actual API's, you could get information, you could get companies to do stuff via an API. For a short while this was possible.

    Now everybody calls this abuse. And a lot of it is abuse, to be fair.

    Now that has been mostly blocked. Every website tries really hard to block bots (and mostly fail because Google funds their crawler millions of dollars while companies raise a stink over paying a single SWE), but it's still at the point that automated interactions with companies (through third-party services for example) are not really possible. I cannot give my credit card info to a company and have it order my favorite foods to my home every day, for example.

    What AI promises, in a way, is to re-enable this. Because AI bots are unblockable (they're more human than humans as far as these tests are concerned). For companies, and for users. And that would be a way to ... put API's into people and companies again.

    Back to step 1.

  • First time I heard this story it was '98 or so and the perp was somebody in the overfunded CS department and the victim somebody in the underfunded math department on the other side of a short and fat pipe. (Probably running Apache httpd on a SGI workstation without enough ram to even run Win '95)

    In years of running webcrawlers I've had very little trouble, I've had more trouble in the last year than in the past 25. (Wrote my first crawler in '99, funny my crawlers have gotten simpler over time not more complex)

    In one case I found a site got terribly slow although I was hitting it at much less than 1 request per second. Careful observation showed the wheels were coming off the site and it had nothing to do with me.

    There's another site that I've probably crawled in it's entirety at least ten times over the past twenty years. I have a crawl from two years ago, my plan was to feed it into a BERT-based system not for training but to discover content that is like the content that I like. I thought I'd get a fresh copy w/ httrack (polite, respects robots.txt, ...) and they blocked both my home IP addresses in 10 minutes. (Granted I don't think the past 2 years of this site was as good as the past, so I will just load what I have into my semantic search & tagging system and use that instead)

    I was angry about how unfair the Google Economy was in 2013, in lines with what this blogger has been saying ever since

    http://www.seobook.com/blog

    (I can say it's a strange way to market an expensive SEO community but...) and it drives me up the wall that people looking in the rear view mirror are getting upset about it now.

    Back in '98 I was excited about "personal webcrawlers" that could be your own web agent. On one hand LLMs could give so much utility in terms of classification, extraction, clustering and otherwise drinking from that firehose but the fear that somebody is stealing their precious creativity is going to close the door forever... And entrench a completely unfair Google Economy. It makes me sad.

    ----

    Oddly those stupid ReCAPTCHAs and Cloudflare CAPTCHAs torment me all the time as a human but I haven't once had them get in the way of a crawling project.

  • People who have published books recently on Amazon have noticed that immediately there are fraud knockoff copies with the title slightly changed. These are created by AI, and are competing with humans. A person this happened to was recently interviewed about their experience on BBC.

  • From the article:

    "As Tomchuk experienced, if a site isn’t properly using robot.txt, OpenAI and others take that to mean they can scrape to their hearts’ content."

    The takeaway: check your robots.txt.

    The question of how much load requests robots can reasonably generate when allowed is a separate matter.

  • Sites should learn to use HTTP error 429 to slow down bots to a reasonable pace. If the bots are coming from a subnet, apply it to the subnet, not to the individual IP. No other action is needed.

  • I used to have some problem with some Chinese crawlers, first I told them no with robots.txt, then I see a swarm of of non-bot user-agents from cloud providers in China, so I blocked their ASN, and then I see another rise of IPs from some Chinese ISP, so I eventually I have to block the entire country_code = cn and show them a robots.txt

  • What options exist if you want to handle this traffic and you own your hardware on prem?

    It seems that any router or switch over 100G is extremely expensive, and often requires some paid for OS.

    The pro move would be to not block these bots. Well I guess block them if you truly can't handle their throughput request (would an ASN blacklist work?)

    Or if you want to force them to slow down, start sending data, but only a random % of responses are sent (so say ignore 85% of the traffic they spam you with, and reply to the others at a super low rate or you could purposely send bad data)

    Or perhaps reachout to your peering partners and talk about traffic shaping these requests.

  • I'm working on fixing this exact problem[1]. Crawlers are gonna keep crawling no matter what, so a solution to meet them where they are is to create a centralized platform that builds in an edge TTL cache, respects robots.txt and retry-after headers out of the box, etc. If there is a convenient and affordable solution that plays nicely with websites, the hope is that devs will gravitate towards the well-behaved solution.

    [1] https://crawlspace.dev

  • Cloudflare should provide a service (paid or free) to block AI crawlers.

  • Has anyone been successfully sued for excess hosting costs due to scraping?

  • > has a terms of service page on its site that forbids bots from taking its images without permission. But that alone did nothing.

    It's time to level up in this arms race. Let's stop delivering html documents, use animated rendering of information that is positioned in a scene so that the user has to move elements around for it to be recognizable, like a full site captcha. It doesn't need to be overly complex for the user that can intuitively navigate even a 3D world, but will take x1000 more processing for OpenAI. Feel free to come up with your creative designs to make automation more difficult.

  • Stuff like this will happen to all websites soon due to AI agents let loose on the web

  • is there a place with a list of aws servers these companies can block?

  • I had the same problem with my club's website.

  • Fail2ban would have been all they needed. It's what I have, works great.

  • Greedy and relentless OpenAI's scraping may be, but that his web-based startup didn't have a rudimentary robots.txt in place seems inexcusably naive. Correctly configuring this file has been one of the most basic steps of web design for living memory and doesn't speak highly of the technical acumen of this company.

    >“We’re in a business where the rights are kind of a serious issue, because we scan actual people,” he said. With laws like Europe’s GDPR, “they cannot just take a photo of anyone on the web and use it.”

    Yes, and protecting that data was your responsibility, Tomchuck. You dropped the ball and are now trying to blame the other players.

  • I have little sympathy for the company in this article. If you put your content on the web, and don't require authentication to access it, it's going to be crawled and scraped. Most of the time you're happy about this — you want search providers to index your content.

    It's one thing if a company ignores robots.txt and causes serious interference with the service, like Perplexity was, but the details here don't really add up: this company didn't have a robots.txt in place, and although the article mentions tens/hundreds of thousands of requests, they don't say anything about them being made unreasonably quickly.

    The default-public accessibility of information on the internet is a net-good for the technology ecosystem. Want to host things online? Learn how.

    EDIT: They're a very media-heavy website. Here's one of the product pages from their catalog: https://triplegangers.com/browse/scans/full-body/sara-liang-.... Each of the body-pose images is displayed at about 35x70px but is served as a 500x1000px image. It now seems like they have some cloudflare caching in place at least.

    I stand by my belief that unless we get some evidence that they were being scraped particularly aggressively, this is on them, and this is being blown out of proportion for publicity.