Ask HN: How do huge sites with Millions of URL manage their sitemap?
Sites like twitter, facebook, quora, etc. probably would be having sitemap so that the search engines can crawl them in better way.
How do they manage such a huge sitemap?
What kind of special tools do they use for it?
I have a few sites with a few million Uris and found that the sitemap leads to much faster and better indexation than I would have had otherwise.
The sitemap protocol is a little bit awkward at this point since a sitemap file can only have 50k links so then you need a sitemap index. If you are not exceptionally careful you will see files updated when the crawlers are downloading them, sitemap indexes not perfectly synced to the sitemaps and other things that cause transient errors. In theory you could be careful with how things are date stamped to reduce the considerable load of downloading your site maps but web crawlers don't 100% trust assertions of how often things get updated.
Adding it up sitemaps work OK for a million pages, but probably not for a billion.
I can't seem to find any evidence of a sitemap on Twitter, Facebook or Quora.
Quora has THIS: http://www.quora.com/sitemap
I can't find any sort-of XML, JSON, or even TXT sitemap on any major site like that.
They may have a special agreement with the search engines, or maybe their sitemaps only appear to search bot user agents/IP's.
The other alternative could be that they simply DON'T HAVE A SITEMAP. Search bots might just be crawling them as normal, or with some sort-of link feed agreement with the companies, which lets them index as soon as a new link is created internally.