How I Wrote a Search Engine in 6 Weeks

  • I will suggest that you read the article again and revise your judgement. This is why.

    I founded a search engine (web, images, video and business search) now in Alpha, with limited resources as well; trust me I know how hard it is.

    Simple questions, how good is your crawler? (You shouldn’t implement a scheme for each web site, though I understand for your business it kinda make sense)

    How much information do you have in your repository? (You should consider data in TB if not you are still far from any trouble) Ferret is a great indexing tool but how much data can it index? How scalable is it? Looking at the tech behind Ferret, how much resource does it use? How good is your relevancy model? (This question is tightly link to your indexing)

    I loved your idea, Just one thing: work on the relevancy again; I searched for ‘ruby on rails’ and got ‘ruby’ only related results first the <relevant ones> after. Also I will suggest you cache images to enhance user experience. Please don’t take my review personal.

  • 6 weeks? Why so long? - (just kidding) I am building a search engine as my current project. It took all of a weekend to get it up and running. I didn't do it alone my friend helped so I guess maybe that doesn't count ;) We've been crawling for a little under 3 weeks and I keep making interface and search-results tweaks but otherwise it works "ok". I am in the process of switching it over to S3 (maybe EC2) - after that change I think I'll open it up to the public.

  • Why do they use a cartoon of Ann Coulter as their logo?

  • Thanks for sharing your experience.

    For doing product search at the few-hundred-thousand-item scale, I would suggest SOLR rather than Nutch from the Lucene family.

    You'd need to do your own crawling/scraping, but the indexing is solid, simple, and flexible. (SOLR's pedigree is from CNET's own product search.)

  •   a. You wrote a crawler, not a search engine.  
      b. Ferret will bite you in the ass.  
      c. For a really good off the shelf crawler, look at Heretrix.