Show HN: I scraped 25M Shopify products to build a search engine

Hi HN! I built Agora as a side-project leading up to the holiday season. I wanted to find an easier way to find Christmas gifts, without needing to go store-by-store.

My wife asked me for a a pair of red shoes for Christmas. I quickly typed it into Google and found a combination of ads from large retailers and links to a 1948 movie called 'Red Shoes'. I decided to build Agora to solve my own problem (and stay happily married). The product is a search engine that automatically crawls thousands of Shopify stores and makes them easily accessible with a search interface. There's a few additional features to enhance the buying experience including saving products, filters, reviews, and popular products.

I've started with exclusively Shopify stores and plan to expand the crawler to other e-commerce platforms like BigCommerce, WooCommerce, Wix, etc. The technical challenge I've found is keeping the search speed and performance strong as the data set becomes larger. There's about 25 million products on Agora right now. I'll ramp this up carefully to make sure we don't compromise the search speed and user experience.

I'd love any feedback!

  • I hope you have better luck than I did!

    A few years ago, my partner and I built vendazzo.com (now defunct). It was an e-commerce search engine on products listed on Shopify shops (sound familiar? :)). At the time, we had > 100m products listed, and I don't remember how many shops we were indexing.. over 100k I think, but we had access to over a million. Overall, I think your approach is very similar to ours, but we managed to keep our costs lower. At the time, we were spending ~$550/mo, and our search times were under 300ms. We had established partnerships with a number of shops, and we had a few users, but not nearly enough. That's where the wheels came off. The site operated for over a year, but the monthly costs wore us down until we finally decided to pull the plug.

    I still maintain that this is a good idea, and constantly have to fight off the urge to "try again", however, to do it properly, I think funding would be necessary, or finding some way to organically gain a lot of users.

    Looking back, there are things I could have done to reduce my opex further, but in the end, it still wouldn't have mattered if I couldn't figure out how to acquire users.

  • What was the process for scraping 25M products ?

    I have always used standard python tools like selenium, bs4 and the like. But I'm guessing none of these work at scale.

    Could you talk about your process and key bottlenecks at that scale a little bit ? Also, how much did it cost ?

    ______________

    A recommendation for how to improve search.

    Your base captions will be pretty bad. You can use spot instances on a smaller GPU machine to run a dense captioning model (https://portal.vision.cognitive.azure.com/demo/dense-caption...) and generate captions for all your images.

    Then for search, a simple vector store index would be a great retrieval solution here. It is better to do search using those as well.

    Both are pretty cheap and can be done reliably within 20-30 lines of code each in python. 3rd party tools for these are pretty stable.

  • I love your approach; you found a problem and developed a solution for it. And then you got the courage to share with the larger technical community. Good on you.

    There's obviously some rough edges (multiple duplicate products, issues with product links linking to empty pages, and no results for broad terms), but don't let that stop you. I'm certain they can all be fixed.

    Keep going! At the least, you'll come out of this with an excellent project in your portfolio.

  • Shopify has tried a few times to build a tool like this but hasn’t ever managed to get any traction. I think that missing any curation at all could be what eventually kills it. Their current attempt is https://shop.app and a query for red shoes is mostly red shoes.

  • I built this a couple years ago (now defunct) for the same reason :) The public JSON endpoints on shopify stores make it pretty easy to get the data. You mentioned using Mongo but it sounds expensive. I honestly think you could do this with just elastic or even postgres full text search and save money.

    Here's a pro tip + feature you should implement: Shopify has a semi-hidden hack where you can link directly to checkout of a product if you know the variant ID. You could add a BUY NOW button to your site without forcing the user to navigate the original site or checkout flow. Example: https://hapaboardshop.com/cart/42165521907955 (it also supports quantities and coupon codes)

    A word of caution: more products isn't necessarily better. I definitely found there to be a long tail of really bad shopify stores and products. IMO it's better to curate or audit the stores you index–otherwise you risk your site being littered with kitchy t-shirts or drop-shipping garbage.

  • Hey, I have a Shopify store that sells e-paper calendars / smart screens. I tried to search for it but I could not find it. What should I do so your crawler can find me?

    https://shop.invisible-computers.com

  • There are a few conferences dedicated to ecommerce search. Mices is pretty good. I did not go there this year but I know some of the people behind it. Good community and lots of stuff happening.

    Two points here.

    - 25 million is really not a lot for most search engines. Something like Elasticsearch can easily deal with that if you deal with it properly. And there are plenty of equally capable solutions. I have worked with logging clusters that processes log entries by those numbers on a daily basis. A modestly sized cluster goes a long way for that. Bare metal is cheaper than cloud for this. But a couple of simple servers with decent CPUs and memory and SSDs should go a long way here. Start worrying once you hit a few hundred GB of storage used. Anything below that is easy to deal with.

    - The key challenge with this volume is not performance but search quality. Building a competitive search engine is hard. You might have thousands of potential matches out of millions for any given query and your job is to pick the best 3, 5, 10 (whatever fits on your screen) ones. This is hard.

    So, what makes for a good answer is the key question to answer. All the naive solutions for this problem put you at the bottom of the market in terms of competitiveness. If you can't do better, you are just another low quality search engine not quite solving the problem. The bar is high these days for a good search engine and most of the better ecommerce companies have highly skilled search teams working on this.

  • Great site. Having built a search engine that needed to handle product data on a similar scale, it's not an easy thing to manage.

    Some observations:

    - Don't use infinite scrolling, it's an outdated UI practice that leads to bad user experience. It also makes the footer entirely unviewable.

    - Clicking on a product card image does not reliably open up the product. I have to randomly click on it a few times (Chrome, Brave)

    - Clicking on product card image and title leads to different actions, this is a bit unexpected, should show some hint of the difference.

    - The product page pop up will reset the search list when closed, this messes up my search navigation, breaks the flow of browsing.

  • Searching is slow (kinda expected that right now), but after clicking a product and then hitting back, I have to wait for the search again.

    Not at computer so I didn't check the headers, but maybe allow the client to cache the response for a short time so it doesn't need to load search results again.

  • Have Swedish family. Searched dala because family wants traditional Christmas ornaments. Sure enough, there were several results that were 10x cheaper than what I could find on the first page of Big Search Company. Great job!

  • The Terms page goes to "Jaggi Enterprises", "A Modern Investment Fund. We buy, build, and invest in software companies with recurring revenue.".

    So maybe this is not really something a guy built for his wife, but some anonymous startup that googled "Which terms rank best on Hacker News" and then wrote the "I did ... my wife .." story?

  • Agora also doesn't return red shoes for the search query "red shoes". Seems like you haven't fully solved the problem yet :)

    From a technical perspective, crawling 25M products is impressive but the search itself doesn't provide much value to me. I already use large e-commerce sites (amazon, wallmart, ...) and targeted ones (Nordstrom, SSENSE, ...). Sure I may not be searching through all the shopify, wix stores but I need to know why that's valuable to me to begin with. Perhaps understanding the value prop of SMBs and educating me about it would be a better positioning for Agora than simply being a search engine.

  • I believe Shopify built their own app / website where you can search for products exclusively from Shopify merchants. https://shop.app/

  • Great project. If you continue to crawl the data, be sure to save it so you can detect price changes a la camelcamelcamel.

  • For those unaware, Shopify already has platform wide search. You can use https://shop.app/ (or the app), and it also has some chatbot thing that can offer suggestions

  • Cool! But how did you get the initial dataset of 643,000+ Shopify stores (data as per your “About” page) in the first place, to then scrape the products from their /products.json feeds? Or did you just try a huge list of domain names at random?

  • That's funny, I made a domain-specific version of this for canadian coffee deals.

    https://beangrid.mcconomy.org/

  • What technology did you use to build the scraper and how did you get around the usual challenges (anti bot, ip banning, etc) with scraping large amounts of data?

  • Cool project!

    As you scale, you may benefit from these two projects I maintain, and the Big Tech uses :)

    https://github.com/unum-cloud/usearch - for faster search

    https://github.com/unum-cloud/uform - for cheaper multi-lingual multi-modal embeddings

    Feel free to reach out with feedback and feature requests!

  • When I search for “op-1”, partial match like “Frontier Co-op Turkey Rub, Organic 1 lb. -- Frontier Co-op” gets ranked higher than “teenage engineering op-1”. I would expect the opposite.

  • Really neat. I tried your search for red shoes, and I found some, er, unexpected imagery on page 1.

    One thing you could do is add semantic search so when a user searches "red shoes," the index returns images that look like red shoes even if the metadata doesn't say anything about color or item types. To do this, I'd use a model like CLIP. Here's an example of using CLIP and Supabase to do semantic image search: https://blog.roboflow.com/how-to-use-semantic-search-supabas...

  • This is great - just a couple UI things bugging me. 1. When clicking "Open" on a product, the user should be able to open that in a separate tab. Currently that's not possible; I'm sure because it's being delivered in a single page (can't check now because you're getting hugged to death by HN).

    2. When the server's slow, as it just was, there should be some kind of waiter / loader to immediately show the user that the "Open" click was sent on a product. Otherwise people will keep clicking it (or worse, clicking other products) and there's no indication that it's loading.

    3. Once a product is open, it's not clear how to get back from it. I see the "X" in the corner, but doing that seems to take me back to a blank search page, not to my search results. The back button also doesn't take me back to the search results...

  • HN— Not sure if anyone will see this but I wanted to thank you all for the support. Although I haven't slept much since going live, it has been amazing getting early feedback from the community.

    Agora is still in MVP stage but getting better by the day. Just pushed a big update: fixing an image shifting bug, a blur effect on loading, Redis for caching, brand pages, architecture fixes, and several other things. Currently working on improving the relevancy algorithm, adding all ~5 million Shopify stores, and then adding WooCommerce stores over the next few days.

    If you have any suggestions or ideas, reach out to me at support @ searchagora .com :)

  • On the page where you show details about the product, I would like to have it include the same product from other Shopify stores by doing an image similarly comparison.

    And then highlight how the price compares.

    For example, here are some pretty crazy red shoes. But they are too expensive for me. Would be interesting to see if this is the only store selling these shoes, or if someone else has the same shoes much cheaper.

    https://www.searchagora.com/products/vasco-4-47fb0f87-5b89-4...

  • How are you planning to monetize this? You mentioned you are spending around $2K just to run it. Is there a commission strategy or ads? Or populate with your products at one point so you sell your own thing?

  • Idea! Shopify has a ton of resellers that sell junk from China. If you figure out how to avoid them, your life would be 10x easier.

  • This is amazing for finding cute collectibles from my favorite TV show that I would otherwise not noticed among random t-shirt and other "slap the picture and call it co-branded" products! I'm not super sure how long it is going to be around, but I think I'm gonna keep playing with it for a while.

  • I searched for 'pĂŁo francĂŞs' and my store was the #1 result. I think you're doing it right! :)

  • Awesome! It would be good to listen to the enter key when typing in a search query. Your privacy and terms links point to what appears to be the saas code framework you used (just a guess). I was looking for your contact/email so I can ask you some questions.

  • Hey, I'm the CEO of Meilisearch. If your issue is performance, I would love to give you a try with Meilisearch. You'll be able to create an "as you type" experience with our engine that responds in less than 50ms!

  • Do you plan to add filters: price etc?

    I was about to 'reviews' as well in the above list but decided not to as they are not always trustworthy. Now AI is so advanced, that it can be used to detect fake reviews and ignore them from sampling.

  • cool project. You might have notice, but there's a non-trivial amount of fraud on shopify (fake shops, info stealers, etc). Might be interesting to look at that dataset and explore a bit =) it's quite fascinating

  • I have no clue how to implement a search but maybe some words are more important than others.

    I searched for "mens dress shirts button long sleeve" and after about 6 results it was all women's clothing.

  • I'm a Shopify store owner myself. I saw there is a $99 per month to get your product verified, how would this compete in terms of CPC with a traditional channel such as google ads or meta ads?

  • Amazing job! I've one question: how did you find the price of every products? I mean, every product page has a different id or class that identify a price. Do you use a regex?

  • Aside: The ending of the 1948 "The Red Shoes" was funny to me, but I think I was a little loopy after slogging thru it. I don't know if I recommend it or not.

  • I like it.

    I need to be able to filter search to if it will deliver to my country.

    It desperately needs some indication that your action is being processed, like a spinner, when you search.

  • What's your revenue model? I see you expanded on the details of your $1.5K monyhly cost, but failing to see how you make money? Affiliates fees?

  • wow! Nice work. I've been trying to build an index of shopify stores. Did you search for all domains pointing to shopify's name servers?

  • Worked well for me, great job. I searched for something I've been looking for and found some interesting options I haven't seen before.

  • You should def give Algolia and Typesense a try. You can get 10k in free Algolia credits for the first year too via Secret (startup deals site).

  • Could you make it so, that I can easily open a product in a new tab. I like to compare lots of products at the same time.

  • Love it! Some improvements are needed on search but is an amazing MVP, I'll use this for my late christmas shopping

  • Why not manticore as backend? Much better perf than ES, less memory intense, sql syntax. Just fantastic all round!!

  • Clicking an item could show you similar items before it takes you to the item (or have capability for similar)

  • "There's about 25 million products on Agora right now."

    How many stores are represented in index.

  • Incredible. Where can I connect with you? Want to pick your brain & swap some thoughts :)

  • heh, I used to work on the data team at Shopify. I built something similar to search internal dbs for secret santa gifts based on some weird criteria. Scraping might have a large margin of error because a lot of products tend to be ephemeral.

    Neat project though!

  • Any Unicode input (Japanese or Greek text for example) currently causes a 500 error.

  • So cool, good luck in the marriage, you made a very cool thing!!

  • Amazing. Why doesn't Shopify built this natively?

  • How did you find the list of shopify stores and names?

  • how did you avoid ip based blocking? rotating proxies?

  • Maybe I'm clearly ignorant, but how does this differ from Klarna (https://www.klarna.com)?

  • Is this really within the TOS of Shopify?

  • Amazing! Does it have an api?

  • Built the same thing a while back while collecting a lead list for sales. Not bothered to keep data updated but was a fun thing to build in a couple days. (disclaimer mobile experience is meh cause it was a fun project)

    https://zensear.ch

    How did you find list of all Shopify stores? I ended up just checking every .com, .net, etc as I didn't find an easy way to figure it out directly from shopify.

  • how did you get a list of the 25 million stores to crawl?

  • Basically it’s Amazon

  • where did you find a list of shopify stores to scrape

  • Super cool!!!

  • Great idea

  • Gg

  • Incredible. Would love to connect with you. Where can I find you LOL

  • I'm sorry, but I have to question where this heartfelt story about looking out for your wife is in any way real?

    The website certainly doesn't look like a side project, it has a fully fledged system for merchants to advertise on Agora for a fee, an affiliate system offering $50 commissions to onboard merchants and the ToS and Privacy policy link to a website with the following mission statement:

    > We buy, build, and invest in software companies with recurring revenue and product-led growth.

  • [dead]

  • [dead]

  • [dead]

  • [flagged]

  • [flagged]

  • This isn't worth the cost or effort. Shopify already has an internal tool with this functionality that they are planning to publicize.