Hacker News Clone

Show HN: An API for scraping recipe web pages

by brad0 on 7/18/2017, 7:50 AM with 105 comments

by oneeyedpigeon on 7/18/2017, 9:29 AM
This really seems to be an exercise in 'structuring recipe data' rather than the ins-and-outs of scraping. Seems like a much-needed task; is there anything approaching a 'standard' for recipe data already?
```
    "ingredients": [
        "600g pineapple, peeled, chopped"
    ]
```
This seems like a prime candidate for improvement; something like the following would seem to be more useful:
```
    "ingredients": [{
        "ingredient": "pineapple",
        "quantity": {
            "value": 600
            "unit": "g",
        },
        "preparation": [ "peeled", "chopped" ]
    }]
```
Using this kind of data, I can imagine automated ordering, management of several planned recipes in conjunction, personalised timing estimations, etc. I think the whole field of cooking is ripe for some quality api-based disruption!
by Freak_NL on 7/18/2017, 9:22 AM
```
    "prepTime": {
        "text": "15 minutes",
        "iso": "PT15M",
        "minutes": 15
    }
```
Why are you triplicating the time periods? I would limit that to just the ISO 8601 value (PT15M) and let the UI choose the proper rendering of the value.
by Freak_NL on 7/18/2017, 9:27 AM
It looks good and seems to work with some arbitrarily picked recipes on the usual large recipe websites, although more obscure links cause some bugs¹. Is the source code on GitHub? Are you handling specific websites such as allrecipes.com with bespoke code?
Technically the project is interesting, but if you want to offer a commercial API you might run into copyright and fair use issues (as with any scraping tool). Not so much a problem for personal use, but expect angry letters from the major recipe websites for violating their terms of use (i.e., this is a threat to their business model).
1: Try this link: https://www.thespruce.com/peking-duck-recipe-694920 . You'll see a lot of garbage HTML and XML entities in the JSON that can be filtered and replaced fairly easily.
by polm23 on 7/18/2017, 11:37 AM
Obligatory mention of the NYTimes using a CRF to get structured data from recipes:
https://open.blogs.nytimes.com/2015/04/09/extracting-structu...
by borne0 on 7/18/2017, 10:58 AM
This is sorely needed, so many recipe blogs follow that tired format of 'long diatribe about something moderately health or family related' then the recipe. Cool!
by keithasaurus on 7/18/2017, 6:48 PM
I can't access the linked page, but I run https://www.cinc.kitchen. It tries to be the GitHub for recipes, but also has some of the most advanced scaling and parsing available. So a lot of thought has gone into how data is organized and processed.
On scrapers specifically, cinc's recipe importer is decent, but mainly relies on structured meta data.
There's a lot of room to improve these tools though. Lots of complexity and edge cases with recipes :)
Happy to answer questions people have about this stuff.
by mstaoru on 7/18/2017, 9:47 AM
Good work! Parsing hRecipe and Schema.org Recipe entities is what we also did before for Spiceship. Unfortunately, the quality of recipes from the Internet is unspeakably low, so we had to switch almost entirely to parsing e-books. Some websites like SeriousEats are better than the others, but generally, it's not serving any purpose except aggregating the recipes Yummly-style and getting some kind of data insights from them.
by moepstar on 7/18/2017, 4:50 PM
Not sure if this is supposed to work, but german recipe site "chefkoch.de"[1] gave me a "Unexpected token in JSON at position 1611" error.
[1] http://www.chefkoch.de/rezepte/565001154855998/Big-Kahuna-Bu...
by dbot on 7/18/2017, 12:57 PM
Related, there's an iOS app called Mealboard which lets you plan out recipes for the week/month/etc. We've started using it all the time, mostly because it has an amazing in-app browser that scrapes web recipes and stores them in the recipe list. Really impressive tool.
by Shivetya on 7/18/2017, 1:46 PM
While I like the idea one of the issues I have run into way too often is the far to many sites are simply linking to the actual site for the recipe. It is getting really difficult to filter these results out as new ones crop up weekly.
example : https://yurielkaim.com/7-green-detox-juice-recipes/
fortunately it doesn't take seven pages to find the actual recipes but I still need to go to each individually and possibly be buried under the ad load. the worst sites spread out the recipes to their own seven pages and still require a link to the site holding the recipe
by joshstrange on 7/18/2017, 2:54 PM
I use Paprika for my recipes and IIRC their API for fetching back your recipes (not public) was actually pretty nice. I have played around with creating a meal prep blog for a while and I was going to write a little service that would hit their API so I could make embeddable widgets for the recipe that pulled directly from my account (so if I updated it in the app the web would update as well). Not the same thing as this at all but I'd be interested to see what JSON format they use for recipes.
by bomdo on 7/18/2017, 9:38 AM
Impressively, it also works for languages other than English (albeit having trouble with formatting here and there).
Is there a behind-the-scenes somewhere? Is this regex magic alone?
by on 7/18/2017, 3:29 PM
undefined
by pawelkomarnicki on 7/18/2017, 12:46 PM
I came up with my own schema for recipes for CookArr.com :-) but in the long run, making the recipes is more interesting for me, than scraping them off other websites :-)
by zackify on 7/18/2017, 1:40 PM
The first recipe I tried gave me an unexpected json error. Seems to only work with a few sites. Cool idea though.
by staticelf on 7/18/2017, 11:36 AM
Very impressed, works fine on websites in Swedish even.
by mapster on 7/18/2017, 6:03 PM
use case: Alexa walking me thru preparing a meal for 4. learning my tastes and diet - making recommendations when asked.
by Cosmopolitan on 7/18/2017, 4:12 PM
I've been using Chrome's scraper extension. This looks like it could be fun to try out though.
by Dowwie on 7/18/2017, 10:53 AM
This falls under the category of unethical scraping