Introducing the Open Images Dataset

  • Lawyers are funny:

    >Today, we introduce Open Images, a dataset consisting of ~9 million URLs ... having a Creative Commons Attribution license* .

    Then the footnote below:

    >* While we tried to identify images that are licensed under a Creative Commons Attribution license, we make no representations or warranties regarding the license status of each image and you should verify the license for each image yourself.

    I think this might be the most blatant instance I've ever seen of, "We have to write this even though it's essentially impossible for you to actually follow our directions."

  • Interesting that the base data consists of URLs. I guess it makes sense given copyright issues. Anybody know what the ballpark expected half-life of such URLs?

  • Any guesses on how large the resulting dataset would be if you actually downloaded all the images? I imagine the urls will get removed in a hurry as everybody starts automating it.

  • First video, now images - wonder if speech and others are on the way?

    It's nice that they're doing this, helps advance the art I think. But it also puts a lot of smaller operations in unis sort of under the Google system in that they're best compared to Google's ML work and others using these datasets. It's a small way of stacking the deck to make Google and DeepMind more embedded in the community.

    That said, its utility for others surely outweighs the strategic advantage gained here, so I for one welcome these libraries. A lot of work goes into them. Hopefully others will release theirs as well.

  • I'm glad I'm getting a return on all the effort clicking street signs and store fronts on reCaptcha.

  • I've put an efficient downloader here for the interested crowd: https://github.com/beniz/openimages_downloader It's a fork of the one script I used to grab Imagenet.

  • Is there a link to the trained model somewhere?

  • Are there any other libraries that are similar?

  • Looking forward to someone trying tensorFlow CNN on this