Ask HN: What's the Heroku for Data?
Heroku made it so easy to one-click deploy a web application. What's the equivalent for building a data pipeline, data warehouse, or data lake?
I work at Google so I'll give examples from Google land
* Google BigQuery (data warehouse) https://cloud.google.com/bigquery/
* Google Dataflow https://cloud.google.com/dataflow/
* Looker https://cloud.google.com/looker
More on data analytics: https://cloud.google.com/products#section-6 databases: https://cloud.google.com/products#databases
Most of these solutions are 'serverless' similar to Heroku model (you don't manage the infrastructure, it scales based on demand, you pay per usage).
For data warehouse try clickhouse.
We're working on an internal platform. Object storage, structured or unstructred data, plug-in architecture and easy to add applications. Notebooks with GPU and a choice of images with lots of pre-installed libraries with one click. i.e: our people will not have to deal with CUDA and won't have to get a powerful machine.
One click publishing of an automatically parametrized notebook as an application, an AppBook, without the widgets code or tagging cells. Automatic tracking of parameters, metrics and saving the models in object storage with no boilerplate code behind the scenes, so people don't _forget_ to log. You can send post requests to your instrumented training notebooks and programmatically tweak training parameters. You can also have a runtime notebook with the params you want and Publish an AppBook so a domain expert can either try to validate it entering values, or train models themselves tweaking parameters.
Also one click deployment that gives you an endpoint to serve your model. You can invoke the model with a post request using an authentication token you can generate from the application.
We're adding functionality for projects and better user management, and supporting more data sources (for now S3 only, but Google Cloud Storage and Azure in the pipe). We don't have async training or invocation yet.
We're opening it for about 30 students of one of our colleagues so they could prepare their masters on an instance we deployed on Google Cloud Platform and pretty big datasets for students (they get to use a Tesla K80, which they couldn't before).
If you're interested drop me a line. We're also supporting the fast.ai course so people could start learning hassle free.
We have worked as a consulting company in ML projects with large organizations and we've suffered though the usual problems of config, hardware, libraries, dependencies, notebook sharing, model deployment, building applications from problem and model to the javascript on the front. Not trivial to find or keep people who can go through that, and it's tricky to need such people, and as a small consulting company doing contract work, the spending of having full time specialists if you can find them is draining. We also had the issues of key people not being part of the projects anymore and no-one knowing what the project was about. Now we're adding features for collaboration and institutionalizing knowledge of the projects, as we do in our software engineering day-to-day where we're all aligned and everyone knows why something was done -rationale, assumptions, hypotheses, etc.-.
So we built this tool to give us leverage, allow our data scientists to deploy models and share interactive notebooks without having to write widget code, or ask people who can deploy something to take care of it. Must be one click.