A Bunch of Money on AWS and Some Benchmark Results
As someone who's spent a lot of time working with TPC-DS [1] and talking to people about it [2], I see a couple areas that could be improved in this benchmark:
1. Total run time is not an appropriate way to summarize the performance across queries, because some queries take 100x longer than others. The appropriate way to summarize this kind of data is to use the geomean [3].
2. The official TPC-DS queries make heavy use of grouping sets, which are a rarely-used SQL feature. I think TPC-DS is better if you rewrite the queries to eliminate grouping sets.
3. You used the exact same queries to "warm up" the data warehouse, and to test the performance. Some data warehouses (notably Redshift) aggressively cache intermediate compilation results, so they are much faster the second time they see a query or even a fragment of a query. To model a real user submitting queries interactively, you should use warmup queries that are similar to but not the same as the ones you use to measure performance.
4. You can solve the "vendor benchmarking their own product" problem by submitting a PR to our repo [4], which currently tests Redshift, Snowflake, BigQuery, Azure SQL DW, and Presto. We'd be happy to review it and endorse the timing if it meets our standards for fairness!
[1] https://fivetran.com/blog/warehouse-benchmark
[2] https://www.youtube.com/watch?v=XpaN-PqSczM
This benchmark is pretty ridiculous for the following reasons:
1. Their database is run in asynchronous durability mode.
2. They specifically do the one thing that TPC-C says you shouldn't do, which is get really high throughput on a small dataset. TPC-C enforces that you scale your data-stored with the query throughput. CockroachDB maxes out at ~12.8tpmC/warehouse because its waiting at the legal maximum throughput, as opposed to running up the numbers in a way that's against the rules (and spirit) of the benchmark.
3. They make all the TPC-DS mistakes that georgewfraser points out elsewhere in this thread.
4. They run in read committed mode (they don't support anything higher), CockroachDB runs in serializable mode.
I ended up ranting about this on Twitter, so rather than reproducing everything here, I'm going to link to my rant there. Apologies for the cross-posting across fora: https://twitter.com/narayanarjun/status/1128393193941274624
Might be advisable to note that you don't support foreign keys if you're going to show how much better your performance is versus a database that does.
The great thing about AWS and other cloud platforms is that you can put the data in e.g. S3, then use the tool of choice for your workloads.
I'm assuming that MemSQL works fine with that sort of configuration, rather than requiring you to lock your data up in some proprietary format.
Also, unlike the bad old days of on-premise platforms, you can try things out to see how they work. You could even do that with a public dataset first, to see how it works (see https://registry.opendata.aws/ for a list of these).
For example, there is an Amazon Customer Review dataset of over 160 million customer reviews - you could use that and try MemSQL for various use cases, then look at alternatives.
Disclaimer - clearly as a Kognitio employee I'd suggest you looked at us for analytics use cases, and you can see an example of sentiment analysis as scale using the Amazon Customer Review data set at https://kognitio.com/blog/sentiment-analysis-amazon-reviews-.... Also, a couple of articles on LinkedIn at https://www.linkedin.com/pulse/100-shades-grey-other-amazon-... for another piece of work on that same data, and https://www.linkedin.com/pulse/media-brexit-story-so-far-may... for a view on Global Media coverage of Brexit over time.
Regardless of the minor technical nitpicks here, I respect the hell out of the person/people who wrote this article, if only because I know how insanely hard it is to express a technical thought in great detail while maintaining a certain amount of levity to try and keep people interested.
Dude, you are such a fucking faggot.
Similar to the phrase "Lies, damned lies, and Benchmarks", should be the phrase "Lies, damned lies, and AWS costs"