Hacker News Clone

Redshift vs. BigQuery vs. Snowflake benchmark

by georgewfraser on 10/9/2017, 2:54 PM with 57 comments

by boulos on 10/9/2017, 4:26 PM
Disclosure: I work on Google Cloud (but not BigQuery specifically).
Cool! First, thanks for doing this.
You allude to it a little in your text, but I think your cost model:
(cost_per_hour / 3600) * query_in_seconds / expected_utilization
doesn't (often) work out. Your benchmarks only run one query at a time, so BigQuery will look pretty expensive compared to the smallest Redshift/Snowflake cluster.
Have you tried simulating say a team of 10 analysts by sending concurrent queries? (You will see the average runtime go up as you fill the box). Alternatively, even 18% utilization is probably way too much for a single person/analyst!
P.S.: You seem to have checked in export PGPASSWORD=...
by mulmen on 10/10/2017, 12:19 AM
I have to consider these results to be meaningless based only on the fact that you do not use the "advanced" features like sort and distribution keys. In Redshift especially sort and distribution keys are not "advanced" features, they are fundamental factors in determining performance.
Your justification for this is that you don't know how queries will run but that's a poor excuse. With your star schema example dataset it should be easy to pick distribution keys that match the access patterns of commonly joined tables. There may be tradeoffs here but it's far from impossible and you are crippling performance by not using these features.
by robbles on 10/9/2017, 4:17 PM
I'm not sure leaving out sort and dist keys is actually being fair to Redshift here.
I get the argument that you're avoiding any "special tuning", but these are basic features that the docs and most guides will tell you are not really optional. In my mind it's more like leaving out the primary key declaration on a table in Postgres or MySQL.
by manigandham on 10/9/2017, 3:58 PM
BigQuery is one of the few products that truly feels magical. It lives up to the "no-ops" model of just working without thinking about details and handles any scale of data you have.
Only downsides would be cost-per-run pricing and low-latency queries.
by CobrastanJorji on 10/9/2017, 4:52 PM
This is super well written! I've seen a lot of benchmark queries show up on Hacker News, and I think this one is exceptional for calling out and justifying all of its assumptions, then comparing it to other benchmarks and explaining how they might be different. I really appreciate all of that clarity. That academic but easy-to-read style doesn't make for flashy headlines but it's so much more informative.
by flgr on 10/9/2017, 5:33 PM
I don't know; personally I feel like the scale here is much too low to be relevant. Also I think not investing the time to tune sort and dist keys makes the comparison meaningless.
But maybe that just becomes meaningful at larger data sizes and maybe most people work with less data most of the time.
by tziki on 10/9/2017, 5:45 PM
Do you plan to take into account the developer cost? BigQuery is more managed than Redshift so that definitely factors into the "real world" cost analysis.
by fibo on 10/9/2017, 8:03 PM
I am using Redshift since two years and, as every database has its own SQL dialect and its own tricks, but it is a really good product. It is a pity that Amazon forked early PostgreSQL v8 and not PostgreSQL v9 (with array support) but if I think about databases I used before (I was admin of a Netezza instance and user of an Oracle instance, they were Data Warehouses too) I realize it is a really big step forward that now we can create with few clicks and witha really fast provisioning time, a database that took days, even month (contracts, waiting they arrive to datacenter, DBAs configuring it, installing it, etc) to have it up and running, that now thinking about Redshift or Big Query it seems to live in another era.
by ozataman on 10/9/2017, 11:43 PM
There is a different side to the cost benchmark that's not captured by the description here. If your use case needs a lot of stored data but not necessarily a matching degree of peak CPU (even if your query load is otherwise pretty consistent), Redshift will become really expensive really fast and it will feel like a waste. BigQuery will meanwhile keep costs linear (almost) in your actual query usage with very low storage costs.
For example, you may need to provision a 20-node cluster only because you need the 10+ terabytes in storage across several datasets you need to keep "hot" for sporadic use throughout the day/week, but don't nearly need all that computational capacity around the clock. Unlike BigQuery, Redshift doesn't separate storage from querying. Redshift also doesn't offer a practically acceptable way to scale up/down; resizes at that scale take up to a day, deleting/restoring datasets would cause lots of administrative overhead and even capacity tuning between multiple users is a frequent concern.
Making matters worse, it is common for a small number of tables to be the large "source of truth" tables that you need to keep around to re-populate various intermediate tables even if they themselves don't get queries that often. In Redshift, you will provision a large cluster just to be able to keep them around even though 99% of your queries will hit one of the smaller tables.
That said, I haven't tried the relatively new "query data on S3" Redshift functionality. It doesn't seem quite the equivalent of what BigQuery does, but may perhaps alleviate this issue.
Sidenote: I have been a huge Redshift fan pretty much since its release under AWS. I do however think that it is starting to lose its edge and show its age among the recent advances in the space; I have been increasingly impressed with the ease of use (including intra team and even inter-team collaboration) in the BigQuery camp.
by afpx on 10/9/2017, 6:23 PM
I’ve been using Redshift fairly heavily for the past couple years. I haven’t had the time to do comparisons between BigTable, unfortunately. I’ve been too strapped for time. And, we have only just started experimenting with Snowflake recently.
I would like to try out BigTable, but experimenting with it has required too much work. I would have to move my entire data set (Several Petabytes) over to Google.
Redshift has been fine. But, with Reshift, we have two big issues: 1) Loading data takes too long. Even with efficient copy operations from S3, it takes too long to import data at scale. 2) Ineffient queries frequently overflow to disk and consume the entire SSD. I’ve tried to train my data science team on how to avoid inefficient queries, but it’s been impossible.
So, I’m really looking forward to seeing the results from part 2!
by marc19 on 10/9/2017, 11:19 PM
Also, here is a rudimentary, 2-part comparison of Redshift vs Azure SQL DW (in preview) for good measure:
Part 1 - http://bicortex.com/microsoft-azure-sql-data-warehouse-quick... Part 2 - http://bicortex.com/microsoft-azure-sql-data-warehouse-quick...
by justinsaccount on 10/9/2017, 4:04 PM
How hard would it be to add https://clickhouse.yandex/ to the benchmarks?
by chinkatie on 10/9/2017, 3:26 PM
Looks like Fortune picked this up. Nice! http://fortune.com/2017/10/09/amazon-google-snowflake-data-w...
by justinsaccount on 10/9/2017, 3:56 PM
https://github.com/fivetran/benchmark Does not exist?
by greggyb on 10/9/2017, 6:36 PM
Curious, any reason you've left out MS Azure DW?
I'd be interested in seeing their numbers in there as well.
by goldfishcaura on 10/9/2017, 7:44 PM
A BigQuery partner here and a long-time user of Redshift. (https://cloud.google.com/partners/directory/#?q=caura https://discourse.looker.com/t/bigquery-migrating-from-legac... and you can Google my creds for Redshift with "Troubleshooting Redshift performance")
So having benchmarks tests is great as a general guideline for what works under different architectures/schema designs. Unfortunately, benchmarking is highly subjective to the initial choices. I am a big fan of BigQuery (enough to go through Google's vetting process), but there are plenty of performance issues that I've run into it that would have been easily resolved with Redshift. Here are some concrete examples:
1) Running a Query across several very small tables. It turns out that occasionally querying small tables causes heavy network traffic within Google's distributed system. The solution on Redshift would be to adjust distribution. On Google, however, you don't have any control over this. You just have to hope that Google's algorithms pick up the issue based on usage (they don't).
2) Joining large tables. Avoid joining large tables in BigQuery. In Redshift the join would have been done by making sure that the sortkey is set on the column that is used for a join on the (typically) right table. Then having a common distkey between the two tables (this way the relevant data on both tables lives on the same node. BigQuery just throws resources at the problem. Well, it turns out that throwing resources at the problem is super slow (think 5-15 Redshift seconds vs. 200 BQ seconds).
Re: Snowflake. Can't speak to it as I haven't had personal experience. I have worked with Data people who had opinions on both favorable and negative sides of the spectrum. This just suggests to me that just like Redshift and BigQuery, Snowflake is not a universal solution. You really need to understand: 1) what your goals are for the usage among varying consumers 2) what skill set do the various users of the database have