Hacker News Clone

Loading half a billion rows into MySQL

by conesus on 6/5/2012, 8:54 PM with 99 comments

by alexatkeplar on 6/5/2012, 10:18 PM
Assuming your event data is immutable (i.e. no UPDATEs, just INSERTs), you'd probably have fewer headaches long-term if you just dumped the database to flatfiles, stored in HDFS and queried using Hive (which has MySQLish query syntax anyway). This architecture will take you to billions of rows quite happily.
This is the architecture we use for eventstream analysis at SnowPlow (https://github.com/snowplow/snowplow).
by whalesalad on 6/5/2012, 9:17 PM
I want to give this 10 upvotes. This is one of those things you only see on HN occasionally, and it's full of all kinds of awesome little nuggets. That command to get the count from the information_schema in and of itself is gold (I now know about the tee command)
by spudlyo on 6/5/2012, 10:22 PM
I saw chunk load time increase from 1m40s to around an hour per million inserts.
Your insert performance falls off a cliff when a majority of the index pages for the table no longer fit into the innodb buffer pool. After that happens, there is gonna be a bunch of random i/o. You can solve this problem by using partitioning, that way only a single partition's worth of index pages need to fit into the buffer pool to keep inserts into that partition fast. Of course you have to size your partitions accordingly.
A few other tips. Disable fsync() at commit entirely. Set innodb_flush_log_at_trx_commit=0. If you crash curing the data load, start over. Set your transaction logs to be as large as possible, which is usually 4G.
by jtchang on 6/6/2012, 12:41 AM
Does it blow anyone else's mind that database servers are so powerful nowadays that you can just load half a billion rows into a SINGLE TABLE and pretty much expect things to work? Obviously there are caveats but all in all a nice chunk of RAM backed by a large SSD goes a long way.
by juiceandjuice on 6/5/2012, 9:42 PM
I routinely load and reload ~7 billion rows into oracle 11g, once every 5 months or so. It takes about 4 days, 20 days if you do something stupid like create the indexes before loading, although I think oracle can go quite a bit faster, and that 4 days is limited by processing and non-DB I/O.
We use oracle because the partitioning options are better and bitmapped indexes. (We wanted more partitions so we could use a hierarchical triangular mesh for point-radius searches)
by cookiecaper on 6/5/2012, 9:47 PM
If this data is primarily archival and there are multiple backups of the same dataset out there, why not use MyISAM? In my mind the only reasons to use InnoDB are integrity-related, things like real foreign keys and a more ACIDy commit method. If the dataset is read-only and copied in several places, surely this stuff does not matter too much and MyISAM is much more performant. Maybe I misread the use case?
by gcarre on 6/5/2012, 9:50 PM
"MySQL Partitioning. We decided against it because it seemed likely that it wouldn’t be much faster than our current solution."
What if you partition by HASH(user_id) instead of partitioning by month (http://dev.mysql.com/doc/refman/5.5/en/partitioning-hash.htm...)?
by neilk on 6/6/2012, 10:21 AM
This struck me as odd:
"You could drop the indices completely and add them later, but with a table size this big I didn’t think it would help much."
Surely this is one of the main reasons why the load speed went way down as the database size increases. It has to be better to add an index later, possibly much better.
by tantalor on 6/5/2012, 9:09 PM
I'm not familiar with Percona, what problem does it solve?
by plasma on 6/5/2012, 11:43 PM
In my experience the built-in partitioning support of MySQL (using PARTITION BY RANGE for example) is very good.
Inserts no longer drag due to huge indexes, delete's are instant (drop a partition of stale data), and SELECT's can limit their lookups to specific underlying files.
This is for a 500m row table of time series data.
by j-b on 6/6/2012, 2:45 PM
Issues that can also severely degrade a systems performance and may not come into play until your data is growing over extended periods of time (not represented in the benchmark here):
1) Table statistics can grow stale and degrade the engines ability to select the correct access path. This can be corrected by running CREATE STATISTICS, depending on your DB version.
2) Page fragmentation, not to be confused with disk fragmentation, when the engine is selecting non-contiguous space. A rebuild of the db will be necessary if your pages are extremely fragmented and/or using a non-optimized allocation size.
by kodablah on 6/5/2012, 10:39 PM
"push as much complexity as possible to the database"
In general, this is usually bad practice and difficult to scale. I do agree w/ using a single table, but I disagree with the general premise of that statement.

by tantalor on 6/5/2012, 9:23 PM

    SELCT table_rows

Should be,

    SELECT table_rows

by gtani on 6/5/2012, 9:46 PM
Good info. Did percona fix the load nulls ticket?
(LOAD DATA INFILE has this ticket from 2006, that,sometime during the 2000's was converted to a feature request
http://bugs.mysql.com/bug.php?id=23212
I think that's why my load does
```
    FIELDS ESCAPED BY '\\'
```
but it's been so long i can't remember (at this point a lot fo folks would mention Postgres, but i'll refrain
by Gonsalu on 6/5/2012, 9:15 PM
I fail to see why setting the transaction isolation level to read uncommitted would make any difference to the data load process.
by tlrobinson on 6/6/2012, 12:33 AM
Good stuff. I remember trying to load a Wikipedia database dump into MySQL on my laptop years ago and giving up.
by dkhenry on 6/6/2012, 2:14 AM
I want to say this gentleman is doing it wrong. If his current set-up is what I perceive from the description he is not going to be gaining any speed or scalability. by moving to one giant table. In fact he might be losing some. It sounds like he set up 12 monthly tables and is inserting data into each one by month then he is querying across all 12 tables and joining the results. You can do that query in one fell swoop using UNION, If you have a significant number of entries the overhead of opening the 12 tables is negligible. If your looking up on a primary key then you will be more then fast enough.
Switch your tables into MyISAM and make a 13th table. insert the current month into the 13 table and query across all 13 tables when you need to get data out. At the end of the month move all records in the 13th table into its proper month table. ( make the 13th table a memory table sync it to disk every few minutes or put it on a really fast SSD )
by amalag on 6/6/2012, 1:42 PM
If you are storing this data for archival purposes & querying, did you check out Infobright? They have built it as a MySQL engine.
by mcartyem on 6/6/2012, 1:19 AM
Why would anyone partition tables -- ever?
by evanelias on 6/6/2012, 3:12 AM
All great advice! As a fellow habitual migrator of billions of rows in MySQL, a few things I'd add:
Do several chunked LOAD DATA INFILE queries in parallel. It's counter-intuitive but you'll get a performance boost. I usually chunk large data sets into 100+ pieces and do 40 concurrent import queries. Mileage may vary, depends a lot on your hardware and data set's sweet spot.
Don't use mysqldump for the export. SELECT...INTO OUTFILE is faster, especially since you can apply the exact same parallelized chunking technique. Do this on a standby slave (ie, one not receiving reads) with replication stopped, so that you get a consistent dump at a point in time.
The TSV dump doesn't need to be in your database directory. It can be anywhere mysqld can access. It can be beneficial to put it on a separate volume entirely, so that you're reading heavily from one volume while writing heavily to another.
Be mindful of disk scheduler choices in Linux if you're exporting from a faster disk to a slower disk. There's some definite edge cases that can lead to starvation of core processes. Also, be mindful of disk scheduler choices in Linux in general. Avoid use of CFQ on your database data volume.
Watch disk utilization with iostat or another tool of your choice. If you're coming close to saturating one or more volumes, that probably means you're on the right track :)
As other commenters have mentioned, the row count in information_schema (or equivalently SHOW TABLE STATUS) is an estimate, due to how MVCC works. If you need a more accurate count (but not quite perfect count for a specific moment-in-time for a growing data set), do that in parallel chunked SELECT COUNT(*) queries.
Happy migrating!
by advocate1 on 6/5/2012, 10:15 PM
It's more work for the computer to process data stored as "rows" than data stored as columns.
Think of how a disk is organized.
If you abandon MySQL and store data as columns you can load a trillion rows.
But forget I mentioned this.
by advocate1 on 6/6/2012, 1:49 AM
This discussion reminds me of non-technical users. They will tolerate incredible sluggishness. It is amazing what Oracle and the free alternatives get away with. People, including engineers/developers, not only tolerate sluggishness but when something actually works on the first try they think it's amazing.
I've witnessed this tolerance for mediocrity for years and it still continues to blow my mind.
by dstorrs on 6/5/2012, 9:15 PM
FTA:
"...I decided to migrate the existing system into a single table setup."
"Alternative Proposed Solutions: MySQL Partitioning and Redis"
I'm surprised he didn't consider at Mongo, Couch, etc.