Ask HN: Developing Web Apps around SQL tables that have millions of rows

Myself and a few buddies have been working on a project that seems to have started taking off. Currently I'm in charge of building our dashboard analytic platform for our client base, and suffice to say the amount of data is beyond what I've been used to dealing with. We're upwards of 31+ million rows as of yesterday and it seems to be rising at a insane pace (almost even exponentially) while our client base increases. We're using a Apache/PHP/MySQL system, no frameworks on the server side but Backbone on the front end. It looks like currently we'll hit 50 million rows in a week, but that pace will increase as we add more clients.

My question for everyone here is what (if any) experiences, insights, or resources did you find really helpful when you started having to deal with querying 1+, 10+, 100+ million rows? Any advice? I'm currently scouring the net for any sort of past experiences or alternative solutions to issues we're facing now and problems we may encounter in the future.

  • You have get already very good tips. I could add this ones for completion:

    Review your backups and contingency plans (and measure your restore times).

    Usually you will want a mysql.conf for production mode, and a different one for full restore (with different parameters). A full restore of a so big database with the mysql defaults, can take looooooooong.

    As an anecdote, in the most big mysql system I've ever touched, there was a "performance problem", when I got the credentials and started to review, 3 different sysadmins, had scheduled 3 different full backups daily, using dumps. Some of them at office hours.

    Yes, this is just plainly stupid, but it's an example of real life.

    On a big database system, backups (and restores) is something who is convenient to design with care, to monitor, and to measure/adapt periodically.

    If you go to master/slave(s) usually you can make backups from some slave, without disrupt the master operations.

    Also, you maybe interested in play with things like mysqlproxy to split reads/writes, so you can have writes going to master, but reads spread across multiple slaves.

    I assume you're already using 64bits, and that you know already about mysql tuning variables.

  • TLDR; Split databases, split tables, prefetch queries with cron jobs or database functions, use separate servers, don't collect redundant data

    As a BI developer, I'd also like to second the idea of being able to split the table up and changing your architecture. Ideally, you would leave your production database untouched and have a replicated database to work with. The replicated database will of course still probably be really massive, so you will want to create tables specifically for querying (attempts to simulate a data mart) where you only grab the data that you need for querying. Another idea would be to have cron jobs pre-query at say midnight so when your users hit the database, they are fetching pre-fetched data.

    Also, if it's rising exponentially, I would have a look to see if all that data is really what you need. It's great to collect every single piece of data if you can but if you can't cover the costs to do that and there's no added value to all that data, it doesn't make sense to keep everything.

  • If you're growing at that pace, you can't do just one thing; you've got to attack it on multiple fronts. I approach things like this doing something like this:

    1) Optimize data type usage

    2) Optimize indexing - removing redundant indexes, adding new indexes, changing existing indexes

    3) Optimize storage - get your SAN/disks in order

    4) Partition the data in such a way that you serve queries and users quickly

    If you do the above, you should easily be able to scale any decent hardware into 10x where you are now. That will buy you some time to then build a data warehouse and backend support/reporting systems. At some point you're going to be in a spot where you either have to invest $5m+ in hardware to keep serving your current data, or you're going to have to start archiving some of this data to a data warehouse.

  • What type of data makes up the rows? Relational databases fit many data persistance problems, however there are other types of databases which might be more applicable to your problem. Essentially, use the right tool for the job.

    Relational, key/value, document, columnar, graph. Within each of those types there are specific implmentations (mysql, postgres, oracle, redis, riak, neo4j, titan, cassandra, mongodb, couchdb, etc), each with their own pros/cons.

    Within the relational database realm, make sure you understand: when and when not to normalize/denormalize, clustered/non-clustered indexes, btree/rtree/hash indexes, schema design, storage engines, partitioning, master/slave, replication, caching strategies, execution plan optimization, etc..

  • Check to see if there are any indices you don't need anymore. Check to see if you got indices on long text or string columns, and see if you can convert those to indices on bigints or int columns (by hashing the string).

    In my opinion, it's not the amount of data that is the bigger issue, but the amount of reads/writes coming into the system. 100+ million rows with just 1 thread reading and writing to it really isn't a big deal, technically. If you have load issues, you might need to shard the data, do bulk inserts by buffering the data, increase the memory, cache commonly read rows in memory, etc.

  • undefined

  • Make sure you're storing your data optimally, and look in to sharding your data if possible.

  • Drop Apache use Ngix and make sure your cacheing what you can.