Why SQL syntax sucks, and why it matters
Yes, SQL isn't good for data analysis. I recently had to help one of our data analysts port some code from Impala to Hive, and the query they gave me was terrifying: a dozen CTEs strung together in a query that was literally hundreds of lines long.
However, I think we already have an answer to that complexity, and it's not Malloy, but data frames. With data frames you can split your query into logical steps, each of which can be debugged and examined without having to rewrite the query like you have to with CTEs.
You probably know that the earliest proponents of the relational model criticized SQL and still do. Chris Date and Hugh Darwen pushed the Tutorial D query language; Date has used that in a couple of his relational database books and in his seminars. Both experts and regular users have recognized the flaws and warts of the SQL language for decades. Nevertheless SQL remains standing with no serious replacement in sight, I think for obvious reasons:
- Vast amounts of SQL code already written, with widespread language and library support.
- The various relational database vendors long ago settled on various flavors of SQL, and SQL has an ISO/ANSI standard. While not entirely compatible across vendors the commercial and open source SQL dialects have mostly stayed "close enough" to allow changing database engines (though that rarely actually happens in my experience).
- Oracle has led the relational database business for decades, they have a big investment in their flavor of SQL and no incentive to disrupt their customer base.
It often happens in the software world that a language or tool catches on and gets widespread use, and then in retrospect we see the flaws. Fixing those flaws, or migrating to a different language or tool, has higher costs and risks (actual or perceived) than living with the flaws. Leaving the mainstream of relational databases may work for startups and hobby projects, but at enterprise scale you can't get any traction trying to shift from a well-understood and working language to an unproven competitor.
To paraphrase Steve Jobs, a SQL replacement would have to offer an order of magnitude improvement over SQL to get attention, not just incremental or cosmetic improvements. The relational model and SQL did offer that order of magnitude improvement over what we had before (and I remember, my career started before relational databases). None of the replacements or alternatives that have come along since the '80s show huge improvements along any dimension except in some niche cases.
I write SQL every day, and have for many years. Sometimes the language annoys me, but I can live with that -- I see it as a tool rather than a lifestyle choice. When I mentor programmers or teach someone relational databases and SQL and see them struggle I think that happens because they have not formed the correct mental model of relational databases, not because of flaws in the SQL language.
This is a follow up to my last post comparing SQL to the new Malloy query language, and a response to some of the HN commentary I received around it.