Maintainable ETL Practices at Stitchfix

  • Huge +1 to leveraging SQL as much as possible during ETL.

    We recently moved our data warehouse to Snowflake. Along the way, we adopted an ELT paradigm instead of ETL, where you load your raw data into the warehouse and then perform any transformations in the warehouse. A tool called dbt (https://www.getdbt.com/) has made this a joy. You write SQL select queries in SQL files, and dbt creates a DAG based on this SQL so it can efficiently create tables/views in parallel. dbt also comes with some nifty testing facilities to validate your data once it is transformed.

    This is a great article, thank you for writing it!

  • In discussions like this, I often wonder how there can be enough data to require a "big data" pipeline with things like Spark and Presto.

    Stitch Fix seems to be one of those online services that send you sets of clothes that they think fit your style. That seems like a really narrow, low-data kind of industry. How much data can they possibly have? And why so big on the backend? In January 2018 they had 100 engineers. Presumably they're even larger now. Just for a service that sends out clothing.

    Maybe I'm lacking in imagination or insight into what takes to run a company like this. On the other hand, a single PostgreSQL instance can run complex ad-hoc queries, with CTEs and everything, on a single node involving millions, even billions, of rows.

  • Too large a chunk of my life was burnt the ETL from hell. Other than SQL, I can't see any other tool that would have helped.

    I found the easiest way was just dump tables of data into long varchars - even the supposedly numeric stuff, because often enough it wasn't - then scrub & filter it from there. (NB, if you take anything from this post, make it that).

    There were multiple challenges, but one that was most painful was the sheer filth mixed into the data. Too often even a human couldn't work out what the heck was supposed to be in some field. Often we could, but it ate a lot of time.

    It doesn't sound like you ETLs are anything like mine; you start with pretty clean data. You are lucky.

  • “The primary benefit of SQL is that it’s understood by all data professionals: data scientists, data engineers, analytics engineers, data analysts, DB admins, and many business analysts.“

    No, the primary benefit of using SQL is not SQL itself but the “free features” you get with the RDBMS you’re using. Enforcement of primary keys, automatic data type conversions, free transactions! and rollbacks ... and the sheer gift that is set-based operations on large chunks of data. It is just plain stupid to want to code the transformations or insertion of millions of rows iteratively than to rely the implicit rollback power of an INSERT statement.

    Rule # 1 in my projects: Avoid getting sexy in code, use the RDBMS for all transformations post staging. Always stage to RDBMS “as is” (data warts and all). Cleanup with SQL and stored procedures. This way you can SQL query the whole original format and figure out the best cleanup and transformations needed.

  • One comment is CTEs don’t use indexes, so you may want to avoid using them in production code. That’s something to consider if you may need to port code to/from databases where indexes may be available.