Data engineering learning path with recommended resources

  • First, this is a nice resource, so good job!

    As someone who has worked for the past several years in this space, I'd say the biggest problems in data engineering are wholistic in nature. Sure, you need to know Python, SQL, Data Warehouses, Data Modeling, etc., but to me by far the biggest problems have to do with the entire architecture i.e. How do you extract data from potentially unreliable data sources, pull that data into some staging area, build further workflows that base off of this raw data to reliably update or create data warehouses/marts or deploy ml models. How do you allow everyone in your company to access and work with the data in a compliant and secure way? How do you test any of this? How can distributed teams, sometimes technical, sometimes more business oriented interact with the architecture and add/control data and release it into the overall company data stream? Has anyone found a reliable and maintainable way to setup CI/CD for company data architecture/pipelines/projects?

    To me these are the big problems. And if anyone has any resources for any of these topics I would be super interested, since I deal with these problems daily :)

  • Here's some more awesome and free Python learning resources:

    * https://greenteapress.com/wp/think-python-2e/

    * https://automatetheboringstuff.com/2e/

    * https://dabeaz-course.github.io/practical-python/Notes/Conte...

    Also, I'd highly discourage tutorialspoint as a resource. Here's an example of them rewording another tutorial as their own: https://twitter.com/nixcraft/status/998248317661335552

  • I’m currently working as a data engineer. I used to be a DBA for 5 years. I’m thinking now that the role of Data Engineer is the combining of what were three roles at my first job: Data warehouse engineer, DBA, and software engineer: it’s really the best of many worlds. I really enjoy it. I get to write the Python I’m good at (was never really good at general software engineering and feature development), gate keep a bit to keep my DBA chops up (SQL code quality, query tuning, access control etc but without all the need to be intimately well versed in any particular database), and spend my time creating new ETL processes and maintaining various EDW’s and data lakes.

    It’s my favorite role I’ve had to date and I’m really happy in it.

  • Is it just me or does anybody feel overwhelmed with lists like these?

    I really appreciate the effort but as an anxious person I always paralyzed or disheartened by the road ahead.

    For instance, one of the recommendations is Learning Python, 5th Edition - Mark Lutz. This is book alone is a tome.

    But anyways, it looks very well presented. Much better than plain bullet points. Well done!

  • Nice job! Perhaps an interesting resource to add: I'm maintaining an "open-source data engineering" awesome list: https://github.com/gunnarmorling/awesome-opensource-data-eng....

  • This isn't really a valid Show HN so I've taken that out of the title. It's maybe a borderline case because the website has some interactivity, it's ultimately a list, and those are explicitly ruled out: https://news.ycombinator.com/showhn.html

  • Looks nice! I am the maintainer of https://roadmap.sh which is a similar list of roadmaps and learning plans. I am currently in the process of making the roadmaps interactive and this gave me some ideas for improving the format that I was preparing. Thank you for sharing!

  • Is anyone aware of a similar list but for systems programmer learning path?

  • Security and privacy are last on the list and marked with an "essentiality" score of 1/3. I think as an industry, we need to do a better job of emphasizing and prioritizing those topics early and often throughout the educational process, or else the perpetual cycle of data misuse, leaks, and breaches is bound to continue.

  • Very nice job. Being a data engineer / data warehouse architect with 15 years of experience, I can surely say that this learning path is almost accurately laid out.

    Interesting rendering as far as webpage is concerned, what framework did you use? Maybe a tutorial on rendering Json data to a webpage like this is really helpful.

  • Here's a worthwhile guide that shows a learning path, includes more skills and seems easier to comprehend https://github.com/datastacktv/data-engineer-roadmap

  • Pipelines Management (Workflow management) The second link is wrong (copy paste error?)

    Good resource.

  • There’s one link I saw on GDPR.

    I’d encourage folks to think about their data retention policies early. Build your data architecture with privacy in mind. Regulations like GDPR can require you serve customers a copy or their data or delete their data from your systems.

    Don’t store things you don’t need. Keep retention policies. Please protect customer data.