Parquet and ORC's many shortfalls for machine learning, and what to do about it?

  • This article summarizes research from my lab in collaboration with ByteDance published in CIDR (a computer science conference held in Amsterdam two weeks from now) on a new columnar format designed for ML workloads.