Pandas 2.0 and the Arrow revolution
The Arrow revolution is particularly important for pandas users.
pandas DataFrames are persisted in memory. The rule of thumb was for RAM capacity / dataset size in memory to be 5-10x as of 2017. Let's assume that pandas has made improvements and it's more like 2x now.
That means you can process datasets that take up 8GB of RAM in memory on a 16GB machine. But 8GB of RAM in memory is a lot different than what you'd expect with pandas.
pandas historically persisted string columns as objects, which was wildly inefficient. The new string[pyarrow] column type is around 3.5 times more efficient from what I've seen.
Let's say a pandas user can only process a string dataset that has 2GB of data on disk (8GB in memory) on their 16GB machine for a particular analysis. If their dataset grows to 3GB, then the analysis errors out with an out of memory exception.
Perhaps this user can now start processing string datasets up to 7GB (3.5 times bigger) with this more efficient string column type. This is a big deal for a lot of pandas users.
The submission link points to the blog instead of the specific post. It should be: https://datapythonista.me/blog/pandas-20-and-the-arrow-revol...
I highly recommend checking out polars if you're tired of pandas' confusing API and poor ergonomics. It takes some time to make the mental switch, but polars is so much more performant and has a much more consistent API (especially if you're used to SQL). I find the code is much more readable and it takes less lines of code to do things.
Just beware that polars is not as mature, so take this into consideration if choosing it for your next project. It also currently lacks some of the more advanced data operations, but you can always convert back and forth to pandas for anything special (of course paying a price for the conversion).
There is also Polars [0], which is backed by arrow and a great alternative to pandas.
Finally! I use pandas all the time particularly for handling strings (dna/aa sequences), and tuples (often nested). Some of the most annoying bugs I encounter in my code are a result of random dtype changes in pandas. Things like it auto-converting str -> np.string (which is NOT a string) during pivot operations.
There's also all types of annoying workarounds you have to do while tuples as indexes resulting from it converting to a MultiIndex. For example
srs = pd.Series({('a'):1,('b','c'):2})
is a len(2) Series. srs.loc[('b','c')] throws an error while srs.loc[('a')] and srs.loc[[('b','c')]] do not. Not to vent my frustrations, but this maybe gives an idea of why this change is important and I very much look forward to improvements in the area!
Polars can do a lot of useful processing while streaming a very large dataset without ever having to load in memory much more than one row at a time. Are there any simple ways to achieve such map/reduce tasks with pandas on datasets that may vastly exceed the available RAM?
I do a lot of...
to get to the raw numpy, because often dealing with raw np buffer is more efficient or ergonomic. Hopefully losing numpy foundations will not affect (the efficiency of) code which does this.some_pandas_object.valuesMy feeling is that the pandas community should be bold and consider also overhauling the API besides the internals. Maybe keep the existing API for backward compatibility but rethink what would be desirable for the next decade of pandas so to speak. Borrowing liberally from what works in other ecosystem API's would be the idea. E.g. R, while far from beautiful can be more concise etc.
Obviously this improves interoperability and the handling of nulls and strings. My naïve understanding is that polars columns are immutable because that makes multiprocessing faster/easier. I’m assuming pandas will not change their api to make columns immutable, so they won’t be targeting multiprocessing like polars?
> As mentioned earlier, one of our top priorities is not breaking existing code or APIs
This is doomed then. Pandas API is already extremely bloated
I recently switched to Polars because Pandas is so absurdly weird. Polars is much, much better. I'll be interested in seeing how Pandas 2 is.
So where does this place Polars? My perhaps incorrect understanding was this (Arrow integration) was a key differentiator of Polars vs Pandas.
Yes! Swapping out NumPy with Arrow underneath! Excited to have the performance of Arrow the API of Pandas. Huge win for the data community!
The changes to handling strings and Python data types are welcome.
However I am curious on how Arrow beats NumPy on regular ints and floats.
For the last 10 years I've been under impression that int and float columns in Pandas are basically NumPy ndarrays with extra methods.
Then NumPy ndarrays are basically C arrays with well defined vector operations which are often trivially parallelizable.
So how does Arrow beat Numpy when calculating
What is the trick?mean (int64) 2.03 ms 1.11 ms 1.8x mean (float64) 3.56 ms 1.73 ms 2.1xNow can mysql/pg provide binary Arrow protocols directly?
undefined
undefined
if you were facing memory issues, then why not use numpy memmap? it's effing incredible https://stackoverflow.com/a/72240526/5739514
pandas is just for 2D columnar stuff; it's sugar on numpy
Pandas is still way too slow, if only there was an intégration with datafusion or arrow2