Pandas Is No Longer My Default

Or Is It?

Collin Mitchell


December 2, 2022

Quick Observation

I am starting to understand why there are so many posts about tools and not as much analysis. The desire to post content and not spend weeks working through problems which may not work out makes posting about new tools quite alluring. That is not to say these posts are of no value but they’re certainly not results driven.

Modin Is My New Default.

I wish I could give credit to where I stumbled across this but I admittely already lost it. Coming soon after my previuos post about being excited that Datatables have come to Python - and I will certianly do a follow up post about using datatables -, I fell over another new tool which is definitely replacing Pandas for me: Modin. So, what is this?: > The modin.pandas DataFrame is an extremely light-weight parallel DataFrame. Modin transparently distributes the data and computation so that all you need to do is continue using the pandas API as you were before installing Modin. Unlike other parallel DataFrame systems, Modin is an extremely light-weight, robust DataFrame. Because it is so light-weight, Modin provides speed-ups of up to 4x on a laptop with 4 physical cores.

It is a drop in replacement which is bulit out of Pandas. But, it also has two libraries underneath - if you installed them - to silently allow you to scale your analysis data size upwards. Take the previous post which was done where we did a test to see how long importing and posting the graph - but now in modin!

import modin.pandas as md
import pandas as pd
from pathlib import Path
import numpy as np
import seaborn as sns

# If it uses ray:
# import ray
# ray.init()

You can change the engine it uses underneath by passing different arguments on the command line - per the Docs.

import os

os.environ["MODIN_ENGINE"] = "ray"  # Modin will use Ray
os.environ["MODIN_ENGINE"] = "dask"  # Modin will use Dask

I don’t Recommend.

This might be jarring for me to be excited - and then not. But, I tried to get this all working - and the good news is that I did. But, only kind of. When I ran without setting up which to use, it defaulted to Ray and that redlined my CPU and locked up my system. I did end up getting a successful run but the speed of it was moot.

%%timeit -r2 -n3
# The big boi
path = Path('../../_drafts/_data/accepted_2007_to_2018Q4.csv')

data = md.read_csv(path)
_ = sns.displot(data, x='loan_amnt', kde=True);
1min 47s ± 1.34 s per loop (mean ± std. dev. of 2 runs, 3 loops each)

If we compare those numbers to the tests from the previous post, it is worse:

pandas:    1min 8s ± 64.7 ms per loop (mean ± std. dev. of 2 runs, 3 loops each)
datatable: 55 s ± 389 ms per loop (mean ± std. dev. of 2 runs, 3 loops each)

So, I have worse performance with the risk of crashing my computer. Maybe there is a way to tune the CPU usage so it is not this painful but under these conditions I cannot trust using this on a large dataset.