Overview of Pandas vs Polars

Simple Comparison of Pandas and Polars
python
pandas
polars
data analyst
data science
Published

November 14, 2023

Polars Vs Pandas: A Brief Overview of Common Functionality

There has been a growing amount of buzz around a new Python Data Frame Library called Polars. This library is a Data Frame Library for two languages: Python and Rust. The library is primarily written in Rust but has Python bindings and you can tell as you read through the documentation’s examples. They do also outright admit this on the front page of the site. The biggest boast from the library is how much faster it is and how performant it is with respect to memory usage. This is excellent for us Python Programmers since these are known weaknesses for Python; at least when pure Python is referenced.

In today’s post, I’m going to compare using these two libraries against one another for common tasks in Data Analytics.So, let’s jump right in!

import numpy as np
import polars as pl # proper nomenclature for importing polars
import pandas as pd

Methods To Create Data Frames

The first most common operation is simply making a data frame. Surprisingly often, when working on projects one will have to convert external data into a frame for processing. Whether this is from files on your own filesystem or getting scraped web data into a usable state. We’ll generate some random data using a novel approach to data generation from Real Python’s Example Post. This uses an improved Random Number Generator built into the numpy library; thanks for showing this off Real Python! Anyways, here is their example but for a lot more rows of data:

rowSize = 5000000
rng = np.random.default_rng(seed=7)

randomdata = {
     "sqft": rng.exponential(scale=1000, size=rowSize),
     "year": rng.integers(low=1889, high=2023, size=rowSize),
     "building_type": rng.choice(["X", "Y", "Z"], size=rowSize),
 }

Both libraries have the default interface to create a data frame from a dictionary in Python:

polarBuildings = pl.DataFrame(randomdata)
pandaBuildings = pd.DataFrame(randomdata)

There are some real differences however in the function definitions between pandas and Polars. If you are careful about reading these pages then you likely will notice Polars refers to a schema. With Polars, you can elaborate what data types you expect all the columns to be instead of letting the library infer them. This is quite useful if you have pre-cleaned data or expect data structured in a specific way. Pandas does not include this ability and you will have to convert and clean up the columns after the import takes place.

If we take a look at these data frames, we can already see a marked difference in presentation:

polarBuildings.head()
shape: (5, 3)
sqft year building_type
f64 i64 str
707.529256 1975 "Z"
1025.203348 2011 "X"
568.548657 1970 "X"
895.109864 1999 "Y"
206.532754 2011 "Y"
pandaBuildings.head()
sqft year building_type
0 707.529256 1975 Z
1 1025.203348 2011 X
2 568.548657 1970 X
3 895.109864 1999 Y
4 206.532754 2011 Y

Polars includes the data type as a subheader which is another nice default feature. You can quickly tell the inferred column types or casting failed. Also, there is no Index printed for the Polars data frame as there is for the Pandas frame. This index does exist - as we’ll see when we start doing subsetting - but is simply not cluttering up the output. Honestly, outside of Time Series data I find the index an annoyance to manage so this is another plus to me. Forcibly telling Pandas not to include the index when writing to CSVs sourced from a Pandas data frame is just another way you can trip over yourself while trying to do work. This does however raise the question how developed the Polars library is with respect to Time Series problems.

Reading in Data from Files

Both Polars and Pandas has the ability to read multiple formats across the data landscape: CSV, Arrow, JSON, etc. They both have matching function calls for this as you would expect:

# Some Data I Know has problems:
url = 'https://media.githubusercontent.com/media/Shoklan/public-projects/main/Data/Kaggle/AllThrillerSeriesList.csv'

dfl = pl.read_csv(url)
dfd = pd.read_csv(url)
dfl.head(7)
shape: (7, 9)
Name Year Tv Certificate Duration per episode Genre Ratings Actor/Actress Votes
i64 str str str str str f64 str str
0 "Andor" "2022– " "TV-14 " " 40 min " " Action, Adven… 8.4 " Diego Luna, K… " 82,474"
1 "The Peripheral… "2022– " "TV-MA " null " Drama, Myster… 8.0 " Chloë Grace M… " 34,768"
2 "The Walking De… "2010–2022" "TV-MA " " 44 min " " Drama, Horror… 8.1 " Andrew Lincol… " 988,666"
3 "Criminal Minds… "2005– " "TV-14 " " 42 min " " Crime, Drama,… 8.1 " Kirsten Vangs… " 198,262"
4 "Breaking Bad" "2008–2013" "TV-MA " " 49 min " " Crime, Drama,… 9.5 " Bryan Cransto… " 1,872,005"
5 "Dark" "2017–2020" "TV-MA " " 60 min " " Crime, Drama,… 8.7 " Louis Hofmann… " 384,702"
6 "Manifest" "2018– " "TV-14 " " 43 min " " Drama, Myster… 7.1 " Melissa Roxbu… " 66,158"
dfd.head(7)
Unnamed: 0 Name Year Tv Certificate Duration per episode Genre Ratings Actor/Actress Votes
0 0 Andor 2022– TV-14 40 min Action, Adventure, Drama 8.4 Diego Luna, Kyle Soller, Stellan Skarsgård, G... 82,474
1 1 The Peripheral 2022– TV-MA NaN Drama, Mystery, Sci-Fi 8.0 Chloë Grace Moretz, Gary Carr, Jack Reynor, J... 34,768
2 2 The Walking Dead 2010–2022 TV-MA 44 min Drama, Horror, Thriller 8.1 Andrew Lincoln, Norman Reedus, Melissa McBrid... 988,666
3 3 Criminal Minds 2005– TV-14 42 min Crime, Drama, Mystery 8.1 Kirsten Vangsness, Matthew Gray Gubler, A.J. ... 198,262
4 4 Breaking Bad 2008–2013 TV-MA 49 min Crime, Drama, Thriller 9.5 Bryan Cranston, Aaron Paul, Anna Gunn, Betsy ... 1,872,005
5 5 Dark 2017–2020 TV-MA 60 min Crime, Drama, Mystery 8.7 Louis Hofmann, Karoline Eichhorn, Lisa Vicari... 384,702
6 6 Manifest 2018– TV-14 43 min Drama, Mystery, Sci-Fi 7.1 Melissa Roxburgh, Josh Dallas, J.R. Ramirez, ... 66,158

Reading Vs Scanning

Here is a topic where we will need to shortly switch our focus from comparison to only Polars: Polars has a Lazy Evaluation API for reading files. If you have not come across Lazy Evaluation before then all this means is that work is delayed until such time that we need a result. A deeper discussion of this topic is beyond the scope of this post but by doing this the program can create a plan of action which can be optimized. If you are interested more in this topic then I recommend looking up discussions about Directed Acyclic Graphs. For now, though this allows Polars to only read columns which are necessary for your analysis and therefore saving you time, compute and memory on Data tasks. Please note that while the .read_csv() functions both work with urls, the scan_* functions do not work with urls. I tried and definitely can confirm it does not work!

Data Frame Summaries

Once your data is imported, often you will want to do some quick checks making sure the data import worked and the results were sane. Both Polars and Pandas share the .describe() function which gives you summary statistical information about the columns:

polarBuildings.describe()
shape: (9, 4)
describe sqft year building_type
str f64 f64 str
"count" 5e6 5e6 "5000000"
"null_count" 0.0 0.0 "0"
"mean" 999.647777 1955.507764 null
"std" 999.390909 38.687521 null
"min" 0.000078 1889.0 "X"
"25%" 287.605602 1922.0 null
"50%" 692.846786 1956.0 null
"75%" 1386.66725 1989.0 null
"max" 15701.555622 2022.0 "Z"
pandaBuildings.describe()
sqft year
count 5.000000e+06 5.000000e+06
mean 9.996478e+02 1.955508e+03
std 9.993909e+02 3.868752e+01
min 7.785435e-05 1.889000e+03
25% 2.876055e+02 1.922000e+03
50% 6.928466e+02 1.956000e+03
75% 1.386667e+03 1.989000e+03
max 1.570156e+04 2.022000e+03

While most of the information is the same, I would like to point out that the Missing Values has been added to this summary output. Since there is no .info() call for Polars, this is a proper place for that information.

Methods to Slice a Data Frame.

Moving past basic information, another common operation is taking slices out of data frames. Pandas has two functions for this: .loc[] and .iloc[] respectively. One is supposed to use numerics while the other is supposed to use labels. Frankly, these are not as clearly defined as they should be. For example, you can use numeric indexes for .iloc[] but also a boolean array per the documentation. The problem here is that both functions allow you to use this boolean array when either one or the other should. And, all forms of slicing the rows is still a number anyways.

# Here is using numbers:
pandaBuildings.iloc[1:10]
sqft year building_type
1 1025.203348 2011 X
2 568.548657 1970 X
3 895.109864 1999 Y
4 206.532754 2011 Y
5 3383.637351 2007 Z
6 9.753627 1948 Y
7 2809.215763 1968 Y
8 575.332756 1953 Z
9 300.534013 1928 X
# This is still using numbers:
pandaBuildings.loc[1:10,'year']
1     2011
2     1970
3     1999
4     2011
5     2007
6     1948
7     1968
8     1953
9     1928
10    1988
Name: year, dtype: int64
from itertools import repeat

# This is not really ok; binary array
pandaBuildings.iloc[list(repeat(False,rowSize - 10)) + list(repeat(True, 10))]
sqft year building_type
4999990 573.107861 1940 X
4999991 255.946544 1995 X
4999992 1193.402742 1966 Y
4999993 241.536200 1981 Y
4999994 116.882680 2012 Y
4999995 813.839454 1895 Y
4999996 832.370784 1976 Y
4999997 910.292342 1940 Z
4999998 219.078462 1946 X
4999999 1464.927639 1956 Y
# This also works and should actually be ok
pandaBuildings.loc[list(repeat(False,rowSize - 10)) + list(repeat(True, 10))]
sqft year building_type
4999990 573.107861 1940 X
4999991 255.946544 1995 X
4999992 1193.402742 1966 Y
4999993 241.536200 1981 Y
4999994 116.882680 2012 Y
4999995 813.839454 1895 Y
4999996 832.370784 1976 Y
4999997 910.292342 1940 Z
4999998 219.078462 1946 X
4999999 1464.927639 1956 Y

When working with Polars, all of this is simply a single interface which is surprisingly intelligent.

# just knows you want the rows
polarBuildings[1:10]
shape: (9, 3)
sqft year building_type
f64 i64 str
1025.203348 2011 "X"
568.548657 1970 "X"
895.109864 1999 "Y"
206.532754 2011 "Y"
3383.637351 2007 "Z"
9.753627 1948 "Y"
2809.215763 1968 "Y"
575.332756 1953 "Z"
300.534013 1928 "X"
# .... and, you can pass the labels as you would hope
polarBuildings[['year', 'sqft'], 1:10 ]
shape: (9, 2)
year sqft
i64 f64
2011 1025.203348
1970 568.548657
1999 895.109864
2011 206.532754
2007 3383.637351
1948 9.753627
1968 2809.215763
1953 575.332756
1928 300.534013
# ... and, we can switch the row, column order without problems!
polarBuildings[1:10,['year', 'sqft'] ]
shape: (9, 2)
year sqft
i64 f64
2011 1025.203348
1970 568.548657
1999 895.109864
2011 206.532754
2007 3383.637351
1948 9.753627
1968 2809.215763
1953 575.332756
1928 300.534013
# ... and preserves the selection order as well!
polarBuildings[1:10,['sqft', 'year'] ]
shape: (9, 2)
sqft year
f64 i64
1025.203348 2011
568.548657 1970
895.109864 1999
206.532754 2011
3383.637351 2007
9.753627 1948
2809.215763 1968
575.332756 1953
300.534013 1928
# .... and we can still specify with just numbers
polarBuildings[1:10, 0:2]
shape: (9, 2)
sqft year
f64 i64
1025.203348 2011
568.548657 1970
895.109864 1999
206.532754 2011
3383.637351 2007
9.753627 1948
2809.215763 1968
575.332756 1953
300.534013 1928

This is a really big win for Polars as I no longer longer have to bother to the juggle between .loc[] and .iloc[].

Broadcasting Values

An often used pattern is to create a new column and then fill it with a default value. Then, conditionally update the column based on other features in the data frame. Doing this in Pandas is very simply and efficient.

pandaBuildings['turtles'] = 'Best Animal'
pandaBuildings.head()
sqft year building_type turtles
0 707.529256 1975 Z Best Animal
1 1025.203348 2011 X Best Animal
2 568.548657 1970 X Best Animal
3 895.109864 1999 Y Best Animal
4 206.532754 2011 Y Best Animal

Doing this in Polars? Not so simple. Sadly, we cannot simply assign a default value to a column we wish existed.

# This fails
try:
    polarBuildings['turtles'] = 'Best Animal'
except Exception as e:
    print(e)
DataFrame object does not support `Series` assignment by index

Use `DataFrame.with_columns`.

You can still do this with Polars but the process is strange. Basically, you need to tell Polars that this is a literal value and then name the column with an .alias() function call.

polarBuildings.with_columns(pl.lit('Best Animal').alias('Turtles')).head()
shape: (5, 4)
sqft year building_type Turtles
f64 i64 str str
707.529256 1975 "Z" "Best Animal"
1025.203348 2011 "X" "Best Animal"
568.548657 1970 "X" "Best Animal"
895.109864 1999 "Y" "Best Animal"
206.532754 2011 "Y" "Best Animal"

Another problem with the Polars method is that this is not done in place. And, what I mean by that is that if you re-check the data frame the column we created does not exist.

# This has not been changed.
polarBuildings.head()
shape: (5, 3)
sqft year building_type
f64 i64 str
707.529256 1975 "Z"
1025.203348 2011 "X"
568.548657 1970 "X"
895.109864 1999 "Y"
206.532754 2011 "Y"

If we want to keep the results then we will need to assign the results back to itself. This is somewhat frustrating since it means every time you want to do a broadcast then you’re making a new object.

Filtering and Aggregation

Next on the list of common features is manipulating the data frame. The two most common forms of this are Filtering for values and aggregations such as counting or summary statistics. Polars has its own language for this: see the documentation page for more information. The language looks like a simplified SQL as applied to a data frames which there is already some support for in Pandas. Using our toy data, if we wanted any year larger than 2017 then there are a few ways to do this in pandas. Personally, I’m quite a fan of the .query() function so I’ll use that but approach this however you like.

pandaBuildings.query('year > 2017').head(7)
sqft year building_type turtles
14 1884.250052 2018 Z Best Animal
36 250.345347 2018 Z Best Animal
46 1148.245259 2022 Y Best Animal
58 763.546235 2019 Y Best Animal
64 26.353141 2021 X Best Animal
65 1085.041269 2021 Z Best Animal
89 469.285667 2022 Y Best Animal

In Polars, we would use the function closest to what functionality we are after: whether .select() or .filter() - or maybe .group_by():

# Same thing but with polars
polarBuildings.filter(pl.col('year') > 2017).head(7)
shape: (7, 3)
sqft year building_type
f64 i64 str
1884.250052 2018 "Z"
250.345347 2018 "Z"
1148.245259 2022 "Y"
763.546235 2019 "Y"
26.353141 2021 "X"
1085.041269 2021 "Z"
469.285667 2022 "Y"

There is a little more typing when using Polars here but this is more or less the same. One boast from the documentation is that Polars allows for parallel computations such as:

polarBuildings.group_by('building_type').agg([
    pl.mean('sqft'),
    pl.median('year'),
    pl.count()
])
shape: (3, 4)
building_type sqft year count
str f64 f64 u32
"Z" 999.42085 1956.0 1665501
"X" 999.225287 1956.0 1668418
"Y" 1000.297707 1955.0 1666081

Without profiling this code - which I am considering in another future post - we cannot really know how efficient this is. But, the readability here is solid compared to the Pandas version asking for the same information:

pandaBuildings.groupby('building_type').agg({
    'sqft':['mean'],
    'year':['median', 'size']})\
    .rename(columns={'size':'count'})
sqft year
mean median count
building_type
X 999.225287 1956.0 1668418
Y 1000.297707 1955.0 1666081
Z 999.420850 1956.0 1665501

Quick Analysis of Movie Data Using Polars

Let’s try using Polars to replicate an example project which I’ve worked on before. Previously, I did some analysis work from a Kaggle dataset on Thriller Movies with Apache Spark; That post is here if you are curious.

# Download the cleaned data from online
dataUrl = 'https://media.githubusercontent.com/media/Shoklan/public-projects/main/Data/Kaggle/AllThrillerSeriesListClean.csv'
movies = pl.read_csv(dataUrl)
movies.head()
shape: (5, 9)
Name Year Tv Certificate Duration per episode Genre Ratings Actor/Actress Votes
i64 str str str str str f64 str str
0 "Andor" "2022– " "TV-14 " " 40 min " " Action, Adven… 8.4 " Diego Luna, K… " 82,474"
1 "The Peripheral… "2022– " "TV-MA " null " Drama, Myster… 8.0 " Chloë Grace M… " 34,768"
2 "The Walking De… "2010–2022" "TV-MA " " 44 min " " Drama, Horror… 8.1 " Andrew Lincol… " 988,666"
3 "Criminal Minds… "2005– " "TV-14 " " 42 min " " Crime, Drama,… 8.1 " Kirsten Vangs… " 198,262"
4 "Breaking Bad" "2008–2013" "TV-MA " " 49 min " " Crime, Drama,… 9.5 " Bryan Cransto… " 1,872,005"
# Get rid of the superfluous index column
movies.drop_in_place(name='')
movies.head()
shape: (5, 8)
Name Year Tv Certificate Duration per episode Genre Ratings Actor/Actress Votes
str str str str str f64 str str
"Andor" "2022– " "TV-14 " " 40 min " " Action, Adven… 8.4 " Diego Luna, K… " 82,474"
"The Peripheral… "2022– " "TV-MA " null " Drama, Myster… 8.0 " Chloë Grace M… " 34,768"
"The Walking De… "2010–2022" "TV-MA " " 44 min " " Drama, Horror… 8.1 " Andrew Lincol… " 988,666"
"Criminal Minds… "2005– " "TV-14 " " 42 min " " Crime, Drama,… 8.1 " Kirsten Vangs… " 198,262"
"Breaking Bad" "2008–2013" "TV-MA " " 49 min " " Crime, Drama,… 9.5 " Bryan Cransto… " 1,872,005"
# clean up and cast year,votes for analysis
movies = movies.with_columns(
    pl.col('Year').map_elements(lambda x: x.split('–')[0]),
    pl.col('Votes').map_elements(lambda x: x.replace(',', '').strip()).cast(pl.Float64),
)

movies = movies.rename({'Actor/Actress':'Actor'})
movies.head()
shape: (5, 8)
Name Year Tv Certificate Duration per episode Genre Ratings Actor/Actress Votes
str str str str str f64 str f64
"Andor" "2022" "TV-14 " " 40 min " " Action, Adven… 8.4 " Diego Luna, K… 82474.0
"The Peripheral… "2022" "TV-MA " null " Drama, Myster… 8.0 " Chloë Grace M… 34768.0
"The Walking De… "2010" "TV-MA " " 44 min " " Drama, Horror… 8.1 " Andrew Lincol… 988666.0
"Criminal Minds… "2005" "TV-14 " " 42 min " " Crime, Drama,… 8.1 " Kirsten Vangs… 198262.0
"Breaking Bad" "2008" "TV-MA " " 49 min " " Crime, Drama,… 9.5 " Bryan Cransto… 1.872005e6
cols = ['Name', 'Genre', 'Ratings', 'Votes']
medianMovieInfo = movies[cols].with_columns(
    pl.col('Genre').str.strip_chars().str.split(', '))\
    .explode('Genre').group_by("Genre").agg(
        pl.col('Ratings').median(),
        pl.col('Votes').median())\
    .sort('Ratings', descending=True)
meanMovieInfo = movies[cols].with_columns(
    pl.col('Genre').str.strip_chars().str.split(', ')
).explode('Genre').group_by("Genre").agg(
    pl.col('Ratings').mean(),
    pl.col('Votes').mean()
).sort('Ratings', descending=True)
import seaborn as sns
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (14,7);

# Wish we didn't have to convert this
dmi = medianMovieInfo.to_pandas().sort_values('Votes', ascending=False)
plt.xticks(rotation=41);
sns.barplot(dmi, x='Genre', y='Votes', color='seagreen').set_title('Median Votes Per Movie Genre');

nmi = meanMovieInfo.to_pandas().sort_values('Votes', ascending=False)
plt.xticks(rotation=41);
sns.barplot(nmi, x='Genre', y='Votes', color='seagreen').set_title('Mean Votes Per Movie Genre');
Text(0.5, 1.0, 'Mean Votes Per Movie Genre')

Conclusions

Hopefully this was as useful to you as it was for me. I like Polars and I think I’ll be using it when I know the data I am using is cleaned and ready for analysis. Otherwise, I think cleaning the data with Pandas is still the likely best first choice.