import numpy as np
import polars as pl # proper nomenclature for importing polars
import pandas as pd
Polars Vs Pandas: A Brief Overview of Common Functionality
There has been a growing amount of buzz around a new Python Data Frame Library called Polars. This library is a Data Frame Library for two languages: Python and Rust. The library is primarily written in Rust but has Python bindings and you can tell as you read through the documentation’s examples. They do also outright admit this on the front page of the site. The biggest boast from the library is how much faster it is and how performant it is with respect to memory usage. This is excellent for us Python Programmers since these are known weaknesses for Python; at least when pure Python is referenced.
In today’s post, I’m going to compare using these two libraries against one another for common tasks in Data Analytics.So, let’s jump right in!
Methods To Create Data Frames
The first most common operation is simply making a data frame. Surprisingly often, when working on projects one will have to convert external data into a frame for processing. Whether this is from files on your own filesystem or getting scraped web data into a usable state. We’ll generate some random data using a novel approach to data generation from Real Python’s Example Post. This uses an improved Random Number Generator built into the numpy library; thanks for showing this off Real Python! Anyways, here is their example but for a lot more rows of data:
= 5000000
rowSize = np.random.default_rng(seed=7)
rng
= {
randomdata "sqft": rng.exponential(scale=1000, size=rowSize),
"year": rng.integers(low=1889, high=2023, size=rowSize),
"building_type": rng.choice(["X", "Y", "Z"], size=rowSize),
}
Both libraries have the default interface to create a data frame from a dictionary in Python:
= pl.DataFrame(randomdata)
polarBuildings = pd.DataFrame(randomdata) pandaBuildings
There are some real differences however in the function definitions between pandas and Polars. If you are careful about reading these pages then you likely will notice Polars refers to a schema. With Polars, you can elaborate what data types you expect all the columns to be instead of letting the library infer them. This is quite useful if you have pre-cleaned data or expect data structured in a specific way. Pandas does not include this ability and you will have to convert and clean up the columns after the import takes place.
If we take a look at these data frames, we can already see a marked difference in presentation:
polarBuildings.head()
sqft | year | building_type |
---|---|---|
f64 | i64 | str |
707.529256 | 1975 | "Z" |
1025.203348 | 2011 | "X" |
568.548657 | 1970 | "X" |
895.109864 | 1999 | "Y" |
206.532754 | 2011 | "Y" |
pandaBuildings.head()
sqft | year | building_type | |
---|---|---|---|
0 | 707.529256 | 1975 | Z |
1 | 1025.203348 | 2011 | X |
2 | 568.548657 | 1970 | X |
3 | 895.109864 | 1999 | Y |
4 | 206.532754 | 2011 | Y |
Polars includes the data type as a subheader which is another nice default feature. You can quickly tell the inferred column types or casting failed. Also, there is no Index printed for the Polars data frame as there is for the Pandas frame. This index does exist - as we’ll see when we start doing subsetting - but is simply not cluttering up the output. Honestly, outside of Time Series data I find the index an annoyance to manage so this is another plus to me. Forcibly telling Pandas not to include the index when writing to CSVs sourced from a Pandas data frame is just another way you can trip over yourself while trying to do work. This does however raise the question how developed the Polars library is with respect to Time Series problems.
Reading in Data from Files
Both Polars and Pandas has the ability to read multiple formats across the data landscape: CSV, Arrow, JSON, etc. They both have matching function calls for this as you would expect:
# Some Data I Know has problems:
= 'https://media.githubusercontent.com/media/Shoklan/public-projects/main/Data/Kaggle/AllThrillerSeriesList.csv'
url
= pl.read_csv(url)
dfl = pd.read_csv(url) dfd
7) dfl.head(
Name | Year | Tv Certificate | Duration per episode | Genre | Ratings | Actor/Actress | Votes | |
---|---|---|---|---|---|---|---|---|
i64 | str | str | str | str | str | f64 | str | str |
0 | "Andor" | "2022– " | "TV-14 " | " 40 min " | " Action, Adven… | 8.4 | " Diego Luna, K… | " 82,474" |
1 | "The Peripheral… | "2022– " | "TV-MA " | null | " Drama, Myster… | 8.0 | " Chloë Grace M… | " 34,768" |
2 | "The Walking De… | "2010–2022" | "TV-MA " | " 44 min " | " Drama, Horror… | 8.1 | " Andrew Lincol… | " 988,666" |
3 | "Criminal Minds… | "2005– " | "TV-14 " | " 42 min " | " Crime, Drama,… | 8.1 | " Kirsten Vangs… | " 198,262" |
4 | "Breaking Bad" | "2008–2013" | "TV-MA " | " 49 min " | " Crime, Drama,… | 9.5 | " Bryan Cransto… | " 1,872,005" |
5 | "Dark" | "2017–2020" | "TV-MA " | " 60 min " | " Crime, Drama,… | 8.7 | " Louis Hofmann… | " 384,702" |
6 | "Manifest" | "2018– " | "TV-14 " | " 43 min " | " Drama, Myster… | 7.1 | " Melissa Roxbu… | " 66,158" |
7) dfd.head(
Unnamed: 0 | Name | Year | Tv Certificate | Duration per episode | Genre | Ratings | Actor/Actress | Votes | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | Andor | 2022– | TV-14 | 40 min | Action, Adventure, Drama | 8.4 | Diego Luna, Kyle Soller, Stellan Skarsgård, G... | 82,474 |
1 | 1 | The Peripheral | 2022– | TV-MA | NaN | Drama, Mystery, Sci-Fi | 8.0 | Chloë Grace Moretz, Gary Carr, Jack Reynor, J... | 34,768 |
2 | 2 | The Walking Dead | 2010–2022 | TV-MA | 44 min | Drama, Horror, Thriller | 8.1 | Andrew Lincoln, Norman Reedus, Melissa McBrid... | 988,666 |
3 | 3 | Criminal Minds | 2005– | TV-14 | 42 min | Crime, Drama, Mystery | 8.1 | Kirsten Vangsness, Matthew Gray Gubler, A.J. ... | 198,262 |
4 | 4 | Breaking Bad | 2008–2013 | TV-MA | 49 min | Crime, Drama, Thriller | 9.5 | Bryan Cranston, Aaron Paul, Anna Gunn, Betsy ... | 1,872,005 |
5 | 5 | Dark | 2017–2020 | TV-MA | 60 min | Crime, Drama, Mystery | 8.7 | Louis Hofmann, Karoline Eichhorn, Lisa Vicari... | 384,702 |
6 | 6 | Manifest | 2018– | TV-14 | 43 min | Drama, Mystery, Sci-Fi | 7.1 | Melissa Roxburgh, Josh Dallas, J.R. Ramirez, ... | 66,158 |
Reading Vs Scanning
Here is a topic where we will need to shortly switch our focus from comparison to only Polars: Polars has a Lazy Evaluation API for reading files. If you have not come across Lazy Evaluation before then all this means is that work is delayed until such time that we need a result. A deeper discussion of this topic is beyond the scope of this post but by doing this the program can create a plan of action which can be optimized. If you are interested more in this topic then I recommend looking up discussions about Directed Acyclic Graphs. For now, though this allows Polars to only read columns which are necessary for your analysis and therefore saving you time, compute and memory on Data tasks. Please note that while the .read_csv()
functions both work with urls, the scan_*
functions do not work with urls. I tried and definitely can confirm it does not work!
Data Frame Summaries
Once your data is imported, often you will want to do some quick checks making sure the data import worked and the results were sane. Both Polars and Pandas share the .describe()
function which gives you summary statistical information about the columns:
polarBuildings.describe()
describe | sqft | year | building_type |
---|---|---|---|
str | f64 | f64 | str |
"count" | 5e6 | 5e6 | "5000000" |
"null_count" | 0.0 | 0.0 | "0" |
"mean" | 999.647777 | 1955.507764 | null |
"std" | 999.390909 | 38.687521 | null |
"min" | 0.000078 | 1889.0 | "X" |
"25%" | 287.605602 | 1922.0 | null |
"50%" | 692.846786 | 1956.0 | null |
"75%" | 1386.66725 | 1989.0 | null |
"max" | 15701.555622 | 2022.0 | "Z" |
pandaBuildings.describe()
sqft | year | |
---|---|---|
count | 5.000000e+06 | 5.000000e+06 |
mean | 9.996478e+02 | 1.955508e+03 |
std | 9.993909e+02 | 3.868752e+01 |
min | 7.785435e-05 | 1.889000e+03 |
25% | 2.876055e+02 | 1.922000e+03 |
50% | 6.928466e+02 | 1.956000e+03 |
75% | 1.386667e+03 | 1.989000e+03 |
max | 1.570156e+04 | 2.022000e+03 |
While most of the information is the same, I would like to point out that the Missing Values has been added to this summary output. Since there is no .info()
call for Polars, this is a proper place for that information.
Methods to Slice a Data Frame.
Moving past basic information, another common operation is taking slices out of data frames. Pandas has two functions for this: .loc[]
and .iloc[]
respectively. One is supposed to use numerics while the other is supposed to use labels. Frankly, these are not as clearly defined as they should be. For example, you can use numeric indexes for .iloc[]
but also a boolean array per the documentation. The problem here is that both functions allow you to use this boolean array when either one or the other should. And, all forms of slicing the rows is still a number anyways.
# Here is using numbers:
1:10] pandaBuildings.iloc[
sqft | year | building_type | |
---|---|---|---|
1 | 1025.203348 | 2011 | X |
2 | 568.548657 | 1970 | X |
3 | 895.109864 | 1999 | Y |
4 | 206.532754 | 2011 | Y |
5 | 3383.637351 | 2007 | Z |
6 | 9.753627 | 1948 | Y |
7 | 2809.215763 | 1968 | Y |
8 | 575.332756 | 1953 | Z |
9 | 300.534013 | 1928 | X |
# This is still using numbers:
1:10,'year'] pandaBuildings.loc[
1 2011
2 1970
3 1999
4 2011
5 2007
6 1948
7 1968
8 1953
9 1928
10 1988
Name: year, dtype: int64
from itertools import repeat
# This is not really ok; binary array
list(repeat(False,rowSize - 10)) + list(repeat(True, 10))] pandaBuildings.iloc[
sqft | year | building_type | |
---|---|---|---|
4999990 | 573.107861 | 1940 | X |
4999991 | 255.946544 | 1995 | X |
4999992 | 1193.402742 | 1966 | Y |
4999993 | 241.536200 | 1981 | Y |
4999994 | 116.882680 | 2012 | Y |
4999995 | 813.839454 | 1895 | Y |
4999996 | 832.370784 | 1976 | Y |
4999997 | 910.292342 | 1940 | Z |
4999998 | 219.078462 | 1946 | X |
4999999 | 1464.927639 | 1956 | Y |
# This also works and should actually be ok
list(repeat(False,rowSize - 10)) + list(repeat(True, 10))] pandaBuildings.loc[
sqft | year | building_type | |
---|---|---|---|
4999990 | 573.107861 | 1940 | X |
4999991 | 255.946544 | 1995 | X |
4999992 | 1193.402742 | 1966 | Y |
4999993 | 241.536200 | 1981 | Y |
4999994 | 116.882680 | 2012 | Y |
4999995 | 813.839454 | 1895 | Y |
4999996 | 832.370784 | 1976 | Y |
4999997 | 910.292342 | 1940 | Z |
4999998 | 219.078462 | 1946 | X |
4999999 | 1464.927639 | 1956 | Y |
When working with Polars, all of this is simply a single interface which is surprisingly intelligent.
# just knows you want the rows
1:10] polarBuildings[
sqft | year | building_type |
---|---|---|
f64 | i64 | str |
1025.203348 | 2011 | "X" |
568.548657 | 1970 | "X" |
895.109864 | 1999 | "Y" |
206.532754 | 2011 | "Y" |
3383.637351 | 2007 | "Z" |
9.753627 | 1948 | "Y" |
2809.215763 | 1968 | "Y" |
575.332756 | 1953 | "Z" |
300.534013 | 1928 | "X" |
# .... and, you can pass the labels as you would hope
'year', 'sqft'], 1:10 ] polarBuildings[[
year | sqft |
---|---|
i64 | f64 |
2011 | 1025.203348 |
1970 | 568.548657 |
1999 | 895.109864 |
2011 | 206.532754 |
2007 | 3383.637351 |
1948 | 9.753627 |
1968 | 2809.215763 |
1953 | 575.332756 |
1928 | 300.534013 |
# ... and, we can switch the row, column order without problems!
1:10,['year', 'sqft'] ] polarBuildings[
year | sqft |
---|---|
i64 | f64 |
2011 | 1025.203348 |
1970 | 568.548657 |
1999 | 895.109864 |
2011 | 206.532754 |
2007 | 3383.637351 |
1948 | 9.753627 |
1968 | 2809.215763 |
1953 | 575.332756 |
1928 | 300.534013 |
# ... and preserves the selection order as well!
1:10,['sqft', 'year'] ] polarBuildings[
sqft | year |
---|---|
f64 | i64 |
1025.203348 | 2011 |
568.548657 | 1970 |
895.109864 | 1999 |
206.532754 | 2011 |
3383.637351 | 2007 |
9.753627 | 1948 |
2809.215763 | 1968 |
575.332756 | 1953 |
300.534013 | 1928 |
# .... and we can still specify with just numbers
1:10, 0:2] polarBuildings[
sqft | year |
---|---|
f64 | i64 |
1025.203348 | 2011 |
568.548657 | 1970 |
895.109864 | 1999 |
206.532754 | 2011 |
3383.637351 | 2007 |
9.753627 | 1948 |
2809.215763 | 1968 |
575.332756 | 1953 |
300.534013 | 1928 |
This is a really big win for Polars as I no longer longer have to bother to the juggle between .loc[]
and .iloc[]
.
Broadcasting Values
An often used pattern is to create a new column and then fill it with a default value. Then, conditionally update the column based on other features in the data frame. Doing this in Pandas is very simply and efficient.
'turtles'] = 'Best Animal'
pandaBuildings[ pandaBuildings.head()
sqft | year | building_type | turtles | |
---|---|---|---|---|
0 | 707.529256 | 1975 | Z | Best Animal |
1 | 1025.203348 | 2011 | X | Best Animal |
2 | 568.548657 | 1970 | X | Best Animal |
3 | 895.109864 | 1999 | Y | Best Animal |
4 | 206.532754 | 2011 | Y | Best Animal |
Doing this in Polars? Not so simple. Sadly, we cannot simply assign a default value to a column we wish existed.
# This fails
try:
'turtles'] = 'Best Animal'
polarBuildings[except Exception as e:
print(e)
DataFrame object does not support `Series` assignment by index
Use `DataFrame.with_columns`.
You can still do this with Polars but the process is strange. Basically, you need to tell Polars that this is a literal value and then name the column with an .alias()
function call.
'Best Animal').alias('Turtles')).head() polarBuildings.with_columns(pl.lit(
sqft | year | building_type | Turtles |
---|---|---|---|
f64 | i64 | str | str |
707.529256 | 1975 | "Z" | "Best Animal" |
1025.203348 | 2011 | "X" | "Best Animal" |
568.548657 | 1970 | "X" | "Best Animal" |
895.109864 | 1999 | "Y" | "Best Animal" |
206.532754 | 2011 | "Y" | "Best Animal" |
Another problem with the Polars method is that this is not done in place. And, what I mean by that is that if you re-check the data frame the column we created does not exist.
# This has not been changed.
polarBuildings.head()
sqft | year | building_type |
---|---|---|
f64 | i64 | str |
707.529256 | 1975 | "Z" |
1025.203348 | 2011 | "X" |
568.548657 | 1970 | "X" |
895.109864 | 1999 | "Y" |
206.532754 | 2011 | "Y" |
If we want to keep the results then we will need to assign the results back to itself. This is somewhat frustrating since it means every time you want to do a broadcast then you’re making a new object.
Filtering and Aggregation
Next on the list of common features is manipulating the data frame. The two most common forms of this are Filtering for values and aggregations such as counting or summary statistics. Polars has its own language for this: see the documentation page for more information. The language looks like a simplified SQL as applied to a data frames which there is already some support for in Pandas. Using our toy data, if we wanted any year larger than 2017 then there are a few ways to do this in pandas. Personally, I’m quite a fan of the .query()
function so I’ll use that but approach this however you like.
'year > 2017').head(7) pandaBuildings.query(
sqft | year | building_type | turtles | |
---|---|---|---|---|
14 | 1884.250052 | 2018 | Z | Best Animal |
36 | 250.345347 | 2018 | Z | Best Animal |
46 | 1148.245259 | 2022 | Y | Best Animal |
58 | 763.546235 | 2019 | Y | Best Animal |
64 | 26.353141 | 2021 | X | Best Animal |
65 | 1085.041269 | 2021 | Z | Best Animal |
89 | 469.285667 | 2022 | Y | Best Animal |
In Polars, we would use the function closest to what functionality we are after: whether .select()
or .filter()
- or maybe .group_by()
:
# Same thing but with polars
filter(pl.col('year') > 2017).head(7) polarBuildings.
sqft | year | building_type |
---|---|---|
f64 | i64 | str |
1884.250052 | 2018 | "Z" |
250.345347 | 2018 | "Z" |
1148.245259 | 2022 | "Y" |
763.546235 | 2019 | "Y" |
26.353141 | 2021 | "X" |
1085.041269 | 2021 | "Z" |
469.285667 | 2022 | "Y" |
There is a little more typing when using Polars here but this is more or less the same. One boast from the documentation is that Polars allows for parallel computations such as:
'building_type').agg([
polarBuildings.group_by('sqft'),
pl.mean('year'),
pl.median(
pl.count() ])
building_type | sqft | year | count |
---|---|---|---|
str | f64 | f64 | u32 |
"Z" | 999.42085 | 1956.0 | 1665501 |
"X" | 999.225287 | 1956.0 | 1668418 |
"Y" | 1000.297707 | 1955.0 | 1666081 |
Without profiling this code - which I am considering in another future post - we cannot really know how efficient this is. But, the readability here is solid compared to the Pandas version asking for the same information:
'building_type').agg({
pandaBuildings.groupby('sqft':['mean'],
'year':['median', 'size']})\
={'size':'count'}) .rename(columns
sqft | year | ||
---|---|---|---|
mean | median | count | |
building_type | |||
X | 999.225287 | 1956.0 | 1668418 |
Y | 1000.297707 | 1955.0 | 1666081 |
Z | 999.420850 | 1956.0 | 1665501 |
Quick Analysis of Movie Data Using Polars
Let’s try using Polars to replicate an example project which I’ve worked on before. Previously, I did some analysis work from a Kaggle dataset on Thriller Movies with Apache Spark; That post is here if you are curious.
# Download the cleaned data from online
= 'https://media.githubusercontent.com/media/Shoklan/public-projects/main/Data/Kaggle/AllThrillerSeriesListClean.csv'
dataUrl = pl.read_csv(dataUrl)
movies movies.head()
Name | Year | Tv Certificate | Duration per episode | Genre | Ratings | Actor/Actress | Votes | |
---|---|---|---|---|---|---|---|---|
i64 | str | str | str | str | str | f64 | str | str |
0 | "Andor" | "2022– " | "TV-14 " | " 40 min " | " Action, Adven… | 8.4 | " Diego Luna, K… | " 82,474" |
1 | "The Peripheral… | "2022– " | "TV-MA " | null | " Drama, Myster… | 8.0 | " Chloë Grace M… | " 34,768" |
2 | "The Walking De… | "2010–2022" | "TV-MA " | " 44 min " | " Drama, Horror… | 8.1 | " Andrew Lincol… | " 988,666" |
3 | "Criminal Minds… | "2005– " | "TV-14 " | " 42 min " | " Crime, Drama,… | 8.1 | " Kirsten Vangs… | " 198,262" |
4 | "Breaking Bad" | "2008–2013" | "TV-MA " | " 49 min " | " Crime, Drama,… | 9.5 | " Bryan Cransto… | " 1,872,005" |
# Get rid of the superfluous index column
='')
movies.drop_in_place(name movies.head()
Name | Year | Tv Certificate | Duration per episode | Genre | Ratings | Actor/Actress | Votes |
---|---|---|---|---|---|---|---|
str | str | str | str | str | f64 | str | str |
"Andor" | "2022– " | "TV-14 " | " 40 min " | " Action, Adven… | 8.4 | " Diego Luna, K… | " 82,474" |
"The Peripheral… | "2022– " | "TV-MA " | null | " Drama, Myster… | 8.0 | " Chloë Grace M… | " 34,768" |
"The Walking De… | "2010–2022" | "TV-MA " | " 44 min " | " Drama, Horror… | 8.1 | " Andrew Lincol… | " 988,666" |
"Criminal Minds… | "2005– " | "TV-14 " | " 42 min " | " Crime, Drama,… | 8.1 | " Kirsten Vangs… | " 198,262" |
"Breaking Bad" | "2008–2013" | "TV-MA " | " 49 min " | " Crime, Drama,… | 9.5 | " Bryan Cransto… | " 1,872,005" |
# clean up and cast year,votes for analysis
= movies.with_columns(
movies 'Year').map_elements(lambda x: x.split('–')[0]),
pl.col('Votes').map_elements(lambda x: x.replace(',', '').strip()).cast(pl.Float64),
pl.col(
)
= movies.rename({'Actor/Actress':'Actor'})
movies movies.head()
Name | Year | Tv Certificate | Duration per episode | Genre | Ratings | Actor/Actress | Votes |
---|---|---|---|---|---|---|---|
str | str | str | str | str | f64 | str | f64 |
"Andor" | "2022" | "TV-14 " | " 40 min " | " Action, Adven… | 8.4 | " Diego Luna, K… | 82474.0 |
"The Peripheral… | "2022" | "TV-MA " | null | " Drama, Myster… | 8.0 | " Chloë Grace M… | 34768.0 |
"The Walking De… | "2010" | "TV-MA " | " 44 min " | " Drama, Horror… | 8.1 | " Andrew Lincol… | 988666.0 |
"Criminal Minds… | "2005" | "TV-14 " | " 42 min " | " Crime, Drama,… | 8.1 | " Kirsten Vangs… | 198262.0 |
"Breaking Bad" | "2008" | "TV-MA " | " 49 min " | " Crime, Drama,… | 9.5 | " Bryan Cransto… | 1.872005e6 |
= ['Name', 'Genre', 'Ratings', 'Votes'] cols
= movies[cols].with_columns(
medianMovieInfo 'Genre').str.strip_chars().str.split(', '))\
pl.col('Genre').group_by("Genre").agg(
.explode('Ratings').median(),
pl.col('Votes').median())\
pl.col('Ratings', descending=True) .sort(
= movies[cols].with_columns(
meanMovieInfo 'Genre').str.strip_chars().str.split(', ')
pl.col('Genre').group_by("Genre").agg(
).explode('Ratings').mean(),
pl.col('Votes').mean()
pl.col('Ratings', descending=True) ).sort(
import seaborn as sns
import matplotlib.pyplot as plt
"figure.figsize"] = (14,7);
plt.rcParams[
# Wish we didn't have to convert this
= medianMovieInfo.to_pandas().sort_values('Votes', ascending=False)
dmi =41);
plt.xticks(rotation='Genre', y='Votes', color='seagreen').set_title('Median Votes Per Movie Genre'); sns.barplot(dmi, x
= meanMovieInfo.to_pandas().sort_values('Votes', ascending=False)
nmi =41);
plt.xticks(rotation='Genre', y='Votes', color='seagreen').set_title('Mean Votes Per Movie Genre'); sns.barplot(nmi, x
Text(0.5, 1.0, 'Mean Votes Per Movie Genre')
Conclusions
Hopefully this was as useful to you as it was for me. I like Polars and I think I’ll be using it when I know the data I am using is cleaned and ready for analysis. Otherwise, I think cleaning the data with Pandas is still the likely best first choice.