import datatable as dt
import pandas as pd
import seaborn as sns
from pathlib import Path
Another Useful Tool From R
Browsing Medium can sometimes be quite useful; you can find some gems in there still. I came across this post which was about getting much faster read times from CSV files and the results looked really good. As I was reading it, I realized the command to read the files in was .fread()
and then I realized this looked exactly like the data.table
library from R. And, that’s exactly what it is: > Thanks for sharing the story on datatable Parul Pandey. The team H2O.ai is working tirelessly to add missing pandas.Frame functionalities to datatable. If there is something that you wished it would have to file issues here → https://github.com/h2oai/datatable/issues
cf: Medium
So, let’s try it out!
= sns.load_dataset('diamonds')
diamonds
diamonds.head()= dt.Frame(diamonds)
dtDiamonds dtDiamonds.head()
carat | cut | color | clarity | depth | table | price | x | y | z | |
---|---|---|---|---|---|---|---|---|---|---|
▪▪▪▪▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | |
0 | 0.23 | Ideal | E | SI2 | 61.5 | 55 | 326 | 3.95 | 3.98 | 2.43 |
1 | 0.21 | Premium | E | SI1 | 59.8 | 61 | 326 | 3.89 | 3.84 | 2.31 |
2 | 0.23 | Good | E | VS1 | 56.9 | 65 | 327 | 4.05 | 4.07 | 2.31 |
3 | 0.29 | Premium | I | VS2 | 62.4 | 58 | 334 | 4.2 | 4.23 | 2.63 |
4 | 0.31 | Good | J | SI2 | 63.3 | 58 | 335 | 4.34 | 4.35 | 2.75 |
5 | 0.24 | Very Good | J | VVS2 | 62.8 | 57 | 336 | 3.94 | 3.96 | 2.48 |
6 | 0.24 | Very Good | I | VVS1 | 62.3 | 57 | 336 | 3.95 | 3.98 | 2.47 |
7 | 0.26 | Very Good | H | SI1 | 61.9 | 55 | 337 | 4.07 | 4.11 | 2.53 |
8 | 0.22 | Fair | E | VS2 | 65.1 | 61 | 337 | 3.87 | 3.78 | 2.49 |
9 | 0.23 | Very Good | H | VS1 | 59.4 | 61 | 338 | 4 | 4.05 | 2.39 |
No Plotting By Default
One point which might harm someone’s willingness to switch over is that plotting is not built directly into the objects like it is with pandas. This means you’ll have to be explicit about importing and using matplotlib
or seaborn
. But, not only that becuase if you try to pass the datatable frame to Seaborn
then it will fair:
# You can run this but it will fail:
='x') sns.displot(dtDiamonds, x
When you run this, you will get the error: > ValueError: Could not interpret value x
for parameter x
… and the code which causes this is:
# Raise when data object is present and a vector can't matched
if isinstance(data, pd.DataFrame) and not isinstance(val, pd.Series):
So, if it’s not a pandas data frame then seaborn just wont accept it. There is a matching tool which implements the Grammar of Graphics for python in the package plotnine
. I tried doing this within the VM and it literally crashed my Virtual Machine. Not just my Python Kernel but the whole thing. So, we’re not going to do that. And, I wouldn’t recommend that you do it either. Which is a shame since I really like ggplot
and the plotnine
library from python.
Is this Worth it?
You should check out the documentation to see if the analytics side of this tool is worth it. From using Datatable on the R side I’m definitely going to be trying this out. But, if I want to do any graphing then I have to convert to pandas - which has a cost to convert. Let’s measure the cost like the other bloggers did. First, we’ll write this to a CSV since we’ll have to account for the transition back.
"_data/diamonds.csv")), len( diamonds )
diamonds.to_csv(Path(import matplotlib.pyplot as plt
I will have to copy the results because I just could not find a way to suppress the graphs printing while keeping the timeit
outputs. You can copy and run these but keep in mind it will spam you with graphs.
%%timeit -r2 -n10
= pd.read_csv(Path("_data/diamonds.csv"))
data = sns.displot(data, x='x', kde=True); a
%%timeit -r2 -n10
= dt.fread(Path("_data/diamonds.csv"))
data = sns.displot(data.to_pandas(), x='x', kde=True); a
Results:
pandas: 321 ms ± 2.59 ms per loop (mean ± std. dev. of 2 runs, 10 loops each)
datatable: 339 ms ± 8.55 ms per loop (mean ± std. dev. of 2 runs, 10 loops each)
So, pandas wins. This dataset though is small though so let’s try a more real world dataset. The analysis in the posts used a dataset with millions of rows so maybe we can test this using a much bigger dataset: All Lending Club loan data.
# The big boi
= Path('_data/accepted_2007_to_2018Q4.csv') path
%%timeit -r2 -n3
= pd.read_csv(path)
data = sns.displot(data, x='loan_amnt', kde=True); _
%%timeit -r2 -n3
= dt.fread(path)
data = sns.displot(data.to_pandas(), x='loan_amnt', kde=True); _
Results:
pandas: 1min 8s ± 64.7 ms per loop (mean ± std. dev. of 2 runs, 3 loops each)
datatable: 55 s ± 389 ms per loop (mean ± std. dev. of 2 runs, 3 loops each)
Conclusions
And, so datatable
wins on the larger dataset even when you have to convert it over. So, somewhere between 53940 and 2260701 rows is where this works better. Like most tools, you’ll have to use your own judgement and your own circumstances whether you’ll find the tool useful. I’m definitely going to pick it up for no other reason than the read speed is superior and I happen to like the data.table experiences when I was using R.