Another Useful Tool From R

Browsing Medium can sometimes be quite useful; you can find some gems in there still. I came across this post which was about getting much faster read times from CSV files and the results looked really good. As I was reading it, I realized the command to read the files in was .fread() and then I realized this looked exactly like the data.table library from R. And, that’s exactly what it is: > Thanks for sharing the story on datatable Parul Pandey. The team H2O.ai is working tirelessly to add missing pandas.Frame functionalities to datatable. If there is something that you wished it would have to file issues here → https://github.com/h2oai/datatable/issues

cf: Medium

So, let’s try it out!

import datatable as dt
import pandas as pd
import seaborn as sns
from pathlib import Path

diamonds = sns.load_dataset('diamonds')
diamonds.head()
dtDiamonds = dt.Frame(diamonds)
dtDiamonds.head()

	carat	cut	color	clarity	depth	table	price	x	y	z
	▪▪▪▪▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪
0	0.23	Ideal	E	SI2	61.5	55	326	3.95	3.98	2.43
1	0.21	Premium	E	SI1	59.8	61	326	3.89	3.84	2.31
2	0.23	Good	E	VS1	56.9	65	327	4.05	4.07	2.31
3	0.29	Premium	I	VS2	62.4	58	334	4.2	4.23	2.63
4	0.31	Good	J	SI2	63.3	58	335	4.34	4.35	2.75
5	0.24	Very Good	J	VVS2	62.8	57	336	3.94	3.96	2.48
6	0.24	Very Good	I	VVS1	62.3	57	336	3.95	3.98	2.47
7	0.26	Very Good	H	SI1	61.9	55	337	4.07	4.11	2.53
8	0.22	Fair	E	VS2	65.1	61	337	3.87	3.78	2.49
9	0.23	Very Good	H	VS1	59.4	61	338	4	4.05	2.39

No Plotting By Default

One point which might harm someone’s willingness to switch over is that plotting is not built directly into the objects like it is with pandas. This means you’ll have to be explicit about importing and using matplotlib or seaborn. But, not only that becuase if you try to pass the datatable frame to Seaborn then it will fair:

# You can run this but it will fail:
sns.displot(dtDiamonds, x='x')

When you run this, you will get the error: > ValueError: Could not interpret value x for parameter x
… and the code which causes this is:

# Raise when data object is present and a vector can't matched
if isinstance(data, pd.DataFrame) and not isinstance(val, pd.Series):

So, if it’s not a pandas data frame then seaborn just wont accept it. There is a matching tool which implements the Grammar of Graphics for python in the package plotnine. I tried doing this within the VM and it literally crashed my Virtual Machine. Not just my Python Kernel but the whole thing. So, we’re not going to do that. And, I wouldn’t recommend that you do it either. Which is a shame since I really like ggplot and the plotnine library from python.

Is this Worth it?

You should check out the documentation to see if the analytics side of this tool is worth it. From using Datatable on the R side I’m definitely going to be trying this out. But, if I want to do any graphing then I have to convert to pandas - which has a cost to convert. Let’s measure the cost like the other bloggers did. First, we’ll write this to a CSV since we’ll have to account for the transition back.

diamonds.to_csv(Path("_data/diamonds.csv")), len( diamonds )
import matplotlib.pyplot as plt

I will have to copy the results because I just could not find a way to suppress the graphs printing while keeping the timeit outputs. You can copy and run these but keep in mind it will spam you with graphs.

%%timeit -r2 -n10
data = pd.read_csv(Path("_data/diamonds.csv"))
a = sns.displot(data, x='x', kde=True);

%%timeit -r2 -n10
data = dt.fread(Path("_data/diamonds.csv"))
a = sns.displot(data.to_pandas(), x='x', kde=True);

Results:

pandas:    321 ms ± 2.59 ms per loop (mean ± std. dev. of 2 runs, 10 loops each)
datatable: 339 ms ± 8.55 ms per loop (mean ± std. dev. of 2 runs, 10 loops each)

So, pandas wins. This dataset though is small though so let’s try a more real world dataset. The analysis in the posts used a dataset with millions of rows so maybe we can test this using a much bigger dataset: All Lending Club loan data.

# The big boi
path = Path('_data/accepted_2007_to_2018Q4.csv')

%%timeit -r2 -n3
data = pd.read_csv(path)
_ = sns.displot(data, x='loan_amnt', kde=True);

%%timeit -r2 -n3
data = dt.fread(path)
_ = sns.displot(data.to_pandas(), x='loan_amnt', kde=True);

Results:

pandas:    1min 8s ± 64.7 ms per loop (mean ± std. dev. of 2 runs, 3 loops each)
datatable: 55 s ± 389 ms per loop (mean ± std. dev. of 2 runs, 3 loops each)

Conclusions

And, so datatable wins on the larger dataset even when you have to convert it over. So, somewhere between 53940 and 2260701 rows is where this works better. Like most tools, you’ll have to use your own judgement and your own circumstances whether you’ll find the tool useful. I’m definitely going to pick it up for no other reason than the read speed is superior and I happen to like the data.table experiences when I was using R.