Collecting External Data for Python.

Or, Delete the Intermediate When Collecting Data.
python
data
r
rda
download
Published

October 26, 2022

While preparing for some upcoming blog posts taking material from Design and Analysis of Experiments With R by John Lawson, I wanted to convert the problems and solutions from R code to Python code. Diong this will require using the real data and - luckily - the data from the book is online on Github. Due to how these packages are, the data is uploaded and kept as binary data which we can use. Unfortunately, the data is in the .rda format which doesn’t convert easily into python.

There is a package for this to convert the data: pyreadr. Which we’re doing to use to convert the data into a dataframe Python understands. Sadly, this package doesn’t handle urls so we’ll need to download the data first. We could clone out the whole repository to collect the data but then we’d have to start manually managing the data - which I don’t want to do.

After a bit of working around, we can use the tempfile builtin package from Python to create a temporary file to dump the data into. This is useful since these will be deleted after it’s .close() is called on the file. But, we’ll want a Named version since we want this accessible to the file system: > This function operates exactly as TemporaryFile() does, except that the file is guaranteed to have a visible name in the file system (on Unix, the directory entry is not unlinked). That name can be retrieved from the name attribute of the returned file-like object. Whether the name can be used to open the file a second time, while the named temporary file is still open, varies across platforms (it can be so used on Unix; it cannot on Windows).

Source

We’ll use the requests library to pull the data from the internet since it’s builtin and easy to use.

# !pip install pyreadr
import pyreadr as pyr
import tempfile as tmp
import requests as r

One caveat here is that you’ll need to rewind the read location in the file to read the temporary file otherwise you’ll get an LibrdataError: Unable to read from file.

with tmp.NamedTemporaryFile() as f:
    something = r.get("https://github.com/cran/daewr/raw/master/data/Apo.rda").content
    f.write(something)
    f.seek(0)
    data = pyr.read_r(f.name)['Apo']

data.head(15).T
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
lab A A A A A A A B B B B B B B B
conc 1.195 1.144 1.167 1.249 1.177 1.217 1.187 1.155 1.173 1.171 1.175 1.153 1.139 1.185 1.144

There we go! You can use this as a simple way to collect data from the internet and feed it into a package which doesn’t support urls to read in data. You can expect its usage in the near future while I work through the textbook.