Part of the assumptions that you need to confirm before using some of these tests is Normalicy which is a way of saying that the data follows a Normal/Guassian Distribution. When this is true, it looks like the Bell Curve which is posted all over when learning about Statistics and is where most time of a student’s time is spent in Statistics classes. If you’re reading this, then you probably already know what one is but if you don’t then I’d suggest touching a Statistics 101 course; something like Khan Academy or the wikipedia page.
This test’s job is to look at the data and attempt to check whether it is not normally distributed.
Shapiro-Wilks in Python
Again, we’ll be borrowing some data from an online class I’ve taken in the past; this dataset is about usage based on IDE.
import pandas as pdimport scipy as scidata = pd.read_csv('https://query.data.world/s/patv4rpu4qipeb4hgggtjen7qh3vdr')data.info()
This test expects a single array of numeric data so we’ll want to split on the IDE type since that is what we’re interested in:
#Don't forget to split the IDE column to separate themide1 = data.query('IDE == "VStudio"')['Time']ide2 = data.query('IDE == "Eclipse"')['Time']w1, p1 = sci.stats.shapiro( ide1 )w2, p2 = sci.stats.shapiro( ide2 )print(f"Visual Studio had a W Stat of {w1:.3} (P={p1:.3E})", f"Eclipse had a W Stat of {w2:.2} ((P={p2:.3E})")
Visual Studio had a W Stat of 0.844 (P=4.191E-03) Eclipse had a W Stat of 0.87 ((P=1.281E-02)
What we’re looking for is if the W Stat is greater than 1.0 and if it is then we’ll reject the Null Hypothesis. Here, both of them are fine so we fail to reject and grant Normalicy. While this works, I’d rather be able to do this for all numeric columns we are interested in. Then, we can include our Normalicy Check right in a pipeline.
# This slash causes python to ignore the return character at the end# then python keeps reading the next line# kind of hacky way to get something looking like R pipesswStats = data\ .drop('Subject', axis=1)\ .groupby('IDE')\ .apply(lambda x: sci.stats.shapiro(x))for ide,stats in swStats.items():print(f"{ide} had a W Stat of {stats[0]:.3} (P={stats[1]:.3E})", )
Eclipse had a W Stat of 0.872 (P=1.281E-02)
VStudio had a W Stat of 0.844 (P=4.191E-03)
That is much better!
Shapiro-Wilks in R
Once again, R is a wonderful language this this is easy to work with. Here is our pipeline R version with the same results:
data <-read_csv("https://query.data.world/s/patv4rpu4qipeb4hgggtjen7qh3vdr")
Rows: 40 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): IDE
dbl (2): Subject, Time
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# skip to the pipeline!swState <- data %>%select( IDE, Time) %>%group_by(IDE) %>%summarize(stat =shapiro.test(Time)[['statistic']],pvalue=shapiro.test(Time)[['p.value']])swState
# A tibble: 2 × 3
IDE stat pvalue
<chr> <dbl> <dbl>
1 Eclipse 0.872 0.0128
2 VStudio 0.844 0.00419
Conclusions
There is some disagreements about whether you should even bother with this test since a QQPlot does the same thing but visually. I touched on the QQ Plot in another post about Apache Spark. Really, it’s up to you and what your needs are but this can be included in a data pipeline and a QQplot cannot so there is at least one point in it’s favor.