Shaprio-Wilk’s Test for Normalicy in Python and R

What Even Is Normal Anyway.
R
python
data
experiment
analysis
Published

February 6, 2023

We Need Normalicy!

Part of the assumptions that you need to confirm before using some of these tests is Normalicy which is a way of saying that the data follows a Normal/Guassian Distribution. When this is true, it looks like the Bell Curve which is posted all over when learning about Statistics and is where most time of a student’s time is spent in Statistics classes. If you’re reading this, then you probably already know what one is but if you don’t then I’d suggest touching a Statistics 101 course; something like Khan Academy or the wikipedia page.

This test’s job is to look at the data and attempt to check whether it is not normally distributed.

Shapiro-Wilks in Python

Again, we’ll be borrowing some data from an online class I’ve taken in the past; this dataset is about usage based on IDE.

import pandas as pd
import scipy as sci

data = pd.read_csv('https://query.data.world/s/patv4rpu4qipeb4hgggtjen7qh3vdr')
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40 entries, 0 to 39
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Subject  40 non-null     int64 
 1   IDE      40 non-null     object
 2   Time     40 non-null     int64 
dtypes: int64(2), object(1)
memory usage: 1.1+ KB
data.sample(10)
    Subject      IDE  Time
31       32  Eclipse   362
30       31  Eclipse   562
17       18  VStudio   305
5         6  VStudio   270
14       15  VStudio   310
3         4  VStudio   155
6         7  VStudio   250
8         9  VStudio   209
39       40  Eclipse   244
16       17  VStudio   376

This test expects a single array of numeric data so we’ll want to split on the IDE type since that is what we’re interested in:

#Don't forget to split the IDE column to separate them
ide1 = data.query('IDE == "VStudio"')['Time']
ide2 = data.query('IDE == "Eclipse"')['Time']

w1, p1 = sci.stats.shapiro( ide1 )
w2, p2 = sci.stats.shapiro( ide2 )

print(f"Visual Studio had a W Stat of {w1:.3} (P={p1:.3E})", f"Eclipse had a W Stat of {w2:.2} ((P={p2:.3E})")
Visual Studio had a W Stat of 0.844 (P=4.191E-03) Eclipse had a W Stat of 0.87 ((P=1.281E-02)

What we’re looking for is if the W Stat is greater than 1.0 and if it is then we’ll reject the Null Hypothesis. Here, both of them are fine so we fail to reject and grant Normalicy. While this works, I’d rather be able to do this for all numeric columns we are interested in. Then, we can include our Normalicy Check right in a pipeline.

# This slash causes python to ignore the return character at the end
# then python keeps reading the next line
# kind of hacky way to get something looking like R pipes
swStats = data\
    .drop('Subject', axis=1)\
    .groupby('IDE')\
    .apply(lambda x: sci.stats.shapiro(x))

for ide,stats in swStats.items():
    print(f"{ide} had a W Stat of {stats[0]:.3} (P={stats[1]:.3E})", )
Eclipse had a W Stat of 0.872 (P=1.281E-02)
VStudio had a W Stat of 0.844 (P=4.191E-03)

That is much better!

Shapiro-Wilks in R

Once again, R is a wonderful language this this is easy to work with. Here is our pipeline R version with the same results:

library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.3.0      ✔ stringr 1.5.0 
✔ readr   2.1.3      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
data <- read_csv("https://query.data.world/s/patv4rpu4qipeb4hgggtjen7qh3vdr")
Rows: 40 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): IDE
dbl (2): Subject, Time

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# skip to the pipeline!
swState <- data %>%
    select( IDE, Time) %>%
    group_by(IDE) %>%
    summarize(
        stat = shapiro.test(Time)[['statistic']],
        pvalue=shapiro.test(Time)[['p.value']])
swState
# A tibble: 2 × 3
  IDE      stat  pvalue
  <chr>   <dbl>   <dbl>
1 Eclipse 0.872 0.0128 
2 VStudio 0.844 0.00419

Conclusions

There is some disagreements about whether you should even bother with this test since a QQPlot does the same thing but visually. I touched on the QQ Plot in another post about Apache Spark. Really, it’s up to you and what your needs are but this can be included in a data pipeline and a QQplot cannot so there is at least one point in it’s favor.