We Need Normalicy!

Part of the assumptions that you need to confirm before using some of these tests is Normalicy which is a way of saying that the data follows a Normal/Guassian Distribution. When this is true, it looks like the Bell Curve which is posted all over when learning about Statistics and is where most time of a student’s time is spent in Statistics classes. If you’re reading this, then you probably already know what one is but if you don’t then I’d suggest touching a Statistics 101 course; something like Khan Academy or the wikipedia page.

This test’s job is to look at the data and attempt to check whether it is not normally distributed.

Shapiro-Wilks in Python

Again, we’ll be borrowing some data from an online class I’ve taken in the past; this dataset is about usage based on IDE.

import pandas as pd
import scipy as sci

data = pd.read_csv('https://query.data.world/s/patv4rpu4qipeb4hgggtjen7qh3vdr')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40 entries, 0 to 39
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Subject  40 non-null     int64 
 1   IDE      40 non-null     object
 2   Time     40 non-null     int64 
dtypes: int64(2), object(1)
memory usage: 1.1+ KB

data.sample(10)

    Subject      IDE  Time
8         9  VStudio   209
38       39  Eclipse   232
20       21  Eclipse   806
39       40  Eclipse   244
37       38  Eclipse   285
32       33  Eclipse   275
25       26  Eclipse   865
36       37  Eclipse   656
23       24  Eclipse   317
15       16  VStudio   334

This test expects a single array of numeric data so we’ll want to split on the IDE type since that is what we’re interested in:

#Don't forget to split the IDE column to separate them
ide1 = data.query('IDE == "VStudio"')['Time']
ide2 = data.query('IDE == "Eclipse"')['Time']

w1, p1 = sci.stats.shapiro( ide1 )
w2, p2 = sci.stats.shapiro( ide2 )

print(f"Visual Studio had a W Stat of {w1:.3} (P={p1:.3E})", f"Eclipse had a W Stat of {w2:.2} ((P={p2:.3E})")

Visual Studio had a W Stat of 0.844 (P=4.191E-03) Eclipse had a W Stat of 0.87 ((P=1.281E-02)

What we’re looking for is if the W Stat is greater than 1.0 and if it is then we’ll reject the Null Hypothesis. Here, both of them are fine so we fail to reject and grant Normalicy. While this works, I’d rather be able to do this for all numeric columns we are interested in. Then, we can include our Normalicy Check right in a pipeline.

# This slash causes python to ignore the return character at the end
# then python keeps reading the next line
# kind of hacky way to get something looking like R pipes
swStats = data\
    .drop('Subject', axis=1)\
    .groupby('IDE')\
    .apply(lambda x: sci.stats.shapiro(x.Time))

<string>:7: FutureWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.

for ide,stats in swStats.items():
    print(f"{ide} had a W Stat of {stats[0]:.3} (P={stats[1]:.3E})", )

Eclipse had a W Stat of 0.872 (P=1.281E-02)
VStudio had a W Stat of 0.844 (P=4.191E-03)

That is much better!

Shapiro-Wilks in R

Once again, R is a wonderful language this this is easy to work with. Here is our pipeline R version with the same results:

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

data <- read_csv("https://query.data.world/s/patv4rpu4qipeb4hgggtjen7qh3vdr")

Rows: 40 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): IDE
dbl (2): Subject, Time

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# skip to the pipeline!
swState <- data %>%
    select( IDE, Time) %>%
    group_by(IDE) %>%
    summarize(
        stat = shapiro.test(Time)[['statistic']],
        pvalue=shapiro.test(Time)[['p.value']])
swState

# A tibble: 2 × 3
  IDE      stat  pvalue
  <chr>   <dbl>   <dbl>
1 Eclipse 0.872 0.0128 
2 VStudio 0.844 0.00419

Conclusions

There is some disagreements about whether you should even bother with this test since a QQPlot does the same thing but visually. I touched on the QQ Plot in another post about Apache Spark. Really, it’s up to you and what your needs are but this can be included in a data pipeline and a QQplot cannot so there is at least one point in it’s favor.