4 Assignment 3 (Midterm) - Sleep & Exercise Analysis

Author

Shannon Joyce

4.1 Introduction

In this assignment, I have been tasked with analyzing data from Dr. Matthew Walker’s sleep lab to determine which exercise type best improves sleep quality and duration. I used both datasets provided to create derived variables with which I ran descriptive statistics and simple visualizations. I followed these preliminary insights with t-tests, ANOVAs, and post-hoc tests.

4.2 Setup & Data Import

library(readxl)
library(tidyverse)
library(supernova)
library(mosaic)
library(ggplot2)
library(WRS2)
library(knitr)

participant_info_midterm <- read_excel("participant_info_midterm.xlsx")
sleep_data_midterm <- read_excel("sleep_data_midterm.xlsx")

kable(head(participant_info_midterm[, 1:4]))

Table 4.1: First 6 rows of the participant info dataset.

ID	Exercise_Group	Sex	Age
P001	NONE	Male	35
P002	Nonee	Malee	57
P003	None	Female	26
P004	None	Female	29
P005	None	Male	33
P006	None	Female	33

kable(head(sleep_data_midterm[, 1:4]))

Table 4.2: First 6 rows of the sleep dataset.

ID	Pre_Sleep	Post_Sleep	Sleep_Efficiency
P001	zzz-5.8	4.7	81.6
P002	Sleep-6.6	7.4	75.7
P003	NA	6.2	82.9
P004	SLEEP-7.2	7.3	83.6
P005	score-7.4	7.4	83.5
P006	Sleep-6.6	7.1	88.5

In this first chunk, I loaded all of my packages and uploaded the two datasets I will be working with.

Did you know?

You can actually upload multiple sheets of data from one file without separating them. In this assignment, I was unaware of that possibility, and I ended up saving the individual sheets as multiple files in order to successfully upload both datasets. In the next chapter, you’ll see how I used the function lapply() to upload multiple sheets of data at once!

4.3 Merge & Base Cleaning

unique(participant_info_midterm$Exercise_Group)

 [1] "NONE"           "Nonee"          "None"           "N"             
 [5] "Cardio"         "C"              "WEIGHTZ"        "WEIGHTS"       
 [9] "WEIGHTSSS"      "Cardio+Weights" "CW"             "C+W"

participant_info_midterm <- participant_info_midterm %>%
  mutate(Exercise_Group = dplyr::recode(Exercise_Group,
                                 "NONE" = "None",
                                 "Nonee" = "None",
                                 "N" = "None",
                                 "C" = "Cardio",
                                 "WEIGHTZ" = "Weights",
                                 "WEIGHTS" = "Weights",
                                 "WEIGHTSSS" = "Weights",
                                 "CW" = "Cardio+Weights",
                                 "C+W" = "Cardio+Weights"))

unique(participant_info_midterm$Sex)

[1] "Male"    "Malee"   "Female"  "Femalee" "F"       "M"       "Fem"    
[8] "MALE"    "Mal"

participant_info_midterm <- participant_info_midterm %>%
  mutate(Sex = dplyr::recode(Sex,
                      "Malee" = "Male",
                      "Femalee" = "Female",
                      "F" = "Female",
                      "M" = "Male",
                      "Fem" = "Female",
                      "MALE" = "Male",
                      "Mal" = "Male"))

unique(sleep_data_midterm$Pre_Sleep)

 [1] "zzz-5.8"   "Sleep-6.6" NA          "SLEEP-7.2" "score-7.4" "Sleep-6"  
 [7] "zzz-8.1"   "sleep-5.5" "sleep-5.7" "score-7"   "Sleep-8"   "score-5.3"
[13] "SLEEP-7.8" "zzz-6.7"   "SLEEP-7.4" "score-7.1" "zzz-6.8"   "Sleep-5.7"
[19] "sleep-5"   "score-6"   "Sleep-6.2" "score-5.2" "sleep-6.6" "SLEEP-5.6"
[25] "SLEEP-5.7" "Sleep-6.9" "SLEEP-6.9" "zzz-5.9"   "SLEEP-6.7" "zzz-5.4"  
[31] "zzz-7"     "zzz-6"     "zzz-6.9"   "score-4"   "SLEEP-7.1" "SLEEP-5.8"
[37] "sleep-6.7" "zzz-6.4"   "Sleep-6.5" "Sleep-7.2" "score-7.3" "sleep-6.2"
[43] "sleep-7.3" "SLEEP-6.2" "SLEEP-6.6" "SLEEP-6.1" "score-6.7" "score-5.1"
[49] "score-5.5" "sleep-6.4" "score-7.5" "zzz-6.6"   "Sleep-5.6" "SLEEP-5.1"
[55] "sleep-6.5" "score-5.9" "score-6.6" "score-6.5" "sleep-8"   "score-6.2"
[61] "zzz-7.3"   "sleep-5.6" "zzz-6.1"   "sleep-5.9" "SLEEP-4.4" "zzz-6.5"

sleep_data_midterm <- sleep_data_midterm %>%
  mutate(Pre_Sleep = str_remove(Pre_Sleep, "zzz-|SLEEP-|sleep-|score-|Sleep-"))

participant_sleep_midterm <- merge(sleep_data_midterm, participant_info_midterm, by="ID")

kable(head(participant_sleep_midterm[, 1:7]))

Table 4.3: First 6 rows of the merged dataset.

ID	Pre_Sleep	Post_Sleep	Sleep_Efficiency	Exercise_Group	Sex	Age
P001	5.8	4.7	81.6	None	Male	35
P002	6.6	7.4	75.7	None	Male	57
P003	NA	6.2	82.9	None	Female	26
P004	7.2	7.3	83.6	None	Female	29
P005	7.4	7.4	83.5	None	Male	33
P006	6.6	7.1	88.5	None	Female	33

In this chunk, I focused on cleaning the data. This included standardizing the ‘Exercise Group’ and ‘Sex’ columns as well as cleaning up the scores in the ‘Pre-Sleep’ column. Afterwards, I merged the two datasets into one large dataset.

4.4 Create Derived Variables

str(participant_sleep_midterm$Pre_Sleep)

 chr [1:100] "5.8" "6.6" NA "7.2" "7.4" "6.6" "6" "8.1" "5.5" "5.7" "7" ...

participant_sleep_midterm <- participant_sleep_midterm %>%
  mutate(Pre_Sleep = as.numeric(Pre_Sleep))

participant_sleep_midterm$Sleep_Difference <- participant_sleep_midterm$Post_Sleep - participant_sleep_midterm$Pre_Sleep

participant_sleep_midterm <- participant_sleep_midterm %>%
  mutate(
    AgeGroup2 = case_when(
      Age < 40  ~ "<40",
      Age >= 40 ~ ">=40"
    )
  )

count(is.na(participant_sleep_midterm$Sleep_Difference))

n_TRUE 
    14

participant_sleep_midterm <- participant_sleep_midterm %>% 
  filter(!is.na(Sleep_Difference))

kable(head(participant_sleep_midterm[, 1:9]))

Table 4.4: Participant Sleep Data with Derived Variables.

ID	Pre_Sleep	Post_Sleep	Sleep_Efficiency	Exercise_Group	Sex	Age	Sleep_Difference	AgeGroup2
P001	5.8	4.7	81.6	None	Male	35	-1.1	<40
P002	6.6	7.4	75.7	None	Male	57	0.8	>=40
P004	7.2	7.3	83.6	None	Female	29	0.1	<40
P005	7.4	7.4	83.5	None	Male	33	0.0	<40
P006	6.6	7.1	88.5	None	Female	33	0.5	<40
P007	6.0	6.7	83.6	None	Male	32	0.7	<40

In this chunk, I created new variables using existing data. I created a ‘Sleep_Difference’ variable by finding the difference between the Post- and Pre- sleep scores. I then created an age group variable by sorting participants into 2 groups: those under 40 years old, and those who are 40 years or older. Finally, I found that there were 14 rows that had NA values in the ‘Sleep_Difference’ column, and I deleted those participants from the dataset.

4.5 Descriptive Statistics

favstats(participant_sleep_midterm$Sleep_Difference) %>% kable()

	min	Q1	median	Q3	max	mean	sd	n	missing
	-1.1	0.3	0.75	1.1	2.1	0.6825581	0.6610494	86	0

Table 4.5: Table showing preliminary descriptive statistics such as averages and frequencies to better understand the dataset and what patterns to explore regarding difference in sleep quality.

favstats(participant_sleep_midterm$Sleep_Efficiency) %>% kable()

	min	Q1	median	Q3	max	mean	sd	n	missing
	71.7	79.975	83.3	88.425	101.5	83.77558	5.973804	86	0

Table 4.6: Table showing preliminary descriptive statistics such as averages and frequencies to better understand the dataset and what patterns to explore regarding sleep efficiency.

participant_sleep_midterm %>%
  group_by(Exercise_Group) %>%
  dplyr::summarize(
    mean_Sleep_Difference = mean(Sleep_Difference),
    mean_Sleep_Efficiency = mean(Sleep_Efficiency)
  ) %>% kable()

Exercise_Group	mean_Sleep_Difference	mean_Sleep_Efficiency
Cardio	1.1380952	85.44762
Cardio+Weights	0.8608696	86.83478
None	0.0476190	81.07143
Weights	0.6666667	81.45714

Table 4.7: Table representing the average difference in sleep quality and average sleep efficiency scores for each exercise group.

This is a summary of the descriptive statistics of the ‘Sleep_Difference’ and ’Sleep_Efficiency variables, including mean, sd, min, max, and group-wise means.

4.6 Visualizations

ggplot(participant_sleep_midterm, aes(x = Sleep_Difference, y = Exercise_Group, fill = Exercise_Group)) +
  geom_boxplot(outlier.shape = NA) +
  theme_classic() +
  coord_flip() +
  theme(legend.position = "none") +
  labs(title = "Difference in Sleep Efficiency by Type of Exercise",
          x = "Difference in Sleep Efficiency", y = "Type of Exercise")

Figure 4.1: Difference in sleep efficiency between participants who exercised using cardiovascular exercise only, weights only, both cardiovascular and weights, or no exercise at all.

ggplot(participant_sleep_midterm, aes(x = Sleep_Efficiency, y = Exercise_Group, fill = Exercise_Group)) +
  geom_boxplot(outlier.shape = NA) +
  theme_classic() +
  coord_flip() +
  theme(legend.position = "none") +
  labs(title = "Sleep Efficiency by Type of Exercise",
       x = "Sleep Efficiency", y = "Type of Exercise")

Figure 4.2: Average sleep efficiency between participants who exercised using cardiovascular exercise only, weights only, both cardiovascular and weights, or no exercise at all.

ggplot(participant_sleep_midterm, aes(x = Sleep_Difference, y = Sleep_Efficiency)) +
  geom_point() +
  theme_classic() +
  geom_smooth(method = "lm", color = "steelblue") +
  labs(title = "Difference in Sleep Efficiency by Participant After Exercise",
       x = "Difference in Sleep Efficiency", y = "Sleep Efficiency")

`geom_smooth()` using formula = 'y ~ x'

Figure 4.3: Difference in sleep efficiency by participant after exercise portion.

Here are three visualizations: two box plots that represent how type of exercise affects difference in sleep efficiency and sleep efficiency itself, and a scatterplot that represents how exercise affected the difference in sleep efficiency for each participant (including a trend line).

4.7 T-Tests

4.7.0.1 Independent T-Test: Sleep_Difference ~ Sex

sleepdif_sex_t_test <-t.test(Sleep_Difference ~ Sex, data = participant_sleep_midterm)
sleepdif_sex_t_test


    Welch Two Sample t-test

data:  Sleep_Difference by Sex
t = 1.5801, df = 77.647, p-value = 0.1182
alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
95 percent confidence interval:
 -0.05865017  0.50972574
sample estimates:
mean in group Female   mean in group Male 
           0.7795918            0.5540541

Female mean: 0.7796
Male mean: 0.5541
P-value: 0.12 (>0.05)

Differences in sleep efficiency between male and female participants are not statistically different from one another/not statistically significant. We fail to reject the null.

4.7.0.2 Independent T-Test: Sleep_Difference ~ AgeGroup2

sleepdif_age_t_test <-t.test(Sleep_Difference ~ AgeGroup2, data = participant_sleep_midterm)
sleepdif_age_t_test


    Welch Two Sample t-test

data:  Sleep_Difference by AgeGroup2
t = -1.3746, df = 36.662, p-value = 0.1776
alternative hypothesis: true difference in means between group <40 and group >=40 is not equal to 0
95 percent confidence interval:
 -0.50676303  0.09717936
sample estimates:
 mean in group <40 mean in group >=40 
         0.6373134          0.8421053

<40 mean: 0.6373
=>40 mean: 0.8421
P-value: 0.18 (>0.05)

Differences in sleep efficiency between participants under 40 years old and participants 40 years old or older are not statistically different from one another/not statistically significant. Again, we fail to reject the null.

4.8 ANOVAs

4.8.0.1 ANOVA A: Sleep_Difference ~ Exercise_Group

anova_model_diff_exercise <- aov(Sleep_Difference ~ Exercise_Group, data = participant_sleep_midterm)

summary(anova_model_diff_exercise)

               Df Sum Sq Mean Sq F value   Pr(>F)    
Exercise_Group  3  13.56   4.520   15.72 3.67e-08 ***
Residuals      82  23.58   0.288                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Table 4.8: ANOVA table representing how much variance in the difference in sleep can be explained by the type of exercise intervention.

supernova(anova_model_diff_exercise)

 Analysis of Variance Table (Type III SS)
 Model: Sleep_Difference ~ Exercise_Group

                             SS df    MS      F   PRE     p
 ----- --------------- | ------ -- ----- ------ ----- -----
 Model (error reduced) | 13.560  3 4.520 15.717 .3651 .0000
 Error (from model)    | 23.583 82 0.288                   
 ----- --------------- | ------ -- ----- ------ ----- -----
 Total (empty model)   | 37.144 85 0.437

Table 4.9: A clearer version of the previous ANOVA table.

F-value: 15.717 (large)
df: 85
p-value: 3.67e-08 (<0.05, significant!)
PRE: 0.3651 (~36.6% of variance)

Exercise group accounts for about 37% of total variance in difference in sleep efficiency.

4.8.0.2 Post-Hoc: Tukey HSD

TukeyHSD(anova_model_diff_exercise)

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = Sleep_Difference ~ Exercise_Group, data = participant_sleep_midterm)

$Exercise_Group
                             diff        lwr         upr     p adj
Cardio+Weights-Cardio  -0.2772257 -0.7017134  0.14726203 0.3237562
None-Cardio            -1.0904762 -1.5245041 -0.65644825 0.0000000
Weights-Cardio         -0.4714286 -0.9054565 -0.03740063 0.0278779
None-Cardio+Weights    -0.8132505 -1.2377382 -0.38876282 0.0000171
Weights-Cardio+Weights -0.1942029 -0.6186906  0.23028480 0.6287294
Weights-None            0.6190476  0.1850197  1.05307556 0.0018927

The ‘Cardio’ and ‘Cardio+Weights’ groups were tied for the most positive change in sleep quality; there was no statistically significant difference between these two exercise groups. In addition, the ‘Weights’ group showed a significantly greater improvement in sleep quality compared to the ‘None’ group. The ‘None’ group produced significantly worse results in sleep quality than all of the other groups.

ANOVA B: Sleep_Efficiency ~ Exercise_Group

anova_model_eff_exercise <- aov(Sleep_Efficiency ~ Exercise_Group, data = participant_sleep_midterm)

summary(anova_model_eff_exercise)

               Df Sum Sq Mean Sq F value  Pr(>F)   
Exercise_Group  3  540.4   180.1   5.925 0.00104 **
Residuals      82 2492.9    30.4                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Table 4.10: ANOVA table representing how much variance in sleep efficiency can be explained by the type of exercise intervention.

supernova(anova_model_eff_exercise)

 Analysis of Variance Table (Type III SS)
 Model: Sleep_Efficiency ~ Exercise_Group

                               SS df      MS     F   PRE     p
 ----- --------------- | -------- -- ------- ----- ----- -----
 Model (error reduced) |  540.400  3 180.133 5.925 .1782 .0010
 Error (from model)    | 2492.939 82  30.402                  
 ----- --------------- | -------- -- ------- ----- ----- -----
 Total (empty model)   | 3033.339 85  35.686

Table 4.11: A clearer version of the previous ANOVA table.

F-value: 5.925
df: 85
p-value: 0.001 (<0.05, significant!)
PRE: 0.1782 (~17.8% of variance)

Exercise group accounts for about 18% of total variance in sleep efficiency.

4.8.0.3 Post-Hoc: Tukey HSD

TukeyHSD(anova_model_eff_exercise)

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = Sleep_Efficiency ~ Exercise_Group, data = participant_sleep_midterm)

$Exercise_Group
                             diff        lwr         upr     p adj
Cardio+Weights-Cardio   1.3871636  -2.977172  5.75149915 0.8383629
None-Cardio            -4.3761905  -8.838613  0.08623232 0.0566544
Weights-Cardio         -3.9904762  -8.452899  0.47194661 0.0962888
None-Cardio+Weights    -5.7633540 -10.127690 -1.39901844 0.0046379
Weights-Cardio+Weights -5.3776398  -9.741975 -1.01330416 0.0094267
Weights-None            0.3857143  -4.076709  4.84813708 0.9958617

Only the ‘Cardio+Weights’ group showed statistically significant improvements in sleep efficiency compared to other groups. Specifically, it performed significantly better than both the ‘Weights’ and ‘None’ groups. There was no significant difference between the ‘Cardio’ and ‘Cardio+Weights’ groups. The ‘Cardio’ group showed potential improvement when compared to the ‘None’ group, but the difference was not found to be statistically significant. The ‘None’ group had the worst overall sleep efficiency.

The ‘Cardio+Weights’ group

4.9 Synthesis & Recommendation

Based on both Sleep_Difference and Sleep_Efficiency, the ‘Cardio+Weights’ exercise is the best choice for improving sleep. It had the highest average improvements and was the only group that showed statistically significant improvements with both variables. For Sleep_Difference, the ANOVA showed a strong effect (F(3, 82) = 15.72, p < .001), with ‘Cardio+Weights’ showing significant improvement compared to the ‘None’ group (p < .001) and ‘Weights’ group (p = .00002). For Sleep_Efficiency, the ANOVA model was also significant (F(3, 82) = 5.93, p = .001), and again, ‘Cardio+Weights’ had much better quality than ‘None’ (p = .0046) and ‘Weights’ (p = .0094). While ‘Cardio’ alone came close to significance, it was not significantly better than any other group. Overall, ‘Cardio+Weights’ consistently led to better sleep outcomes and is the most effective recommendation.

4.10 Reflection

The analyses were definitely the toughest part- coding them was one thing, but understanding and interpreting them required a completely different mindset (I need to re-up on my statistics for sure). I would say that took the longest. I felt really confident about the cleaning, merging, and plotting, and I had fun trying out different ggplot2 themes to see which I liked the best. Though by the end, I felt confident enough in my data analyses and interpretations, it took a lot of resources to bring me to those conclusions, and I would like to work on my interpretation skills for the future so that it does not take so long for me to parse through everything to figure out what it all means.