12 T검정, 로지스틱 회귀분석 플랏

12.1 install package from github

tidyvere and htmlTable, and broom 가 데이터를 변환하고 표시하는데 자주 사용됩니다. devtools 은 “github”에 있는 함수를 불러오는데 사용됩니다.
install_github 을 통해 패키지를 설치하겠습니다.

rm(list=ls())
#basic requirment
if(!require("tidyverse")) install.packages("tidyverse")
if(!require("htmlTable")) install.packages("htmlTable")
if(!require("broom")) install.packages("broom")
if(!require("labelled")) install.packages("labelled")
# packages from github
if(!require("devtools")) install.packages("devtools")
library(devtools)
if(!require("tabf")) install_github("jinhaslab/tabf",  quiet = TRUE)
library(tabf)

12.2 데이터 준비

12.2.1 근로환경조사 기본 변수 생성 (선택)

data manipulation에서 실습한 내용을 통해 변수를 실습을 위한 변수를 생성한 데이터인 kwcsData1.rds 를 이용하겠습니다. 이에 대한 자세한 과정은 이전 챔터에 있습니다.
Table 1 변수 생성과정 챕터

url <- "https://raw.githubusercontent.com/jinhaslab/opendata/main/kwcs/kwcsData1.rds"
download.file(url, "data/kwcsData1.rds")
dat1 = readRDS("data/kwcsData1.rds")

12.3 Overview of data and Table 1

Variables can be divided into dependent and independent variables, and independent variables can be further categorized based on their nature into continuous or nominal variables. Let’s distinguish them as such.

stratas  = c("sleepgp")
catVars = c("wwa1gp", "shortReturn","shiftWork" , "njob", "sexgp",  "edugp", "empgp")
conVars = c("AGE","satisfaction")

tab1 = tabf(dat1=dat1, stratas = stratas, catVars = catVars, conVars = conVars)
tab1

# A tibble: 22 × 5
   variables      values       `0.non distrubance` `1.sleep disturbance` p.value
   <chr>          <chr>        <chr>               <chr>                 <chr>  
 1 "AGE"          ""           46.8±12.4           49.7±11.9             "<0.00…
 2 "wwa1gp"       "Never"      12222 (94.7%)       684 (5.3%)            "<0.00…
 3 ""             "Rarely"     12316 (94.5%)       714 (5.5%)            ""     
 4 ""             "Sometimes"  9112 (90.3%)        981 (9.7%)            ""     
 5 ""             "Often"      3456 (82.8%)        717 (17.2%)           ""     
 6 ""             "Always"     634 (70.0%)         272 (30.0%)           ""     
 7 "satisfaction" ""           2.1±0.5             2.4±0.6               "<0.00…
 8 "shortReturn"  "non short … 36183 (92.5%)       2927 (7.5%)           "<0.00…
 9 ""             "short retu… 1557 (77.9%)        441 (22.1%)           ""     
10 "shiftWork"    "non shift … 35056 (91.9%)       3073 (8.1%)           "<0.00…
# ℹ 12 more rows

tab1 %>% 
  setNames(c("", "", "None", "Disturbance", "P value")) %>%
  htmlTable(
    cgroup = c("",  "Sleep disturbance", ""), 
    n.cgroup = c(2, 2, 1), 
    tfoot = "P value calculated by Chisq-Test and T-Test", 
    rnames = FALSE, 
    caption = "Basic Characteristics according to Sleep disturbance", 
    css.table = "font-family: 'Times New Roman';"
  )

Basic Characteristics according to Sleep disturbance
		Sleep disturbance
		None	Disturbance	P value
AGE		46.8±12.4	49.7±11.9	<0.001
wwa1gp	Never	12222 (94.7%)	684 (5.3%)	<0.001
	Rarely	12316 (94.5%)	714 (5.5%)
	Sometimes	9112 (90.3%)	981 (9.7%)
	Often	3456 (82.8%)	717 (17.2%)
	Always	634 (70.0%)	272 (30.0%)
satisfaction		2.1±0.5	2.4±0.6	<0.001
shortReturn	non short return	36183 (92.5%)	2927 (7.5%)	<0.001
	short return	1557 (77.9%)	441 (22.1%)
shiftWork	non shift work	35056 (91.9%)	3073 (8.1%)	<0.001
	shift work	2684 (90.1%)	295 (9.9%)
njob	one-job	37471 (91.9%)	3317 (8.1%)	<0.001
	njob	269 (84.1%)	51 (15.9%)
sexgp	Men	17892 (93.1%)	1327 (6.9%)	<0.001
	Women	19848 (90.7%)	2041 (9.3%)
edugp	university or more	19597 (92.9%)	1502 (7.1%)	<0.001
	high school	14943 (91.9%)	1318 (8.1%)
	middle school or below	3200 (85.4%)	548 (14.6%)
empgp	paid-worker	25786 (92.4%)	2122 (7.6%)	<0.001
	employer/self-employer	2539 (91.7%)	229 (8.3%)
	own-account worker	8359 (90.5%)	880 (9.5%)
	unpaind family work	1056 (88.5%)	137 (11.5%)
P value calculated by Chisq-Test and T-Test

T-test

The mean level of age are differ according to sleep disturbance. Box plot show mean and distribution of values. There are two method of t.test in R. One is using two vector, and the other is using of relationship.

#stratas = "sleepgp"
age1 = dat1$AGE[dat1$sleepgp =="0.non distrubance"]
age2 = dat1$AGE[dat1$sleepgp =="1.sleep disturbance"] 
t.test(age1, age2)


    Welch Two Sample t-test

data:  age1 and age2
t = -13.409, df = 4040.8, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -3.306275 -2.462776
sample estimates:
mean of x mean of y 
 46.77878  49.66330

t.test(dat1$AGE ~ dat1$sleepgp)


    Welch Two Sample t-test

data:  dat1$AGE by dat1$sleepgp
t = -13.409, df = 4040.8, p-value < 2.2e-16
alternative hypothesis: true difference in means between group 0.non distrubance and group 1.sleep disturbance is not equal to 0
95 percent confidence interval:
 -3.306275 -2.462776
sample estimates:
  mean in group 0.non distrubance mean in group 1.sleep disturbance 
                         46.77878                          49.66330

T test’s p-value can be generated by following pipe code.

ttestPvalue <- dat1 %>%
  select(sleepgp, AGE) %>%
  pivot_longer(-sleepgp) %>%
  nest(dat=-name) %>%
  mutate(fit =map(dat, ~t.test(.$value ~ .$sleepgp)), 
         tidied =map(fit, tidy)) %>%
  unnest(tidied) %>%
  select(name, p.value) %>%
  mutate(pvalue = ifelse(p.value < 0.001, "<0.001", sprintf("%.3f", p.value))) %>%
  pull(pvalue)

# 위에는 여러개의 p value를 구할때 사용하는 것이고 하나만 사용할 때는
myttest <- t.test(dat1$AGE ~ dat1$sexgp)
myttest$p.value #를 사용한다.

[1] 3.287249e-28

dat1 %>%
  select(sleepgp, AGE) %>%
  ggplot(aes(x=sleepgp, y= AGE)) +
  geom_boxplot()

dat1 %>%
  select(sleepgp, AGE) %>%
  ggplot(aes(x=sleepgp, y= AGE)) +
  geom_boxplot() +
  ggtitle("Age according to sleep disturbance")+
  xlab("Sleep Disturbance") + 
  theme(text=element_text(family="Times"))

dat1 %>%
  select(sleepgp, AGE) %>%
  ggplot(aes(x=sleepgp, y= AGE)) +
  geom_boxplot() +
  ggtitle("Age according to sleep disturbance")+
  xlab("Sleep Disturbance") + 
  annotate(
    geom ="text", -Inf, Inf, 
    hjust=-3, vjust=5,
    label =sprintf("P value:  %s", ttestPvalue), 
    family = "Times"
  ) #+

  #theme(text=element_text(family="Times"))

Chisq Test

wwa1gp is “A. kept worrying about work when you were not working”, and the response is 0.Never, 1.Rarely, 2.Sometimes, 3.Often and 4.Always. In the previous table, the distribution of sleep distrubance are differ according to wwa1gp.
The bar chart is one of the best option for distribution visualization.

dat1 %>%
  group_by(wwa1gp) %>%
  count(sleepgp) %>%
  mutate(prob = n/sum(n)*100, 
         prob = round(prob, 1)) %>%
  filter(sleepgp =="1.sleep disturbance") %>%
  ggplot(aes(x=wwa1gp, y = prob)) +
  geom_bar(stat="identity")

Some modification are needed to communicat to other researcher. Percent scale are easy to follow.

dat1 %>%
  group_by(wwa1gp) %>%
  count(sleepgp) %>%
  mutate(prob = n/sum(n)*100, 
         prob = round(prob, 1)) %>%
  filter(sleepgp =="1.sleep disturbance") %>%
  ggplot(aes(x=wwa1gp, y = prob)) +
  #theme(text=element_text(family="Times New Roman", face="bold", size=12)) +
  xlab("kept worrying about work when you were not working") +
  ylab("Sleep Disturbance (proportion)") +
  ylim(0, 0.35)+
  geom_bar(stat="identity") +
  geom_text(aes(label = sprintf("%s%%", prob)),  vjust = -0.5) +
  scale_y_continuous(labels = function(x) sprintf("%s%%", round(x)))

Now, we added the p value of chisq.test().

chisqp = dat1 %>%
  select(wwa1gp, sleepgp) %>%
  table() %>%
  chisq.test()
if (chisqp$p.value <0.001){
  pChisqV = "<0.001"
  } else{
    pChisqV = chisqp$p.value %>% sprintf("%.3f", .)
  }
pChisqV

[1] "<0.001"

#library(extrafont)
library(ggthemes)
dat1 %>%
  group_by(wwa1gp) %>%
  count(sleepgp) %>%
  mutate(prob = n/sum(n)*100, 
         prob = round(prob, 1)) %>%
  filter(sleepgp =="1.sleep disturbance") %>%
  ggplot(aes(x=wwa1gp, y = prob)) +
  xlab("kept worrying about work when you were not working") +
  ylab("Sleep Disturbance (proportion)") +
  geom_bar(stat="identity") +
  geom_text(aes(label = sprintf("%s%%", prob)),  vjust = -0.5) +
  scale_y_continuous(labels = function(x) sprintf("%s%%", round(x))) +
  annotate(
    geom ="text", -Inf, Inf, 
    hjust=-0.5, vjust=8,
    label =sprintf("P value:  %s", pChisqV), 
    #family = "Times New Roman", 
    fontface = "bold"
  ) +
  theme_minimal() #+

  #theme(text=element_text(family="Times New Roman", face="bold", size=12))

12.4 Odds ratio and 95% confidence interval

Logistic regression model

There are 3 kind of models, model II and III are include more confounding variables compare to model I.

mod1 = dat1 %>%
  glm(data=.,family="binomial",formula = sleepgp == "1.sleep disturbance"  
      ~ wwa1gp)
mod2 = dat1 %>%
  glm(data=.,family="binomial",formula = sleepgp == "1.sleep disturbance"
      ~ wwa1gp + AGE + sexgp +satisfaction)
mod3 = dat1 %>%
  glm(data=.,family="binomial",formula = sleepgp == "1.sleep disturbance"
      ~ wwa1gp + AGE + sexgp +satisfaction + shiftWork + njob)

oddf0() function gives us odds ratio, lower limt and upper limit (95% confidence interval). We can draw OR (95% CI) plot, using errobar() function. we can also change the color using RGB triplet (ref: http://www.cookbook-r.com/Graphs/Colors_(ggplot2)).

oddf0(mod1) %>%
  ggplot(aes(y=values, x = or)) +
  theme_classic() +
  geom_errorbarh(aes(xmin = ll, xmax = ul), height = 0.09, color = "#666666", size=0.5) +
  geom_point(shape = 15, size = 2, color = "#444444")  +
  geom_vline(xintercept = 1, color = "#FF0000", linetype = "dashed", cex =0.5, alpha = 0.5) +
  ylab("kept worrying about work when you were not working") +
  xlab("Odds Ratio and 95% Confidence Interval of Sleep Dsturbance") #+

  #theme(text = element_text(family ="Times"))

Vertical error bar also suitable to plot the OR (95% CI).

oddf0(mod1) %>%
  ggplot(aes(x=values, y = or)) +
  theme_classic() +
  geom_errorbar(aes(ymin = ll, ymax = ul), width = 0.07, color = "#666666", size=0.5) +
  geom_point(shape = 15, size = 2.5, color = "#444444") +
  geom_hline(yintercept = 1, color = "#FF0000", linetype = "dashed", cex =0.5, alpha = 0.5) +
  xlab("Kept worrying about work when you were not working") +
  ylab("Odds Ratio and 95% Confidence Interval of Sleep Dsturbance")# +

  #theme(text = element_text(family ="Times"))