Predictors of Heart Disease: Statistical Modeling and Testing

Authors

Julian L. Costa

Connor Horan

Elishel Sason

Published

October 4, 2025

1 Introduction

Heart disease stands as one of the leading causes of mortality worldwide, impacting the lives of millions and placing extreme burden on healthcare systems. Understanding the various risk factors related to heart disease can help create more efficient prevention strategies and improve overall health outcomes. In this project, we seek to identify and study the significant predictors of heart disease in hopes to improve intervention strategies and encourage lifestyle changes that may aid at-risk populations.

2 Research Question & Hypotheses

Our research question for this project is:

“What are the significant risk factors associated with heart disease?”

Our Hypothesis for this project is:

“Higher Age, Smoking, High Blood Pressure, and Stroke are significantly associated with a greater likelihood of heart disease.“

In addition to Higher Age, Smoking, and High Blood Pressure, we plan to examine the effects of other predictors in the model to determine how, if at all, they influence the likelihood of heart disease. These three variables were chosen as the focus of our initial analysis because we believed they would be the most significant.

3 Data Description

This project utilizes the Heart Disease Health Indicators dataset from Kaggle, which contains health-related information on over 253,000 individuals. The data was originally collected from the Behavioral Risk Factor Surveillance System (BRFSS), a health-related telephone survey that is collected annually by the CDC.

Link to Dataset: https://www.kaggle.com/datasets/alexteboul/heart-disease-health-indicators-dataset?resource=download

3.0.1 Variables, according to the BRFSS Codebook:

HeartDiseaseorAttack: Whether the respondent has coronary heart disease or myocardial infarction
- Values: 0 = No, 1 = Yes.
HighBP: Whether the respondent has been told they have high blood pressure
- Values: 0 = No, 1 = Yes.
HighChol: Whether the respondent has been told they have high cholesterol
- Values: 0 = No, 1 = Yes.
CholCheck: Whether the respondent has had their cholesterol checked within the past 5 years
- Values: 0 = No, 1 = Yes.
BMI: Body Mass Index
- Values: Numeric (to one decimal place, e.g., 24.5).
Smoker: Whether the respondent has smoked at least 100 cigarettes in their lifetime
- Values: 0 = No, 1 = Yes.
Stroke: Whether the respondent has ever had a stroke
- Values: 0 = No, 1 = Yes.
Diabetes: Whether the respondent has diabetes or prediabetes
- Values: 0 = No, 1 = Yes.
PhysActivity: Whether the respondent had physical activity or exercise other than their regular job in the past 30 days
- Values: 0 = No, 1 = Yes.
Fruits: Whether the respondent consumes fruit one or more times per day
- Values: 0 = No, 1 = Yes.
Veggies: Whether the respondent consumes vegetables one or more times per day
- Values: 0 = No, 1 = Yes.
HvyAlcoholConsump: Whether the respondent is a heavy drinker (defined as consuming more than 14 drinks per week for men and more than 7 drinks per week for women)
- Values: 0 = No, 1 = Yes.
AnyHealthcare: Whether the respondent has any kind of health care coverage
- Values: 0 = No, 1 = Yes.
NoDocbcCost: Whether the respondent could not see a doctor in the past year because of cost
- Values: 0 = No, 1 = Yes.
GenHlth: General health status
- Values: Ordinal scale (1 = Excellent, 2 = Very Good, 3 = Good, 4 = Fair, 5 = Poor).
MentHlth: Number of days in the past 30 days that the respondent’s mental health was not good
- Values: Integer (0–30).
PhysHlth: Number of days in the past 30 days that the respondent’s physical health was not good
- Values: Integer (0–30).
DiffWalk: Whether the respondent has serious difficulty walking or climbing stairs
- Values: 0 = No, 1 = Yes.
Sex: The respondent’s gender
- Values: 0 = Female, 1 = Male.
Age: The respondent’s age group
- Values: Ordinal scale (1 = 18–24, 2 = 25–29, 3 = 30–34, 4 = 35–39, 5 = 40–44, 6 = 45–49, 7 = 50–54, 8 = 55–59, 9 = 60–64, 10 = 65–69, 11 = 70–74, 12 = 75–79, 13 = 80+)
Education: Education level of the respondent
- Values: Ordinal scale (1 = Did not graduate high school, 2 = Graduated high school, ..., 5 = College graduate).
Income: Income level of the respondent
- Values: Ordinal scale (1 = Less than $10,000, 2 = $10,000 to $15,000, ..., 8 = $75,000 or more).

4 Loading the Dataset

heart_data <- read.csv("heart_data.csv")

5 Data Overview

# Examining the structure of the data
str(heart_data)

'data.frame':   253680 obs. of  22 variables:
 $ HeartDiseaseorAttack: num  0 0 0 0 0 0 0 0 1 0 ...
 $ HighBP              : num  1 0 1 1 1 1 1 1 1 0 ...
 $ HighChol            : num  1 0 1 0 1 1 0 1 1 0 ...
 $ CholCheck           : num  1 0 1 1 1 1 1 1 1 1 ...
 $ BMI                 : num  40 25 28 27 24 25 30 25 30 24 ...
 $ Smoker              : num  1 1 0 0 0 1 1 1 1 0 ...
 $ Stroke              : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Diabetes            : num  0 0 0 0 0 0 0 0 2 0 ...
 $ PhysActivity        : num  0 1 0 1 1 1 0 1 0 0 ...
 $ Fruits              : num  0 0 1 1 1 1 0 0 1 0 ...
 $ Veggies             : num  1 0 0 1 1 1 0 1 1 1 ...
 $ HvyAlcoholConsump   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ AnyHealthcare       : num  1 0 1 1 1 1 1 1 1 1 ...
 $ NoDocbcCost         : num  0 1 1 0 0 0 0 0 0 0 ...
 $ GenHlth             : num  5 3 5 2 2 2 3 3 5 2 ...
 $ MentHlth            : num  18 0 30 0 3 0 0 0 30 0 ...
 $ PhysHlth            : num  15 0 30 0 0 2 14 0 30 0 ...
 $ DiffWalk            : num  1 0 1 0 0 0 0 1 1 0 ...
 $ Sex                 : num  0 0 0 0 0 1 0 0 0 1 ...
 $ Age                 : num  9 7 9 11 11 10 9 11 9 8 ...
 $ Education           : num  4 6 4 3 5 6 6 4 5 4 ...
 $ Income              : num  3 1 8 6 4 8 7 4 1 3 ...

As we can see, the dataset contains 253680 observations of 22 variables.

5.0.1 Checking for Missing Values

colSums(is.na(heart_data))

HeartDiseaseorAttack               HighBP             HighChol 
                   0                    0                    0 
           CholCheck                  BMI               Smoker 
                   0                    0                    0 
              Stroke             Diabetes         PhysActivity 
                   0                    0                    0 
              Fruits              Veggies    HvyAlcoholConsump 
                   0                    0                    0 
       AnyHealthcare          NoDocbcCost              GenHlth 
                   0                    0                    0 
            MentHlth             PhysHlth             DiffWalk 
                   0                    0                    0 
                 Sex                  Age            Education 
                   0                    0                    0 
              Income 
                   0

5.0.2 Summary Statistics

# Outlining summary statistics for all of the continuous variables
summary(heart_data[, c("BMI", "PhysHlth", "MentHlth")])

      BMI           PhysHlth         MentHlth     
 Min.   :12.00   Min.   : 0.000   Min.   : 0.000  
 1st Qu.:24.00   1st Qu.: 0.000   1st Qu.: 0.000  
 Median :27.00   Median : 0.000   Median : 0.000  
 Mean   :28.38   Mean   : 4.242   Mean   : 3.185  
 3rd Qu.:31.00   3rd Qu.: 3.000   3rd Qu.: 2.000  
 Max.   :98.00   Max.   :30.000   Max.   :30.000

# Outlining frequency tables and bar plots for binary variables

# Creating a bar plot for Heart Disease Status
hd_counts <- table(heart_data$HeartDiseaseorAttack)
colors_hd <- c("skyblue", "salmon")
barplot(hd_counts, 
        col = colors_hd, 
        names.arg = c("No Heart Disease", "Heart Disease"), 
        main = "Heart Disease Status", 
        xlab = "Heart Disease Status", 
        ylab = "Count")

# Creating a bar plot for High Blood Pressure Status
bp_counts <- table(heart_data$HighBP)
colors_bp <- c("lightgreen", "tomato")
barplot(bp_counts, 
        col = colors_bp, 
        names.arg = c("No High BP", "High BP"), 
        main = "High Blood Pressure Status", 
        xlab = "High Blood Pressure Status", 
        ylab = "Count")

# Creating a bar plot for Smoking Status
smoker_counts <- table(heart_data$Smoker)
colors_smoker <- c("lightblue", "orange")
barplot(smoker_counts, 
        col = colors_smoker, 
        names.arg = c("Non-Smoker", "Smoker"), 
        main = "Smoking Status", 
        xlab = "Smoking Status", 
        ylab = "Count")

# Creating a histogram for Age
hist(heart_data$Age, main = "Age", xlab = "Age Category")

# Creating a histogram for BMI
hist(heart_data$BMI, main = "BMI", xlab = "BMI")

5.1 Variable Relationships

5.1.1 Physical Health by Heart Disease

# Creating a box plot for Physical Health by Heart Disease Status
boxplot(heart_data$PhysHlth ~ heart_data$HeartDiseaseorAttack,
        main = "Physical Health by Heart Disease Status",
        xlab = "Heart Disease Status",
        ylab = "Physical Health Score",
        col = c("skyblue", "salmon"))

5.1.2 High Blood Pressure by Heart Disease

# Outlining a table of high blood pressure counts by heart disease status
high_bp_hd <- table(heart_data$HighBP, heart_data$HeartDiseaseorAttack)

# Plotting the table into a barplot
barplot(high_bp_hd,
        beside = TRUE,
        col = c("skyblue", "salmon"),
        names.arg = c("No Heart Disease", "Heart Disease"),
        legend.text = c("No High Blood Pressure", "High Blood Pressure"),
        args.legend = list(x = "topright", bty = "n"),
        main = "High Blood Pressure by Heart Disease Status",
        xlab = "Heart Disease Status",
        ylab = "Count")

5.1.3 Heart Disease by Age

# Calculating the proportion of individuals with heart disease in each age category
age_hd_table <- table(heart_data$Age, heart_data$HeartDiseaseorAttack)
hd_proportions <- prop.table(age_hd_table, margin = 1)[, "1"]

# Plotting the line chart
plot(names(hd_proportions), hd_proportions, type = "o", pch = 16, col = "salmon",
     main = "Heart Disease Prevalence by Age Category",
     xlab = "Age Category", ylab = "Proportion with Heart Disease",
     ylim = c(0, max(hd_proportions) * 1.1))

# Adding a horizontal reference line at the average proportion
abline(h = mean(hd_proportions), col = "gray", lty = 2)

5.1.4 Physical and Mental Health Scores by Heart Disease Status

ggplot(heart_data, aes(x = factor(HeartDiseaseorAttack), y = PhysHlth, fill = factor(HeartDiseaseorAttack))) +
  geom_boxplot() +
  labs(title = "Physical Health by Heart Disease Status", x = "Heart Disease Status", y = "Physical Health Score") +
  scale_fill_manual(values = c("skyblue", "salmon"), labels = c("No Heart Disease", "Heart Disease"))

ggplot(heart_data, aes(x = factor(HeartDiseaseorAttack), y = MentHlth, fill = factor(HeartDiseaseorAttack))) +
  geom_boxplot() +
  labs(title = "Mental Health by Heart Disease Status", x = "Heart Disease Status", y = "Mental Health Score") +
  scale_fill_manual(values = c("skyblue", "salmon"), labels = c("No Heart Disease", "Heart Disease"))

6 Hypothesis Testing

6.1 Chi-Squared

6.1.1 Proportional Tables

# Outlining the categorical variables
categorical_vars <- c("Smoker", "HighBP", "PhysActivity", "HighChol", "Stroke", "Diabetes",
                      "Fruits", "Veggies", "AnyHealthcare", "NoDocbcCost", "GenHlth", 
                      "DiffWalk", "Sex", "Age", "Education", "Income")

# Setting up proportional tables list
proportional_tables <- list()
for (var in categorical_vars) {
  table_var <- table(heart_data$HeartDiseaseorAttack, heart_data[[var]])
  proportional_tables[[var]] <- prop.table(table_var, margin = 2)
}

# Merging proportional tables
prop_table_summary <- do.call(rbind, lapply(names(proportional_tables), function(var) {
  as.data.frame(as.table(proportional_tables[[var]])) %>% mutate(Variable = var)
}))
colnames(prop_table_summary) <- c("HeartDisease", "Category", "Proportion", "Variable")

# Plotting proportion heatmap using tables
proportion_plot <- ggplot(prop_table_summary, aes(x = Category, y = Variable, fill = Proportion)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0.5, 
                       limit = c(0, 1), name = "Proportion") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Proportional Distribution of Variables by Heart Disease",
       x = "Category", y = "Variable")

# Displaying proportion heatmap
print(proportion_plot)

6.1.2 Correlation

# Outlining the numeric variables for correlation analysis
numeric_vars <- c("BMI", "MentHlth", "PhysHlth")
numeric_vars <- numeric_vars[numeric_vars %in% names(heart_data)]

# Verifying numeric_vars
if (length(numeric_vars) > 0) {
  # Calculating correlations
  correlation_values <- sapply(numeric_vars, function(var) {
    cor(heart_data$HeartDiseaseorAttack, heart_data[[var]], use = "complete.obs")
  })
  
  # Creating a data frame for correlation results
  correlation_results <- data.frame(
    Variable = numeric_vars,
    Correlation = round(correlation_values, 3),
    stringsAsFactors = FALSE
  )
} else {
  correlation_results <- NULL
}

6.1.3 Correlation Results

# Displaying results
if (!is.null(correlation_results)) {
  cat("\n## Correlation of Continuous Variables with Heart Disease\n")
  print(correlation_results, row.names = FALSE)
} else {
  cat("\n## Correlation of Continuous Variables with Heart Disease\n")
  cat("Not Applicable.\n")
}


## Correlation of Continuous Variables with Heart Disease
 Variable Correlation
      BMI       0.053
 MentHlth       0.065
 PhysHlth       0.182

6.1.4 Chi-Squared Test

$H_0 : \text{There is no association between the categorical variable and heart disease status.}$

$H_a : \text{There is an association between the categorical variable and heart disease status.}$

$\alpha = 0.05$

# Creating a function to calculate Cramér's V
cramers_v <- function(chi_sq, n, df) {
  return(sqrt(chi_sq / (n * min(df, n - 1))))
}

# Initializing the results table
chi_sq_results <- data.frame(
  Variable = character(), 
  Chi_Square = numeric(),
  df = numeric(),
  p_value = numeric(),
  Cramers_V = numeric(),
  stringsAsFactors = FALSE
)

# Looping through categorical variables
for (var in categorical_vars) {
  table_var <- table(heart_data$HeartDiseaseorAttack, heart_data[[var]])
  chi_sq_test <- chisq.test(table_var)
  n <- sum(table_var)
  v <- cramers_v(chi_sq_test$statistic, n, chi_sq_test$parameter)
  
  # Adding results to the table
  chi_sq_results <- rbind(chi_sq_results, data.frame(
    Variable = var,
    Chi_Square = round(chi_sq_test$statistic, 2),
    df = chi_sq_test$parameter,
    p_value = round(chi_sq_test$p.value, 4),
    Cramers_V = round(v, 2)
  ))
}

6.1.5 Chi-Squared Results

# Displaying the results
print(chi_sq_results, row.names = FALSE)

      Variable Chi_Square df p_value Cramers_V
        Smoker    3321.61  1       0      0.11
        HighBP   11117.88  1       0      0.21
  PhysActivity    1932.63  1       0      0.09
      HighChol    8288.02  1       0      0.18
        Stroke   10450.58  1       0      0.20
      Diabetes    8244.89  2       0      0.13
        Fruits      99.22  1       0      0.02
       Veggies     388.82  1       0      0.04
 AnyHealthcare      88.74  1       0      0.02
   NoDocbcCost     243.40  1       0      0.03
       GenHlth   19008.16  4       0      0.14
      DiffWalk   11475.80  1       0      0.21
           Sex    1879.79  1       0      0.09
           Age   13731.04 12       0      0.07
     Education    2589.79  5       0      0.05
        Income    5277.60  7       0      0.05

6.1.6 Summary

The chi-squared results show that nearly all categorical predictors are significantly associated with heart disease (p < 0.05), as indicated by their high chi-square values and low p-values. HighBP, HighChol, and Stroke have the largest chi-square values, suggesting they are the strongest contributors among these variables.

Cramér’s V values indicate moderate association strength for HighBP (0.21), HighChol (0.18), and Stroke (0.20), while other variables like Smoker (0.11), Diabetes (0.13), and PhysActivity (0.09) show smaller, yet still notable associations. Predictors such as Fruits, Veggies, AnyHealthcare, and NoDocbcCost have very weak associations (Cramér’s V < 0.05), indicating minimal practical significance despite statistical significance.

6.2 T-Test

$H_0 : \mu_1 = \mu_2 \quad \text{(The means of the two groups are equal)}$

$H_a : \text{There is an association between the categorical variable and heart disease status.}$

$\alpha = 0.05$

# Subsets for individuals with and without heart disease
heart_disease_yes <- heart_data[heart_data$HeartDiseaseorAttack == 1, ]
heart_disease_no <- heart_data[heart_data$HeartDiseaseorAttack == 0, ]

# Function to calculate Cohen's d
cohen_d <- function(group1, group2) {
  n1 <- length(group1)
  n2 <- length(group2)
  mean1 <- mean(group1, na.rm = TRUE)
  mean2 <- mean(group2, na.rm = TRUE)
  sd1 <- sd(group1, na.rm = TRUE)
  sd2 <- sd(group2, na.rm = TRUE)
  sp <- sqrt(((n1 - 1) * sd1^2 + (n2 - 1) * sd2^2) / (n1 + n2 - 2))
  d <- (mean1 - mean2) / sp
  return(d)
}

# Outlining the variables to test
t_test_vars <- c("Age", "BMI", "GenHlth", "MentHlth", "PhysHlth", "Education", "Income")

# Generating a results table
t_test_results_table <- data.frame(
  Variable = character(),
  t_Statistic = numeric(),
  p_Value = numeric(),
  Cohen_d = numeric(),
  stringsAsFactors = FALSE
)

# Looping for t-tests and Cohen's d
for (var in t_test_vars) {
  t_test_result <- t.test(heart_disease_yes[[var]], heart_disease_no[[var]], var.equal = FALSE)
  cohen_d_value <- cohen_d(heart_disease_yes[[var]], heart_disease_no[[var]])
  
  # Adding results to the correct table
  t_test_results_table <- rbind(t_test_results_table, data.frame(
    Variable = var,
    t_Statistic = round(t_test_result$statistic, 2),
    p_Value = round(t_test_result$p.value, 4),
    Cohen_d = round(cohen_d_value, 2)
  ))
}

6.2.1 T-Test Results

# Displaying the table
print(t_test_results_table, row.names = FALSE)

  Variable t_Statistic p_Value Cohen_d
       Age      147.62       0    0.78
       BMI       26.18       0    0.18
   GenHlth      128.81       0    0.92
  MentHlth       26.74       0    0.22
  PhysHlth       68.93       0    0.63
 Education      -46.92       0   -0.34
    Income      -67.35       0   -0.49

6.2.2 Summary

According to the T tests with Cohen’s D on the continuous variables, at a significance level of 𝛂 = 0.05:

Age and GenHlth appear to be the strongest predictors of heart disease with the largest differences between groups.
PhysHlth and Income appear to be significant predictors with moderate effect sizes.
BMI, MentHlth, and Education appear to be significant predictors, but with small effect sizes.

6.3 ANOVA

$H_0 : \mu_1 = \mu_2 = \mu_3 = \dots = \mu_k \quad \text{(All group means are equal)}$

$H_a : \text{At least one group mean is different.}$

$\alpha = 0.05$

# Ensuring HeartDiseaseorAttack is a factor so it can be used in ANOVA
heart_data$HeartDiseaseorAttack <- as.factor(heart_data$HeartDiseaseorAttack)

# Performing ANOVA and collecting results
anova_results_hd <- do.call(rbind, lapply(numeric_vars, function(var) {
  anova_model <- aov(heart_data[[var]] ~ heart_data$HeartDiseaseorAttack)
  anova_summary <- summary(anova_model)
  
  # Extracting the key metrics from the ANOVa for simpler overview
  ms_between <- round(anova_summary[[1]]$`Mean Sq`[1], 2)
  ms_residual <- round(anova_summary[[1]]$`Mean Sq`[2], 2)
  f_stat <- round(anova_summary[[1]]$`F value`[1], 2)
  p_val <- round(anova_summary[[1]]$`Pr(>F)`[1], 4) 
  
  # Setting up to return as a data frame row
  data.frame(
    Variable = var,
    MeanSq_Between = ms_between,
    MeanSq_Residual = ms_residual,
    F_Statistic = f_stat,
    P_Value = p_val,
    stringsAsFactors = FALSE
  )
}))

6.3.1 ANOVA Results: Numeric Variables by HeartDisease

# Displaying results in a table
print(anova_results_hd, row.names = FALSE)

 Variable MeanSq_Between MeanSq_Residual F_Statistic P_Value
      BMI       31009.66           43.55      712.00       0
 MentHlth       58211.10           54.72     1063.78       0
 PhysHlth      636519.14           73.49     8660.85       0

Tukey’s HSD

# Tukey's HSD for BMI
anova_bmi <- aov(BMI ~ HeartDiseaseorAttack, data = heart_data)
tukey_bmi <- TukeyHSD(anova_bmi)
print(tukey_bmi)

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = BMI ~ HeartDiseaseorAttack, data = heart_data)

$HeartDiseaseorAttack
        diff      lwr      upr p adj
1-0 1.196998 1.109076 1.284921     0

# Tukey's HSD for MentHlth
anova_ment <- aov(MentHlth ~ HeartDiseaseorAttack, data = heart_data)
tukey_ment <- TukeyHSD(anova_ment)
print(tukey_ment)

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = MentHlth ~ HeartDiseaseorAttack, data = heart_data)

$HeartDiseaseorAttack
        diff      lwr      upr p adj
1-0 1.640016 1.541462 1.738569     0

# Tukey's HSD for PhysHlth
anova_phys <- aov(PhysHlth ~ HeartDiseaseorAttack, data = heart_data)
tukey_phys <- TukeyHSD(anova_phys)
print(tukey_phys)

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = PhysHlth ~ HeartDiseaseorAttack, data = heart_data)

$HeartDiseaseorAttack
       diff      lwr      upr p adj
1-0 5.42314 5.308926 5.537353     0

6.3.2 Boxplots for ANOVA variables

# Generating boxplots for the ANOVA variables
for (var in numeric_vars) {
  plot <- ggplot(heart_data, aes(x = HeartDiseaseorAttack, y = .data[[var]], fill = HeartDiseaseorAttack)) +
    geom_boxplot(show.legend = FALSE) +
    labs(
      title = paste("Boxplot of", var, "by HeartDisease"),
      x = "HeartDisease (0 = No, 1 = Yes)",
      y = var
    ) +
    scale_fill_manual(values = c("blue", "orange"), labels = c("No HeartDisease", "HeartDisease")) +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))
  
  # Displaying the plots
  print(plot)
}

6.3.3 Summary

BMI:

High F-statistic (712.00) and p-value (0) indicate significant differences in BMI across the heart attack status groups.
Suggests that BMI is strongly associated with heart attack status.

Mental Health:

Extremely high F-statistic (1063.78) and p-value (0) suggest significant differences in mental health (unhealthy days) across groups.
Indicates that mental health may play a role in heart attack status.

Physical Health:

Exceptionally high F-statistic (8660.85) and p-value (0) show strong differences in physical health (unhealthy days) across heart attack groups.
Highlights physical health as a critical factor.

Tukey Interpretation:

BMI

Difference: 1.197
- Indicates that individuals with a history of heart attack (1) have, on average, a BMI 1.197 units higher than those without a history (0).
Confidence Interval (lwr, upr): [1.109, 1.285]
- The true difference in mean BMI between groups is likely between 1.109 and 1.285.
p adj: 0
- This extremely low p-value confirms the difference is statistically significant.

Mental Health

Difference: 1.640
- Individuals with heart attack history report 1.640 more unhealthy mental health days on average.
Confidence Interval (lwr, upr): [1.541, 1.739]
- The true difference lies within this range.
p adj: 0
- This is a statistically significant difference.

Physical Health

Difference: 5.423
- Individuals with a history of heart attack report 5.423 more unhealthy physical health days on average.
Confidence Interval (lwr, upr): [5.309, 5.537]
- The difference is precise and significant, as indicated by the small interval.
p adj: 0
- The low p-value confirms a significant difference.

7 Model Building

Next, we begin the assembly of our regression model, speficially utilizing logistic regression.

7.1 Assumptions of Logistic Regression

Binary Outcome Variable: The dependent variable is binary.
Independence of Observations: Observations are independent of each other.
Linearity of Logit: The relationship between the predictors and the log-odds of the outcome is linear.
No Multicollinearity: Predictors are not highly correlated with each other.
Large Sample Size: Logistic regression works best with a large sample size to ensure reliable estimates.

We will confirm these assumptions in the Model Diagnostics and Validation section below.

7.2 Base Model

We begin with our base model, including all predictors:

base_model <- glm(HeartDiseaseorAttack ~ ., data = heart_data, family = binomial)
summary(base_model)


Call:
glm(formula = HeartDiseaseorAttack ~ ., family = binomial, data = heart_data)

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)       -7.9124089  0.1028054 -76.965  < 2e-16 ***
HighBP             0.5245081  0.0177520  29.546  < 2e-16 ***
HighChol           0.6112677  0.0164496  37.160  < 2e-16 ***
CholCheck          0.5248111  0.0662510   7.922 2.35e-15 ***
BMI                0.0009744  0.0012122   0.804   0.4215    
Smoker             0.3629524  0.0157323  23.071  < 2e-16 ***
Stroke             0.9783443  0.0244338  40.041  < 2e-16 ***
Diabetes           0.1465123  0.0089720  16.330  < 2e-16 ***
PhysActivity       0.0398975  0.0171991   2.320   0.0204 *  
Fruits             0.0060203  0.0163274   0.369   0.7123    
Veggies            0.0426327  0.0189359   2.251   0.0244 *  
HvyAlcoholConsump -0.2940714  0.0392877  -7.485 7.15e-14 ***
AnyHealthcare     -0.0070009  0.0412820  -0.170   0.8653    
NoDocbcCost        0.2528463  0.0268896   9.403  < 2e-16 ***
GenHlth            0.4907058  0.0095105  51.596  < 2e-16 ***
MentHlth           0.0024628  0.0009778   2.519   0.0118 *  
PhysHlth           0.0010542  0.0008766   1.202   0.2292    
DiffWalk           0.2947780  0.0193855  15.206  < 2e-16 ***
Sex                0.7611811  0.0160326  47.477  < 2e-16 ***
Age                0.2556493  0.0036439  70.158  < 2e-16 ***
Education          0.0110314  0.0081696   1.350   0.1769    
Income            -0.0431551  0.0042476 -10.160  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 158355  on 253679  degrees of freedom
Residual deviance: 120933  on 253658  degrees of freedom
AIC: 120977

Number of Fisher Scoring iterations: 6

stepwise_model <- stepAIC(base_model, direction = "both", trace = FALSE)
summary(stepwise_model)


Call:
glm(formula = HeartDiseaseorAttack ~ HighBP + HighChol + CholCheck + 
    Smoker + Stroke + Diabetes + PhysActivity + Veggies + HvyAlcoholConsump + 
    NoDocbcCost + GenHlth + MentHlth + DiffWalk + Sex + Age + 
    Income, family = binomial, data = heart_data)

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)       -7.849811   0.084630 -92.755  < 2e-16 ***
HighBP             0.525327   0.017568  29.902  < 2e-16 ***
HighChol           0.611535   0.016439  37.201  < 2e-16 ***
CholCheck          0.525554   0.066050   7.957 1.76e-15 ***
Smoker             0.360912   0.015659  23.048  < 2e-16 ***
Stroke             0.978517   0.024405  40.096  < 2e-16 ***
Diabetes           0.147432   0.008805  16.745  < 2e-16 ***
PhysActivity       0.039919   0.017008   2.347  0.01892 *  
Veggies            0.046723   0.018425   2.536  0.01122 *  
HvyAlcoholConsump -0.295596   0.039239  -7.533 4.95e-14 ***
NoDocbcCost        0.254267   0.026437   9.618  < 2e-16 ***
GenHlth            0.494177   0.008690  56.866  < 2e-16 ***
MentHlth           0.002741   0.000954   2.873  0.00406 ** 
DiffWalk           0.302854   0.018630  16.256  < 2e-16 ***
Sex                0.761142   0.015982  47.626  < 2e-16 ***
Age                0.254982   0.003500  72.862  < 2e-16 ***
Income            -0.040995   0.003938 -10.410  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 158355  on 253679  degrees of freedom
Residual deviance: 120937  on 253663  degrees of freedom
AIC: 120971

Number of Fisher Scoring iterations: 6

Note: We have omitted the full stepwise regression output from this report to ensure smoother rendering and reduce processing times.

Looking at the summary of the logistic regression model with all variables, it appears that the predictors BMI, Fruits, AnyHealthcare, PhysHlth, and Education have p-values over 0.05, indicating insignificance to the model.

A stepwise selection process in both directions was used to select the best model by using AIC to identify the most significant factors. This selection method combines forward selection and backward elimination.

The model resulted in the inclusion of the following predictors as most significant:

HighBP
HighChol
CholCheck
Smoker
Stroke
Diabetes
PhysActivity
Veggies
HvyAlcoholConsump
NoDocbcCost
GenHlth
MentHlth
DiffWalk
Sex
Age
Income

The selection process included 6 iterations, removing the predictors BMI, Fruits, AnyHealthcare, PhysHlth, and Education from the full model as they did not significantly contribute to the model’s fit according to the AIC.

7.2.1 Summary

As detailed above, the final model based on our stepwise selection process is as follows:

# Model after stepwise selection
final_model <- glm(HeartDiseaseorAttack ~ HighBP + HighChol + CholCheck + Smoker + 
                     Stroke + Diabetes + PhysActivity + Veggies + HvyAlcoholConsump + 
                     NoDocbcCost + GenHlth + MentHlth + DiffWalk + Sex + Age + 
                     Income, data = heart_data, family = binomial)

8 Model Diagnostics and Validation

Let us now confirm that the assumptions of logistic regression in our model:

8.1 Binary Outcome Variable

The dependent variable is HeartDiseaseorAttack, which has values: 0 = No, 1 = Yes

This variable is binary and confirms this assumption.

8.2 Independence

In our dataset, each observation represents a unique individual, and there are no repeated measures or clustering effects. Therefore, the independence assumption is confirmed.

8.3 Linearity of Logit

The age variable is categorized into groups, making it more complex to analyze in logistic regression. Our attempts to test its linearity caused technical issues with the model, so keep the analysis straightforward, we left the variable as is. Accurate assessment of this assumption may require a more thorough refining of the age variable, be it through transformations or other means.

8.4 No Multicollinearity

# Checking for multicollinearity via the Variance Inflation Factor (VIF)
vif_values <- vif(final_model)
print("Variance Inflation Factor (VIF) values:")

[1] "Variance Inflation Factor (VIF) values:"

print(vif_values)

           HighBP          HighChol         CholCheck            Smoker 
         1.118635          1.060904          1.013787          1.041327 
           Stroke          Diabetes      PhysActivity           Veggies 
         1.033980          1.105517          1.131639          1.044400 
HvyAlcoholConsump       NoDocbcCost           GenHlth          MentHlth 
         1.017747          1.129484          1.526453          1.235811 
         DiffWalk               Sex               Age            Income 
         1.392603          1.110160          1.140386          1.311509

Based on the details outlines above, the Variance Inflation Factor (VIF) values indicate that there are no concerning multicollinearity issues present in our final model. Generally, VIF values below 5 (sometimes 10 for larger datasets) suggest that predictors are not highly correlated.

8.5 Cook’s Distance

# Checking for influential points via Cook's Distance plot
cooks_d <- cooks.distance(stepwise_model)
cook_thresh <- 4 / nrow(heart_data)
plot(cooks_d, type = "h", main = "Cook's Distance for Each Observation", 
     ylab = "Cook's Distance", xlab = "Observations")
abline(h = cook_thresh, col = "red", lty = 2)

# Checking for influential points via leverage plot
leverage <- hatvalues(stepwise_model)
leverage_thresh <- 2 * (length(coef(stepwise_model)) + 1) / nrow(heart_data)
plot(leverage, cooks_d, xlab = "Leverage", ylab = "Cook's Distance",
     main = "Influential Observations", pch = 20, col = "black")
abline(h = cook_thresh, col = "blue", lty = 2)
abline(v = leverage_thresh, col = "red", lty = 2)

Based on both the Cooks Distance and Leverage plots, we can see that no single observation appears to be excessively influential, indicating that the model is robust. Given that no significant outliers have been identified, we will not be removing any observations from the model.

8.6 Large Sample Size

Given that our dataset containts over 250,000 observations, the assumption of large sample size is met.

8.7 Interaction Terms

In an effort to describe the model in the most detail possible, we will also include the following interaction terms:

HighBP × HighChol
HighBP × Smoker
HighChol × Stroke
Smoker × Stroke
Stroke × GenHlth
GenHlth × Age
PhysActivity × Veggies
HvyAlcoholConsump × NoDocbcCost

# Outlining the interaction model
interaction_model <- glm(
  HeartDiseaseorAttack ~ HighBP * HighChol +
    HighBP * Smoker +
    HighChol * Stroke +
    Smoker * Stroke +
    Stroke * GenHlth +
    GenHlth * Age +
    PhysActivity * Veggies +
    HvyAlcoholConsump * NoDocbcCost +
    HighBP + HighChol + Smoker + Stroke + GenHlth + PhysActivity + Veggies +
    HvyAlcoholConsump + NoDocbcCost + GenHlth + MentHlth + DiffWalk +
    Sex + Age + Income + CholCheck + Diabetes,
  data = heart_data,
  family = binomial
)

# Printing a summary of the interaction model
summary(interaction_model)


Call:
glm(formula = HeartDiseaseorAttack ~ HighBP * HighChol + HighBP * 
    Smoker + HighChol * Stroke + Smoker * Stroke + Stroke * GenHlth + 
    GenHlth * Age + PhysActivity * Veggies + HvyAlcoholConsump * 
    NoDocbcCost + HighBP + HighChol + Smoker + Stroke + GenHlth + 
    PhysActivity + Veggies + HvyAlcoholConsump + NoDocbcCost + 
    GenHlth + MentHlth + DiffWalk + Sex + Age + Income + CholCheck + 
    Diabetes, family = binomial, data = heart_data)

Coefficients:
                                Estimate Std. Error z value Pr(>|z|)    
(Intercept)                   -9.0547531  0.1325791 -68.297  < 2e-16 ***
HighBP                         0.4907252  0.0332683  14.751  < 2e-16 ***
HighChol                       0.5700414  0.0284349  20.047  < 2e-16 ***
Smoker                         0.4259475  0.0290007  14.687  < 2e-16 ***
Stroke                         1.5480487  0.0905713  17.092  < 2e-16 ***
GenHlth                        0.8676025  0.0325858  26.625  < 2e-16 ***
Age                            0.3697799  0.0106097  34.853  < 2e-16 ***
PhysActivity                   0.0712664  0.0320648   2.223 0.026245 *  
Veggies                        0.0710875  0.0282127   2.520 0.011746 *  
HvyAlcoholConsump             -0.3192760  0.0417302  -7.651 1.99e-14 ***
NoDocbcCost                    0.2205944  0.0269747   8.178 2.89e-16 ***
MentHlth                       0.0015246  0.0009568   1.593 0.111079    
DiffWalk                       0.3023440  0.0185343  16.313  < 2e-16 ***
Sex                            0.7559528  0.0159463  47.406  < 2e-16 ***
Income                        -0.0373180  0.0039367  -9.479  < 2e-16 ***
CholCheck                      0.5193762  0.0661194   7.855 3.99e-15 ***
Diabetes                       0.1442157  0.0087666  16.451  < 2e-16 ***
HighBP:HighChol                0.1279302  0.0346383   3.693 0.000221 ***
HighBP:Smoker                 -0.0884719  0.0339417  -2.607 0.009145 ** 
HighChol:Stroke               -0.4416252  0.0507694  -8.699  < 2e-16 ***
Smoker:Stroke                 -0.1004807  0.0485797  -2.068 0.038605 *  
Stroke:GenHlth                -0.0585123  0.0225713  -2.592 0.009533 ** 
GenHlth:Age                   -0.0367094  0.0031181 -11.773  < 2e-16 ***
PhysActivity:Veggies          -0.0447662  0.0368020  -1.216 0.223830    
HvyAlcoholConsump:NoDocbcCost  0.1921910  0.1206683   1.593 0.111223    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 158355  on 253679  degrees of freedom
Residual deviance: 120680  on 253655  degrees of freedom
AIC: 120730

Number of Fisher Scoring iterations: 7

Next we will perform the stepwise selection again, this timem focusing on the new interaction model:

# Outlining the stepwise selection for the new interaction model
stepwise_interaction_model <- stepAIC(interaction_model, direction = "both", trace = FALSE)

# Printing a summary
summary(stepwise_interaction_model)


Call:
glm(formula = HeartDiseaseorAttack ~ HighBP + HighChol + Smoker + 
    Stroke + GenHlth + Age + PhysActivity + Veggies + HvyAlcoholConsump + 
    NoDocbcCost + MentHlth + DiffWalk + Sex + Income + CholCheck + 
    Diabetes + HighBP:HighChol + HighBP:Smoker + HighChol:Stroke + 
    Smoker:Stroke + Stroke:GenHlth + GenHlth:Age + HvyAlcoholConsump:NoDocbcCost, 
    family = binomial, data = heart_data)

Coefficients:
                                Estimate Std. Error z value Pr(>|z|)    
(Intercept)                   -9.0374608  0.1318007 -68.569  < 2e-16 ***
HighBP                         0.4908139  0.0332682  14.753  < 2e-16 ***
HighChol                       0.5700173  0.0284345  20.047  < 2e-16 ***
Smoker                         0.4261185  0.0290001  14.694  < 2e-16 ***
Stroke                         1.5486694  0.0905683  17.099  < 2e-16 ***
GenHlth                        0.8678146  0.0325848  26.632  < 2e-16 ***
Age                            0.3697842  0.0106094  34.854  < 2e-16 ***
PhysActivity                   0.0381591  0.0169391   2.253 0.024276 *  
Veggies                        0.0450789  0.0183618   2.455 0.014087 *  
HvyAlcoholConsump             -0.3195208  0.0417301  -7.657 1.91e-14 ***
NoDocbcCost                    0.2203268  0.0269749   8.168 3.14e-16 ***
MentHlth                       0.0015095  0.0009568   1.578 0.114632    
DiffWalk                       0.3025022  0.0185347  16.321  < 2e-16 ***
Sex                            0.7562402  0.0159439  47.431  < 2e-16 ***
Income                        -0.0373760  0.0039367  -9.494  < 2e-16 ***
CholCheck                      0.5192716  0.0661177   7.854 4.04e-15 ***
Diabetes                       0.1442449  0.0087667  16.454  < 2e-16 ***
HighBP:HighChol                0.1280530  0.0346379   3.697 0.000218 ***
HighBP:Smoker                 -0.0886864  0.0339409  -2.613 0.008976 ** 
HighChol:Stroke               -0.4416122  0.0507695  -8.698  < 2e-16 ***
Smoker:Stroke                 -0.1000196  0.0485779  -2.059 0.039499 *  
Stroke:GenHlth                -0.0587523  0.0225704  -2.603 0.009239 ** 
GenHlth:Age                   -0.0367077  0.0031181 -11.773  < 2e-16 ***
HvyAlcoholConsump:NoDocbcCost  0.1913418  0.1206844   1.585 0.112859    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 158355  on 253679  degrees of freedom
Residual deviance: 120681  on 253656  degrees of freedom
AIC: 120729

Number of Fisher Scoring iterations: 7

Note: We have omitted the full stepwise regression output from this report to ensure smoother rendering and reduce processing times.

As shown through the stepwise selection process above, we have arrived at a logistic regression model that aims to predicts the likelihood of heart disease. This model is based on the significant predictors from our first final model, as well as interactions between key variables. Below is a summary of the model’s components:

# Outlining the final model
final_interaction_model <- glm(formula = HeartDiseaseorAttack ~ HighBP + HighChol + Smoker + Stroke + GenHlth + Age + PhysActivity + Veggies + HvyAlcoholConsump + 
    NoDocbcCost + MentHlth + DiffWalk + Sex + Income + CholCheck + 
    Diabetes + HighBP:HighChol + HighBP:Smoker + HighChol:Stroke + 
    Smoker:Stroke + Stroke:GenHlth + GenHlth:Age + HvyAlcoholConsump:NoDocbcCost, 
    family = binomial, data = heart_data)

# Printing a summary of the final model
summary(final_interaction_model)


Call:
glm(formula = HeartDiseaseorAttack ~ HighBP + HighChol + Smoker + 
    Stroke + GenHlth + Age + PhysActivity + Veggies + HvyAlcoholConsump + 
    NoDocbcCost + MentHlth + DiffWalk + Sex + Income + CholCheck + 
    Diabetes + HighBP:HighChol + HighBP:Smoker + HighChol:Stroke + 
    Smoker:Stroke + Stroke:GenHlth + GenHlth:Age + HvyAlcoholConsump:NoDocbcCost, 
    family = binomial, data = heart_data)

Coefficients:
                                Estimate Std. Error z value Pr(>|z|)    
(Intercept)                   -9.0374608  0.1318007 -68.569  < 2e-16 ***
HighBP                         0.4908139  0.0332682  14.753  < 2e-16 ***
HighChol                       0.5700173  0.0284345  20.047  < 2e-16 ***
Smoker                         0.4261185  0.0290001  14.694  < 2e-16 ***
Stroke                         1.5486694  0.0905683  17.099  < 2e-16 ***
GenHlth                        0.8678146  0.0325848  26.632  < 2e-16 ***
Age                            0.3697842  0.0106094  34.854  < 2e-16 ***
PhysActivity                   0.0381591  0.0169391   2.253 0.024276 *  
Veggies                        0.0450789  0.0183618   2.455 0.014087 *  
HvyAlcoholConsump             -0.3195208  0.0417301  -7.657 1.91e-14 ***
NoDocbcCost                    0.2203268  0.0269749   8.168 3.14e-16 ***
MentHlth                       0.0015095  0.0009568   1.578 0.114632    
DiffWalk                       0.3025022  0.0185347  16.321  < 2e-16 ***
Sex                            0.7562402  0.0159439  47.431  < 2e-16 ***
Income                        -0.0373760  0.0039367  -9.494  < 2e-16 ***
CholCheck                      0.5192716  0.0661177   7.854 4.04e-15 ***
Diabetes                       0.1442449  0.0087667  16.454  < 2e-16 ***
HighBP:HighChol                0.1280530  0.0346379   3.697 0.000218 ***
HighBP:Smoker                 -0.0886864  0.0339409  -2.613 0.008976 ** 
HighChol:Stroke               -0.4416122  0.0507695  -8.698  < 2e-16 ***
Smoker:Stroke                 -0.1000196  0.0485779  -2.059 0.039499 *  
Stroke:GenHlth                -0.0587523  0.0225704  -2.603 0.009239 ** 
GenHlth:Age                   -0.0367077  0.0031181 -11.773  < 2e-16 ***
HvyAlcoholConsump:NoDocbcCost  0.1913418  0.1206844   1.585 0.112859    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 158355  on 253679  degrees of freedom
Residual deviance: 120681  on 253656  degrees of freedom
AIC: 120729

Number of Fisher Scoring iterations: 7

8.7.1 Main Predictors

Detailed below are the main variables that remain significant without being part of an interaction:

HighBP (p<0.001): Individuals with high blood pressure have higher odds of heart disease.
HighChol (p<0.001): High cholesterol is strongly associated with an increased likelihood of heart disease.
Smoker (p<0.001): Smoking significantly raises the odds of heart disease.
Stroke (p<0.001): A history of stroke is highly predictive of heart disease.
GenHlth (p<0.001): Poorer general health is associated with a higher likelihood of heart disease.
Age (p<0.001): Age is positively correlated with the odds of heart disease.
PhysActivity (p=0.024): Regular physical activity reduces the likelihood of heart disease, albeit with a modest effect.
Veggies (p=0.014): Consuming vegetables is associated with a lower risk of heart disease.
HvyAlcoholConsump (p<0.001): Heavy alcohol consumption is negatively associated with heart disease likelihood.
NoDocbcCost (p<0.001): Inability to afford medical care increases the odds of heart disease.
DiffWalk (p<0.001): Difficulty walking is strongly associated with heart disease.
Sex (p<0.001): Males have higher odds of heart disease compared to females.
Income (p<0.001): Higher income levels are negatively associated with heart disease likelihood.
CholCheck (p<0.001): Recent cholesterol checks are positively associated with identifying heart disease.
Diabetes (p<0.001): Individuals with diabetes have higher odds of heart disease.

8.7.2 Interaction Terms

Detailed below are the significant interaction effects:

HighBP (p=0.0002): The combined presence of high blood pressure and high cholesterol has a multiplicative effect, increasing the odds of heart disease.
HighBP (p=0.009): Smoking interacts with high blood pressure to slightly reduce the risk compared to their independent effects.
HighChol (p<0.001): High cholesterol’s effect on heart disease risk is mitigated for those with a history of stroke.
Smoker (p=0.039): Smoking combined with a stroke history slightly reduces the risk compared to their independent effects.
Stroke (p=0.009): Poor general health modifies the effect of having a stroke on heart disease risk.
GenHlth (p<0.001): Age modifies the effect of general health on heart disease risk.

8.7.3 Interpretation of Results

Main Effects:
- Variables like high blood pressure, cholesterol, smoking, and stroke remain strong independent predictors of heart disease.
- Lifestyle factors (e.g., physical activity, vegetable consumption) and socioeconomic factors (e.g., income, access to healthcare) also play key roles.
Interactions:
- The interaction terms show how combinations of predictors influence heart disease differently compared to their independent effects. For example, the interaction between age and general health highlights how age can amplify the impact of poor health.
Variables Dropped in Interaction:
- MentHlth (p=0.115): Number of bad mental health days is no longer significant in the interaction model.
- HvyAlcoholConsump (p=0.113): This interaction term did not show a significant effect on heart disease risk.

8.7.4 Model Performance

AIC: The final model’s AIC (120729) indicates a good fit relative to earlier models.
Residual Deviance: The residual deviance (120681) suggests the model explains most of the variability in the data.

9 Predictive Model Outputs and Interpretation

Now, let us outline key outputs from the final logistic regression model, including odds ratios, ROC and calibration curves. This will help us to evaluate the model’s predictive performance and provide insights into the factors influencing heart disease, helping us ensure the model’s results are interpretable and aligned with our research objectives.

9.1 Odds Ratio

# odds ratios and confidence intervals
model_summary <- tidy(final_interaction_model, exponentiate = TRUE, conf.int = TRUE)

colnames(model_summary) <- c("Variable", "OddsRatio", "StdError", "ConfLow", "ConfHigh", "PValue")

# odds ratio plot
ggplot(model_summary, aes(x = reorder(Variable, OddsRatio), y = OddsRatio)) +
  geom_point(size = 3, color = "blue") +
  geom_errorbar(aes(ymin = ConfLow, ymax = ConfHigh), width = 0.2) +
  geom_hline(yintercept = 1, linetype = "dashed", color = "red") +
  coord_flip() +
  labs(
    title = "Odds Ratio Plot",
    x = "Predictor Variables",
    y = "Odds Ratio (with 95% CI)"
  ) +
  theme_minimal()

# Outlining interpreations of the odds ratio in a table for compactness
predictors_table <- data.frame(
  Predictor = c(
    "HighBP", "HighChol", "Smoker", 
    "Stroke", "GenHlth", "Age", "DiffWalk", 
    "Income", "CholCheck", "PhysActivity", 
    "Veggies", "HvyAlcoholConsump", 
    "Diabetes", "Sex", "NoDocbcCost",
    "HighBP × HighChol", "HighBP × Smoker", "HighChol × Stroke", 
    "Smoker × Stroke", "Stroke × GenHlth", "GenHlth × Age"
  ),
  Coefficient = c(
    0.491, 0.570, 0.426, 1.549, 0.868, 0.370, 0.303, 
    -0.037, 0.519, 0.038, 0.045, -0.320, 0.144, 0.756, 0.220,
    0.128, -0.089, -0.442, -0.100, -0.059, -0.037
  ),
  OddsRatio = c(
    1.63, 1.77, 1.53, 4.71, 2.38, 1.45, 1.35, 
    0.96, 1.68, 1.04, 1.05, 0.73, 1.15, 2.13, 1.25,
    1.14, 0.91, 0.64, 0.90, 0.94, 0.96
  ),
  Interpretation = c(
    "High blood pressure increases the odds of heart disease by 63%",
    "High cholesterol raises the odds of heart disease by 77%",
    "Smokers are 53% more likely to have heart disease",
    "A history of stroke increases the likelihood of heart disease by 371%",
    "Poorer general health increases heart disease risk by 138%",
    "Moving to the next age group increases heart disease odds by 45%",
    "Individuals with mobility issues are 35% more likely to have heart disease",
    "Higher income levels are slightly protective against heart disease",
    "Recent cholesterol screening is associated with a 68% higher likelihood of identifying heart disease",
    "Slight increase in risk; likely due to other confounding health behaviors",
    "Associated with a minor protective effect",
    "Heavy alcohol consumption reduces odds, but this result may reflect underlying health or reporting biases",
    "Diabetes increases the odds of heart disease by 15%",
    "Males have over twice the odds of heart disease compared to females",
    "Inability to afford medical care increases heart disease odds by 25%",
    "Combined presence of high blood pressure and high cholesterol increases odds by 14% over their independent effects",
    "Smoking slightly offsets the risk of high blood pressure on heart disease, potentially due to competing risks",
    "The effect of high cholesterol is lessened for those with a history of stroke",
    "The combined effect of smoking and stroke history is slightly less than their individual effects",
    "Poor general health slightly reduces the stroke effect on heart disease risk",
    "As individuals age, the impact of general health on heart disease slightly diminishes"
  )
)

# Generating the the table
library(knitr)
kable(
  predictors_table, 
  col.names = c("Predictor", "Coefficient", "Odds Ratio", "Interpretation"),
  caption = "Summary of Predictors, Interaction Terms, and Odds Ratios"
)

Summary of Predictors, Interaction Terms, and Odds Ratios
Predictor	Coefficient	Odds Ratio	Interpretation
HighBP	0.491	1.63	High blood pressure increases the odds of heart disease by 63%
HighChol	0.570	1.77	High cholesterol raises the odds of heart disease by 77%
Smoker	0.426	1.53	Smokers are 53% more likely to have heart disease
Stroke	1.549	4.71	A history of stroke increases the likelihood of heart disease by 371%
GenHlth	0.868	2.38	Poorer general health increases heart disease risk by 138%
Age	0.370	1.45	Moving to the next age group increases heart disease odds by 45%
DiffWalk	0.303	1.35	Individuals with mobility issues are 35% more likely to have heart disease
Income	-0.037	0.96	Higher income levels are slightly protective against heart disease
CholCheck	0.519	1.68	Recent cholesterol screening is associated with a 68% higher likelihood of identifying heart disease
PhysActivity	0.038	1.04	Slight increase in risk; likely due to other confounding health behaviors
Veggies	0.045	1.05	Associated with a minor protective effect
HvyAlcoholConsump	-0.320	0.73	Heavy alcohol consumption reduces odds, but this result may reflect underlying health or reporting biases
Diabetes	0.144	1.15	Diabetes increases the odds of heart disease by 15%
Sex	0.756	2.13	Males have over twice the odds of heart disease compared to females
NoDocbcCost	0.220	1.25	Inability to afford medical care increases heart disease odds by 25%
HighBP × HighChol	0.128	1.14	Combined presence of high blood pressure and high cholesterol increases odds by 14% over their independent effects
HighBP × Smoker	-0.089	0.91	Smoking slightly offsets the risk of high blood pressure on heart disease, potentially due to competing risks
HighChol × Stroke	-0.442	0.64	The effect of high cholesterol is lessened for those with a history of stroke
Smoker × Stroke	-0.100	0.90	The combined effect of smoking and stroke history is slightly less than their individual effects
Stroke × GenHlth	-0.059	0.94	Poor general health slightly reduces the stroke effect on heart disease risk
GenHlth × Age	-0.037	0.96	As individuals age, the impact of general health on heart disease slightly diminishes

9.2 ROC Curve

# ROC curve code:

# Predicted probabilities from final model
predicted_probs <- predict(final_interaction_model, heart_data, type = "response")

# Actual outcomes
actual_outcomes <- as.numeric(heart_data$HeartDiseaseorAttack) # EDIT: Converted to numeric to ensure compatibility with `roc()` function.

# ROC curve
roc_curve <- roc(actual_outcomes, predicted_probs)

# ROC curve plot
plot(roc_curve, main = "ROC Curve", col = "blue", lwd = 2)
abline(a = 0, b = 1, lty = 2, col = "gray") # Diagonal reference line

# AUC
auc_value <- auc(roc_curve)
cat("Area Under the Curve (AUC):", auc_value, "\n")

Area Under the Curve (AUC): 0.8475062

An AUC of 0.8475 suggests that the model has good discriminatory power, meaning it can effectively separate individuals with and without the outcome (heart disease).

9.3 Calibration Curve

# Calibration curve code:

# Grouping predicted probabilities
calibration_data <- data.frame(
  predicted_probs = predicted_probs,
  actual_outcomes = actual_outcomes
)
calibration_data$bin <- cut(
  calibration_data$predicted_probs,
  breaks = seq(0, 1, by = 0.1),
  include.lowest = TRUE
)

# Mean predicted probabilities and observed outcomes in each bin
calibration_summary <- calibration_data %>%
  group_by(bin) %>%
  summarise(
    mean_predicted = mean(predicted_probs, na.rm = TRUE), # EDIT: Added `na.rm = TRUE` to handle potential NAs in predicted probabilities.
    mean_observed = mean(actual_outcomes, na.rm = TRUE)  # EDIT: Added `na.rm = TRUE` to handle potential NAs in observed outcomes.
  )

# Calibration curve plot
plot(
  calibration_summary$mean_predicted,
  calibration_summary$mean_observed,
  xlab = "Mean Predicted Probability",
  ylab = "Mean Observed Probability",
  main = "Calibration Curve",
  col = "blue",
  pch = 16
)
abline(0, 1, lty = 2, col = "gray")

The points are close to the dotted line meaning the predicted probabilities align well with observed outcomes.

10 Summary: Research Question and Hypotheses

The analysis sought to identify significant predictors of heart disease. Our hypothesis was supported: age, smoking, high blood pressure, and stroke are strongly associated with increased odds of heart disease. Factors like cholesterol, general health, and socioeconomic indicators like income also play critical roles.