heart_data <- read.csv("heart_data.csv")Predictors of Heart Disease: Statistical Modeling and Testing
1 Introduction
Heart disease stands as one of the leading causes of mortality worldwide, impacting the lives of millions and placing extreme burden on healthcare systems. Understanding the various risk factors related to heart disease can help create more efficient prevention strategies and improve overall health outcomes. In this project, we seek to identify and study the significant predictors of heart disease in hopes to improve intervention strategies and encourage lifestyle changes that may aid at-risk populations.
2 Research Question & Hypotheses
Our research question for this project is:
“What are the significant risk factors associated with heart disease?”
Our Hypothesis for this project is:
“Higher Age, Smoking, High Blood Pressure, and Stroke are significantly associated with a greater likelihood of heart disease.“
In addition to Higher Age, Smoking, and High Blood Pressure, we plan to examine the effects of other predictors in the model to determine how, if at all, they influence the likelihood of heart disease. These three variables were chosen as the focus of our initial analysis because we believed they would be the most significant.
3 Data Description
This project utilizes the Heart Disease Health Indicators dataset from Kaggle, which contains health-related information on over 253,000 individuals. The data was originally collected from the Behavioral Risk Factor Surveillance System (BRFSS), a health-related telephone survey that is collected annually by the CDC.
Link to Dataset: https://www.kaggle.com/datasets/alexteboul/heart-disease-health-indicators-dataset?resource=download
3.0.1 Variables, according to the BRFSS Codebook:
HeartDiseaseorAttack: Whether the respondent has coronary heart disease or myocardial infarction
- Values: 0 = No, 1 = Yes.
HighBP: Whether the respondent has been told they have high blood pressure
- Values: 0 = No, 1 = Yes.
HighChol: Whether the respondent has been told they have high cholesterol
- Values: 0 = No, 1 = Yes.
CholCheck: Whether the respondent has had their cholesterol checked within the past 5 years
- Values: 0 = No, 1 = Yes.
BMI: Body Mass Index
- Values: Numeric (to one decimal place, e.g., 24.5).
Smoker: Whether the respondent has smoked at least 100 cigarettes in their lifetime
- Values: 0 = No, 1 = Yes.
Stroke: Whether the respondent has ever had a stroke
- Values: 0 = No, 1 = Yes.
Diabetes: Whether the respondent has diabetes or prediabetes
- Values: 0 = No, 1 = Yes.
PhysActivity: Whether the respondent had physical activity or exercise other than their regular job in the past 30 days
- Values: 0 = No, 1 = Yes.
Fruits: Whether the respondent consumes fruit one or more times per day
- Values: 0 = No, 1 = Yes.
Veggies: Whether the respondent consumes vegetables one or more times per day
- Values: 0 = No, 1 = Yes.
HvyAlcoholConsump: Whether the respondent is a heavy drinker (defined as consuming more than 14 drinks per week for men and more than 7 drinks per week for women)
- Values: 0 = No, 1 = Yes.
AnyHealthcare: Whether the respondent has any kind of health care coverage
- Values: 0 = No, 1 = Yes.
NoDocbcCost: Whether the respondent could not see a doctor in the past year because of cost
- Values: 0 = No, 1 = Yes.
GenHlth: General health status
- Values: Ordinal scale (1 = Excellent, 2 = Very Good, 3 = Good, 4 = Fair, 5 = Poor).
MentHlth: Number of days in the past 30 days that the respondent’s mental health was not good
- Values: Integer (0–30).
PhysHlth: Number of days in the past 30 days that the respondent’s physical health was not good
- Values: Integer (0–30).
DiffWalk: Whether the respondent has serious difficulty walking or climbing stairs
- Values: 0 = No, 1 = Yes.
Sex: The respondent’s gender
- Values: 0 = Female, 1 = Male.
Age: The respondent’s age group
- Values: Ordinal scale (1 = 18–24, 2 = 25–29, 3 = 30–34, 4 = 35–39, 5 = 40–44, 6 = 45–49, 7 = 50–54, 8 = 55–59, 9 = 60–64, 10 = 65–69, 11 = 70–74, 12 = 75–79, 13 = 80+)
Education: Education level of the respondent
- Values: Ordinal scale (1 = Did not graduate high school, 2 = Graduated high school, ..., 5 = College graduate).
Income: Income level of the respondent
- Values: Ordinal scale (1 = Less than $10,000, 2 = $10,000 to $15,000, ..., 8 = $75,000 or more).
4 Loading the Dataset
5 Data Overview
# Examining the structure of the data
str(heart_data)'data.frame': 253680 obs. of 22 variables:
$ HeartDiseaseorAttack: num 0 0 0 0 0 0 0 0 1 0 ...
$ HighBP : num 1 0 1 1 1 1 1 1 1 0 ...
$ HighChol : num 1 0 1 0 1 1 0 1 1 0 ...
$ CholCheck : num 1 0 1 1 1 1 1 1 1 1 ...
$ BMI : num 40 25 28 27 24 25 30 25 30 24 ...
$ Smoker : num 1 1 0 0 0 1 1 1 1 0 ...
$ Stroke : num 0 0 0 0 0 0 0 0 0 0 ...
$ Diabetes : num 0 0 0 0 0 0 0 0 2 0 ...
$ PhysActivity : num 0 1 0 1 1 1 0 1 0 0 ...
$ Fruits : num 0 0 1 1 1 1 0 0 1 0 ...
$ Veggies : num 1 0 0 1 1 1 0 1 1 1 ...
$ HvyAlcoholConsump : num 0 0 0 0 0 0 0 0 0 0 ...
$ AnyHealthcare : num 1 0 1 1 1 1 1 1 1 1 ...
$ NoDocbcCost : num 0 1 1 0 0 0 0 0 0 0 ...
$ GenHlth : num 5 3 5 2 2 2 3 3 5 2 ...
$ MentHlth : num 18 0 30 0 3 0 0 0 30 0 ...
$ PhysHlth : num 15 0 30 0 0 2 14 0 30 0 ...
$ DiffWalk : num 1 0 1 0 0 0 0 1 1 0 ...
$ Sex : num 0 0 0 0 0 1 0 0 0 1 ...
$ Age : num 9 7 9 11 11 10 9 11 9 8 ...
$ Education : num 4 6 4 3 5 6 6 4 5 4 ...
$ Income : num 3 1 8 6 4 8 7 4 1 3 ...
As we can see, the dataset contains 253680 observations of 22 variables.
5.0.1 Checking for Missing Values
colSums(is.na(heart_data))HeartDiseaseorAttack HighBP HighChol
0 0 0
CholCheck BMI Smoker
0 0 0
Stroke Diabetes PhysActivity
0 0 0
Fruits Veggies HvyAlcoholConsump
0 0 0
AnyHealthcare NoDocbcCost GenHlth
0 0 0
MentHlth PhysHlth DiffWalk
0 0 0
Sex Age Education
0 0 0
Income
0
5.0.2 Summary Statistics
# Outlining summary statistics for all of the continuous variables
summary(heart_data[, c("BMI", "PhysHlth", "MentHlth")]) BMI PhysHlth MentHlth
Min. :12.00 Min. : 0.000 Min. : 0.000
1st Qu.:24.00 1st Qu.: 0.000 1st Qu.: 0.000
Median :27.00 Median : 0.000 Median : 0.000
Mean :28.38 Mean : 4.242 Mean : 3.185
3rd Qu.:31.00 3rd Qu.: 3.000 3rd Qu.: 2.000
Max. :98.00 Max. :30.000 Max. :30.000
# Outlining frequency tables and bar plots for binary variables
# Creating a bar plot for Heart Disease Status
hd_counts <- table(heart_data$HeartDiseaseorAttack)
colors_hd <- c("skyblue", "salmon")
barplot(hd_counts,
col = colors_hd,
names.arg = c("No Heart Disease", "Heart Disease"),
main = "Heart Disease Status",
xlab = "Heart Disease Status",
ylab = "Count")# Creating a bar plot for High Blood Pressure Status
bp_counts <- table(heart_data$HighBP)
colors_bp <- c("lightgreen", "tomato")
barplot(bp_counts,
col = colors_bp,
names.arg = c("No High BP", "High BP"),
main = "High Blood Pressure Status",
xlab = "High Blood Pressure Status",
ylab = "Count")# Creating a bar plot for Smoking Status
smoker_counts <- table(heart_data$Smoker)
colors_smoker <- c("lightblue", "orange")
barplot(smoker_counts,
col = colors_smoker,
names.arg = c("Non-Smoker", "Smoker"),
main = "Smoking Status",
xlab = "Smoking Status",
ylab = "Count")# Creating a histogram for Age
hist(heart_data$Age, main = "Age", xlab = "Age Category")# Creating a histogram for BMI
hist(heart_data$BMI, main = "BMI", xlab = "BMI")5.1 Variable Relationships
5.1.1 Physical Health by Heart Disease
# Creating a box plot for Physical Health by Heart Disease Status
boxplot(heart_data$PhysHlth ~ heart_data$HeartDiseaseorAttack,
main = "Physical Health by Heart Disease Status",
xlab = "Heart Disease Status",
ylab = "Physical Health Score",
col = c("skyblue", "salmon"))5.1.2 High Blood Pressure by Heart Disease
# Outlining a table of high blood pressure counts by heart disease status
high_bp_hd <- table(heart_data$HighBP, heart_data$HeartDiseaseorAttack)
# Plotting the table into a barplot
barplot(high_bp_hd,
beside = TRUE,
col = c("skyblue", "salmon"),
names.arg = c("No Heart Disease", "Heart Disease"),
legend.text = c("No High Blood Pressure", "High Blood Pressure"),
args.legend = list(x = "topright", bty = "n"),
main = "High Blood Pressure by Heart Disease Status",
xlab = "Heart Disease Status",
ylab = "Count")5.1.3 Heart Disease by Age
# Calculating the proportion of individuals with heart disease in each age category
age_hd_table <- table(heart_data$Age, heart_data$HeartDiseaseorAttack)
hd_proportions <- prop.table(age_hd_table, margin = 1)[, "1"]
# Plotting the line chart
plot(names(hd_proportions), hd_proportions, type = "o", pch = 16, col = "salmon",
main = "Heart Disease Prevalence by Age Category",
xlab = "Age Category", ylab = "Proportion with Heart Disease",
ylim = c(0, max(hd_proportions) * 1.1))
# Adding a horizontal reference line at the average proportion
abline(h = mean(hd_proportions), col = "gray", lty = 2)5.1.4 Physical and Mental Health Scores by Heart Disease Status
ggplot(heart_data, aes(x = factor(HeartDiseaseorAttack), y = PhysHlth, fill = factor(HeartDiseaseorAttack))) +
geom_boxplot() +
labs(title = "Physical Health by Heart Disease Status", x = "Heart Disease Status", y = "Physical Health Score") +
scale_fill_manual(values = c("skyblue", "salmon"), labels = c("No Heart Disease", "Heart Disease"))ggplot(heart_data, aes(x = factor(HeartDiseaseorAttack), y = MentHlth, fill = factor(HeartDiseaseorAttack))) +
geom_boxplot() +
labs(title = "Mental Health by Heart Disease Status", x = "Heart Disease Status", y = "Mental Health Score") +
scale_fill_manual(values = c("skyblue", "salmon"), labels = c("No Heart Disease", "Heart Disease"))6 Hypothesis Testing
6.1 Chi-Squared
6.1.1 Proportional Tables
# Outlining the categorical variables
categorical_vars <- c("Smoker", "HighBP", "PhysActivity", "HighChol", "Stroke", "Diabetes",
"Fruits", "Veggies", "AnyHealthcare", "NoDocbcCost", "GenHlth",
"DiffWalk", "Sex", "Age", "Education", "Income")
# Setting up proportional tables list
proportional_tables <- list()
for (var in categorical_vars) {
table_var <- table(heart_data$HeartDiseaseorAttack, heart_data[[var]])
proportional_tables[[var]] <- prop.table(table_var, margin = 2)
}
# Merging proportional tables
prop_table_summary <- do.call(rbind, lapply(names(proportional_tables), function(var) {
as.data.frame(as.table(proportional_tables[[var]])) %>% mutate(Variable = var)
}))
colnames(prop_table_summary) <- c("HeartDisease", "Category", "Proportion", "Variable")
# Plotting proportion heatmap using tables
proportion_plot <- ggplot(prop_table_summary, aes(x = Category, y = Variable, fill = Proportion)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0.5,
limit = c(0, 1), name = "Proportion") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Proportional Distribution of Variables by Heart Disease",
x = "Category", y = "Variable")
# Displaying proportion heatmap
print(proportion_plot)6.1.2 Correlation
# Outlining the numeric variables for correlation analysis
numeric_vars <- c("BMI", "MentHlth", "PhysHlth")
numeric_vars <- numeric_vars[numeric_vars %in% names(heart_data)]
# Verifying numeric_vars
if (length(numeric_vars) > 0) {
# Calculating correlations
correlation_values <- sapply(numeric_vars, function(var) {
cor(heart_data$HeartDiseaseorAttack, heart_data[[var]], use = "complete.obs")
})
# Creating a data frame for correlation results
correlation_results <- data.frame(
Variable = numeric_vars,
Correlation = round(correlation_values, 3),
stringsAsFactors = FALSE
)
} else {
correlation_results <- NULL
}6.1.3 Correlation Results
# Displaying results
if (!is.null(correlation_results)) {
cat("\n## Correlation of Continuous Variables with Heart Disease\n")
print(correlation_results, row.names = FALSE)
} else {
cat("\n## Correlation of Continuous Variables with Heart Disease\n")
cat("Not Applicable.\n")
}
## Correlation of Continuous Variables with Heart Disease
Variable Correlation
BMI 0.053
MentHlth 0.065
PhysHlth 0.182
6.1.4 Chi-Squared Test
\(H_0 : \text{There is no association between the categorical variable and heart disease status.}\)
\(H_a : \text{There is an association between the categorical variable and heart disease status.}\)
\(\alpha = 0.05\)
# Creating a function to calculate Cramér's V
cramers_v <- function(chi_sq, n, df) {
return(sqrt(chi_sq / (n * min(df, n - 1))))
}
# Initializing the results table
chi_sq_results <- data.frame(
Variable = character(),
Chi_Square = numeric(),
df = numeric(),
p_value = numeric(),
Cramers_V = numeric(),
stringsAsFactors = FALSE
)
# Looping through categorical variables
for (var in categorical_vars) {
table_var <- table(heart_data$HeartDiseaseorAttack, heart_data[[var]])
chi_sq_test <- chisq.test(table_var)
n <- sum(table_var)
v <- cramers_v(chi_sq_test$statistic, n, chi_sq_test$parameter)
# Adding results to the table
chi_sq_results <- rbind(chi_sq_results, data.frame(
Variable = var,
Chi_Square = round(chi_sq_test$statistic, 2),
df = chi_sq_test$parameter,
p_value = round(chi_sq_test$p.value, 4),
Cramers_V = round(v, 2)
))
}6.1.5 Chi-Squared Results
# Displaying the results
print(chi_sq_results, row.names = FALSE) Variable Chi_Square df p_value Cramers_V
Smoker 3321.61 1 0 0.11
HighBP 11117.88 1 0 0.21
PhysActivity 1932.63 1 0 0.09
HighChol 8288.02 1 0 0.18
Stroke 10450.58 1 0 0.20
Diabetes 8244.89 2 0 0.13
Fruits 99.22 1 0 0.02
Veggies 388.82 1 0 0.04
AnyHealthcare 88.74 1 0 0.02
NoDocbcCost 243.40 1 0 0.03
GenHlth 19008.16 4 0 0.14
DiffWalk 11475.80 1 0 0.21
Sex 1879.79 1 0 0.09
Age 13731.04 12 0 0.07
Education 2589.79 5 0 0.05
Income 5277.60 7 0 0.05
6.1.6 Summary
The chi-squared results show that nearly all categorical predictors are significantly associated with heart disease (p < 0.05), as indicated by their high chi-square values and low p-values. HighBP, HighChol, and Stroke have the largest chi-square values, suggesting they are the strongest contributors among these variables.
Cramér’s V values indicate moderate association strength for HighBP (0.21), HighChol (0.18), and Stroke (0.20), while other variables like Smoker (0.11), Diabetes (0.13), and PhysActivity (0.09) show smaller, yet still notable associations. Predictors such as Fruits, Veggies, AnyHealthcare, and NoDocbcCost have very weak associations (Cramér’s V < 0.05), indicating minimal practical significance despite statistical significance.
6.2 T-Test
\(H_0 : \mu_1 = \mu_2 \quad \text{(The means of the two groups are equal)}\)
\(H_a : \text{There is an association between the categorical variable and heart disease status.}\)
\(\alpha = 0.05\)
# Subsets for individuals with and without heart disease
heart_disease_yes <- heart_data[heart_data$HeartDiseaseorAttack == 1, ]
heart_disease_no <- heart_data[heart_data$HeartDiseaseorAttack == 0, ]
# Function to calculate Cohen's d
cohen_d <- function(group1, group2) {
n1 <- length(group1)
n2 <- length(group2)
mean1 <- mean(group1, na.rm = TRUE)
mean2 <- mean(group2, na.rm = TRUE)
sd1 <- sd(group1, na.rm = TRUE)
sd2 <- sd(group2, na.rm = TRUE)
sp <- sqrt(((n1 - 1) * sd1^2 + (n2 - 1) * sd2^2) / (n1 + n2 - 2))
d <- (mean1 - mean2) / sp
return(d)
}
# Outlining the variables to test
t_test_vars <- c("Age", "BMI", "GenHlth", "MentHlth", "PhysHlth", "Education", "Income")
# Generating a results table
t_test_results_table <- data.frame(
Variable = character(),
t_Statistic = numeric(),
p_Value = numeric(),
Cohen_d = numeric(),
stringsAsFactors = FALSE
)
# Looping for t-tests and Cohen's d
for (var in t_test_vars) {
t_test_result <- t.test(heart_disease_yes[[var]], heart_disease_no[[var]], var.equal = FALSE)
cohen_d_value <- cohen_d(heart_disease_yes[[var]], heart_disease_no[[var]])
# Adding results to the correct table
t_test_results_table <- rbind(t_test_results_table, data.frame(
Variable = var,
t_Statistic = round(t_test_result$statistic, 2),
p_Value = round(t_test_result$p.value, 4),
Cohen_d = round(cohen_d_value, 2)
))
}6.2.1 T-Test Results
# Displaying the table
print(t_test_results_table, row.names = FALSE) Variable t_Statistic p_Value Cohen_d
Age 147.62 0 0.78
BMI 26.18 0 0.18
GenHlth 128.81 0 0.92
MentHlth 26.74 0 0.22
PhysHlth 68.93 0 0.63
Education -46.92 0 -0.34
Income -67.35 0 -0.49
6.2.2 Summary
According to the T tests with Cohen’s D on the continuous variables, at a significance level of 𝛂 = 0.05:
Age and GenHlth appear to be the strongest predictors of heart disease with the largest differences between groups.
PhysHlth and Income appear to be significant predictors with moderate effect sizes.
BMI, MentHlth, and Education appear to be significant predictors, but with small effect sizes.
6.3 ANOVA
\(H_0 : \mu_1 = \mu_2 = \mu_3 = \dots = \mu_k \quad \text{(All group means are equal)}\)
\(H_a : \text{At least one group mean is different.}\)
\(\alpha = 0.05\)
# Ensuring HeartDiseaseorAttack is a factor so it can be used in ANOVA
heart_data$HeartDiseaseorAttack <- as.factor(heart_data$HeartDiseaseorAttack)
# Performing ANOVA and collecting results
anova_results_hd <- do.call(rbind, lapply(numeric_vars, function(var) {
anova_model <- aov(heart_data[[var]] ~ heart_data$HeartDiseaseorAttack)
anova_summary <- summary(anova_model)
# Extracting the key metrics from the ANOVa for simpler overview
ms_between <- round(anova_summary[[1]]$`Mean Sq`[1], 2)
ms_residual <- round(anova_summary[[1]]$`Mean Sq`[2], 2)
f_stat <- round(anova_summary[[1]]$`F value`[1], 2)
p_val <- round(anova_summary[[1]]$`Pr(>F)`[1], 4)
# Setting up to return as a data frame row
data.frame(
Variable = var,
MeanSq_Between = ms_between,
MeanSq_Residual = ms_residual,
F_Statistic = f_stat,
P_Value = p_val,
stringsAsFactors = FALSE
)
}))6.3.1 ANOVA Results: Numeric Variables by HeartDisease
# Displaying results in a table
print(anova_results_hd, row.names = FALSE) Variable MeanSq_Between MeanSq_Residual F_Statistic P_Value
BMI 31009.66 43.55 712.00 0
MentHlth 58211.10 54.72 1063.78 0
PhysHlth 636519.14 73.49 8660.85 0
Tukey’s HSD
# Tukey's HSD for BMI
anova_bmi <- aov(BMI ~ HeartDiseaseorAttack, data = heart_data)
tukey_bmi <- TukeyHSD(anova_bmi)
print(tukey_bmi) Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = BMI ~ HeartDiseaseorAttack, data = heart_data)
$HeartDiseaseorAttack
diff lwr upr p adj
1-0 1.196998 1.109076 1.284921 0
# Tukey's HSD for MentHlth
anova_ment <- aov(MentHlth ~ HeartDiseaseorAttack, data = heart_data)
tukey_ment <- TukeyHSD(anova_ment)
print(tukey_ment) Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = MentHlth ~ HeartDiseaseorAttack, data = heart_data)
$HeartDiseaseorAttack
diff lwr upr p adj
1-0 1.640016 1.541462 1.738569 0
# Tukey's HSD for PhysHlth
anova_phys <- aov(PhysHlth ~ HeartDiseaseorAttack, data = heart_data)
tukey_phys <- TukeyHSD(anova_phys)
print(tukey_phys) Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = PhysHlth ~ HeartDiseaseorAttack, data = heart_data)
$HeartDiseaseorAttack
diff lwr upr p adj
1-0 5.42314 5.308926 5.537353 0
6.3.2 Boxplots for ANOVA variables
# Generating boxplots for the ANOVA variables
for (var in numeric_vars) {
plot <- ggplot(heart_data, aes(x = HeartDiseaseorAttack, y = .data[[var]], fill = HeartDiseaseorAttack)) +
geom_boxplot(show.legend = FALSE) +
labs(
title = paste("Boxplot of", var, "by HeartDisease"),
x = "HeartDisease (0 = No, 1 = Yes)",
y = var
) +
scale_fill_manual(values = c("blue", "orange"), labels = c("No HeartDisease", "HeartDisease")) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Displaying the plots
print(plot)
}6.3.3 Summary
BMI:
High F-statistic (712.00) and p-value (0) indicate significant differences in BMI across the heart attack status groups.
Suggests that BMI is strongly associated with heart attack status.
Mental Health:
Extremely high F-statistic (1063.78) and p-value (0) suggest significant differences in mental health (unhealthy days) across groups.
Indicates that mental health may play a role in heart attack status.
Physical Health:
Exceptionally high F-statistic (8660.85) and p-value (0) show strong differences in physical health (unhealthy days) across heart attack groups.
Highlights physical health as a critical factor.
Tukey Interpretation:
BMI
Difference: 1.197
- Indicates that individuals with a history of heart attack (1) have, on average, a BMI 1.197 units higher than those without a history (0).
Confidence Interval (lwr, upr): [1.109, 1.285]
- The true difference in mean BMI between groups is likely between 1.109 and 1.285.
p adj: 0
- This extremely low p-value confirms the difference is statistically significant.
Mental Health
Difference: 1.640
- Individuals with heart attack history report 1.640 more unhealthy mental health days on average.
Confidence Interval (lwr, upr): [1.541, 1.739]
- The true difference lies within this range.
p adj: 0
- This is a statistically significant difference.
Physical Health
Difference: 5.423
- Individuals with a history of heart attack report 5.423 more unhealthy physical health days on average.
Confidence Interval (lwr, upr): [5.309, 5.537]
- The difference is precise and significant, as indicated by the small interval.
p adj: 0
- The low p-value confirms a significant difference.
7 Model Building
Next, we begin the assembly of our regression model, speficially utilizing logistic regression.
7.1 Assumptions of Logistic Regression
Binary Outcome Variable: The dependent variable is binary.
Independence of Observations: Observations are independent of each other.
Linearity of Logit: The relationship between the predictors and the log-odds of the outcome is linear.
No Multicollinearity: Predictors are not highly correlated with each other.
Large Sample Size: Logistic regression works best with a large sample size to ensure reliable estimates.
We will confirm these assumptions in the Model Diagnostics and Validation section below.
7.2 Base Model
We begin with our base model, including all predictors:
base_model <- glm(HeartDiseaseorAttack ~ ., data = heart_data, family = binomial)
summary(base_model)
Call:
glm(formula = HeartDiseaseorAttack ~ ., family = binomial, data = heart_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.9124089 0.1028054 -76.965 < 2e-16 ***
HighBP 0.5245081 0.0177520 29.546 < 2e-16 ***
HighChol 0.6112677 0.0164496 37.160 < 2e-16 ***
CholCheck 0.5248111 0.0662510 7.922 2.35e-15 ***
BMI 0.0009744 0.0012122 0.804 0.4215
Smoker 0.3629524 0.0157323 23.071 < 2e-16 ***
Stroke 0.9783443 0.0244338 40.041 < 2e-16 ***
Diabetes 0.1465123 0.0089720 16.330 < 2e-16 ***
PhysActivity 0.0398975 0.0171991 2.320 0.0204 *
Fruits 0.0060203 0.0163274 0.369 0.7123
Veggies 0.0426327 0.0189359 2.251 0.0244 *
HvyAlcoholConsump -0.2940714 0.0392877 -7.485 7.15e-14 ***
AnyHealthcare -0.0070009 0.0412820 -0.170 0.8653
NoDocbcCost 0.2528463 0.0268896 9.403 < 2e-16 ***
GenHlth 0.4907058 0.0095105 51.596 < 2e-16 ***
MentHlth 0.0024628 0.0009778 2.519 0.0118 *
PhysHlth 0.0010542 0.0008766 1.202 0.2292
DiffWalk 0.2947780 0.0193855 15.206 < 2e-16 ***
Sex 0.7611811 0.0160326 47.477 < 2e-16 ***
Age 0.2556493 0.0036439 70.158 < 2e-16 ***
Education 0.0110314 0.0081696 1.350 0.1769
Income -0.0431551 0.0042476 -10.160 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 158355 on 253679 degrees of freedom
Residual deviance: 120933 on 253658 degrees of freedom
AIC: 120977
Number of Fisher Scoring iterations: 6
stepwise_model <- stepAIC(base_model, direction = "both", trace = FALSE)
summary(stepwise_model)
Call:
glm(formula = HeartDiseaseorAttack ~ HighBP + HighChol + CholCheck +
Smoker + Stroke + Diabetes + PhysActivity + Veggies + HvyAlcoholConsump +
NoDocbcCost + GenHlth + MentHlth + DiffWalk + Sex + Age +
Income, family = binomial, data = heart_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.849811 0.084630 -92.755 < 2e-16 ***
HighBP 0.525327 0.017568 29.902 < 2e-16 ***
HighChol 0.611535 0.016439 37.201 < 2e-16 ***
CholCheck 0.525554 0.066050 7.957 1.76e-15 ***
Smoker 0.360912 0.015659 23.048 < 2e-16 ***
Stroke 0.978517 0.024405 40.096 < 2e-16 ***
Diabetes 0.147432 0.008805 16.745 < 2e-16 ***
PhysActivity 0.039919 0.017008 2.347 0.01892 *
Veggies 0.046723 0.018425 2.536 0.01122 *
HvyAlcoholConsump -0.295596 0.039239 -7.533 4.95e-14 ***
NoDocbcCost 0.254267 0.026437 9.618 < 2e-16 ***
GenHlth 0.494177 0.008690 56.866 < 2e-16 ***
MentHlth 0.002741 0.000954 2.873 0.00406 **
DiffWalk 0.302854 0.018630 16.256 < 2e-16 ***
Sex 0.761142 0.015982 47.626 < 2e-16 ***
Age 0.254982 0.003500 72.862 < 2e-16 ***
Income -0.040995 0.003938 -10.410 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 158355 on 253679 degrees of freedom
Residual deviance: 120937 on 253663 degrees of freedom
AIC: 120971
Number of Fisher Scoring iterations: 6
Note: We have omitted the full stepwise regression output from this report to ensure smoother rendering and reduce processing times.
Looking at the summary of the logistic regression model with all variables, it appears that the predictors BMI, Fruits, AnyHealthcare, PhysHlth, and Education have p-values over 0.05, indicating insignificance to the model.
A stepwise selection process in both directions was used to select the best model by using AIC to identify the most significant factors. This selection method combines forward selection and backward elimination.
The model resulted in the inclusion of the following predictors as most significant:
HighBP
HighChol
CholCheck
Smoker
Stroke
Diabetes
PhysActivity
Veggies
HvyAlcoholConsump
NoDocbcCost
GenHlth
MentHlth
DiffWalk
Sex
Age
Income
The selection process included 6 iterations, removing the predictors BMI, Fruits, AnyHealthcare, PhysHlth, and Education from the full model as they did not significantly contribute to the model’s fit according to the AIC.
7.2.1 Summary
As detailed above, the final model based on our stepwise selection process is as follows:
# Model after stepwise selection
final_model <- glm(HeartDiseaseorAttack ~ HighBP + HighChol + CholCheck + Smoker +
Stroke + Diabetes + PhysActivity + Veggies + HvyAlcoholConsump +
NoDocbcCost + GenHlth + MentHlth + DiffWalk + Sex + Age +
Income, data = heart_data, family = binomial)8 Model Diagnostics and Validation
Let us now confirm that the assumptions of logistic regression in our model:
8.1 Binary Outcome Variable
The dependent variable is HeartDiseaseorAttack, which has values: 0 = No, 1 = Yes
This variable is binary and confirms this assumption.
8.2 Independence
In our dataset, each observation represents a unique individual, and there are no repeated measures or clustering effects. Therefore, the independence assumption is confirmed.
8.3 Linearity of Logit
The age variable is categorized into groups, making it more complex to analyze in logistic regression. Our attempts to test its linearity caused technical issues with the model, so keep the analysis straightforward, we left the variable as is. Accurate assessment of this assumption may require a more thorough refining of the age variable, be it through transformations or other means.
8.4 No Multicollinearity
# Checking for multicollinearity via the Variance Inflation Factor (VIF)
vif_values <- vif(final_model)
print("Variance Inflation Factor (VIF) values:")[1] "Variance Inflation Factor (VIF) values:"
print(vif_values) HighBP HighChol CholCheck Smoker
1.118635 1.060904 1.013787 1.041327
Stroke Diabetes PhysActivity Veggies
1.033980 1.105517 1.131639 1.044400
HvyAlcoholConsump NoDocbcCost GenHlth MentHlth
1.017747 1.129484 1.526453 1.235811
DiffWalk Sex Age Income
1.392603 1.110160 1.140386 1.311509
Based on the details outlines above, the Variance Inflation Factor (VIF) values indicate that there are no concerning multicollinearity issues present in our final model. Generally, VIF values below 5 (sometimes 10 for larger datasets) suggest that predictors are not highly correlated.
8.5 Cook’s Distance
# Checking for influential points via Cook's Distance plot
cooks_d <- cooks.distance(stepwise_model)
cook_thresh <- 4 / nrow(heart_data)
plot(cooks_d, type = "h", main = "Cook's Distance for Each Observation",
ylab = "Cook's Distance", xlab = "Observations")
abline(h = cook_thresh, col = "red", lty = 2)# Checking for influential points via leverage plot
leverage <- hatvalues(stepwise_model)
leverage_thresh <- 2 * (length(coef(stepwise_model)) + 1) / nrow(heart_data)
plot(leverage, cooks_d, xlab = "Leverage", ylab = "Cook's Distance",
main = "Influential Observations", pch = 20, col = "black")
abline(h = cook_thresh, col = "blue", lty = 2)
abline(v = leverage_thresh, col = "red", lty = 2)Based on both the Cooks Distance and Leverage plots, we can see that no single observation appears to be excessively influential, indicating that the model is robust. Given that no significant outliers have been identified, we will not be removing any observations from the model.
8.6 Large Sample Size
Given that our dataset containts over 250,000 observations, the assumption of large sample size is met.
8.7 Interaction Terms
In an effort to describe the model in the most detail possible, we will also include the following interaction terms:
HighBP × HighChol
HighBP × Smoker
HighChol × Stroke
Smoker × Stroke
Stroke × GenHlth
GenHlth × Age
PhysActivity × Veggies
HvyAlcoholConsump × NoDocbcCost
# Outlining the interaction model
interaction_model <- glm(
HeartDiseaseorAttack ~ HighBP * HighChol +
HighBP * Smoker +
HighChol * Stroke +
Smoker * Stroke +
Stroke * GenHlth +
GenHlth * Age +
PhysActivity * Veggies +
HvyAlcoholConsump * NoDocbcCost +
HighBP + HighChol + Smoker + Stroke + GenHlth + PhysActivity + Veggies +
HvyAlcoholConsump + NoDocbcCost + GenHlth + MentHlth + DiffWalk +
Sex + Age + Income + CholCheck + Diabetes,
data = heart_data,
family = binomial
)
# Printing a summary of the interaction model
summary(interaction_model)
Call:
glm(formula = HeartDiseaseorAttack ~ HighBP * HighChol + HighBP *
Smoker + HighChol * Stroke + Smoker * Stroke + Stroke * GenHlth +
GenHlth * Age + PhysActivity * Veggies + HvyAlcoholConsump *
NoDocbcCost + HighBP + HighChol + Smoker + Stroke + GenHlth +
PhysActivity + Veggies + HvyAlcoholConsump + NoDocbcCost +
GenHlth + MentHlth + DiffWalk + Sex + Age + Income + CholCheck +
Diabetes, family = binomial, data = heart_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.0547531 0.1325791 -68.297 < 2e-16 ***
HighBP 0.4907252 0.0332683 14.751 < 2e-16 ***
HighChol 0.5700414 0.0284349 20.047 < 2e-16 ***
Smoker 0.4259475 0.0290007 14.687 < 2e-16 ***
Stroke 1.5480487 0.0905713 17.092 < 2e-16 ***
GenHlth 0.8676025 0.0325858 26.625 < 2e-16 ***
Age 0.3697799 0.0106097 34.853 < 2e-16 ***
PhysActivity 0.0712664 0.0320648 2.223 0.026245 *
Veggies 0.0710875 0.0282127 2.520 0.011746 *
HvyAlcoholConsump -0.3192760 0.0417302 -7.651 1.99e-14 ***
NoDocbcCost 0.2205944 0.0269747 8.178 2.89e-16 ***
MentHlth 0.0015246 0.0009568 1.593 0.111079
DiffWalk 0.3023440 0.0185343 16.313 < 2e-16 ***
Sex 0.7559528 0.0159463 47.406 < 2e-16 ***
Income -0.0373180 0.0039367 -9.479 < 2e-16 ***
CholCheck 0.5193762 0.0661194 7.855 3.99e-15 ***
Diabetes 0.1442157 0.0087666 16.451 < 2e-16 ***
HighBP:HighChol 0.1279302 0.0346383 3.693 0.000221 ***
HighBP:Smoker -0.0884719 0.0339417 -2.607 0.009145 **
HighChol:Stroke -0.4416252 0.0507694 -8.699 < 2e-16 ***
Smoker:Stroke -0.1004807 0.0485797 -2.068 0.038605 *
Stroke:GenHlth -0.0585123 0.0225713 -2.592 0.009533 **
GenHlth:Age -0.0367094 0.0031181 -11.773 < 2e-16 ***
PhysActivity:Veggies -0.0447662 0.0368020 -1.216 0.223830
HvyAlcoholConsump:NoDocbcCost 0.1921910 0.1206683 1.593 0.111223
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 158355 on 253679 degrees of freedom
Residual deviance: 120680 on 253655 degrees of freedom
AIC: 120730
Number of Fisher Scoring iterations: 7
Next we will perform the stepwise selection again, this timem focusing on the new interaction model:
# Outlining the stepwise selection for the new interaction model
stepwise_interaction_model <- stepAIC(interaction_model, direction = "both", trace = FALSE)
# Printing a summary
summary(stepwise_interaction_model)
Call:
glm(formula = HeartDiseaseorAttack ~ HighBP + HighChol + Smoker +
Stroke + GenHlth + Age + PhysActivity + Veggies + HvyAlcoholConsump +
NoDocbcCost + MentHlth + DiffWalk + Sex + Income + CholCheck +
Diabetes + HighBP:HighChol + HighBP:Smoker + HighChol:Stroke +
Smoker:Stroke + Stroke:GenHlth + GenHlth:Age + HvyAlcoholConsump:NoDocbcCost,
family = binomial, data = heart_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.0374608 0.1318007 -68.569 < 2e-16 ***
HighBP 0.4908139 0.0332682 14.753 < 2e-16 ***
HighChol 0.5700173 0.0284345 20.047 < 2e-16 ***
Smoker 0.4261185 0.0290001 14.694 < 2e-16 ***
Stroke 1.5486694 0.0905683 17.099 < 2e-16 ***
GenHlth 0.8678146 0.0325848 26.632 < 2e-16 ***
Age 0.3697842 0.0106094 34.854 < 2e-16 ***
PhysActivity 0.0381591 0.0169391 2.253 0.024276 *
Veggies 0.0450789 0.0183618 2.455 0.014087 *
HvyAlcoholConsump -0.3195208 0.0417301 -7.657 1.91e-14 ***
NoDocbcCost 0.2203268 0.0269749 8.168 3.14e-16 ***
MentHlth 0.0015095 0.0009568 1.578 0.114632
DiffWalk 0.3025022 0.0185347 16.321 < 2e-16 ***
Sex 0.7562402 0.0159439 47.431 < 2e-16 ***
Income -0.0373760 0.0039367 -9.494 < 2e-16 ***
CholCheck 0.5192716 0.0661177 7.854 4.04e-15 ***
Diabetes 0.1442449 0.0087667 16.454 < 2e-16 ***
HighBP:HighChol 0.1280530 0.0346379 3.697 0.000218 ***
HighBP:Smoker -0.0886864 0.0339409 -2.613 0.008976 **
HighChol:Stroke -0.4416122 0.0507695 -8.698 < 2e-16 ***
Smoker:Stroke -0.1000196 0.0485779 -2.059 0.039499 *
Stroke:GenHlth -0.0587523 0.0225704 -2.603 0.009239 **
GenHlth:Age -0.0367077 0.0031181 -11.773 < 2e-16 ***
HvyAlcoholConsump:NoDocbcCost 0.1913418 0.1206844 1.585 0.112859
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 158355 on 253679 degrees of freedom
Residual deviance: 120681 on 253656 degrees of freedom
AIC: 120729
Number of Fisher Scoring iterations: 7
Note: We have omitted the full stepwise regression output from this report to ensure smoother rendering and reduce processing times.
As shown through the stepwise selection process above, we have arrived at a logistic regression model that aims to predicts the likelihood of heart disease. This model is based on the significant predictors from our first final model, as well as interactions between key variables. Below is a summary of the model’s components:
# Outlining the final model
final_interaction_model <- glm(formula = HeartDiseaseorAttack ~ HighBP + HighChol + Smoker + Stroke + GenHlth + Age + PhysActivity + Veggies + HvyAlcoholConsump +
NoDocbcCost + MentHlth + DiffWalk + Sex + Income + CholCheck +
Diabetes + HighBP:HighChol + HighBP:Smoker + HighChol:Stroke +
Smoker:Stroke + Stroke:GenHlth + GenHlth:Age + HvyAlcoholConsump:NoDocbcCost,
family = binomial, data = heart_data)
# Printing a summary of the final model
summary(final_interaction_model)
Call:
glm(formula = HeartDiseaseorAttack ~ HighBP + HighChol + Smoker +
Stroke + GenHlth + Age + PhysActivity + Veggies + HvyAlcoholConsump +
NoDocbcCost + MentHlth + DiffWalk + Sex + Income + CholCheck +
Diabetes + HighBP:HighChol + HighBP:Smoker + HighChol:Stroke +
Smoker:Stroke + Stroke:GenHlth + GenHlth:Age + HvyAlcoholConsump:NoDocbcCost,
family = binomial, data = heart_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.0374608 0.1318007 -68.569 < 2e-16 ***
HighBP 0.4908139 0.0332682 14.753 < 2e-16 ***
HighChol 0.5700173 0.0284345 20.047 < 2e-16 ***
Smoker 0.4261185 0.0290001 14.694 < 2e-16 ***
Stroke 1.5486694 0.0905683 17.099 < 2e-16 ***
GenHlth 0.8678146 0.0325848 26.632 < 2e-16 ***
Age 0.3697842 0.0106094 34.854 < 2e-16 ***
PhysActivity 0.0381591 0.0169391 2.253 0.024276 *
Veggies 0.0450789 0.0183618 2.455 0.014087 *
HvyAlcoholConsump -0.3195208 0.0417301 -7.657 1.91e-14 ***
NoDocbcCost 0.2203268 0.0269749 8.168 3.14e-16 ***
MentHlth 0.0015095 0.0009568 1.578 0.114632
DiffWalk 0.3025022 0.0185347 16.321 < 2e-16 ***
Sex 0.7562402 0.0159439 47.431 < 2e-16 ***
Income -0.0373760 0.0039367 -9.494 < 2e-16 ***
CholCheck 0.5192716 0.0661177 7.854 4.04e-15 ***
Diabetes 0.1442449 0.0087667 16.454 < 2e-16 ***
HighBP:HighChol 0.1280530 0.0346379 3.697 0.000218 ***
HighBP:Smoker -0.0886864 0.0339409 -2.613 0.008976 **
HighChol:Stroke -0.4416122 0.0507695 -8.698 < 2e-16 ***
Smoker:Stroke -0.1000196 0.0485779 -2.059 0.039499 *
Stroke:GenHlth -0.0587523 0.0225704 -2.603 0.009239 **
GenHlth:Age -0.0367077 0.0031181 -11.773 < 2e-16 ***
HvyAlcoholConsump:NoDocbcCost 0.1913418 0.1206844 1.585 0.112859
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 158355 on 253679 degrees of freedom
Residual deviance: 120681 on 253656 degrees of freedom
AIC: 120729
Number of Fisher Scoring iterations: 7
8.7.1 Main Predictors
Detailed below are the main variables that remain significant without being part of an interaction:
HighBP (p<0.001): Individuals with high blood pressure have higher odds of heart disease.
HighChol (p<0.001): High cholesterol is strongly associated with an increased likelihood of heart disease.
Smoker (p<0.001): Smoking significantly raises the odds of heart disease.
Stroke (p<0.001): A history of stroke is highly predictive of heart disease.
GenHlth (p<0.001): Poorer general health is associated with a higher likelihood of heart disease.
Age (p<0.001): Age is positively correlated with the odds of heart disease.
PhysActivity (p=0.024): Regular physical activity reduces the likelihood of heart disease, albeit with a modest effect.
Veggies (p=0.014): Consuming vegetables is associated with a lower risk of heart disease.
HvyAlcoholConsump (p<0.001): Heavy alcohol consumption is negatively associated with heart disease likelihood.
NoDocbcCost (p<0.001): Inability to afford medical care increases the odds of heart disease.
DiffWalk (p<0.001): Difficulty walking is strongly associated with heart disease.
Sex (p<0.001): Males have higher odds of heart disease compared to females.
Income (p<0.001): Higher income levels are negatively associated with heart disease likelihood.
CholCheck (p<0.001): Recent cholesterol checks are positively associated with identifying heart disease.
Diabetes (p<0.001): Individuals with diabetes have higher odds of heart disease.
8.7.2 Interaction Terms
Detailed below are the significant interaction effects:
HighBP (p=0.0002): The combined presence of high blood pressure and high cholesterol has a multiplicative effect, increasing the odds of heart disease.
HighBP (p=0.009): Smoking interacts with high blood pressure to slightly reduce the risk compared to their independent effects.
HighChol (p<0.001): High cholesterol’s effect on heart disease risk is mitigated for those with a history of stroke.
Smoker (p=0.039): Smoking combined with a stroke history slightly reduces the risk compared to their independent effects.
Stroke (p=0.009): Poor general health modifies the effect of having a stroke on heart disease risk.
GenHlth (p<0.001): Age modifies the effect of general health on heart disease risk.
8.7.3 Interpretation of Results
Main Effects:
Variables like high blood pressure, cholesterol, smoking, and stroke remain strong independent predictors of heart disease.
Lifestyle factors (e.g., physical activity, vegetable consumption) and socioeconomic factors (e.g., income, access to healthcare) also play key roles.
Interactions:
- The interaction terms show how combinations of predictors influence heart disease differently compared to their independent effects. For example, the interaction between age and general health highlights how age can amplify the impact of poor health.
Variables Dropped in Interaction:
MentHlth (p=0.115): Number of bad mental health days is no longer significant in the interaction model.
HvyAlcoholConsump (p=0.113): This interaction term did not show a significant effect on heart disease risk.
8.7.4 Model Performance
AIC: The final model’s AIC (120729) indicates a good fit relative to earlier models.
Residual Deviance: The residual deviance (120681) suggests the model explains most of the variability in the data.
9 Predictive Model Outputs and Interpretation
Now, let us outline key outputs from the final logistic regression model, including odds ratios, ROC and calibration curves. This will help us to evaluate the model’s predictive performance and provide insights into the factors influencing heart disease, helping us ensure the model’s results are interpretable and aligned with our research objectives.
9.1 Odds Ratio
# odds ratios and confidence intervals
model_summary <- tidy(final_interaction_model, exponentiate = TRUE, conf.int = TRUE)
colnames(model_summary) <- c("Variable", "OddsRatio", "StdError", "ConfLow", "ConfHigh", "PValue")
# odds ratio plot
ggplot(model_summary, aes(x = reorder(Variable, OddsRatio), y = OddsRatio)) +
geom_point(size = 3, color = "blue") +
geom_errorbar(aes(ymin = ConfLow, ymax = ConfHigh), width = 0.2) +
geom_hline(yintercept = 1, linetype = "dashed", color = "red") +
coord_flip() +
labs(
title = "Odds Ratio Plot",
x = "Predictor Variables",
y = "Odds Ratio (with 95% CI)"
) +
theme_minimal()# Outlining interpreations of the odds ratio in a table for compactness
predictors_table <- data.frame(
Predictor = c(
"HighBP", "HighChol", "Smoker",
"Stroke", "GenHlth", "Age", "DiffWalk",
"Income", "CholCheck", "PhysActivity",
"Veggies", "HvyAlcoholConsump",
"Diabetes", "Sex", "NoDocbcCost",
"HighBP × HighChol", "HighBP × Smoker", "HighChol × Stroke",
"Smoker × Stroke", "Stroke × GenHlth", "GenHlth × Age"
),
Coefficient = c(
0.491, 0.570, 0.426, 1.549, 0.868, 0.370, 0.303,
-0.037, 0.519, 0.038, 0.045, -0.320, 0.144, 0.756, 0.220,
0.128, -0.089, -0.442, -0.100, -0.059, -0.037
),
OddsRatio = c(
1.63, 1.77, 1.53, 4.71, 2.38, 1.45, 1.35,
0.96, 1.68, 1.04, 1.05, 0.73, 1.15, 2.13, 1.25,
1.14, 0.91, 0.64, 0.90, 0.94, 0.96
),
Interpretation = c(
"High blood pressure increases the odds of heart disease by 63%",
"High cholesterol raises the odds of heart disease by 77%",
"Smokers are 53% more likely to have heart disease",
"A history of stroke increases the likelihood of heart disease by 371%",
"Poorer general health increases heart disease risk by 138%",
"Moving to the next age group increases heart disease odds by 45%",
"Individuals with mobility issues are 35% more likely to have heart disease",
"Higher income levels are slightly protective against heart disease",
"Recent cholesterol screening is associated with a 68% higher likelihood of identifying heart disease",
"Slight increase in risk; likely due to other confounding health behaviors",
"Associated with a minor protective effect",
"Heavy alcohol consumption reduces odds, but this result may reflect underlying health or reporting biases",
"Diabetes increases the odds of heart disease by 15%",
"Males have over twice the odds of heart disease compared to females",
"Inability to afford medical care increases heart disease odds by 25%",
"Combined presence of high blood pressure and high cholesterol increases odds by 14% over their independent effects",
"Smoking slightly offsets the risk of high blood pressure on heart disease, potentially due to competing risks",
"The effect of high cholesterol is lessened for those with a history of stroke",
"The combined effect of smoking and stroke history is slightly less than their individual effects",
"Poor general health slightly reduces the stroke effect on heart disease risk",
"As individuals age, the impact of general health on heart disease slightly diminishes"
)
)
# Generating the the table
library(knitr)
kable(
predictors_table,
col.names = c("Predictor", "Coefficient", "Odds Ratio", "Interpretation"),
caption = "Summary of Predictors, Interaction Terms, and Odds Ratios"
)| Predictor | Coefficient | Odds Ratio | Interpretation |
|---|---|---|---|
| HighBP | 0.491 | 1.63 | High blood pressure increases the odds of heart disease by 63% |
| HighChol | 0.570 | 1.77 | High cholesterol raises the odds of heart disease by 77% |
| Smoker | 0.426 | 1.53 | Smokers are 53% more likely to have heart disease |
| Stroke | 1.549 | 4.71 | A history of stroke increases the likelihood of heart disease by 371% |
| GenHlth | 0.868 | 2.38 | Poorer general health increases heart disease risk by 138% |
| Age | 0.370 | 1.45 | Moving to the next age group increases heart disease odds by 45% |
| DiffWalk | 0.303 | 1.35 | Individuals with mobility issues are 35% more likely to have heart disease |
| Income | -0.037 | 0.96 | Higher income levels are slightly protective against heart disease |
| CholCheck | 0.519 | 1.68 | Recent cholesterol screening is associated with a 68% higher likelihood of identifying heart disease |
| PhysActivity | 0.038 | 1.04 | Slight increase in risk; likely due to other confounding health behaviors |
| Veggies | 0.045 | 1.05 | Associated with a minor protective effect |
| HvyAlcoholConsump | -0.320 | 0.73 | Heavy alcohol consumption reduces odds, but this result may reflect underlying health or reporting biases |
| Diabetes | 0.144 | 1.15 | Diabetes increases the odds of heart disease by 15% |
| Sex | 0.756 | 2.13 | Males have over twice the odds of heart disease compared to females |
| NoDocbcCost | 0.220 | 1.25 | Inability to afford medical care increases heart disease odds by 25% |
| HighBP × HighChol | 0.128 | 1.14 | Combined presence of high blood pressure and high cholesterol increases odds by 14% over their independent effects |
| HighBP × Smoker | -0.089 | 0.91 | Smoking slightly offsets the risk of high blood pressure on heart disease, potentially due to competing risks |
| HighChol × Stroke | -0.442 | 0.64 | The effect of high cholesterol is lessened for those with a history of stroke |
| Smoker × Stroke | -0.100 | 0.90 | The combined effect of smoking and stroke history is slightly less than their individual effects |
| Stroke × GenHlth | -0.059 | 0.94 | Poor general health slightly reduces the stroke effect on heart disease risk |
| GenHlth × Age | -0.037 | 0.96 | As individuals age, the impact of general health on heart disease slightly diminishes |
9.2 ROC Curve
# ROC curve code:
# Predicted probabilities from final model
predicted_probs <- predict(final_interaction_model, heart_data, type = "response")
# Actual outcomes
actual_outcomes <- as.numeric(heart_data$HeartDiseaseorAttack) # EDIT: Converted to numeric to ensure compatibility with `roc()` function.
# ROC curve
roc_curve <- roc(actual_outcomes, predicted_probs)
# ROC curve plot
plot(roc_curve, main = "ROC Curve", col = "blue", lwd = 2)
abline(a = 0, b = 1, lty = 2, col = "gray") # Diagonal reference line# AUC
auc_value <- auc(roc_curve)
cat("Area Under the Curve (AUC):", auc_value, "\n")Area Under the Curve (AUC): 0.8475062
An AUC of 0.8475 suggests that the model has good discriminatory power, meaning it can effectively separate individuals with and without the outcome (heart disease).
9.3 Calibration Curve
# Calibration curve code:
# Grouping predicted probabilities
calibration_data <- data.frame(
predicted_probs = predicted_probs,
actual_outcomes = actual_outcomes
)
calibration_data$bin <- cut(
calibration_data$predicted_probs,
breaks = seq(0, 1, by = 0.1),
include.lowest = TRUE
)
# Mean predicted probabilities and observed outcomes in each bin
calibration_summary <- calibration_data %>%
group_by(bin) %>%
summarise(
mean_predicted = mean(predicted_probs, na.rm = TRUE), # EDIT: Added `na.rm = TRUE` to handle potential NAs in predicted probabilities.
mean_observed = mean(actual_outcomes, na.rm = TRUE) # EDIT: Added `na.rm = TRUE` to handle potential NAs in observed outcomes.
)
# Calibration curve plot
plot(
calibration_summary$mean_predicted,
calibration_summary$mean_observed,
xlab = "Mean Predicted Probability",
ylab = "Mean Observed Probability",
main = "Calibration Curve",
col = "blue",
pch = 16
)
abline(0, 1, lty = 2, col = "gray")The points are close to the dotted line meaning the predicted probabilities align well with observed outcomes.
10 Summary: Research Question and Hypotheses
The analysis sought to identify significant predictors of heart disease. Our hypothesis was supported: age, smoking, high blood pressure, and stroke are strongly associated with increased odds of heart disease. Factors like cholesterol, general health, and socioeconomic indicators like income also play critical roles.