Demographic and Behavioral Correlates of Driving Under the Influence of Alcohol

Analysis of the 2023 National Survey on Drug Use and Health

Author

Julian L. Costa

Published

October 4, 2025

1 Introduction and Research Question

Driving under the influence of alcohol remains one of the most severe public health and safety issues in the United States. In 2020 alone, 11,654 people were killed in crashes involving alcohol-impaired drivers, with 38% of those fatalities being passengers, other drivers, or pedestrians (Centers for Disease Control and Prevention [CDC], 2023). These figures highlight the necessity for stronger prevention efforts to reduce alcohol-impaired driving and protect everyone on the road.

1.1 Research Question

Which demographic and behavioral factors are associated with driving under the influence of alcohol?

2 Data Description

In this project, we use data from the 2023 National Survey on Drug Use and Health (NSDUH), a nationally representative survey conducted annually by the Substance Abuse and Mental Health Services Administration (SAMHSA) to assess substance use and mental health patterns among individuals in the U.S. population.

The primary outcome of interest is the variable drink_drive, a binary indicator of whether a respondent reported driving under the influence of alcohol in the past 12 months (1 = Yes, 0 = No).

Detailed below is a breakdown of all variables we will be analyzing. Note that the variable names reflect the final, cleaned names used throughout this report. These were re-coded from the original NSDUH dataset for clarity and consistency, as detailed in the preparation steps in the following section.

  • drink_drive: Whether respondent drove under the influence (0 = No, 1 = Yes)

  • age: Age group (factor with levels: 12–13 to 65+)

  • sex: Respondent’s sex (Male, Female)

  • edu_level: Highest level of education attained

  • employment: Current employment status

  • marijuana_days: Days used marijuana in past 12 months

  • alcohol_days: Days used alcohol in past 12 months

  • binge_days: Days of binge drinking in past 30 days

2.1 Preparation

To begin our analysis, we’ll load in all of the data:

# Loading in the dataset
load("NSDUH_2023.Rdata")
substance_use_data <- puf2023_102124

Next we’ll use the NSDUH code book, provided by SAMHSA, to re-code some of the multi-categorical variables for easier analysis:

# Recoding DRVINALCO into binary outcome (0 = No, 1 = Yes)
substance_use_data <- substance_use_data %>%
  mutate(drink_drive = case_when(
    DRVINALCO == 1 ~ 1,
    DRVINALCO == 2 ~ 0,
    TRUE ~ NA_real_
  ))

# Recoding AGE3 to labeled factors
substance_use_data <- substance_use_data %>%
  mutate(age = factor(AGE3, levels = 1:11, labels = c(
    "12-13", "14-15", "16-17", "18-20", "21-23",
    "24-25", "26-29", "30-34", "35-49", "50-64", "65+"
  )))

# Recoding IRSEX into labeled factor for sex
substance_use_data <- substance_use_data %>%
  mutate(sex = factor(IRSEX, levels = c(1, 2), labels = c("Male", "Female")))

# Recoding IREDUHIGHST2 into grouped education levels
substance_use_data <- substance_use_data %>%
  mutate(edu_level = case_when(
    IREDUHIGHST2 <= 7 ~ "Less than HS",
    IREDUHIGHST2 == 8 ~ "HS Grad",
    IREDUHIGHST2 %in% c(9, 10) ~ "Some College/AA",
    IREDUHIGHST2 == 11 ~ "College+",
    TRUE ~ NA_character_
  )) %>%
  mutate(edu_level = factor(edu_level, levels = c(
    "Less than HS", "HS Grad", "Some College/AA", "College+"
  )))

# Recoding IRWRKSTAT into grouped employment status
substance_use_data <- substance_use_data %>%
  mutate(employment = case_when(
    IRWRKSTAT %in% c(1, 2) ~ "Employed",
    IRWRKSTAT == 3 ~ "Unemployed",
    IRWRKSTAT == 4 ~ "Not in labor force",
    TRUE ~ NA_character_
  )) %>%
  mutate(employment = factor(employment, levels = c(
    "Employed", "Unemployed", "Not in labor force"
  )))


# Creating labeled variables for binge, alcohol, and marijuana use from raw NSDUH data
substance_use_data <- substance_use_data %>%
  mutate(
    binge_days = ALCBNG30D,       # Binge days in past 30 days
    alcohol_days = ALCYRTOT,      # Alcohol use days in past year
    marijuana_days = MJYRTOT      # Marijuana use days in past year
  )

Now we can observe the totals across the newly labeled outcome variable, drink_drive, as well as its proportions:

# Outlining the totals across the outcome variable
table(substance_use_data$drink_drive, useNA = "ifany")

    0     1  <NA> 
29572  2977 24156 
prop.table(table(substance_use_data$drink_drive))

        0         1 
0.9085379 0.0914621 

As seen above in our proportions, we have a considerable number of missing responses on the outcome variable (NA = 24,156). To account for this in our report, we’ll create a separate dataset to ensure our analysis is only based on complete cases.

# Creating dataset with complete cases
substance_use_model_data <- substance_use_data %>%
  filter(
    !is.na(drink_drive),
    !is.na(age),
    !is.na(sex),
    !is.na(edu_level),
    !is.na(employment),
    !is.na(marijuana_days),
    !is.na(alcohol_days),
    !is.na(binge_days)
  )

Now, let’s quickly review our new dataset:

# Reviewing the distribution after filtering
table(substance_use_model_data$drink_drive)

    0     1 
29184  2971 
# Reviewing the proportions after filtering
prop.table(table(substance_use_model_data$drink_drive))

         0          1 
0.90760379 0.09239621 

After filtering for complete cases across the outcome and selected predictors, we retained 32,155 observations (29,184 non-impaired drivers and 2,971 who reported driving under the influence). This ensures that our model is based only on complete and consistent data.

2.2 Data Cleaning

Before moving on to outline our summary statistics, let’s go through and run a scan of all variables to check for out-of-range or specially coded values that could represent skipped questions or missing data.

Categorical Variables:

# Outlining the unique values for categorical/factor-like variables
sort(unique(substance_use_model_data$sex))
[1] Male   Female
Levels: Male Female
sort(unique(substance_use_model_data$age))
 [1] 14-15 16-17 18-20 21-23 24-25 26-29 30-34 35-49 50-64 65+  
Levels: 12-13 14-15 16-17 18-20 21-23 24-25 26-29 30-34 35-49 50-64 65+
sort(unique(substance_use_model_data$employment))
[1] Employed           Unemployed         Not in labor force
Levels: Employed Unemployed Not in labor force
sort(unique(substance_use_model_data$edu_level))
[1] Less than HS    HS Grad         Some College/AA College+       
Levels: Less than HS HS Grad Some College/AA College+
sort(unique(substance_use_model_data$drink_drive))
[1] 0 1

Continuous Variables:

# Outlining the full range for the continuous variables
range(substance_use_model_data$marijuana_days, na.rm = TRUE)
[1]   1 998
range(substance_use_model_data$alcohol_days, na.rm = TRUE)
[1]   1 997
range(substance_use_model_data$binge_days, na.rm = TRUE)
[1]  0 98

As shown, we have a few issues with the data as it was originally structured, specifically for the variables marijuana_days, alcohol_days, and binge_days, wherein the variables contained unusual or out-of-range values, such as 997 and 998, which are used in the NSDUH dataset to indicate skips, refusals, or other unclear responses. Additionally, the binge_days variable included Inf and -Inf, which was possibly the result of an error during data entry or import.

To resolve this, we can re-code these values as NA to ensure they do not affect our analysis:

# Cleaning the marijuana_days variable
substance_use_model_data <- substance_use_model_data %>%
  mutate(marijuana_days = ifelse(marijuana_days > 365 | marijuana_days %in% c(998), NA, marijuana_days))

# Cleaning the alcohol_days variable
substance_use_model_data <- substance_use_model_data %>%
  mutate(alcohol_days = ifelse(alcohol_days > 365 | alcohol_days %in% c(997), NA, alcohol_days))

# Cleaning the binge_days variable
substance_use_model_data <- substance_use_model_data %>%
  mutate(binge_days = ifelse(is.infinite(binge_days) | binge_days > 30, NA, binge_days))

2.3 Summary Statistics

Next, we can run some basic summary statistics to get a better view of the data and note any particular trends or findings we may way to consider in our analysis.

2.3.1 Outcome Variable: drink_drive

# Creating a bar plot for drink_drive
ggplot(substance_use_model_data, aes(x = factor(drink_drive, labels = c("No", "Yes")))) +
  geom_bar(fill = "firebrick4") +
  geom_text(stat = "count", aes(label = ..count..), vjust = -0.5, size = 3) +
  labs(
    title = "Self-Reported Driving Under the Influence of Alcohol",
    x = "Drove Under the Influence (Past Year)",
    y = "Number of Respondents"
  ) +
  theme_minimal()

2.3.2 Age Group: age

# Creating a bar plot for age
ggplot(substance_use_model_data, aes(x = age)) +
  geom_bar(fill = "bisque4") +
  geom_text(stat = "count", aes(label = ..count..), vjust = -0.5, size = 3) +
  labs(
    title = "Age Group Distribution",
    x = "Age Group",
    y = "Number of Respondents"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

2.3.3 Sex: sex

# Creating a bar plot for sex
ggplot(substance_use_model_data, aes(x = factor(sex, labels = c("Male", "Female")))) +
  geom_bar(fill = "darkorchid4") +
  geom_text(stat = "count", aes(label = ..count..), vjust = -0.5, size = 3.5) +
  labs(
    title = "Sex Distribution",
    x = "Sex",
    y = "Count"
  ) +
  theme_minimal()

2.3.4 Education: edu_level

# Creating a bar plot for edu_level
ggplot(substance_use_model_data, aes(x = edu_level)) +
  geom_bar(fill = "darkslategrey") +
  geom_text(stat = "count", aes(label = ..count..), vjust = -0.5, size = 2) +
  labs(
    title = "Education Level of Respondents",
    x = "Education Level",
    y = "Count"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

2.3.5 Employment: employment

# Creating a bar plot for employment
ggplot(substance_use_model_data, aes(x = employment)) +
  geom_bar(fill = "lightpink4") +
  geom_text(stat = "count", aes(label = ..count..), vjust = -0.5, size = 2) +
  labs(
    title = "Employment Status of Respondents",
    x = "Employment Status",
    y = "Count"
  ) +
  theme_minimal()

2.3.6 Marijuana Use: marijuana_days

# Generating summary stats for marijuana_days
summary(substance_use_model_data$marijuana_days)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    1.0    10.0    95.0   144.3   300.0   365.0   20413 
sd(substance_use_model_data$marijuana_days, na.rm = TRUE)
[1] 141.7845
# Creating a bar plot for marijuana_days
ggplot(substance_use_model_data, aes(x = marijuana_days)) +
  geom_histogram(fill = "forestgreen", bins = 30) +
  labs(
    title = "Distribution of Marijuana Use (Days per Year)",
    x = "Days Used Marijuana in Past Year",
    y = "Number of Respondents"
  ) +
  theme_minimal()

2.3.7 Alcohol Use: alcohol_days

# Generating summary stats for alcohol_days
summary(substance_use_model_data$alcohol_days)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   1.00   12.00   48.00   79.35  104.00  365.00     257 
sd(substance_use_model_data$alcohol_days, na.rm = TRUE)
[1] 90.38573
# Creating a bar plot for alcohol_days
ggplot(substance_use_model_data, aes(x = alcohol_days)) +
  geom_histogram(fill = "steelblue", bins = 30) +
  labs(
    title = "Distribution of Alcohol Use (Past Year)",
    x = "Days Used Alcohol in Past Year",
    y = "Number of Respondents"
  ) +
  theme_minimal()

2.3.8 Binge Drinking: binge_days

# Generating summary stats for binge_days
summary(substance_use_model_data$binge_days)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.000   0.000   0.000   2.141   2.000  30.000    8628 
sd(substance_use_model_data$binge_days, na.rm = TRUE)
[1] 4.494708
# Creating a bar plot for binge_days
ggplot(substance_use_model_data, aes(x = binge_days)) +
  geom_histogram(fill = "indianred4", bins = 30) +
  labs(
    title = "Distribution of Binge Drinking (Past 30 Days)",
    x = "Binge Drinking Days (Past 30 Days)",
    y = "Number of Respondents"
  ) +
  theme_minimal()

3 Model Selection

3.0.1 Full Model

Now, we can begin the process of building our model. This can be done by first outlining the full model, containing all of the predictors we are interested in analyzing for our report.

The full model will include age, sex, edu_level, employment, marijuana_days, alcohol_days, and binge_days.

# Generating the full model with all predictors
full_model <- glm(drink_drive ~ age + sex + edu_level + employment + marijuana_days + alcohol_days + binge_days, data = substance_use_model_data, family = binomial)

# Outlining the full model
summary(full_model)

Call:
glm(formula = drink_drive ~ age + sex + edu_level + employment + 
    marijuana_days + alcohol_days + binge_days, family = binomial, 
    data = substance_use_model_data)

Coefficients:
                               Estimate Std. Error z value Pr(>|z|)    
(Intercept)                  -2.2385533  0.4354719  -5.141 2.74e-07 ***
age16-17                      0.3800072  0.4682059   0.812 0.417007    
age18-20                      0.1863738  0.4587349   0.406 0.684539    
age21-23                     -0.0976258  0.4559279  -0.214 0.830449    
age24-25                      0.0458616  0.4589095   0.100 0.920395    
age26-29                     -0.2007250  0.4599995  -0.436 0.662576    
age30-34                     -0.1723373  0.4578464  -0.376 0.706613    
age35-49                     -0.1579920  0.4550952  -0.347 0.728469    
age50-64                     -0.0435211  0.4649984  -0.094 0.925432    
age65+                       -0.3605761  0.4840207  -0.745 0.456296    
sexFemale                    -0.4227684  0.0598654  -7.062 1.64e-12 ***
edu_levelHS Grad              0.1135641  0.1456435   0.780 0.435544    
edu_levelSome College/AA      0.5125278  0.1410381   3.634 0.000279 ***
edu_levelCollege+             0.7599047  0.1435514   5.294 1.20e-07 ***
employmentUnemployed         -0.7375579  0.1378221  -5.352 8.72e-08 ***
employmentNot in labor force -0.4998876  0.0900559  -5.551 2.84e-08 ***
marijuana_days               -0.0005017  0.0002171  -2.311 0.020815 *  
alcohol_days                  0.0038247  0.0003458  11.059  < 2e-16 ***
binge_days                    0.0526844  0.0054169   9.726  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 8198.2  on 9021  degrees of freedom
Residual deviance: 7520.8  on 9003  degrees of freedom
  (23133 observations deleted due to missingness)
AIC: 7558.8

Number of Fisher Scoring iterations: 5

As shown above in the model summary, none (0) of the age categories appeared to show significance. Additionally, the education level edu_levelHS Grad also appeared to be insignificant. However, in order for us to confidently remove variables from the model, we’ll need to perform a model selection process. To do this, we’ll use Stepwise selection.

3.0.2 Stepwise Selection

Next, we can conduct a stepwise model selection using the stepAIC() function. This will allow us to generate a new reduced model that fits only the predictors most useful to the model’s predictive power.

# Conducting a model selection process using stepAIC()
stepAIC(full_model)
Start:  AIC=7558.81
drink_drive ~ age + sex + edu_level + employment + marijuana_days + 
    alcohol_days + binge_days

                 Df Deviance    AIC
- age             9   7538.5 7558.5
<none>                7520.8 7558.8
- marijuana_days  1   7526.2 7562.2
- sex             1   7571.0 7607.0
- employment      2   7578.9 7612.9
- edu_level       3   7587.7 7619.7
- binge_days      1   7614.7 7650.7
- alcohol_days    1   7639.7 7675.7

Step:  AIC=7558.48
drink_drive ~ sex + edu_level + employment + marijuana_days + 
    alcohol_days + binge_days

                 Df Deviance    AIC
<none>                7538.5 7558.5
- marijuana_days  1   7544.4 7562.4
- sex             1   7585.8 7603.8
- edu_level       3   7594.5 7608.5
- employment      2   7602.0 7618.0
- binge_days      1   7635.5 7653.5
- alcohol_days    1   7650.1 7668.1

Call:  glm(formula = drink_drive ~ sex + edu_level + employment + marijuana_days + 
    alcohol_days + binge_days, family = binomial, data = substance_use_model_data)

Coefficients:
                 (Intercept)                     sexFemale  
                  -2.1414273                    -0.4093347  
            edu_levelHS Grad      edu_levelSome College/AA  
                  -0.0176445                     0.3552848  
           edu_levelCollege+          employmentUnemployed  
                   0.5566774                    -0.7212350  
employmentNot in labor force                marijuana_days  
                  -0.5263620                    -0.0005249  
                alcohol_days                    binge_days  
                   0.0035793                     0.0531265  

Degrees of Freedom: 9021 Total (i.e. Null);  9012 Residual
  (23133 observations deleted due to missingness)
Null Deviance:      8198 
Residual Deviance: 7538     AIC: 7558

As shown above, following the stepwise selection process, the variable age was dropped from the model.

The resulting reduced model included sex, edu_level, employment, marijuana_days, alcohol_days, and binge_days. This reduced model has an AIC of 7558.48, which appears to be a small, but considerable, improvement from the full model (AIC = 7558.81).

3.1 Final Reduced Model

# Generating the final reduced model
final_reduced_model <- glm(drink_drive ~ sex + edu_level + employment + marijuana_days + alcohol_days + binge_days, data = substance_use_model_data, family = binomial)

# Outlining the final reduced model
summary(final_reduced_model)

Call:
glm(formula = drink_drive ~ sex + edu_level + employment + marijuana_days + 
    alcohol_days + binge_days, family = binomial, data = substance_use_model_data)

Coefficients:
                               Estimate Std. Error z value Pr(>|z|)    
(Intercept)                  -2.1414273  0.1202664 -17.806  < 2e-16 ***
sexFemale                    -0.4093347  0.0596762  -6.859 6.92e-12 ***
edu_levelHS Grad             -0.0176445  0.1248941  -0.141  0.88765    
edu_levelSome College/AA      0.3552848  0.1176211   3.021  0.00252 ** 
edu_levelCollege+             0.5566774  0.1176148   4.733 2.21e-06 ***
employmentUnemployed         -0.7212350  0.1375610  -5.243 1.58e-07 ***
employmentNot in labor force -0.5263620  0.0857782  -6.136 8.45e-10 ***
marijuana_days               -0.0005249  0.0002161  -2.429  0.01515 *  
alcohol_days                  0.0035793  0.0003326  10.762  < 2e-16 ***
binge_days                    0.0531265  0.0053744   9.885  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 8198.2  on 9021  degrees of freedom
Residual deviance: 7538.5  on 9012  degrees of freedom
  (23133 observations deleted due to missingness)
AIC: 7558.5

Number of Fisher Scoring iterations: 5

For each variable in the model, we conducted a Wald test to analyze its contribution to predicting self-reported impaired driving.

Detailed below are the hypotheses tested for each variable:

  • \(H_0: \beta = 0 \text{ (the predictor is not associated with the odds of driving under the influence)}\)

  • \(H_0: \beta \neq 0 \text{ (the predictor is associated with the odds of driving under the influence)}\)

  • \(\alpha = 0.05\)

Following our fitting of the full model and a review of its summary, we note the following regarding each of our predictors:

  • Sex: Female respondents had lower odds of reporting driving under the influence compared to males (p < 0.001).

  • Education:

    • HS Grad was not statistically significant (p = 0.888).

    • Some College/AA and College+ were both associated with higher odds of drinking and driving, with p-values < 0.01.

  • Employment:

    • Both Unemployed and Not in labor force were significantly associated with lower odds of drinking and driving compared to those employed (p < 0.001).
  • Marijuana Use: Each additional day of marijuana use was associated with a small but statistically significant decrease in the odds of drinking and driving (p = 0.015).

  • Alcohol Use: More days of alcohol use in the past year were associated with a higher likelihood of drinking and driving (p < 0.001).

  • Binge Drinking: Each additional binge drinking day in the past 30 days was also associated with significantly increased odds of driving under the influence (p < 0.001).

While the improvement from the full model was small based on the reduced model’s AIC value (7558.48), we will still choose to move forward with this reduced model to ensure our predictions and future analyses are as consistent and reliable as possible. These attributes will be further explored and validated in the upcoming section.

4 Model Reliability

4.1 Accuracy, Sensitivity, and Specificity

As a method of ensuring the model’s predictive power and performance, let us first generate a confusion matrix in order to compare predicted vs. actual outcomes, as well as calculate accuracy, sensitivity (True Positive Rate), and specificity (True Negative Rate).

4.1.1 Setup

# Generating the predicted probabilities from reduced model
phat <- predict(final_reduced_model, type = "response")

# Pulling the actual outcome variable using model.frame() to ensure correct length
actual <- model.frame(final_reduced_model)$drink_drive

# Generating the predicted classifications
yhat <- ifelse(phat >= 0.5, 1, 0)

4.1.2 Confusion Matrix

# Outlining the confusion matrix
table_pred_actual <- table(Predicted = yhat, Actual = actual)
table_pred_actual
         Actual
Predicted    0    1
        0 7390 1444
        1  107   81

Using our confusion matrix, we can calculate the accuracy, sensitivity, and specificity:

4.1.3 Accuracy

# Calculating accuracy
accuracy <- sum(diag(table_pred_actual)) / sum(table_pred_actual)

# Outlining the results
accuracy
[1] 0.8280869

Here, we can see that the model accurately predicted about 82.8% of all cases. While not as great as we would hope it to be, it is certainly acceptable and indicative of a well performing model.

4.1.4 Sensitivity

# Calculating sensitivity
sensitivity <- table_pred_actual["1", "1"] / sum(table_pred_actual[, "1"])

# Outlining the results
sensitivity
[1] 0.05311475

As shown here, among all the people who actually drove under the influence, the model correctly identified only 5.3% of them. This value is extremely low and shows us that the model struggles to accurately detect true positives.

4.1.5 Specificity

# Calculating specificity
specificity <- table_pred_actual["0", "0"] / sum(table_pred_actual[, "0"])

# Calculating specificity
specificity
[1] 0.9857276

As shown here, among all the people who did not drive under the influence, the model correctly classified about 98.6%, showcasing a very strong ability for the model to detect true negatives.

4.1.6 Conclusion

Given our findings from the analysis above, it appears that our model is extremely conservative. While it appears to do a great job of identifying those who did not drink and drive, it is seemingly missing a large number of those who did. This could be due to several issues, though it could be the case that fewer people choose to self report drinking and driving. However, we’ll need to analyse a bit more about the model’s performance and reliability in order to make a true determination.

4.2 ROC Curve and AUC

Next, we’ll observe the model’s Receiver Operating Characteristic (ROC) Curve and Area Under the Curve (AUC) value to get a better idea of the model’s overall performance and predictive capability. The ROC Curve will help us to determine how well our model separates those who did and did not engage in drinking and driving. Here, a model with perfect separation would hug the top-left corner of the plot, whereas a model closer to the diagonal line would suggest the model is no better than random guessing. The AUC helps us to quantify this, where values near 0.5 indicate weak predictive power and values close to 1.0 indicate a stronger model performance.

# Generating the ROC curve
roc_curve <- roc(actual, phat)

# Plotting the ROC curve
plot(roc_curve, col = "dodgerblue", lwd = 2, main = "ROC Curve for Final Reduced Model")
abline(a = 0, b = 1, lty = 2, col = "gray40")

# Calculating the AUC
auc_value <- auc(roc_curve)
auc_value
Area under the curve: 0.7057

As we can see, the ROC curve for our model sits just above the diagonal, suggesting to us that the model has a reasonable ability to distinguish between the two groups. Additionally, the AUC of 0.706 confirms to us that the model performs better than random guessing and has considerable predictive ability to classify individuals who drove under the influence.

5 Conclusions

In this report, we worked with the 2023 NSDUH dataset to identify key demographic and behavioral factors associated with self-reported driving under the influence of alcohol. Following an exploratory analysis of all considered variables, as well as a Stepwise model selection process to determine the final model, we arrived a final reduced model containing the following six predictors:

  • Sex: sex

  • Education level: edu_level

  • Employment status: employment

  • Marijuana use: marijuana_days

  • Alcohol use: alcohol_days

  • Binge drinking frequency: binge_days

As a result of our analysis, we found that females had significantly lower odds of driving under the influence compared to males. We also observed that respondents with some college or a college degree had higher odds of drinking and driving compared to those with less education. Meanwhile, individuals who were unemployed or not in the labor force had lower odds than employed individuals. Our findings also suggested that more days of alcohol use and more frequent binge drinking were strongly associated with an increased likelihood of impaired driving. Lastly, we observed that greater marijuana use appeared to be linked to slightly lower odds of drinking and driving.

In terms of performance, we found that, while the model showcased considerably good accuracy (82.8%) and great specificity (98.6%), it appeared to experience issues identifying individuals who actually engaged in impaired driving (sensitivity = 5.3%). The Area Under the Curve (AUC = 0.706) indicated that the model performs better than random guessing, though the aforementioned concern regarding sensitivity remains a notable issue and could indicate model instability. As an expansion to this analysis and its findings, a future analyses might include roadside data, such as crash and toxicology reports, allowing for a more precise analysis.

In summary, it is important to note that education and prevention efforts play a crucial role in reducing impaired driving. It is essential that we encourage safe use of motor vehicles and roadways, continuing to highlight the dangerous outcomes associated with impaired driving. It is the hope of this analysis to outline some of the key factors associated with impaired driving, with the aim of encouraging future research in this area, contributing to the decline of DUI crashes and fatalities. With the right research and interventions, progress can continue, and lives can be saved.

6 Works Cited

  1. Agresti, A. (2019). An Introduction to Categorical Data Analysis (3rd ed.). Hoboken, NJ: Wiley.

  2. Centers for Disease Control and Prevention. (2023, August 22). Impaired driving: Get the facts. U.S. Department of Health & Human Services. https://www.cdc.gov/impaired-driving/facts/

  3. Foundation for Advancing Alcohol Responsibility. (2024). Drunk driving fatality statistics. https://www.responsibility.org/alcohol-statistics/drunk-driving-statistics/drunk-driving-fatality-statistics/

  4. National Institute on Alcohol Abuse and Alcoholism. (2023). Alcohol-related emergencies and deaths in the United States. U.S. Department of Health & Human Services. https://www.niaaa.nih.gov/alcohols-effects-health/alcohol-topics-z/alcohol-facts-and-statistics/alcohol-related-emergencies-and-deaths-united-states

  5. National Highway Traffic Safety Administration. (2024). Traffic safety facts 2022: Alcohol-impaired driving (Report No. DOT HS 813 294). U.S. Department of Transportation. https://crashstats.nhtsa.dot.gov/Api/Public/ViewPublication/813294

  6. National Highway Traffic Safety Administration. (2007, April). Traffic safety facts: Alcohol-impaired driving (DOT HS 810 942). U.S. Department of Transportation. https://crashstats.nhtsa.dot.gov/Api/Public/ViewPublication/810942

  7. Substance Abuse and Mental Health Services Administration. (2024). National Survey on Drug Use and Health (NSDUH), 2023 [Data set]. https://www.samhsa.gov/data/dataset/national-survey-drug-use-and-health-2023-nsduh-2023-ds0001

  8. Substance Abuse and Mental Health Services Administration. (2024). NSDUH 2023 codebook for public use file [PDF]. https://www.samhsa.gov/data/system/files/media-puf-file/NSDUH-2023-DS0001-info-codebook_v1.pdf