Multiple Linear Regression Analysis with Categorical Predictors

In our previous post, we described to you how to handle the variables when there are categorical predictors in the regression equation. If you missed that, please read it from here. In this post, we will do the Multiple Linear Regression Analysis on our dataset. Also if you don’t have the dataset, please download it from here.

First let’s see some scatter plots to get an idea about the relationship between variables.

Given below is the scatterplot of charges vs age with the categorical variables “smoker” and “gender” as group variables

scatterplot of charges vs age with categorical variable groups

Here you can see there is a slight positive relationship between age and insurance charges. In fact, if you run a correlation analysis for the above data, you will get a correlation coefficient of 0.297. We can see smoking males and females have high insurance charges. Also, almost all the non-smoking males and females have insurance charges around 10,000$. Only one observation is above that limit.

Now let’s generate the scatterplot of charges vs BMI with the categorical variables “smoker” and “sex”

scatterplot of charges vs bmi with categorical variable groups

Here there is no relationship between BMI and charges. The data points are scattered everywhere. Correlation analysis outputs the correlation coefficient as 0.2.

The scatterplot below represents the relationship between age vs BMI.

scatterplot of bmi vs age

We can clearly see that there is no relationship between the two variables. All the data points are scattered everywhere. The correlation coefficient of 0.112 testifies our claim

Okay, now let’s jump into the Regression Analysis.

We first conduct Simple Linear Regression Analysis with each Independent variable with the Dependent Variable. Then we move on to the full regression model.

Since it takes so much space to display all our regression results, I have summarized the results in the following table

ModelP-ValueSR-sq (adj)R-sq (Pred)
Age2.610.11811271.25.44%0%
bmi1.430.24311503.51.50%0%
bmi,age1.860.17611251.25.77%0%
bmi,sex1.580.22511356.84.00%0%
age,sex1.330.28211456.92.30%0%
age,smoke39.8805963.6973.53%64.03%
bmi,smoke20.9307445.2558.74%50.98%
bmi,smoke,age25.606078.9472.49%59.27%
bmi,age,sex1.280.303114212.91%0%
full model19.7106048.2972.71%57.10%

From the above table, you can see that Single Variable models are out of the question. Age, Smoke model is significant with the lowest regression error. It has the highest R-sq (adj) and R-sq (pred) values too. Those values are even better than the full regression model. Therefore we can conclude that the regression equation with age and smoke variables is the best model that fits the data.

Leave a Comment