Skip to content

What are dummy variables in regression?

We did a basic Multiple Linear Regression Analysis in our previous post. If you missed that, please click the link and refer it. Here we go one step forewards by adding categorical variables into the regression equation. Let’s see how to do that!

What is a Categorical Variable?

A Categorical Variable has two or more categories. They are also called as Qualitative Variables. Examples for Categorical Variables are “gender”, “marital status” etc.

What is a Dummy Variable?

When we have one or more Categorical Variables in our regression equation, we express them as “Dummy Variables”. For a variable with n categories, there are always (n-1) dummy variables. Dummy Variables are also called as “Indicator Variables”

Example of a Dummy Variable:-

Say we have the categorical variable “Gender” in our regression equation. We can represent this as 0 for Male and 1 for Female.

coding categorical variable into indicator variable

Let’s jump into our problem

The dataset that we are going to use here is called “insurance”. You can download the dataset from here. We are interested in predicting insurance charges. Most probable factors which determine the insurance charges for a certain individual are listed.

We have 6 Independent Variables here.

Variable Type
Age Quantitave
Sex Categorical (2)
BMI Quantitative
Children Quantitative
Smoker Categorical (2)
Region Categorical (4)

In the above table, I have listed all the Predictor Variables and their type. Also, no. of categories mentioned within parenthesis in front of each categorical variable.

snapshot of regression dataset before indicator variables
First 10 Rows of the Dataset

Let’s code each categorical variable into indicator (dummy) variables

Sex

Male -> 0 Female -> 1

Smoker

Yes ->1 No ->0

Region Code
North East 0
North West 1
South East 2
South West 3

Region

We have four categories here (n = 4). Therefore we need (4-1) = 3 Indicator Variables here.

As a rule of thumb, we code variables according to the alphabetical order.

agesexbmichildrensmokerregion
19127.9013
18033.77102
28033302
33022.705001
32028.88001
31125.74002
46133.44102
37127.74301
37029.83200
60125.84001

Number of Possible Regression Equations

For a moment, assume that there is only one Categorical Predictor variable named ‘Gender’ and One Quantitative Predictor named ‘Age’.

Our regression equation look like follows;

regression equation

Since gender can only take values 1 and 0, whenever gender = 0, regression equation consist of only beta 0 and beta 1.

Therefore it is clear that, whenever categorical variables are present, the number of regression equations equals the product of the number of categories. In our example above we have 3 categorical variables consisting of all together (4*2*2) 16 equations.

Now we have done the preliminary stage of our Multiple Linear Regression Analysis. That is we have finished coding our variables. Now we have to do the Regression Analysis. We that using Minitab Software.

Note that we have truncated this data set and dropped some columns to make our regression analysis simple.

Please refer the next post for analysis

Leave a Reply