What are Dummy Variables in Regression?

We did a basic Multiple Linear Regression Analysis in our previous post. If you missed that, please click the link and refer it. Here we go one step forewards by adding categorical variables into the regression equation. Let’s see how to do that!

What is a Categorical Variable?

A Categorical Variable has two or more categories. They are also called as Qualitative Variables. Examples for Categorical Variables are “gender”, “marital status” etc.

What is a Dummy Variable?

When we have one or more Categorical Variables in our regression equation, we express them as “Dummy Variables”. For a variable with n categories, there are always (n-1) dummy variables. Dummy Variables are also called as “Indicator Variables”

Example of a Dummy Variable:-

Say we have the categorical variable “Gender” in our regression equation. We can represent this as 0 for Male and 1 for Female.

coding categorical variable into indicator variable

Let’s jump into our problem

The dataset that we are going to use here is called “insurance”. You can download the dataset from here. We are interested in predicting insurance charges. Most probable factors which determine the insurance charges for a certain individual are listed.

We have 6 Independent Variables here.

Variable	Type
Age	Quantitave
Sex	Categorical (2)
BMI	Quantitative
Children	Quantitative
Smoker	Categorical (2)
Region	Categorical (4)

In the above table, I have listed all the Predictor Variables and their type. Also, no. of categories mentioned within parenthesis in front of each categorical variable.

snapshot of regression dataset before indicator variables — First 10 Rows of the Dataset

Let’s code each categorical variable into indicator (dummy) variables

Sex

Male -> 0 Female -> 1

Smoker

Yes ->1 No ->0

Region	Code
North East	0
North West	1
South East	2
South West	3

Region

We have four categories here (n = 4). Therefore we need (4-1) = 3 Indicator Variables here.

As a rule of thumb, we code variables according to the alphabetical order.

age	sex	bmi	children	smoker	region
19	1	27.9	0	1	3
18	0	33.77	1	0	2
28	0	33	3	0	2
33	0	22.705	0	0	1
32	0	28.88	0	0	1
31	1	25.74	0	0	2
46	1	33.44	1	0	2
37	1	27.74	3	0	1
37	0	29.83	2	0	0
60	1	25.84	0	0	1

Number of Possible Regression Equations

For a moment, assume that there is only one Categorical Predictor variable named ‘Gender’ and One Quantitative Predictor named ‘Age’.

Our regression equation look like follows;

Since gender can only take values 1 and 0, whenever gender = 0, regression equation consist of only beta 0 and beta 1.

Therefore it is clear that, whenever categorical variables are present, the number of regression equations equals the product of the number of categories. In our example above we have 3 categorical variables consisting of all together (4*2*2) 16 equations.

Now we have done the preliminary stage of our Multiple Linear Regression Analysis. That is we have finished coding our variables. Now we have to do the Regression Analysis. We that using Minitab Software.

Note that we have truncated this data set and dropped some columns to make our regression analysis simple.

Please refer the next post for analysis

What are dummy variables in regression?

What is a Categorical Variable?

What is a Dummy Variable?

Example of a Dummy Variable:-

Sex

Smoker

Region

Number of Possible Regression Equations

Leave a Comment Cancel Reply

Sign up to receive email updates, fresh news and more!

What is a Categorical Variable?

What is a Dummy Variable?

Example of a Dummy Variable:-

Sex

Smoker

Region

Number of Possible Regression Equations

Related Posts

Leave a Comment Cancel Reply