# What is Linear Regression Analysis?

Introduction

Linear regression uses the fact that there is a statistically significant correlation between two variables to allow you to make predictions about one variable based on your knowledge of the other.

When you conduct a regression analysis in compensation, you are trying to establish or “predict” the correlation, closeness or strength between two variables such as Age and Salary, Tenure and Salary, Job Size and Salary, Job Grade and Salary etc.

A regression analysis of job size versus salary is used:

• To determine internal equity of the company i.e. the bigger the job, the higher the salary.
• To determine the salary spread of jobs within the same job points / grade.
• To identify outliers i.e. jobs falling outside the two controlling lines (maximum and minimum).
• To identify gaps in grade structure.

Formula for Linear Regression

Source: Biddle website

Linear Regression

When there is only one variable (tenure or job grade), the regression is called (simple) linear regression and is usually represented by a line in the middle of the data points.

Refer to the figure below.  In the diagram, in addition to being a scatterplot showing the relationship between Time With Company (Tenure) and Hourly Compensation (Wage/Salary), a line is drawn through the middle of the group of dots. This line is called the regression line.

Source: Introduction to Linear Regression, Biddle website

If you are trying to predict what would be hourly compensation of an employee who has worked for the company for, say, 20 months, then your best single guess is the average compensation paid to people who have worked for 20 months with the company. Looking at Figure 3-1, above, you can see that the average compensation score for people who have worked for the company for 20 months is around 32 dollars per hour. So, if you knew that an employee had worked for the company for 20 months—and knew nothing else about the employee—your best guess about the compensation that employee receives is around 32 dollars per hour.

The key point is that the larger the correlation coefficient is between the two variables, in this case Time With Company and Hourly Compensation, the stronger the relationship that exists between them. The stronger the relationship, the more accurate your prediction will be!

Correlation Coefficient

The quantity r, called the linear correlation coefficient, measures the strength and
the direction of a linear relationship between two variables.

The value of r is such that -1 < r < +1.  The + and – signs are used for positive
linear correlations and negative linear correlations, respectively.

Positive correlation: If x and y have a strong positive linear correlation, r is close
to +1.  An r value of exactly +1 indicates a perfect positive fit.   Positive values
indicate a relationship between x and y variables such that as values for x increases,
values for  also increase.

Negative correlation: If x and y have a strong negative linear correlation, r is close
to -1.  An r value of exactly -1 indicates a perfect negative fit.   Negative values
indicate a relationship between x and such that as values for x increase, values
for y decrease.

No correlation: If there is no linear correlation or a weak linear correlation, r is
close to 0.  A value near zero means that there is a random, nonlinear relationship
between the two variables. (Note that r is a dimensionless quantity; that is, it does not depend on the units employed.)

A perfect correlation of  ± 1 occurs only when the data points all lie exactly on a
straight line.  If r = +1, the slope of this line is positive.  If r = -1, the slope of this
line is negative.

A correlation greater than 0.8 is generally described as strong, whereas a correlation less than 0.5 is generally described as weak.

Source: Correlation Coefficient, Mathbits website.

Coefficient of Determination

The quantity r 2 represents the percent of the data that is closest to the line of best fit.

For example, if r = 0.922, then r 2 = 0.850, which means that 85% of the total variation in y can be explained by the linear relationship between x and y (as described by the regression equation).  The other 15% of the total variation in y remains unexplained.

The coefficient of determination is a measure of how well the regression line represents the data.  If the regression line passes exactly through every point on the scatter plot, it would be able to explain all of the variation. The further the line is away from the points, the less it is able to explain.

Source: Correlation Coefficient, Mathbits website.

The R-square value explains the strength of this relationship. The closer it is to 1 (or 100%), the more it explains the result. But in reality you never get 1. So for example, the R-square value between job grade and salary is 52%. It means that job grade “explains” 52% of an employee’s salary.

However, there is more than one factor that influences salary. So for example, the R-square value between tenure and salary is 29%. It means that tenure “explains” 29% of an employee’s salary. It means that you can make a better prediction