What is regression analysis?

Regression analysis is a statistical method that shows the relationship between two or more variables. Usually expressed in a graph, the method tests the relationship between a dependent variable against independent variables. Typically, the independent variable(s) changes with the dependent variable(s) and the regression analysis attempts to answer which factors matter most to that change.

We know that we need to make data driven decisions, but when there’s literally millions, or trillions of data points, where do you even begin? Fortunately, artificial intelligence (AI) and machine learning (ML) can take enormous amounts of data and parse it in a matter of hours to make it more digestible. It is then up to the analyst to examine the relationship more closely.

An example of a regression analysis

In the real world, a scenario where regression analysis is used might look something like this.

A retail business needs to predict sales figures for the next month (or the dependent variable). It is difficult to know, since there are so many variables surrounding that number (the independent variables)—the weather, a new model release, what your competitors do, or the maintenance work going on to the pavement outside.

Many may have an opinion, such as Bob from accounts or Rachel who has worked on the sales floor for ten years. But regression analysis sorts through all the measurable variables and can logically indicate which will have an impact. The analysis tells you which factors will influence sales and how the variables interact with each other. This helps the business to make better, data-driven decisions.

In this retail business example, the dependent variable is sales, and the independent variables are the weather, competitor behavior, footpath maintenance and new model releases.

The use of regression lines in regression analysis

To start a regression analysis, a data scientist will collect all the data they need about the variables. This will likely include sales figures for a substantial period beforehand, and the weather, including rainfall levels, for that same period. Then, the data is processed and presented in a chart.

In the analysis, the Y-axis always contains the dependent variable, or what you are trying to test. In this case, sales figures. The X-axis represents the independent variable, the number of inches of rain. Looking at this simple fictional chart, you can see that sales increase when it rains, a positive correlation. But it doesn’t tell you exactly how much you can expect to sell with a certain amount of rainfall. This is when you add a regression line.

This is a line that shows the best fit for the data, and the relationship between the dependent and independent variable. In this example, you can see the regression line intersects the data, showing visually a prediction of what would happen with any amount of rainfall.

A regression line uses a formula to calculate its predictions. Y = A + BX. Y is the dependent variable (sales), X the independent variable (rainfall), B is the slope of the line and A is the point where the Y intercepts the line.

In data science, sophisticated programs run all these calculations in a split second, to produce highly accurate, data driven predictions.

Multiple regressions

While there can only be one dependent variable per regression, there can be multiple independent variables. This is generally referred to as a multiple regression.

This allows statisticians to identify complex relationships between variables. While the outcomes will be more complex, they can create more realistic results than a simple, one-variable regression analysis. In the retail example, this will show the effects of weather, product release and competitor’s advertising on the sales in the store.

What are error terms?

Regression analyses do not predict causation, just the relationship between variables. While it is tempting to say that it is obvious that the rainfall level affects sales figures, there’s no proof that this is the case. Independent variables will never be a perfect predictor of a dependent variable.

The error term is the figure that shows you the certainty with which you can trust the formula. The larger the error term, the less certain that regression line is. The error term might be 50 percent, indicating that variable is no better than chance. Or, it could be 85 percent, showing that there is a significant likelihood the independent variable affects the dependent variable.

Correlation does not equal causation – it might not be the rain causing that increase in sales, it could be another independent variable. While the variables seem to be linked, it is possible that there is something else altogether, and only by running multiple analysis will a business be able to gain a clearer understanding of the factors involved. It is almost impossible to predict a direct cause and effect in regression analysis.

This is why regression analyses usually include a number of variables, so that it’s more likely that you’re finding the actual cause of the sales increase or decrease. Of course, including multiple independent variables can create a messy set of outcomes, however good data scientists and statisticians can sort through the data to get accurate results.

The other thing that can help is knowledge of the business. The store might sell more products on days with heavier rainfall , but if the data scientists talk to the sales staff, they may find out that more people come in for the free coffee that is given away on rainy days. If that is the case, is the cause of increased sales the rain, or the free coffee?

This means the business needs to do a bit of market research. Asking their customers why they purchased something on a specific day. It may be that the coffee drew them in, the rain made them stay, and then they saw a product they have been intending to buy. Therefore, the cause of increased sales is the rain, but you need to factor in the free coffee too. One without the other will not result in the same outcome.

How can a company use regression analysis?

Generally, regression analysis is used to:

Try and explain a phenomenon
Predict future events
Optimize manufacturing and delivery processes
Resolve errors
Provide new insights

Phenomenon explanation

This could be trying to find a reason (variable) why sales soar on a certain day of the month, why service calls rose in a certain month, or why people return rental cars late on certain days only.

Make predictions

If the regression analysis showed that people purchased more of a product after a certain promotion, the business can make an accurate decision about which advertising to run or promotion to use.

Predictions in regression analysis can cover a wide variety of situations and scenarios. For example, predicting how many people will see a billboard can help management decide if an investment into advertising there is a good idea; in which scenario does this billboard offer a good return on investment?

Insurance companies and banks use the predictions of regression analysis a lot. How many mortgage holders will pay back their loans on time? How many policyholders will have a car accident or have thefts occur at their homes? These predictions allow risk assessment, but also predict optimum fee and premium prices.

Optimize processes

In a bakery, there could be a relationship between the shelf life of cookies and the temperature of the oven when cooking. The outcome of optimization here would be longest shelf life, while retaining the chewy quality of the cookies. A call centre might need to know the relationship between complaint volumes and wait times, so they can train their staff/ hire more staff to respond to calls within a certain time frame for maximum customer satisfaction. Of course, the call volumes will change throughout the day, further equipping management to make educated, optimized decisions about staffing levels.

Resolving errors

A store manager comes up with a bright idea; that extending opening hours will increase sales. After all, the manager explains, if you are open for four more hours a day, that means a corresponding increase of sales. Except, keeping a store open longer does not always mean an increase in profit A regression analysis can be run which shows that any increase in sales might not cover the cost of these sales. Such quantitative analysis provides support for executive decisions.

New insights

Most businesses have large volumes of data, often in a chaotic state. Using regression analysis, this data can yield information about relationships between variables that may have been unnoticed in the past. If you use your point of sale data you may discover busy times of the day, spikes in demand, or previously unnoticed high sales dates.

Challenges with regression analysis

Correlation does not equal causation. You can show a relationship between any two variables, but that does not prove that one of the variables causes the other. Some people think when they see a positive relationship in a regression analysis that it is a clear sign of cause and effect. However, as we discussed before, regression analysis only shows the relationship between variables, not the cause and effect. You must be careful that you are not making assumptions about relationships that do not actually exist in real life.

The independent variable may be something you can’t control. For instance, you know that rain increases sales volumes, but you cannot control the weather. Does that variable even matter? You can control a lot of internal factors; your marketing, store layout, staff behaviour, features and promos. Waiting for it to rain is not a good sales strategy.

GI:GO (Garbage in: garbage out)

A large part of a data scientist’s role is cleaning data. This is because your calculations are only as good as the data provided. If the input information is garbage, the outcome of the regression analysis will be too. While statistics and data cleansing can manage and control for some irregularities or imperfections, the data must be accurate in order for the resulting predictions to be accurate.

Ignoring the error term. If the results say the data explains 60 percent of the result, there may be important information in that remaining 40 percent that must be examined. You must ask yourself: Is this calculation accurate enough to trust, or is there a bigger factor or variable at play here? Often, getting an experienced manager or person involved with the business to look at the outcome can be a sanity check. Intuition and business domain knowledge are important, because it ensures there is nothing being missed or falsely attributed.

Visualize brilliance in action.