What is logistic regression?
Logistic regression is a statistical model that Is used to determine the probability that an event will happen. It shows the relationship between features, and then calculates the probability of a certain outcome.
Logistic regression is used in machine learning (ML) to help create accurate predictions. It is similar to linear regression, except rather than a graphical outcome, the target variable is binary; the value is either 1, or 0.
There are two types of measurables, the explanatory variables/ features (item being measured) and the response variable/ target binary variable, which is the outcome.
For example, when trying to predict whether a student will pass or fail a test, the hours studied are the feature, and the response variable will have two values - pass or fail.
There are three basic kinds of logistic regression:
- Binary logistic regression: Here there are only two possible outcomes for the categorical response. As in the example above – a student passes or fails.
- Multinomial logistic regression: This is where the response variables can include three or more variables, which will not be in any order. An example is predicting whether diners at a restaurant prefer a certain kind of food – vegetarian, meat or vegan.
- Ordinal logistic regression: Like multinomial regression, there can be three or more variables. However, there is an order the measurements follow. An example is rating a hotel on a scale of 1 to 5.
Assumptions used for logistic regression
When working with logistic regression, there are certain assumptions that are made.
- In binary logistic regression, it is necessary that the response variable is a binary. The outcome is either one thing, or another.
- The desired outcome should be represented by the factor level 1 of the response variable, the undesired is 0.
- Only variables that are meaningful must be included.
- Independent variables have to essentially be independent of one another. There should be little to no multi-co-linearity.
- Log odds and independent variables have to be linearly related.
- Logistic regression must be applied only to massive sample sizes.
Applications of logistic regression
There are several fields and ways in which logistic regression can be used and these include almost all fields of medical and social sciences.
For example, the Trauma and Injury Severity Score (TRISS). This is used across the world to predict fatality in injured patients. This model has been developed with the application of logistic regression. It uses variables such as the revised trauma score, injury severity score, and the age of patient to predict health outcomes. It is a technique that can even be used to predict the possibility of a person being afflicted by a certain disease. For example, ailments like diabetes and heart disease can be predicted based on variables such as age, gender, weight and genetic factors.
Logistic regression can also be used to attempt to predict elections. Will a Democrat, Republican or Independent leader come to power in the USA? These predictions are made on the basis of variables such as age, gender, place of residence, social standing and previous voting patterns (variables) to produce a vote prediction (response variable).
Logistic regression can be used in engineering to predict the success or failure of a system that is being tested, or a prototype product.
LR can be used to predict the chances of a customer’s enquiry turning into a sale, the possibility of a subscription being started or terminated, or even potential customer interest in a new product line.
An example of use in the financial sector is in a credit card company that uses it to predict the likelihood that a customer will default on their payments. The model built could be for the issuance of a credit card to a customer or not. The model can say whether a certain customer will “default” or “not default”. This is known as the “default propensity modeling” in banking terms.
Much along the same lines, e-commerce companies invest heavily in advertising and promotional campaigns across media. They want to see which campaign is the most effective and the option most likely to get a response from their potential target audience. The model set will categorize the customer as a “responder” or “non responder”. This model is called propensity to respond modeling.
With insights that come from logistic regression outputs, companies are able to optimize their strategies and achieve business goals with reduction in expenses as well as losses. Logistic regressions help to maximize return on investment (ROI) in marketing campaigns, a benefit to the bottom line of a company in the long run.
Advantages and disadvantages of logistic regression
Logistic Regression is widely used because it is extremely efficient and does not need huge amounts of computational resources. It can be interpreted easily and does not need scaling of input features. It is simple to regularize, and the outputs it provides are well-calibrated predicted probabilities.
Just as it does in linear regression, logistic regression tends to work more efficiently when attributes unrelated to the output variable and those that are correlated, are omitted. Feature engineering therefore has an important role to play in the efficacy of performance of logistic and linear regression.
Logistic regression is also easily implemented and simple to train and that’s what makes it a great baseline to help measure the performance of other complex algorithms.
Logistic regression cannot be used to solve nonlinear problems and unfortunately, many of today’s systems are nonlinear. Additionally, logistic regression is not the most powerful algorithm available. There are several alternatives that can create much better, more complex predictions.
Logistic regression also relies heavily on data presentation. This means that unless you have identified all the necessary independent variables, the output is of no value. With an outcome that is discrete, logistic regression can only be used to predict a categorical outcome. And finally, it is an algorithm with a known history of vulnerability to over-fitting.
Ready for immersive, real-time insights for everyone?