What is outlier detection?
Outlier detection is the process of detecting outliers, or a data point that is far away from the average, and depending on what you are trying to accomplish, potentially removing or resolving them from the analysis to prevent any potential skewing. Outlier detection is one of the most important processes taken to create good, reliable data.
What is an outlier?
Outliers are extreme data points that are beyond the expected norms for their type. This can be a whole data set that is confounding, or extremities of a certain data set. Imagining a standard bell curve, the outliers are the data on the far right and left. These outliers can indicate fraud or some other anomaly you are trying to detect, but they can also be measurement errors, experimental problems, or a novel, one-off blip. Basically, it refers to a data point or set of data points that diverges dramatically from expected samples and patterns.
There are two types of outliers, multivariate and univariate. Univariate outliers are a data point that is extreme for one variable. A multivariate outlier is a combination of unusual data points, including at least two data points.
Point outliers: These are single data points that are far removed from the rest of the data points.
Contextual outliers: These are considered to be ‘noise’, such as punctuation symbols and commas in text, or background noise when performing speech recognition.
Collective outliers: These are subsets of unexpected data that show a deviation from conventional data, which may indicate a new phenomenon.
What causes an outlier?
There are eight main causes of outliers.
- Incorrect data entry by humans
- Codes used instead of values
- Sampling errors, or data has been extracted from the wrong place or mixed with other data
- Unexpected distribution of variables
- Measurement errors caused by the application or system
- Experimental errors in extracting the data or planning errors
- Intentional dummy outliers inserted to test the detection methods
- Natural deviations in data, not actually an error, that are indicate fraud or some other anomaly you are trying to detect
When collecting and processing data, outliers can come from a range of sources and hide in many ways. It is part of the outlier detection process to identify these and distinguish them from genuine data that is behaving in unexpected ways.
Outliers which are not actual errors but a genuine set of unexpected data are called novelties. Part of a data scientist's work is identifying the novelties and leaving them in the data set, as they are important in decision making and ensuring accurate results.
Why should a user look for outliers?
One of the core issues in artificial intelligence (AI), machine learning (ML), and data science is data quality. With the world of data science growing, there has been expansion and growth of data. But the rate of outliers or anomalies has also increased. This means that aberrant data can hamper the model specifications, baffle the parameter estimation and generate incorrect information. Think about where data science is used and how this faulty data matters:
- Voting irregularities
- Clinical drug trials: Imagine if a good drug has poor results or is measured incorrectly, a range of treatment options could be missed.
- Fraud detection: This could result in people being denied credit when they were low risk or given credit when they were high risk.
- Business decisions: If a business is told to make a certain choice but the data was faulty, this could result in a huge marketing spend for little to no return on the investment, or even worse, losing valuable customers.
- Smart cities: If data quality is poor or hacked into and maliciously changed, city administrators will struggle to make accurate decisions about anything in their city including traffic light installations, rubbish collection, or policing numbers.
Techniques used for outlier detection
A data scientist can use a number of techniques to identify outliers and decide if they are errors or novelties.
This is the simplest nonparametric technique, where data is in a one-dimensional space. Outliers are calculated by dividing them into three quartiles. The range limits are then set as upper and lower whiskers of a box plot. Then, the data that is outside those ranges can be removed.
This parametric technique indicates how many standard deviations a certain point of data is from the sample’s mean. This assumes a gaussian distribution (a normal, bell-shaped curve). However if the data is not normally distributed, data can be transformed by scaling it, and giving it a more normal appearance. The z-score of data points are then calculated, placed on the bell curve, and then using heuristics (rule of thumb) a cut-off point for thresholds of standard deviation can be decided. Then, the data points that lie beyond that standard deviation can be classified as outliers and removed from the equation.The Z-score is a simple, powerful way to remove outliers, but it is only useful with medium to small data sets. It can’t be used for nonparametric data.
This is Density Based Spatial Clustering of Applications with Noise, which is basically a graphical representation showing density of data. Using complex calculations, it clusters data together in groups of related points. DBSCAN groups data into core points, border points, and outliers. Core points are main data groups, border points have enough density to be considered part of the data group, and outliers are in no cluster at all, and can be disregarded from data.DBSCAN is great across three or more dimensions, and it is very intuitive, making visualization easy. However the values in the feature space need to be scaled, selecting the optimal parameters can be tricky, and the model needs to be re-calibrated every time new data needs analysis.
This method is effective for finding novelties and outliers. It uses binary decision trees which are constructed using randomly selected features and a random split value. The forest trees then form a tree forest, which is averaged out. Then, outlier scores can be calculated, giving each node, or data point, a score from 0 to 1, 0 being normal and 1 being more of an outlier. Isolation forests don’t require scaling and they are effective when you can’t assume value distributions. It has very few parameters, making it robust and simple to optimize. However, data visualization is complex and it can be a long, expensive process.
Challenges with outlier detection
No mathematical process or data science strategy is immune from error or problems. Particularly large data sets must be managed well in order to correctly remove outliers, while keeping valid data and novelties intact. Some challenges include:
- When noise or outliers are very similar to valid data, it can be difficult to tease the flawed data from the good data.
- Outlier behavior can change characteristics. This means that algorithms and models that previously correctly identified outliers may no longer work.
- Data can be over-pruned or may remove genuine outliers that should be included in the data set.
- Malicious data attacks can change data to confound results.
All these challenges can be overcome with excellent algorithms that are constantly being reassessed to ensure they are accurate.
Ready for immersive, real-time insights for everyone?