How to best handle outliers or anomalous data points?
Data is a collection of facts, figures, objects, symbols, and events gathered from different sources. Organizations collect data with various data collection methods to make better decisions. Without data, it would be difficult for organizations to make appropriate decisions, so data is collected from different audiences at various points in time.
Outlier detection in the field of data mining (DM) and knowledge discovery from data (KDD) is of great interest in areas that require decision support systems, such as, for example, in the financial area, where through DM you can detect financial fraud or find errors produced by users. Therefore, it is essential to evaluate the veracity of the information, through methods for detecting unusual behavior in the data.
This article proposes a method to detect values that are considered outliers in a database of nominal type data. The method implements a global “k” nearest neighbors algorithm, a clustering algorithm called k-means , and a statistical method called chi-square. The application of these techniques has been implemented on a database of clients who have requested financial credit. The experiment was performed on a data set with 1180 tuples, where outliers were deliberately introduced. The results demonstrated that the proposed method is capable of detecting all introduced outliers.
Detecting outliers represents a challenge in data mining techniques. Outliers or also called anomalous values have different properties with respect to generality, since due to the nature of their values and therefore, their behavior, they are not data that maintain a behavior similar to the majority. Anomalous are susceptible to being introduced by malicious mechanisms ( Atkinson, 1981 ).
Mandhare and Idate (2017) consider this type of data to be a threat and define it as irrelevant or malicious. Additionally, this data creates conflicts during the analysis process, resulting in unreliable and inconsistent information. However, although anomalous data are irrelevant for finding patterns in everyday data, they are useful as an object of study in cases where, through them, it is possible to identify problems such as financial fraud through an uncontrolled process.
What is anomaly detection?
Anomaly detection examines specific data points and detects unusual occurrences that appear suspicious because they are different from established patterns of behavior. Anomaly detection is not new, but as the volume of data increases, manual tracking is no longer practical.
Why is anomaly detection important?
Anomaly detection is especially important in industries such as finance, retail, and cybersecurity, but all businesses should consider implementing an anomaly detection solution. Such a solution provides an automated means to detect harmful outliers and protects data. For example, banking is a sector that benefits from anomaly detection. Thanks to it, banks can identify fraudulent activity and inconsistent patterns, and protect data.
Data is the lifeline of your business and compromising it can put your operation at risk. Without anomaly detection, you could lose revenue and brand value, which takes years to cultivate. Your company is facing security breaches and the loss of confidential customer information. If this happens, you risk losing a level of customer trust that may be irretrievable.
The detection process using data mining techniques facilitates the search for anomalous values ( Arce, Lima, Orellana, Ortega and Sellers, 2018 ). Several studies show that most of this type of data also originates from domains such as credit cards ( Bansal, Gaur, & Singh, 2016 ), security systems ( Khan, Pradhan, & Fatima, 2017 ), and electronic health information ( Zhang & Wang, 2018 ).
The detection process includes a data mining process that uses tools based on unsupervised algorithms (Onan, 2017). The detection process consists of two approaches depending on its form: local and global ( Monamo, Marivate, & Twala, 2017 ). Global approaches include a set of techniques in which each anomaly is assigned a score relative to the global data set. On the other hand, local approaches represent the anomalies in a given data with respect to its direct neighborhood; that is, to the data that are close in terms of the similarity of their characteristics.
According to the aforementioned concepts, the local approach detects outliers that are ignored when a global approach is used, especially those with variable density (Amer and Goldstein, 2012). Examples of such algorithms are those based on i) clustering and ii) nearest neighbor. The first category algorithm considers outliers to be in sparse neighborhoods, which are far from the nearest neighbors. While the second category operates in grouped algorithms ( Onan, 2017 ).
There are several approaches related to the detection of outliers, in this context, Hassanat, Abbadi, Altarawneh, and Alhasanat (2015) , carried out a survey where a summary of the different studies of detection of outliers is presented, these being: statistics-based approach, the distance-based approach and the density-based approach. The authors present a discussion related to outliers, and conclude that the k-mean algorithm is the most popular in clustering a data set.
Furthermore, in other studies (Dang et al., 2015; Ganji, 2012; Gu et al., 2017; Malini and Pushpa, 2017; Mandhare and Idate, 2017; Sumaiya Thaseen and Aswani Kumar, 2017; Yan et al., 2016 ) data mining techniques, statistical methods or both are used. For outlier detection, nearest neighbor (KNN) techniques have commonly been applied along with others to find unusual patterns during data behavior or to improve process performance. What’s up. (2017) present an efficient grid-based method for finding outlier data patterns in large data sets.
Similarly, Yan et al. (2016) propose an outlier detection method with KNN and data pruning, which takes successive samples of tuples and columns, and applies a KNN algorithm to reduce dimensionality without losing relevant information.
Classification of significant columns
To classify the significant columns, the chi-square statistic was used. Chi-square is a non-parametric test used to determine whether a distribution of observed frequencies differs from expected theoretical frequencies ( Gol and Abur, 2015 ). The weight of the input column (columns that determine the customer profile) is calculated in relation to the output column (credit amount). The higher the weight of a column corresponding to the input columns on a scale of zero to one, the more relevant it is considered.
That is, the closer the weight value is to one, the more important the relationship with respect to the output column will be. The statistic can only be applied to nominal type columns and has been selected as a method to define relevance. Chi-square reports a level of significance of the associations or dependencies and was used as a hypothesis test on the weight or importance of each of the columns with respect to the output column S. The resulting value is stored in a column called weight , which together with the anomaly score is reported at the end of the process.
Nearest local neighbor rating
To obtain the values with suspected abnormality, K-NN Global Anomaly Score is used. KNN is based on the k-nearest neighbor algorithm, which calculates the anomaly score of the data relative to the neighborhood. Usually, outliers are far from their neighbors or their neighborhood is sparse. In the first case, it is known as global anomaly detection and is identified with KNN; The second refers to an approach based on local density.
The score comes by default from the average of the distance to the nearest neighbors ( Amer and Goldstein, 2012 ). In the k nearest neighbor classification, the output column S of the nearest neighbor of the training dataset is related to a new data not classified in the prediction, this implies a linear decision line.
To obtain a correct prediction, the value k (number of neighbors to be considered around the analyzed value) must be carefully configured. A high value of k represents a poor solution with respect to prediction, while low values tend to generate noise ( Bhattacharyya, Jha, Tharakunnel, & Westland, 2011 ).
Frequently, the parameter k is chosen empirically and depends on each problem. Hassanat, Abbadi, Altarawneh, and Alhasanat (2014) propose carrying out tests with different numbers of close neighbors until reaching the one with the best precision. Their proposal starts with values from k=1 to k= square root of the number of tuples in the training dataset. The general rule is often to map k with the square root of the number of tuples in dataset D.
HOW CAN WE SOLVE THE PROBLEM OF ATYPICAL DATA?
If we have confirmed that these outliers are not due to an error when constructing the database or in measuring the variable, eliminating them is not the solution. If it is not due to an error, eliminating or replacing it can modify the inferences made from that information, because it introduces a bias, reduces the sample size, and can affect both the distribution and the variances.
Furthermore, the treasure of our research lies in the variability of the data!
That is, variability (differences in the behavior of a phenomenon) must be explained, not eliminated. And if you still can’t explain it, you should at least be able to reduce the influence of these outliers on your data.
The best option is to remove weight from these atypical observations using robust techniques .
Robust statistical methods are modern techniques that address these problems. They are similar to the classic ones but are less affected by the presence of outliers or small variations with respect to the models’ hypotheses.
ALTERNATIVES TO THE MEDIA
If we calculate the median (the center value of an ordered sample) for the second data set we have a value of 14 (the same as for the first data set). We see that this centrality statistic has not been disturbed by the presence of an extreme value, therefore, it is more robust.
Let’s look at other alternatives…
The trimmed mean (trimming) “discards” extreme values. That is, it eliminates a fraction of the extreme data from the analysis (eg 20%) and calculates the mean of the new data set. The trimmed mean for our case would be worth 13.67.
The winsorized mean progressively replaces a percentage of the extreme values (eg 20%) with less extreme ones. In our case, the winsorized mean of the second sample would be the same 13.62.
We see that all of these robust estimates better represent the sample and are less affected by extreme data.