Easy Last updated on May 7, 2022, 1:19 a.m.
Biased data have a lot of definitions when we look at it in terms of statistics, in simple words, we can say that the provided data is not representative of the real-world data. It does not include proper features that can capture the desired objective. Say, for example in the case of classification objective we have biased data where we have a lot of entries from one class and comparably fewer entries of other classes, which led to improper training of data and false results.
In the case of categorical features (or classification setting), analysis of label frequency and visualization can be the best way to detect biased data, we can have further plot histograms of multiple features of data to understand the hidden skewness of data. Even advanced models like CNNs use visualization of layers’ features to detect bias.
In the case of biasness of Regression features, more statistical metrics like mean, median, and standard deviation are used to understand and detect bias. For example, in figure 2 we can see the method to detect the type of skewness. If the mean of data is equal to the median, we can assume the data to be unbiased, whereas in the case of median < mean its’ right-skewed, and with median > mean its’ left-skewed.
There are multiple methods to deal with biased data:
1. Sampling Algorithms can be used to reduce the skewness of data distribution in the target variable. Commonly used algorithms are Rejection sampling, Metropolis-Hastings, Gibbs Sampler, and Importance sampling methods.
2. Data Preprocessing: In the case of Regression or skewness in features, certain data processing/ data transformation techniques can also be used. Commonly used techniques are Square(apply on left skew), Log transformation, Remove outliers, Normalize (min-max), Cube root (when values are too large), Square root (applied only to positive values), and Box-Cox transformation
3. Using probabilistic models: If we fail to make the data unbiased, we can also opt to use probabilistic modeling approaches like Bayesian Classifier, Hidden Markov models, etc. These models have an underlying assumption of probabilistic distribution, thus have a prior even when data is absent.
4. Efficient evaluation metrics: Another way to tackle skewed data modeling is using better evaluation metrics. For example, in the case of classification, we can use Precision, Recall, ROC, etc, whereas for Regression we can use weighted MSE, MAE, etc.