Mahalanobis and robust distance

The Mahalanobis distance indicates for each data item its displacement from the center of the data. In this calculation also the scatter of the data is considered.

The formulae of the Mahalanobis distance is

Formulae of the Mahalanobis distance

C(X) represents the covariance matrix and T(X) the arithmetic mean per dimension of all data items. The arithmetic mean as well as the covariance matrix can be strongly influenced by outliers. Consequently robust estimates for the center of the data T(X) and the scatter C(X) are used to calculate the robust distance.

To demonstrate the difference between the Mahalanobis and the robust distance the artificial dataset of Hawkins, Bradu, and Kass (1984) is used. This dataset holds 75 data items, including 14 outliers. The dataset is illustrated in the following picture.

Illustration of the HBK dataset in parallel coordinates
Picture 5: Illustration of the HBK dataset in parallel coordinates

In this visualization it can be easily seen that there is a group of data items that is completely different than the rest of the data. But it is not always possible to detect those so called outliers in an illustration of a high dimensional dataset. So it is important to calculate a reliable measurement that allows the detection of data items that show different behaviour than the majority of the data.

To detect such items in high dimensional data space, the Mahalanobis and the robust distances were calculated and additionally to the HBK dataset dimensions visualized in the parallel coordinates. In picture 6 the highest values of the Mahalanobis distance dimension were selected to identify outliers. But this selection also considers "normal" data items. Alternatively only the data item that has a significantly high distance value could have been chosen, but obviously with this selection not all outliers would have been detected.

Outlier detection according to the Mahalanobis distance
Picture 6: Outlier detection according to the Mahalanobis distance

In opposite to that the robust distance easily separates the outliers from the main part of the data. This can be seen in picture 7. Here it is helpful that the distance values also show a significant gap.

Outlier detection according to the robust distance
Picture 7: Outlier detection according to the robust distance