Compute Mahalanobis Distance and Flag Multivariate Outliers

Pipe-friendly wrapper around to the function mahalanobis(), which returns the squared Mahalanobis distance of all rows in x. Compared to the base function, it automatically flags multivariate outliers.

Mahalanobis distance is a common metric used to identify multivariate outliers. The larger the value of Mahalanobis distance, the more unusual the data point (i.e., the more likely it is to be a multivariate outlier).

The distance tells us how far an observation is from the center of the cloud, taking into account the shape (covariance) of the cloud as well.

To detect outliers, the calculated Mahalanobis distance is compared against a chi-square (X^2) distribution with degrees of freedom equal to the number of dependent (outcome) variables and an alpha level of 0.001.

The threshold to declare a multivariate outlier is determined using the function qchisq(0.999, df) , where df is the degree of freedom (i.e., the number of dependent variable used in the computation).

mahalanobis_distance(data, ...)

Arguments

data: a data frame. Columns are variables.
...: One unquoted expressions (or variable name). Used to select a variable of interest. Can be also used to ignore a variable that are not needed for the computation. For example specify -id to ignore the id column.

Value

Returns the input data frame with two additional columns: 1) "mahal.dist": Mahalanobis distance values; and 2) "is.outlier": logical values specifying whether a given observation is a multivariate outlier

Examples


# Compute mahalonobis distance and flag outliers if any
iris %>%
  doo(~mahalanobis_distance(.))
#> # A tibble: 150 × 6
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width mahal.dist is.outlier
#>           <dbl>       <dbl>        <dbl>       <dbl>      <dbl> <lgl>     
#>  1          5.1         3.5          1.4         0.2       2.13 FALSE     
#>  2          4.9         3            1.4         0.2       2.85 FALSE     
#>  3          4.7         3.2          1.3         0.2       2.08 FALSE     
#>  4          4.6         3.1          1.5         0.2       2.45 FALSE     
#>  5          5           3.6          1.4         0.2       2.46 FALSE     
#>  6          5.4         3.9          1.7         0.4       3.88 FALSE     
#>  7          4.6         3.4          1.4         0.3       2.86 FALSE     
#>  8          5           3.4          1.5         0.2       1.83 FALSE     
#>  9          4.4         2.9          1.4         0.2       3.38 FALSE     
#> 10          4.9         3.1          1.5         0.1       2.38 FALSE     
#> # … with 140 more rows

# Compute distance by groups and filter outliers
iris %>%
 group_by(Species) %>%
 doo(~mahalanobis_distance(.)) %>%
 filter(is.outlier == TRUE)
#> # A tibble: 0 × 7
#> # … with 7 variables: Species <fct>, Sepal.Length <dbl>, Sepal.Width <dbl>,
#> #   Petal.Length <dbl>, Petal.Width <dbl>, mahal.dist <dbl>, is.outlier <lgl>