Detect outliers using boxplot methods. Boxplots are a popular and an easy method for identifying outliers. There are two categories of outlier: (1) outliers and (2) extreme points.

Values above Q3 + 1.5xIQR or below Q1 - 1.5xIQR are considered as outliers. Values above Q3 + 3xIQR or below Q1 - 3xIQR are considered as extreme points (or extreme outliers).

Q1 and Q3 are the first and third quartile, respectively. IQR is the interquartile range (IQR = Q3 - Q1).

Generally speaking, data points that are labelled outliers in boxplots are not considered as troublesome as those considered extreme points and might even be ignored. Note that, any NA and NaN are automatically removed before the quantiles are computed.

identify_outliers(data, ..., variable = NULL)

is_outlier(x, coef = 1.5)

is_extreme(x)

Arguments

data

a data frame

...

One unquoted expressions (or variable name). Used to select a variable of interest. Alternative to the argument variable.

variable

variable name for detecting outliers

x

a numeric vector

coef

coefficient specifying how far the outlier should be from the edge of their box. Possible values are 1.5 (for outlier) and 3 (for extreme points only). Default is 1.5

Value

  • identify_outliers(). Returns the input data frame with two additional columns: "is.outlier" and "is.extreme", which hold logical values.

  • is_outlier() and is_extreme(). Returns logical vectors.

Functions

  • identify_outliers: takes a data frame and extract rows suspected as outliers according to a numeric column. The following columns are added "is.outlier" and "is.extreme".

  • is_outlier: detect outliers in a numeric vector. Returns logical vector.

  • is_extreme: detect extreme points in a numeric vector. An alias of is_outlier(), where coef = 3. Returns logical vector.

Examples

# Generate a demo data set.seed(123) demo.data <- data.frame( sample = 1:20, score = c(rnorm(19, mean = 5, sd = 2), 50), gender = rep(c("Male", "Female"), each = 10) ) # Identify outliers according to the variable score demo.data %>% identify_outliers(score)
#> sample score gender is.outlier is.extreme #> 1 20 50 Female TRUE TRUE
# Identify outliers by groups demo.data %>% group_by(gender) %>% identify_outliers("score")
#> # A tibble: 2 x 5 #> gender sample score is.outlier is.extreme #> <fct> <int> <dbl> <lgl> <lgl> #> 1 Female 18 1.07 TRUE FALSE #> 2 Female 20 50 TRUE TRUE