Detect outliers using boxplot methods. Boxplots are a popular and an easy method for identifying outliers. There are two categories of outlier: (1) outliers and (2) extreme points.

Values above Q3 + 1.5xIQR or below Q1 - 1.5xIQR are considered as outliers. Values above Q3 + 3xIQR or below Q1 - 3xIQR are considered as extreme points (or extreme outliers).

Q1 and Q3 are the first and third quartile, respectively. IQR is the interquartile range (IQR = Q3 - Q1).

Generally speaking, data points that are labelled outliers in boxplots are not considered as troublesome as those considered extreme points and might even be ignored.

identify_outliers(data, ..., variable = NULL)

is_outlier(x, coef = 1.5)

is_extreme(x)

Arguments

data

a data frame

...

One unquoted expressions (or variable name). Used to select a variable of interest. Alternative to the argument variable.

variable

variable name for detecting outliers

x

a numeric vector

coef

coefficient specifying how far the outlier should be from the edge of their box. Possible values are 1.5 (for outlier) and 3 (for extreme points only). Default is 1.5

Value

  • identify_outliers(). Returns the input data frame with two additional columns: "is.outlier" and "is.extreme", which hold logical values.

  • is_outlier() and is_extreme(). Returns logical vectors.

Functions

  • identify_outliers: takes a data frame and extract rows suspected as outliers according to a numeric column. The following columns are added "is.outlier" and "is.extreme".

  • is_outlier: detect outliers in a numeric vector. Returns logical vector.

  • is_extreme: detect extreme points in a numeric vector. An alias of is_outlier(), where coef = 3. Returns logical vector.

Examples

# Generate a demo data set.seed(123) demo.data <- data.frame( sample = 1:20, score = c(rnorm(19, mean = 5, sd = 2), 50), gender = rep(c("Male", "Female"), each = 10) ) # Identify outliers according to the variable score demo.data %>% identify_outliers(score)
#> sample score gender is.outlier is.extreme #> 1 20 50 Female TRUE TRUE
# Identify outliers by groups demo.data %>% group_by(gender) %>% identify_outliers("score")
#> # A tibble: 2 x 5 #> gender sample score is.outlier is.extreme #> <fct> <int> <dbl> <lgl> <lgl> #> 1 Female 18 1.07 TRUE FALSE #> 2 Female 20 50 TRUE TRUE