Determine the Optimal Cutpoint for Continuous Variables

Determine the optimal cutpoint for one or multiple continuous variables at once, using the maximally selected rank statistics from the 'maxstat' R package. This is an outcome-oriented methods providing a value of a cutpoint that correspond to the most significant relation with outcome (here, survival).

surv_cutpoint(): Determine the optimal cutpoint for each variable using 'maxstat'.
surv_categorize(): Divide each variable values based on the cutpoint returned by surv_cutpoint().

Usage

surv_cutpoint(
  data,
  time = "time",
  event = "event",
  variables,
  minprop = 0.1,
  progressbar = TRUE
)

surv_categorize(x, variables = NULL, labels = c("low", "high"))

# S3 method for class 'surv_cutpoint'
summary(object, ...)

# S3 method for class 'surv_cutpoint'
print(x, ...)

# S3 method for class 'surv_cutpoint'
plot(x, variables = NULL, ggtheme = theme_classic(), bins = 30, ...)

# S3 method for class 'plot_surv_cutpoint'
print(x, ..., newpage = TRUE)

Arguments

data: a data frame containing survival information (time, event) and continuous variables (e.g.: gene expression data).
time, event: column names containing time and event data, respectively. Event values sould be 0 or 1.
variables: a character vector containing the names of variables of interest, for wich we want to estimate the optimal cutpoint.
minprop: the minimal proportion of observations per group.
progressbar: logical value. If TRUE, show progress bar. Progressbar is shown only, when the number of variables > 5.
x, object: an object of class surv_cutpoint
labels: labels for the levels of the resulting category.
...: other arguments. For plots, see ?ggpubr::ggpar
ggtheme: function, ggplot2 theme name. Default value is theme_classic. Allowed values include ggplot2 official themes. See theme().
bins: Number of bins for histogram. Defaults to 30.
newpage: open a new page. See grid.arrange.

Value

surv_cutpoint(): returns an object of class 'surv_cutpoint', which is a list with the following components:
- maxstat results for each variable (see ?maxstat::maxstat)
- cutpoint: a data frame containing the optimal cutpoint of each variable. Rows are variable names and columns are c("cutpoint", "statistic").
- data: a data frame containing the survival data and the original data for the specified variables.
- minprop: the minimal proportion of observations per group.
- not_numeric: contains data for non-numeric variables, in the context where the user provided categorical variable names in the argument variables.
Methods defined for surv_cutpoint object are summary, print and plot.
surv_categorize(): returns an object of class 'surv_categorize', which is a data frame containing the survival data and the categorized variables.

Author

Alboukadel Kassambara, alboukadel.kassambara@gmail.com

Examples

# 0. Load some data
data(myeloma)
head(myeloma)
#>          molecular_group chr1q21_status treatment event  time   CCND1 CRIM1
#> GSM50986      Cyclin D-1       3 copies       TT2     0 69.24  9908.4 420.9
#> GSM50988      Cyclin D-2       2 copies       TT2     0 66.43 16698.8  52.0
#> GSM50989           MMSET       2 copies       TT2     0 66.50   294.5 617.9
#> GSM50990           MMSET       3 copies       TT2     1 42.67   241.9  11.9
#> GSM50991             MAF           <NA>       TT2     0 65.00   472.6  38.8
#> GSM50992    Hyperdiploid       2 copies       TT2     0 65.20   664.1  16.9
#>          DEPDC1    IRF4   TP53   WHSC1
#> GSM50986  523.5 16156.5   10.0   261.9
#> GSM50988   21.1 16946.2 1056.9   363.8
#> GSM50989  192.9  8903.9 1762.8 10042.9
#> GSM50990  184.7 11894.7  946.8  4931.0
#> GSM50991  212.0  7563.1  361.4   165.0
#> GSM50992  341.6 16023.4 2096.3   569.2

# 1. Determine the optimal cutpoint of variables
res.cut <- surv_cutpoint(myeloma, time = "time", event = "event",
   variables = c("DEPDC1", "WHSC1", "CRIM1"))

summary(res.cut)
#>        cutpoint statistic
#> DEPDC1    279.8  4.275452
#> WHSC1    3205.6  3.361330
#> CRIM1      82.3  1.968317

# 2. Plot cutpoint for DEPDC1
# palette = "npg" (nature publishing group), see ?ggpubr::ggpar
plot(res.cut, "DEPDC1", palette = "npg")
#> $DEPDC1

#> 

# 3. Categorize variables
res.cat <- surv_categorize(res.cut)
head(res.cat)
#>           time event DEPDC1 WHSC1 CRIM1
#> GSM50986 69.24     0   high   low  high
#> GSM50988 66.43     0    low   low   low
#> GSM50989 66.50     0    low  high  high
#> GSM50990 42.67     1    low  high   low
#> GSM50991 65.00     0    low   low   low
#> GSM50992 65.20     0   high   low   low

# 4. Fit survival curves and visualize
library("survival")
fit <- survfit(Surv(time, event) ~DEPDC1, data = res.cat)
ggsurvplot(fit, data = res.cat, risk.table = TRUE, conf.int = TRUE)