Censored Regression

Introduction

Data collected from an experimental study is sometimes censored, so that only partial information is known about a subset of observations. For instance, when measuring the lifespan of mice, we may find a number of subjects live beyond the duration of the project. Thus, all we know is the lower bound on their lifespan. This right censoring can be incorporated into a regression model via convex optimization.

Suppose that only \(K\) of our observations \((x_i,y_i)\) are fully observed, and the remaining are censored such that we observe \(x_i\), but only know \(y_i \geq D\) for \(i = K+1,\ldots,m\) and some constant \(D \in {\mathbf R}\). We can build an OLS model using the uncensored data, restricting the fitted values \(\hat y_i = x_i^T\beta\) to lie above \(D\) for the censored observations:

\[ \begin{array}{ll} \underset{\beta}{\mbox{minimize}} & \sum_{i=1}^K (y_i - x_i^T\beta)^2 \\ \mbox{subject to} & x_i^T\beta \geq D, \quad i = K+1,\ldots,m. \end{array} \]

This avoids the bias introduced by standard OLS, while still utilizing all of the data points in the regression. The constraint requires only one more line in CVXR.

Example

We will generate synthetic data for this example, censoring observations beyond a value \(D\).

## Problem data
n <- 30
M <- 50
K <- 200

set.seed(n * M * K)
X <- matrix(stats::rnorm(K * n), nrow = K, ncol = n)
beta_true <- matrix(stats::rnorm(n), nrow = n, ncol = 1)
y <- X %*% beta_true + 0.3 * sqrt(n) * stats::rnorm(K)

## Order variables based on y
idx <- order(y, decreasing = FALSE)
y_ordered <- y[idx]
X_ordered <- X[idx, ]

## Find cutoff and censor
D <- (y_ordered[M] + y_ordered[M + 1]) / 2
censored <- (y_ordered > D)
y_censored <- pmin(y_ordered, D)

We now fit regular OLS, OLS using just the censored data and finally the censored regression.

## Regular OLS
beta <- Variable(n)
obj <- sum((y_censored - X_ordered %*% beta)^2)
prob <- Problem(Minimize(obj))
result <- solve(prob)
beta_ols <- result$getValue(beta)

## OLS using uncensored data
obj <- sum((y_censored[1:M] - X_ordered[1:M,] %*% beta)^2)
prob <- Problem(Minimize(obj))
result <- solve(prob)
beta_unc <- result$getValue(beta)

## Censored regression
obj <- sum((y_censored[1:M] - X_ordered[1:M,] %*% beta)^2)
constr <- list(X_ordered[(M+1):K,] %*% beta >= D)
prob <- Problem(Minimize(obj), constr)
result <- solve(prob)
beta_cens <- result$getValue(beta)

Here’s are some plots comparing the results. The blue diamond points are estimates.

plot_results <- function(beta_res, title) {
    d <- data.frame(index = seq_len(K),
                    y_ordered = y_ordered,
                    y_fit = as.numeric(X_ordered %*% beta_res),
                    censored = as.factor(censored))
    ggplot(data = d) +
        geom_point(mapping = aes(x = index, y = y_ordered, color = censored)) +
        geom_point(mapping = aes(x = index, y = y_fit), color = "blue", shape = 23) +
        geom_abline(intercept = D, slope = 0, lty = "dashed") +
        labs(x = "Observations", y = "y") +
        ggtitle(title)
}
plot_results(beta_ols, "Regular OLS.")

plot_results(beta_unc, "OLS using uncensored data.")

plot_results(beta_cens, "Censored Regression.")

## Testthat Results: No output is good
## Error: `beta_ols` not identical to censored$beta_ols.
## Objects equal but not identical

Session Info

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-apple-darwin20
## Running under: macOS Sequoia 15.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Los_Angeles
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices datasets  utils     methods   base     
## 
## other attached packages:
## [1] ggplot2_3.5.1    CVXR_1.0-15      testthat_3.2.1.1 here_1.0.1      
## 
## loaded via a namespace (and not attached):
##  [1] gmp_0.7-5         clarabel_0.9.0.1  sass_0.4.9        utf8_1.2.4       
##  [5] generics_0.1.3    slam_0.1-54       blogdown_1.19     lattice_0.22-6   
##  [9] digest_0.6.37     magrittr_2.0.3    evaluate_1.0.1    grid_4.4.2       
## [13] bookdown_0.41     pkgload_1.4.0     fastmap_1.2.0     rprojroot_2.0.4  
## [17] jsonlite_1.8.9    Matrix_1.7-1      brio_1.1.5        Rmosek_10.2.0    
## [21] fansi_1.0.6       scales_1.3.0      codetools_0.2-20  jquerylib_0.1.4  
## [25] cli_3.6.3         Rmpfr_0.9-5       rlang_1.1.4       Rglpk_0.6-5.1    
## [29] bit64_4.5.2       munsell_0.5.1     withr_3.0.2       cachem_1.1.0     
## [33] yaml_2.3.10       tools_4.4.2       osqp_0.6.3.3      Rcplex_0.3-6     
## [37] rcbc_0.1.0.9001   dplyr_1.1.4       colorspace_2.1-1  gurobi_11.0-0    
## [41] assertthat_0.2.1  vctrs_0.6.5       R6_2.5.1          lifecycle_1.0.4  
## [45] bit_4.5.0         desc_1.4.3        cccp_0.3-1        pkgconfig_2.0.3  
## [49] bslib_0.8.0       pillar_1.9.0      gtable_0.3.6      glue_1.8.0       
## [53] Rcpp_1.0.13-1     highr_0.11        xfun_0.49         tibble_3.2.1     
## [57] tidyselect_1.2.1  knitr_1.48        farver_2.1.2      htmltools_0.5.8.1
## [61] labeling_0.4.3    rmarkdown_2.29    compiler_4.4.2

Source

R Markdown

References