Bayesian Cumulative Probability Models for Continuous and Mixed Outcomes

class: center, middle, inverse, title-slide

# Bayesian Cumulative Probability Models for Continuous and Mixed Outcomes
### Nathan T. James
### December 4, 2019

---

# Outline

.font160[
- Background 
- Bayesian Cumulative Probability Model
- Simulations
- Case Study
- Discussion and Future Research
]

---

# Background: .font70[Cumulative Probability Models]

.font140[
Ordinal cumulative probability models (CPMs) such as the proportional odds regression model are typically used for discrete ordered outcomes. Why use a CPM for continuous outcomes?
]

.font140[
- Regression coefficients are invariant to monotonic transformations of outcome 
- Directly models full conditional cumulative distribution function (CDF)
- Handles ordered outcomes including mixed continuous/discrete distributions (e.g., continuous outcome with lower limit of detection)
]

.font140[
Recent papers by Liu et al.<sup>1</sup> and Tian et al.<sup>2</sup> describe estimation and inference for CPMs with continuous responses using non-parametric maximum likelihood estimation (NPMLE).
]

.font90[
.footnote[[1] Liu et al. (2017). Modeling continuous response variables using ordinal regression. *Statistics in Medicine*, 36(27), 4316-4335.]  
.footnote[[2] Tian et al. (In Press). An Empirical Comparison of Two Novel Transformation Models. *Statistics in Medicine*. ]
]

---

# Background: .font70[Bayesian Inference]

.font150[
Under the Bayesian paradigm we combine the `$\color{RoyalBlue}{\text{likelihood}}$` with `$\color{orange}{\text{prior}}$` information to get a `$\color{LimeGreen}{\text{posterior}}$` for the parameters given the data.

`$$\color{LimeGreen}{p(\theta|y)}=\frac{\color{orange}{p(\theta)}\color{RoyalBlue}{p(y|\theta)}}{\int \color{orange}{p(\theta)} \color{RoyalBlue}{p(y|\theta)}\, d\theta}$$`

For fixed data `$y$` the denominator is a constant with respect to `$\theta$` so

`$$\color{LimeGreen}{p(\theta|y)} \propto \color{orange}{p(\theta)}\color{RoyalBlue}{p(y|\theta)}$$`

All inference is based on the `$\color{LimeGreen}{\text{posterior}}$` distribution of the parameters
]

---

# Bayesian Cumulative Probability Model

.font150[
Bayesian CPMs inherit many of the properties of CPMs estimated with NPMLE since they use the same likelihood

Additional Advantages  
➕ Interpretation using posterior probabilities   
➕ Exact inference within simulation error  
➕ Ability to incorporate prior information if available  
➕ Extensions to hierarchical models, mixture models  
]

.font150[
Disadvantages of Bayesian CPM  
➖ More computationally intensive  
➖ Determining appropriate prior distribution can be challenging  
]

---

# Bayesian Cumulative Probability Model: .font70[Model specification]

.font150[
Let `$Y_i$` be the outcome for individual `$i=1, \ldots, n$` with `$p$` associated covariates `$\boldsymbol{X_i}=(X_{i1},\ldots,X_{ip})$`

Each `$Y_i$` falls into one of `$j=1,\ldots,J$` ordered categories  
]

.font150[
This can be modeled as `$Y_i \sim Multinomial(1,\boldsymbol{\pi_i})$`

where `$\boldsymbol{\pi_i}=(\pi_{i1},\ldots,\pi_{iJ})$` are the probabilities of individual `$i$` being in category `$j$` and `$\sum_{j=1}^{J} \pi_{ij}=1$`

Then the **cumulative probability** of falling into category `$j$` or lower is `$P(Y_i \le j)=\eta_{ij}=\sum_{k=1}^{j} \pi_{ik}$`
]

---

# Bayesian Cumulative Probability Model: .font70[Likelihood]

.font150[
The CPM relates the cumulative probabilities to the covariates through a monotonically increasing link function `$G(\cdot)$`

`$$G(\eta_{ij})=\gamma_j-\boldsymbol{x_i^{T}\beta} \text{ or equivalently } \eta_{ij}=G^{-1}(\gamma_j-\boldsymbol{x_i^{T}\beta})$$`

where the `$\gamma_j$` are latent continuous cutpoints `$-\infty \equiv \gamma_0 < \gamma_1 < \cdots < \gamma_{J-1} < \gamma_J \equiv \infty$`
]

.font150[
Common choices for `$G(\cdot)$` are logit `$G(p)=\log\left(\frac{p}{1-p}\right)$` , probit `$G(p)=\Phi^{-1}(p)$` and loglog  `$G(p)=-\log(-\log(p))$` 
]

---

# Bayesian Cumulative Probability Model: .font70[Likelihood]

.font140[
The probabilities of category membership are `$$\pi_{ij}=\eta_{i,j} - \eta_{i,j-1} = G^{-1} (\gamma_j-\boldsymbol{x_i^{T}\beta}) - G^{-1}(\gamma_{j-1}-\boldsymbol{x_i^{T}\beta})$$`
]

.font140[
so the likelihood for a random sample of observations `$(y_1,\ldots,y_n)$` is
`$$p(\boldsymbol{y}|\boldsymbol{x},\boldsymbol{\gamma},\boldsymbol{\beta})=\prod_{j=1}^{J}\prod_{i:y_i=j} [G^{-1}(\gamma_j-\boldsymbol{x_i^{T}\beta}) - G^{-1}(\gamma_{j-1}-\boldsymbol{x_i^{T}\beta})]$$`
]

.font140[
For continuous data with no ties `$J=n$`. Letting `$r(y_i)$` be the rank of `$y_i$` the `$\color{RoyalBlue}{\text{likelihood}}$` is
`$$\color{RoyalBlue}{p(\boldsymbol{y}|\boldsymbol{x},\boldsymbol{\gamma},\boldsymbol{\beta})}=\prod_{i=1}^{n} [G^{-1}(\gamma_{r(y_i)}-\boldsymbol{x_i^{T}\beta}) - G^{-1}(\gamma_{r(y_i)-1}-\boldsymbol{x_i^{T}\beta})]$$`
]

---

# Bayesian Cumulative Probability Model: .font70[Priors]

.font150[
To complete the Bayesian model specification we need to define `$\color{orange}{\text{priors}}$` for the unknown parameters `$\color{orange}{p(\boldsymbol{\beta},\boldsymbol{\gamma})}$`
]

.font150[
We assume a priori independence between `$\boldsymbol{\beta}$` and `$\boldsymbol{\gamma}$` so `$\color{orange}{p(\boldsymbol{\beta},\boldsymbol{\gamma})}=\color{orange}{p(\boldsymbol{\beta})p(\boldsymbol{\gamma})}$`

For convenience, we let the prior for the regression coefficients be uninformative `$\color{orange}{p(\boldsymbol{\beta})} \propto 1$` but other choices are possible
]

.font150[
Specifying a prior for the `$\boldsymbol{\gamma}$` directly is challenging because of the ordering constraint and high dimensionality

Instead, we specify a prior for a **transformation** of `$\boldsymbol{\gamma}$`
]

---

# Bayesian Cumulative Probability Model: .font70[Priors]

.font150[
Let `$\pi_{\cdot j} \equiv Pr(r(y)=j|x=\boldsymbol{0})$` be the probability of being in category `$j$` when all the covariates are 0. From the previous definition 
`$$\pi_{\cdot j}=G^{-1}(\gamma_j-0) - G^{-1}(\gamma_{j-1}-0)=G^{-1}(\gamma_j) - G^{-1}(\gamma_{j-1})$$`
]
--

.font150[
Conversely,
`$$\sum_{k=1}^{j}\pi_{\cdot k}=\sum_{k=1}^{j}G^{-1}(\gamma_k) - G^{-1}(\gamma_{k-1})=G^{-1}(\gamma_{j}) \quad \Rightarrow \quad G\left(\sum_{k=1}^{j} \pi_{\cdot k}\right)=\gamma_j$$`

These equations define a transformation `$h(\boldsymbol{\gamma})$` between the cutpoints `$\boldsymbol{\gamma}$` and the probabilities of category membership if all the covariates were 0, `$\boldsymbol{\pi_{\cdot}}=(\pi_{\cdot 1},\ldots,\pi_{\cdot J})$`
]

---

# Bayesian Cumulative Probability Model: .font70[Dirichlet distribution]

.font130[
Finally, we specify a Dirichlet prior for `$\boldsymbol{\pi_{\cdot}}$` as `$\color{orange}{p(\boldsymbol{\pi_{\cdot}})} \propto \prod_{j=1}^{J}\pi_{\cdot j}^{\alpha_j-1}$` where `$\alpha_1 = \cdots =\alpha_J=\frac{1}{J}$`

The Dirichlet distribution is the multivariate generalization of a Beta distribution to `$J \ge 2$` dimensions with probabilities `$\pi_i \in (0,1)$` such that `$\sum_{i=1}^{J}\pi_i=1$` and parameters `$\alpha_i > 0\; \forall i$`
]

.font130[
Because the Dirichlet is conjugate to a multinomial distribution the `$\alpha_i$` can be interpreted as the number of pseudo-observations in each category contributed by the prior

Setting `$\alpha_1 = \cdots =\alpha_J=\frac{1}{J}$` implies a total prior contribution of `$\sum_{i=1}^{J} \alpha_i=\sum_{i=1}^{J}\frac{1}{J}=1$` observation

Additionally this prior implies `$\pi_{\cdot j}>0$` for all `$j$` when all the covariates are 0
]

---

# Bayesian Cumulative Probability Model: .font70[Full Bayesian model]

.font160[
Combining the `$\color{orange}{\text{priors}}$` for `$\boldsymbol{\beta}$` and `$\boldsymbol{\gamma}$` with the `$\color{RoyalBlue}{\text{likelihood}}$` we have

`\begin{align*}
\color{LimeGreen}{p(\boldsymbol{\gamma},\boldsymbol{\beta}|\boldsymbol{x},\boldsymbol{y})} & \propto \color{orange}{p(\boldsymbol{\gamma})p(\boldsymbol{\beta})} \color{RoyalBlue}{p(\boldsymbol{y}|\boldsymbol{x},\boldsymbol{\gamma},\boldsymbol{\beta})}\\
&\propto \color{orange}{p(h(\boldsymbol{\gamma}))|\mathcal{J_h}|p(\boldsymbol{\beta})} \color{RoyalBlue}{p(\boldsymbol{y}|\boldsymbol{x},\boldsymbol{\gamma},\boldsymbol{\beta})}\\ 
&\propto \color{orange}{p(\boldsymbol{\pi_{\cdot}})|\mathcal{J_h}|p(\boldsymbol{\beta})} \color{RoyalBlue}{p(\boldsymbol{y}|\boldsymbol{x},\boldsymbol{\gamma},\boldsymbol{\beta})}
\end{align*}`

Where `$h(\boldsymbol{\gamma})=\boldsymbol{\pi_{\cdot}}$` is the transformation from the cutpoints to the probabilities of category membership when all covariates are 0 and `$\mathcal{J_h}$` is the Jacobian of the transformation
]

---

# Software

.font150[
We implement the model using `Stan` through the `R` front-end `rstan`
]

.font150[
`Stan` is a probabilistic programming language for Bayesian modeling which uses a variant of Markov Chain Monte Carlo (MCMC) to draw samples from the `$\color{LimeGreen}{\text{posterior}}$` distribution

Code and examples are available at
https://github.com/ntjames/bayes_cpm
]

---

# Posterior conditional distributions

.font150[
Using MCMC samples from posterior distribution ( `$\tilde{\boldsymbol{\gamma}}, \tilde{\boldsymbol{\beta}}$` ), it is straightforward to calculate quantities of interest for a given set of covariates 
]

.font150[
- Posterior conditional CDF:  `$\tilde{F}(y_i|\boldsymbol{x}_i)=G^{-1}(\tilde{\gamma}_{r(y_i)}-\boldsymbol{x}_i^{T}\tilde{\boldsymbol{\beta}})$`
]

.font150[
- Posterior conditional mean:
`$\tilde{E}[Y|\boldsymbol{x}_i]=\sum_{i=1}^{n}y_i\tilde{f}(y_i|\boldsymbol{x}_i)$` where `$\tilde{f}(y_i|\boldsymbol{x}_i)=\tilde{F}(y_i|\boldsymbol{x}_i)-\tilde{F}(y_{i-1}|\boldsymbol{x}_i)$`
]

.font150[
- Posterior conditional quantile:
To estimate the `$q^{th}$` quantile 
  - Find `$y_i=\inf\{y:\tilde{F}(y|x)\ge q\}$` and the next smallest value `$y_{i-1}$` 
  - Use linear interpolation to find quantile `$y_q$` where `$y_{i-1}<y_q<y_i$` 
]
---

# Posterior conditional distributions: .font70[Example]

.font120[
Consider an example with continuous outcome `$Y$` and one continuous predictor `$X \sim N(0,1)$`

`$$Y=X\beta +\varepsilon \quad \varepsilon \sim Logistic(0,1)$$`
Simulate `$n=100$` observations from this model with `$\beta=0.9$` and use a Bayesian CPM with logistic link to get
`$5000$` samples from the posterior distribution for `$\gamma_j\,(j=1,\ldots, 99)$` and `$\beta$`. The table shows values from the first `$6$` posterior samples
]

<table class="table table-striped" style="margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:right;"> $\gamma_1$ </th>
   <th style="text-align:right;"> $\gamma_2$ </th>
   <th style="text-align:right;"> $\gamma_3$ </th>
   <th style="text-align:left;"> $\cdots$ </th>
   <th style="text-align:right;"> $\gamma_{98}$ </th>
   <th style="text-align:right;"> $\gamma_{99}$ </th>
   <th style="text-align:right;"> $\beta$ </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> -5.439155 </td>
   <td style="text-align:right;"> -4.805212 </td>
   <td style="text-align:right;"> -4.229542 </td>
   <td style="text-align:left;"> ... </td>
   <td style="text-align:right;"> 5.121333 </td>
   <td style="text-align:right;"> 5.365933 </td>
   <td style="text-align:right;"> 1.0661293 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> -5.346134 </td>
   <td style="text-align:right;"> -5.206944 </td>
   <td style="text-align:right;"> -4.166805 </td>
   <td style="text-align:left;"> ... </td>
   <td style="text-align:right;"> 3.560953 </td>
   <td style="text-align:right;"> 5.214068 </td>
   <td style="text-align:right;"> 0.4648851 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> -5.584148 </td>
   <td style="text-align:right;"> -3.243365 </td>
   <td style="text-align:right;"> -3.086034 </td>
   <td style="text-align:left;"> ... </td>
   <td style="text-align:right;"> 5.243730 </td>
   <td style="text-align:right;"> 5.509438 </td>
   <td style="text-align:right;"> 0.8648104 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> -6.658529 </td>
   <td style="text-align:right;"> -5.830662 </td>
   <td style="text-align:right;"> -3.941990 </td>
   <td style="text-align:left;"> ... </td>
   <td style="text-align:right;"> 4.892642 </td>
   <td style="text-align:right;"> 5.578405 </td>
   <td style="text-align:right;"> 0.8610707 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> -4.032964 </td>
   <td style="text-align:right;"> -3.723712 </td>
   <td style="text-align:right;"> -3.604697 </td>
   <td style="text-align:left;"> ... </td>
   <td style="text-align:right;"> 4.197490 </td>
   <td style="text-align:right;"> 4.940910 </td>
   <td style="text-align:right;"> 0.6785981 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> -4.174127 </td>
   <td style="text-align:right;"> -3.815911 </td>
   <td style="text-align:right;"> -3.704210 </td>
   <td style="text-align:left;"> ... </td>
   <td style="text-align:right;"> 3.618923 </td>
   <td style="text-align:right;"> 5.120170 </td>
   <td style="text-align:right;"> 0.8566873 </td>
  </tr>
</tbody>
</table>

---

# Posterior conditional distributions: .font70[Example]

.font120[
Estimated median conditional CDFs, 90% credible intervals, and true conditional CDFs for three `$x$` values
]

---

# Posterior conditional distributions: .font70[Example]

.font110[
Conditional mean and quantile distributions are obtained from the conditional CDF. Kernel density plots for the conditional mean and median distributions are shown below
]

--
.font110[
- No asymptotic approximations are needed to obtain credible intervals

- Quantities can be directly interpreted using posterior probabilities
  - There is a `$95\%$` probability that the mean when `$x=2$` is between `$0.67$` and `$2.28$`
  - When `$x=0$`, the probability that the median is `$> |0.25|$` is `$\approx 21\%$` 
]

---

# Simulations: .font70[Setting]

.font150[
To evaluate the long-run frequency properties of the Bayesian CPM we generate data from a log-normal model
`$$Y=\exp(X_1\beta_1 + X_2 \beta_2 + \varepsilon) \quad \varepsilon \sim N(0,1)$$`
where `$\beta_1=1$`, `$\beta_2=-0.5$`, `$X_1 \sim Bernoulli(0.5)$` and `$X_2 \sim N(0,1)$`

`$1000$` datasets were simulated for sample sizes `$n=25,50,100,200$` and `$400$`

A Bayesian CPM with the properly specified probit link was fit to each dataset
]

---

# Simulations: .font70[Results]

.font120[
The average bias of the posterior median and standard error are shown for the two `$\beta$` regression parameters and
five `$\gamma$` intercept parameters corresponding to `$y_1=e^{-1}$`, `$y_2=e^{-0.33}$`, `$y_3=e^{0.5}$`, `$y_4=e^{1.33}$`, `$y_5= e^{2}$`
]

---

# Simulations: .font70[Results]

.font120[
To simulate a mixed discrete/continuous outcome identical datasets were produced but with values of `$Y<0$` set to `$0$`
]

---

# Case Study: .font70[Description]

.font110[
Data were collected from 216 HIV-positive adults on antiretroviral therapy in two cohort studies

The aim of the analysis is to estimate the association between body mass index (BMI) and several inflammation biomarkers since people living with HIV have higher risk of diabetes and cardiovascular disease
]

.font110[
Both biomarkers are skewed and have values censored below a lower detection limit which are recorded as 0
]

---

# Case Study: .font70[Analysis]

.font140[
To deal with the skewness and censoring in the outcomes we fit a Bayesian CPM model with probit link to estimate the association between BMI and the conditional mean, median, and 90th quantile of each biomarker

In addition to BMI and the inflammation biomarkers, the analysis adjusted for age, sex, race (nonwhite/white), smoking status (yes/no), study location, and CD4 cell count
]

.font140[
A more traditional approach would require three separate models for each outcome, e.g.

- censored regression with outcome transformation for conditional mean
- two quantile regression models for conditional median and 90th quantile
]

---

# Case Study: .font70[Results IL-6]

Estimated transformation function based on the `$\boldsymbol{\gamma}$` and posterior estimates and intervals for the covariates for the IL-6 biomarker

---

# Case Study: .font70[Results IL-6]

Increasing BMI is associated with increased mean, median and 90th quantile of IL-6

---

# Case Study: .font70[Results IL-1-beta]

Estimated transformation function based on the `$\boldsymbol{\gamma}$` and posterior estimates and intervals for the covariates for the IL-1- `$\beta$` biomarker

---

# Case Study: .font70[Results IL-1-beta]

There is no noticeable association between BMI and mean, median, and 90th quantile of IL-1- `$\beta$`

---

# Discussion

.font160[
- Bayesian CPMs are a versatile modeling approach with many advantages  
 - Avoid specification of outcome transformation  
 - Handle continuous and discrete ordered outcomes  
 - Estimate full conditional CDF, conditional mean and quantiles with one model   
 - Provide exact inference based on interpretable posterior probabilities  
 - Incorporate prior information if available   
- Although they are more computationally intensive than NPMLE CPMs, preliminary simulations show run time increases linearly with sample size
]

---

# Future Research: .font70[Bayesian Nonparametric prior]

.font150[
- The prior for the current model requires knowledge of the number of distinct outcome values, which is determined *after* the data is observed

- This violates the principle that the prior should be determined without looking at the data

- We can avoid specifying the number of categories by using a _Dirichlet Process prior_, an infinite-dimensional generalization of the Dirichlet distribution
]

---

# Future Research: .font70[Mixture Link]

.font140[
- Another potential limitation in our model is the assumption of a specific link function `$G(\cdot)$`. One approach to relax this assumption is to provide a mixture link

- Lang<sup>3</sup> defines a family of links that is a mixture of a complementary loglog link, a logistic link, and a loglog link
 - `$G_{\lambda}(p)=m_1(\lambda)G_{c\log\log}(p)+m_2(\lambda)G_{\text{logistic}}(p)+m_3(\lambda)G_{\log\log}(p)$`
 - where `$m_1(\lambda)$`, `$m_2(\lambda)$`, and `$m_3(\lambda)$` are functions defining the weight for each component link

.font80[
.footnote[[3] Lang (1999). Bayesian ordinal and binary regression models with a parametric family of mixture links. *Computational Statistics & Data Analysis*, 31, 59-87.]  ]
]

---

# Acknowledgements

.font160[
> Frank Harrell  
> Bryan Shepherd  
> Yuqi Tian  
> Leena Choi  
]

.font120[
Thank you to Dr. John Koethe for providing the HIV biomarker data

Slides are available at http://www.ntjames.com/seminar/2019dec04
]

---

# References

.font90[
Albert and Chib (1993). Bayesian Analysis of Binary and Polychotomous Response Data. *Journal of the American Statistical Association*, 88(422), 669-679.

Albert and Chib (1997). Bayesian Methods for Cumulative, Sequential, and Two-Step Ordinal Data Regression. Report. Department of Mathematics and Statistics, Bowling Green State University

Congdon (2005). *Bayesian Models for Categorical Data*. John Wiley & Sons: Chichester, West Sussex.

Gelman et al (2013). *Bayesian Data Analysis*. Chapman and Hall/CRC: New York.

Harrell (2015). *Regression Modeling Strategies*. Springer: New York.

Johnson and Albert (1999). *Ordinal Data Modeling*. Springer-Verlag: New York.

Lang (1999). Bayesian ordinal and binary regression models with a parametric family of mixture links. *Computational Statistics & Data Analysis*, 31, 59-87.

Liu et al. (2017). Modeling continuous response variables using ordinal regression. *Statistics in Medicine*, 36(27), 4316-4335.

Tian et al. (In Press). An Empirical Comparison of Two Novel Transformation Models. *Statistics in Medicine*.

Trevelyan et al. (2015). Bayesian Model Choice in Cumulative Link Ordinal Regression Models. *Bayesian Analysis*, 10(1), 1-30.
]

---

# Alternate `$\boldsymbol{\gamma}$` parameterizations

.font140[
Approach 1 - Sequentially truncated distributions (McKinley et al.)
`$$p(\boldsymbol{\gamma})=p(\gamma_1)\prod_{j=2}^{J-1}p(\gamma_j|\gamma_{j-1})\\\gamma_j|\gamma_{j-1} \sim N(0,\sigma_{\gamma}^2)I(\gamma_{j-1},\infty)$$` for `$j=2,\ldots,J-1$` where `$I(\gamma_{j-1},\infty)$` indicates truncation to the region `$(\gamma_{j-1},\infty)$`

Approach 2 - Transformation to unconstrained space (Albert and Chib)
`$$\delta_1=\log \gamma_1\;\; \delta_j=\log(\gamma_j-\gamma_{j-1}),\,2\le j \le J-1\\
\boldsymbol{\delta} \sim N_{J-1}(\boldsymbol{\mu_0},\boldsymbol{\Sigma_0})$$`
]