Journal

Statistics in Medicine

Papers (10)

Analyzing Coarsened and Missing Data by Imputation Methods

ABSTRACTIn various missing data problems, values are not entirely missing, but are coarsened. For coarsened observations, instead of observing the true value, a subset of values ‐ strictly smaller than the full sample space of the variable ‐ is observed to which the true value belongs. In our motivating example for patients with endometrial carcinoma, the degree of lymphovascular space invasion (LVSI) can be either absent, focally present, or substantially present. For a subset of individuals, however, LVSI is reported as being present, which includes both non‐absent options. In the analysis of such a dataset, difficulties arise when coarsened observations are to be used in an imputation procedure. To our knowledge, no clear‐cut method has been described in the literature on how to handle an observed subset of values, and treating them as entirely missing could lead to biased estimates. Therefore, in this paper, we evaluated the best strategy to deal with coarsened and missing data in multiple imputation. We tested a number of plausible ad hoc approaches, possibly already in use by statisticians. Additionally, we propose a principled approach to this problem, consisting of an adaptation of the SMC‐FCS algorithm (SMC‐FCS: Coarsening compatible), that ensures that imputed values adhere to the coarsening information. These methods were compared in a simulation study. This comparison shows that methods that prevent imputations of incompatible values, like the SMC‐FCS method, perform consistently better in terms of a lower bias and RMSE, and achieve better coverage than methods that ignore coarsening or handle it in a more naïve way. The analysis of the motivating example shows that the way the coarsening information is handled can matter substantially, leading to different conclusions across methods. Overall, our proposed SMC‐FCS method outperforms other methods in handling coarsened data, requires limited additional computation cost and is easily extendable to other scenarios.

Incorporating Additional Evidence as Prior Information to Resolve Non‐Identifiability in Bayesian Disease Model Calibration: A Tutorial

ABSTRACTDisease models are used to examine the likely impact of therapies, interventions, and public policy changes. Ensuring that these are well calibrated on the basis of available data and that the uncertainty in their projections is properly quantified is an important part of the process. The question of non‐identifiability poses a challenge to disease model calibration where multiple parameter sets generate identical model outputs. For statisticians evaluating the impact of policy interventions such as screening or vaccination, this is a critical issue. This study explores the use of the Bayesian framework to provide a natural way to calibrate models and address non‐identifiability in a probabilistic fashion in the context of disease modeling. We present Bayesian approaches for incorporating expert knowledge and external data to ensure that appropriately informative priors are specified on the joint parameter space. These approaches are applied to two common disease models: a basic susceptible‐infected‐susceptible (SIS) model and a much more complex agent‐based model which has previously been used to address public policy questions in HPV and cervical cancer. The conditions that allow the problem of non‐identifiability to be resolved are demonstrated for the SIS model. For the larger HPV model, an overview of the findings is presented, but of key importance is a discussion on how the non‐identifiability impacts the calibration process. Through case studies, we demonstrate how informative priors can help resolve non‐identifiability and improve model inference. We also discuss how sensitivity analysis can be used to assess the impact of prior specifications on model results. Overall, this work provides an important tutorial for researchers interested in applying Bayesian methods to calibrate models and handle non‐identifiability in disease models.

Dynamic prediction of disease processes based on recurrent history and functional principal component analysis of longitudinal biomarkers: Application for ovarian epithelial cancer

Ovarian epithelial cancer is a gynecological tumor with a high risk of recurrence and death. In the clinical diagnosis of ovarian epithelial cancer, CA125 has become an important indicator of disease burden. To account for patient recurrence and death, a proper method is needed to integrate information from biomarkers and recurrence simultaneously. In the past 10 years, many methods have been proposed for joint modeling of longitudinal biomarkers and survival data, but few of them are applicable to longitudinal data and disease processes, including recurrence and death. In this article, we proposed a new joint frailty model based on functional principal component analysis for dynamic prediction of survival probabilities on the total time scale, which took recurrent history and longitudinal data into account simultaneously. The estimation of the joint frailty model is achieved by maximizing the penalized log‐likelihood function. The simulation results demonstrated the advantages of our method in both discrimination and accuracy under different scenarios. To indicate the method's practicality, it is applied to an actual dataset of patients with ovarian epithelial cancer to predict survival dynamically using longitudinal data of biomarker CA125 and recurrent history data.

Bayesian nonparametric inference for the overlap coefficient: With an application to disease diagnosis

Diagnostic tests play an important role in medical research and clinical practice. The ultimate goal of a diagnostic test is to distinguish between diseased and nondiseased individuals and before a test is routinely used in practice, it is a pivotal requirement that its ability to discriminate between these two states is thoroughly assessed. The overlap coefficient, which is defined as the proportion of overlap area between two probability density functions, has gained popularity as a summary measure of diagnostic accuracy. We propose two Bayesian nonparametric estimators, based on Dirichlet process mixtures, for estimating the overlap coefficient. We further introduce the covariate‐specific overlap coefficient and develop a Bayesian nonparametric approach based on Dirichlet process mixtures of additive normal models for estimating it. A simulation study is conducted to assess the empirical performance of our proposed estimators. Two illustrations are provided: one concerned with the search for biomarkers of ovarian cancer and another one aimed to assess the age‐specific accuracy of glucose as a biomarker of diabetes.

Sample‐weighted semiparametric estimation of cause‐specific cumulative risk and incidence using left‐ or interval‐censored data from electronic health records

Electronic health records (EHRs) can be a cost‐effective data source for forming cohorts and developing risk models in the context of disease screening. However, important issues need to be handled: competing outcomes, left‐censoring of prevalent disease, interval‐censoring of incident disease, and uncertainty of prevalent disease when accurate disease ascertainment is not conducted at baseline. Furthermore, novel tests that are costly and limited in availability can be conducted on stored biospecimens selected as samples from EHRs by using different sampling fractions. We extend sample‐weighted semiparametric marginal mixture models to estimating competing risks. For flexible modeling of relative risks, a general transformation of the subdistribution hazard function and regression parameters is used. We propose a numerical algorithm for nonparametrically calculating the maximum likelihood estimates for subdistribution hazard functions and regression parameters. Methods for calculating the consistent confidence intervals for relative and absolute risk estimates are presented. The proposed algorithm and methods show reliable finite sample performance through simulation studies. We apply our methods to a cohort assembled from EHRs at a health maintenance organization where we estimate cumulative risk of cervical precancer/cancer and incidence of infection‐clearance by HPV genotype among human papillomavirus (HPV) positive women. There is no significant difference in 3‐year HPV‐clearance rates across different HPV types, but 3‐year cumulative risk of progression‐to‐precancer/cancer from HPV‐16 is relatively higher than the other HPV genotypes.

Hidden mover‐stayer model for disease progression accounting for misclassified and partially observed diagnostic tests: Application to the natural history of human papillomavirus and cervical precancer

Hidden Markov models (HMMs) have been proposed to model the natural history of diseases while accounting for misclassification in state identification. We introduce a discrete time HMM for human papillomavirus (HPV) and cervical precancer/cancer where the hidden and observed state spaces are defined by all possible combinations of HPV, cytology, and colposcopy results. Because the population of women undergoing cervical cancer screening is heterogeneous with respect to sexual behavior, and therefore risk of HPV acquisition and subsequent precancers, we use a mover‐stayer mixture model that assumes a proportion of the population will stay in the healthy state and are not subject to disease progression. As each state is a combination of three distinct tests that characterize the cervix, partially observed data arise when at least one but not every test is observed. The standard forward‐backward algorithm, used for evaluating the E‐step within the E‐M algorithm for maximum‐likelihood estimation of HMMs, cannot incorporate time points with partially observed data. We propose a new forward‐backward algorithm that considers all possible fully observed states that could have occurred across a participant's follow‐up visits. We apply our method to data from a large management trial for women with low‐grade cervical abnormalities. Our simulation study found that our method has relatively little bias and out preforms simpler methods that resulted in larger bias.

Variant‐Specific Mendelian Risk Prediction Model

ABSTRACT Many pathogenic sequence variants (PSVs) have been associated with increased risk of cancers. Mendelian risk prediction models use Mendelian laws of inheritance, as well as specified PSV frequency and penetrance (age‐specific probability of developing cancer given genotype), to predict the probability of having a PSV based on family history. Most existing models assume that the penetrance is the same for all PSVs in a certain gene. However, for some genes (e.g., BRCA1/2), cancer risk has been shown to vary by PSV. We propose an extension of Mendelian risk prediction models that relaxes the assumption of homogeneous gene‐level risk by incorporating PSV‐specific penetrances and illustrate this extension on an existing Mendelian risk prediction model, Fam3PRO. We illustrate our proposed Fam3PRO‐variant model by incorporating variant‐specific BRCA1/2 PSVs through region classifications. Based on prior literature, we defined three cancer‐specific risk regions: The breast cancer clustering region (BCCR), the ovarian cancer clustering region (OCCR), and the “other” region. We conducted simulations to evaluate the performance of the proposed illustrative Fam3PRO‐variant model compared to the existing Fam3PRO model. Simulation results showed that the Fam3PRO‐variant model was well calibrated to predict region‐specific BRCA1/2 carrier status with high discrimination and accuracy. Importantly, our simulations also highlighted the impact of underreporting in family history data on model performance: While underreporting slightly reduced absolute calibration, the Fam3PRO‐variant model remained robust in discrimination and provided more accurate region‐specific PSV risk predictions than gene‐level models. We further evaluated Fam3PRO‐variant on two cohorts: 1897 families from the Cancer Genetics Network (CGN) and 25 671 families from the Clinical Cancer Genomics Community Research Network (CCGCRN). Results showed that our proposed model provides region‐specific PSV carrier probabilities with high accuracy, while the calibration, discrimination, and accuracy of gene‐specific PSV carrier probabilities were comparable to the existing gene‐specific model. Moreover, we assessed the clinical utility of Fam3PRO‐variant by evaluating positive predictive value (PPV), negative predictive value (NPV), sensitivity, and specificity at clinically relevant thresholds (2.5%, 5%, and 10%), as recommended by NCCN guidelines. Fam3PRO‐variant performed comparably to Fam3PRO at the gene level across all metrics, with notably high specificity and NPV at the region‐specific level. These results suggest that, even in the presence of underreporting, Mendelian risk prediction models can be effectively extended to incorporate variant‐specific penetrances, providing more precise region‐specific PSV carrier probabilities and improving cancer prevention and risk prediction.

SpiderLearner: An ensemble approach to Gaussian graphical model estimation

Abstract Gaussian graphical models (GGMs) are a popular form of network model in which nodes represent features in multivariate normal data and edges reflect conditional dependencies between these features. GGM estimation is an active area of research. Currently available tools for GGM estimation require investigators to make several choices regarding algorithms, scoring criteria, and tuning parameters. An estimated GGM may be highly sensitive to these choices, and the accuracy of each method can vary based on structural characteristics of the network such as topology, degree distribution, and density. Because these characteristics are a priori unknown, it is not straightforward to establish universal guidelines for choosing a GGM estimation method. We address this problem by introducing SpiderLearner, an ensemble method that constructs a consensus network from multiple estimated GGMs. Given a set of candidate methods, SpiderLearner estimates the optimal convex combination of results from each method using a likelihood‐based loss function. ‐fold cross‐validation is applied in this process, reducing the risk of overfitting. In simulations, SpiderLearner performs better than or comparably to the best candidate methods according to a variety of metrics, including relative Frobenius norm and out‐of‐sample likelihood. We apply SpiderLearner to publicly available ovarian cancer gene expression data including 2013 participants from 13 diverse studies, demonstrating our tool's potential to identify biomarkers of complex disease. SpiderLearner is implemented as flexible, extensible, open‐source code in the R package ensembleGGM at https://github.com/katehoffshutta/ensembleGGM .

Statistical approaches using longitudinal biomarkers for disease early detection: A comparison of methodologies

Early detection of clinical outcomes such as cancer may be predicted using longitudinal biomarker measurements. Tracking longitudinal biomarkers as a way to identify early disease onset may help to reduce mortality from diseases like ovarian cancer that are more treatable if detected early. Two disease risk prediction frameworks, the shared random effects model (SREM) and the pattern mixture model (PMM) could be used to assess longitudinal biomarkers on disease early detection. In this article, we studied the discrimination and calibration performances of SREM and PMM on disease early detection through an application to ovarian cancer, where early detection using the risk of ovarian cancer algorithm (ROCA) has been evaluated. Comparisons of the above three approaches were performed via analyses of the ovarian cancer data from the Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial. Discrimination was evaluated by the time‐dependent receiver operating characteristic curve and its area, while calibration was assessed using calibration plot and the ratio of observed to expected number of diseased subjects. The out‐of‐sample performances were calculated via using leave‐one‐out cross‐validation, aiming to minimize potential model overfitting. A careful analysis of using the biomarker cancer antigen 125 for ovarian cancer early detection showed significantly improved discrimination performance of PMM as compared with SREM and ROCA, nevertheless all approaches were generally well calibrated. Robustness of all approaches was further investigated in extensive simulation studies. The improved performance of PMM relative to ROCA is in part due to the fact that the biomarker measurements were taken at a yearly interval, which is not frequent enough to reliably estimate the changepoint or the slope after changepoint in cases under ROCA.

Comparing the sensitivities of two screening tests in nonblinded randomized paired screen‐positive trials with differential screening uptake

AbstractBefore a new screening test can be used in routine screening, its performance needs to be compared to the standard screening test. This comparison is generally done in population screening trials with a screen‐positive design where participants undergo one or both screening tests after which disease verification takes place for those positive on at least one screening test. We consider the randomized paired screen‐positive design of Alonzo and Kittelson where participants are randomized to receive one of the two screening tests and only participants with a positive screening test subsequently receive the other screening test followed by disease verification. The tests are usually offered in an unblinded fashion in which case the screening uptake may differ between arms, in particular when one test is more burdensome than the other. When uptake is associated with disease, the estimator for the relative sensitivity derived by Alonzo and Kittelson may be biased and the type I error of the associated statistical test is no longer guaranteed to be controlled. We present methods for comparing sensitivities of screening tests in randomized paired screen‐positive trials that are robust to differential screening uptake. In a simulation study, we show that our methods adequately control the type I error when screening uptake is associated with disease. We apply the developed methods to data from the IMPROVE trial, a nonblinded cervical cancer screening trial comparing the accuracy of HPV testing on self‐collected versus provider‐collected samples. In this trial, screening uptake was higher among participants randomized to self‐collection.

Publisher

Wiley

ISSN

0277-6715