Journal

Briefings in Bioinformatics

Papers (17)

XGSEA: CROSS-species gene set enrichment analysis via domain adaptation

Abstract Motivation Gene set enrichment analysis (GSEA) has been widely used to identify gene sets with statistically significant difference between cases and controls against a large gene set. GSEA needs both phenotype labels and expression of genes. However, gene expression are assessed more often for model organisms than minor species. Also, importantly gene expression are not measured well under specific conditions for human, due to high risk of direct experiments, such as non-approved treatment or gene knockout, and then often substituted by mouse. Thus, predicting enrichment significance (on a phenotype) of a given gene set of a species (target, say human), by using gene expression measured under the same phenotype of the other species (source, say mouse) is a vital and challenging problem, which we call CROSS-species gene set enrichment problem (XGSEP). Results For XGSEP, we propose the CROSS-species gene set enrichment analysis (XGSEA), with three steps of: (1) running GSEA for a source species to obtain enrichment scores and $p$-values of source gene sets; (2) representing the relation between source and target gene sets by domain adaptation; and (3) using regression to predict $p$-values of target gene sets, based on the representation in (2). We extensively validated the XGSEA by using five regression and one classification measurements on four real data sets under various settings, proving that the XGSEA significantly outperformed three baseline methods in most cases. A case study of identifying important human pathways for T -cell dysfunction and reprogramming from mouse ATAC-Seq data further confirmed the reliability of the XGSEA. Availability Source code of the XGSEA is available through https://github.com/LiminLi-xjtu/XGSEA.

MORE interpretable multi-omic regulatory networks to characterise phenotypes

Abstract Studying phenotype-specific regulatory mechanisms is crucial to understanding the molecular basis of diseases and other complex traits. However, existing approaches for constructing multi-omic regulatory networks MO-RN are scarce, and most cannot integrate diverse omics modalities, incorporate prior biological knowledge, or infer phenotype-specific networks. To address these challenges, we present MORE (Multi-Omics REgulation), a novel R package for inferring multi-modal regulatory networks. MORE is available at https://github.com/BiostatOmics/MORE and supports any number and type of omics layers while optionally incorporating prior regulatory knowledge. Leveraging advanced regression-based models and variable selection techniques, MORE identifies significant regulatory relationships. This tool also provides useful functionalities for the biological interpretation of MO-RN: network visualisations, differential regulatory networks, and functional enrichment analyses of key network features. We evaluated MORE on simulated multi-omic datasets and benchmarked it against state-of-the-art tools. Our tool consistently outperformed other methods regarding accuracy in identifying significant regulators, model goodness-of-fit, and computational efficiency. We further applied MORE to a multi-omic ovarian cancer dataset to uncover tumour subtype-specific regulatory mechanisms associated with distinct survival outcomes. This analysis revealed differential regulatory patterns to understand the molecular basis of each subtype. By addressing the limitations of methods for multi-omic network inference, MORE represents a valuable resource for studying regulatory systems. Its ability to construct phenotype-specific regulatory networks with high accuracy and interpretability positions it as a useful resource for researchers seeking to unravel the complexities of molecular interactions and regulatory mechanisms across diverse biological contexts.

sTPLS: identifying common and specific correlated patterns under multiple biological conditions

Abstract The rapidly emerging large-scale data in diverse biological research fields present valuable opportunities to explore the underlying mechanisms of tissue development and disease progression. However, few existing methods can simultaneously capture common and condition-specific association between different types of features across different biological conditions, such as cancer types or cell populations. Therefore, we developed the sparse tensor-based partial least squares (sTPLS) method, which integrates multiple pairs of datasets containing two types of features but derived from different biological conditions. We demonstrated the effectiveness and versatility of sTPLS through simulation study and three biological applications. By integrating the pairwise pharmacogenomic data, sTPLS identified 11 gene-drug comodules with high biological functional relevance specific for seven cancer types and two comodules that shared across multi-type cancers, such as breast, ovarian, and colorectal cancers. When applied to single-cell data, it uncovered nine gene-peak comodules representing transcriptional regulatory relationships specific for five cell types and three comodules shared across similar cell types, such as intermediate and naïve B cells. Furthermore, sTPLS can be directly applied to tensor-structured data, successfully revealing shared and distinct cell communication patterns mediated by the MK signaling pathway in coronavirus disease 2019 patients and healthy controls. These results highlight the effectiveness of sTPLS in identifying biologically meaningful relationships across diverse conditions, making it useful for multi-omics integrative analysis.

scPAS: single-cell phenotype-associated subpopulation identifier

Abstract Despite significant advancements in single-cell sequencing analysis for characterizing tissue sample heterogeneity, identifying the associations between cell subpopulations and disease phenotypes remains a challenging task. Here, we introduce scPAS, a new bioinformatics tool designed to integrate bulk data to identify phenotype-associated cell subpopulations within single-cell data. scPAS employs a network-regularized sparse regression model to quantify the association between each cell in single-cell data and a phenotype. Additionally, it estimates the significance of these associations through a permutation test, thereby identifying phenotype-associated cell subpopulations. Utilizing simulated data and various single-cell datasets from breast carcinoma, ovarian cancer, and atherosclerosis, as well as spatial transcriptomics data from multiple cancers, we demonstrated the accuracy, flexibility, and broad applicability of scPAS. Evaluations on large datasets revealed that scPAS exhibits superior operational efficiency compared to other methods. The open-source scPAS R package is available at GitHub website: https://github.com/aiminXie/scPAS.

Ovarian cancer is detectable from peripheral blood using machine learning over T-cell receptor repertoires

Abstract The extraordinary diversity of T cells and B cells is critical for body maintenance. This diversity has an important role in protecting against tumor formation. In humans, the T-cell receptor (TCR) repertoire is generated through a striking stochastic process called V(D)J recombination, in which different gene segments are assembled and modified, leading to extensive variety. In ovarian cancer (OC), an unfortunate 80% of cases are detected late, leading to poor survival outcomes. However, when detected early, approximately 94% of patients live longer than 5 years after diagnosis. Thus, early detection is critical for patient survival. To determine whether the TCR repertoire obtained from peripheral blood is associated with tumor status, we collected blood samples from 85 women with or without OC and obtained TCR information. We then used machine learning to learn the characteristics of samples and to finally predict, over a set of unseen samples, whether the person is with or without OC. We successfully stratified the two groups, thereby associating the peripheral blood TCR repertoire with the formation of OC tumors. A careful study of the origin of the set of T cells most informative for the signature indicated the involvement of a specific invariant natural killer T (iNKT) clone and a specific mucosal-associated invariant T (MAIT) clone. Our findings here support the proposition that tumor-relevant signal is maintained by the immune system and is coded in the T-cell repertoire available in peripheral blood. It is also possible that the immune system detects tumors early enough for repertoire technologies to inform us near the beginning of tumor formation. Although such detection is made by the immune system, we might be able to identify it, using repertoire data from peripheral blood, to offer a pragmatic way to search for early signs of cancer with minimal patient burden, possibly with enhanced sensitivity.

WDNE: an integrative graphical model for inferring differential networks from multi-platform gene expression data with missing values

Abstract The mechanisms controlling biological process, such as the development of disease or cell differentiation, can be investigated by examining changes in the networks of gene dependencies between states in the process. High-throughput experimental methods, like microarray and RNA sequencing, have been widely used to gather gene expression data, which paves the way to infer gene dependencies based on computational methods. However, most differential network analysis methods are designed to deal with fully observed data, but missing values, such as the dropout events in single-cell RNA-sequencing data, are frequent. New methods are needed to take account of these missing values. Moreover, since the changes of gene dependencies may be driven by certain perturbed genes, considering the changes in gene expression levels may promote the identification of gene network rewiring. In this study, a novel weighted differential network estimation (WDNE) model is proposed to handle multi-platform gene expression data with missing values and take account of changes in gene expression levels. Simulation studies demonstrate that WDNE outperforms state-of-the-art differential network estimation methods. When applied WDNE to infer differential gene networks associated with drug resistance in ovarian tumors, cell differentiation and breast tumor heterogeneity, the hub genes in the estimated differential gene networks can provide important insights into the underlying mechanisms. Furthermore, a Matlab toolbox, differential network analysis toolbox, was developed to implement the WDNE model and visualize the estimated differential networks.

Integration and interplay of machine learning and bioinformatics approach to identify genetic interaction related to ovarian cancer chemoresistance

Abstract Although chemotherapy is the first-line treatment for ovarian cancer (OCa) patients, chemoresistance (CR) decreases their progression-free survival. This paper investigates the genetic interaction (GI) related to OCa-CR. To decrease the complexity of establishing gene networks, individual signature genes related to OCa-CR are identified using a gradient boosting decision tree algorithm. Additionally, the genetic interaction coefficient (GIC) is proposed to measure the correlation of two signature genes quantitatively and explain their joint influence on OCa-CR. Gene pair that possesses high GIC is identified as signature pair. A total of 24 signature gene pairs are selected that include 10 individual signature genes and the influence of signature gene pairs on OCa-CR is explored. Finally, a signature gene pair-based prediction of OCa-CR is identified. The area under curve (AUC) is a widely used performance measure for machine learning prediction. The AUC of signature gene pair reaches 0.9658, whereas the AUC of individual signature gene-based prediction is 0.6823 only. The identified signature gene pairs not only build an efficient GI network of OCa-CR but also provide an interesting way for OCa-CR prediction. This improvement shows that our proposed method is a useful tool to investigate GI related to OCa-CR.

Mechanistically derived patient-level framework for precision medicine identifies a personalized immune prognostic signature in high-grade serous ovarian cancer

Abstract An accurate prognosis assessment for cancer patients could aid in guiding clinical decision-making. Reliance on traditional clinical features alone in a complex clinical environment is challenging and unsatisfactory in the era of precision medicine; thus, reliable prognostic biomarkers are urgently required to improve a patient staging system. In this study, we proposed a patient-level computational framework from mechanistic and translational perspectives to establish a personalized prognostic signature (named PLPPS) in high-grade serous ovarian carcinoma (HGSOC). The PLPPS composed of 68 immune genes achieved accurate prognostic risk stratification for 1190 patients in the meta-training cohort and was rigorously validated in multiple cross-platform independent cohorts comprising 792 HGSOC patients. Furthermore, the PLPPS was shown to be the better prognostic factor compared with clinical parameters in the univariate analysis and retained a significant independent association with prognosis after adjusting for clinical parameters in the multivariate analysis. In benchmark comparisons, the performance of PLPPS (hazard ratio (HR), 1.371; concordance index (C-index), 0.604 and area under the curve (AUC), 0.637) is comparable to or better than other published gene signatures (HR, 0.972 to 1.340; C-index, 0.495 to 0.592 and AUC, 0.48–0.624). With further validation in prospective clinical trials, we hope that the PLPPS might become a promising genomic tool to guide personalized management and decision-making of HGSOC in clinical practice.

Survey and comparative assessments of computational multi-omics integrative methods with multiple regulatory networks identifying distinct tumor compositions across pan-cancer data sets

Abstract The significance of pan-cancer categories has recently been recognized as widespread in cancer research. Pan-cancer categorizes a cancer based on its molecular pathology rather than an organ. The molecular similarities among multi-omics data found in different cancer types can play several roles in both biological processes and therapeutic developments. Therefore, an integrated analysis for various genomic data is frequently used to reveal novel genetic and molecular mechanisms. However, a variety of algorithms for multi-omics clustering have been proposed in different fields. The comparison of different computational clustering methods in pan-cancer analysis performance remains unclear. To increase the utilization of current integrative methods in pan-cancer analysis, we first provide an overview of five popular computational integrative tools: similarity network fusion, integrative clustering of multiple genomic data types (iCluster), cancer integration via multi-kernel learning (CIMLR), perturbation clustering for data integration and disease subtyping (PINS) and low-rank clustering (LRACluster). Then, a priori interactions in multi-omics data were incorporated to detect prominent molecular patterns in pan-cancer data sets. Finally, we present comparative assessments of these methods, with discussion over key issues in applying these algorithms. We found that all five methods can identify distinct tumor compositions. The pan-cancer samples can be reclassified into several groups by different proportions. Interestingly, each method can classify the tumors into categories that are different from original cancer types or subtypes, especially for ovarian serous cystadenocarcinoma (OV) and breast invasive carcinoma (BRCA) tumors. In addition, all clusters of the five computational methods show notable prognostic values. Furthermore, both the 9 recurrent differential genes and the 15 common pathway characteristics were identified across all the methods. The results and discussion can help the community select appropriate integrative tools according to different research tasks or aims in pan-cancer analysis.

DeepHPV: a deep learning model to predict human papillomavirus integration sites

Abstract Human papillomavirus (HPV) integrating into human genome is the main cause of cervical carcinogenesis. HPV integration selection preference shows strong dependence on local genomic environment. Due to this theory, it is possible to predict HPV integration sites. However, a published bioinformatic tool is not available to date. Thus, we developed an attention-based deep learning model DeepHPV to predict HPV integration sites by learning environment features automatically. In total, 3608 known HPV integration sites were applied to train the model, and 584 reviewed HPV integration sites were used as the testing dataset. DeepHPV showed an area under the receiver-operating characteristic (AUROC) of 0.6336 and an area under the precision recall (AUPR) of 0.5670. Adding RepeatMasker and TCGA Pan Cancer peaks improved the model performance to 0.8464 and 0.8501 in AUROC and 0.7985 and 0.8106 in AUPR, respectively. Next, we tested these trained models on independent database VISDB and found the model adding TCGA Pan Cancer performed better (AUROC: 0.7175, AUPR: 0.6284) than the model adding RepeatMasker peaks (AUROC: 0.6102, AUPR: 0.5577). Moreover, we introduced attention mechanism in DeepHPV and enriched the transcription factor binding sites including BHLHA15, CHR, COUP-TFII, DMRTA2, E2A, HIC1, INR, NPAS, Nr5a2, RARa, SCL, Snail1, Sox10, Sox3, Sox4, Sox6, STAT6, Tbet, Tbx5, TEAD, Tgif2, ZNF189, ZNF416 near attention intensive sites. Together, DeepHPV is a robust and explainable deep learning model, providing new insights into HPV integration preference and mechanism. Availability: DeepHPV is available as an open-source software and can be downloaded from https://github.com/JiuxingLiang/DeepHPV.git, Contact: huzheng1998@163.com, liangjiuxing@m.scnu.edu.cn, lizheyzy@163.com

BiChemoCLAM: a weakly supervised multimodal framework for chemotherapy response prediction

Abstract Chemotherapy is an important treatment for cancer patients, but it comes with risks. Therefore, effective chemotherapy response prediction is crucial. While whole slide image provides high-resolution insights into tumour environments, existing weakly supervised learning frameworks struggle to effectively integrate molecular data, such as gene expression, limiting their predictive power in complex chemotherapy response and small-sample scenarios. We present a bimodal chemotherapy response multi-instance learning framework, BiChemoCLAM, a novel multimodal deep learning framework that combines attention-driven multiple instance learning with multimodal compact bilinear pooling for interpretable and data-efficient chemotherapy response prediction. It achieves an Area Under Curve (AUC) of 80.91%, 71.68%, and 75.80% on ovarian serous cystadenocarcinoma, colorectal adenocarcinoma, and bladder urothelial carcinoma cancer datasets, respectively. The experimental results show that BiChemoCLAM is an effective model for predicting response to chemotherapy.

Gene-based mediation analysis in epigenetic studies

AbstractMediation analysis has been a useful tool for investigating the effect of mediators that lie in the path from the independent variable to the outcome. With the increasing dimensionality of mediators such as in (epi)genomics studies, high-dimensional mediation model is needed. In this work, we focus on epigenetic studies with the goal to identify important DNA methylations that act as mediators between an exposure disease outcome. Specifically, we focus on gene-based high-dimensional mediation analysis implemented with kernel principal component analysis to capture potential nonlinear mediation effect. We first review the current high-dimensional mediation models and then propose two gene-based analytical approaches: gene-based high-dimensional mediation analysis based on linearity assumption between mediators and outcome (gHMA-L) and gene-based high-dimensional mediation analysis based on nonlinearity assumption (gHMA-NL). Since the underlying true mediation relationship is unknown in practice, we further propose an omnibus test of gene-based high-dimensional mediation analysis (gHMA-O) by combing gHMA-L and gHMA-NL. Extensive simulation studies show that gHMA-L performs better under the model linear assumption and gHMA-NL does better under the model nonlinear assumption, while gHMA-O is a more powerful and robust method by combining the two. We apply the proposed methods to two datasets to investigate genes whose methylation levels act as important mediators in the relationship: (1) between alcohol consumption and epithelial ovarian cancer risk using data from the Mayo Clinic Ovarian Cancer Case-Control Study and (2) between childhood maltreatment and comorbid post-traumatic stress disorder and depression in adulthood using data from the Gray Trauma Project.

Graph-based deep learning for integrating single-cell and bulk transcriptomic data to identify clinical cancer subtypes

Abstract The integration of single-cell RNA sequencing (scRNA-seq) and bulk transcriptomic data has become essential for deciphering the complex heterogeneity of cancer and identifying clinical cancer subtypes. However, the inherent challenges posed by the high dimensionality, sparsity, and noise characteristics of scRNA-seq data have significantly hindered its widespread clinical translation. To address these limitations, we introduce single-cell and bulk transcriptomic graph deep learning, a graph-based deep learning method that synergistically integrates scRNA-seq and bulk transcriptomic data to precisely identify cancer subtypes and predict clinical outcomes. scBGDL constructs sample-specific gene graphs modeling complex gene–gene interactions and cellular relationships. The architecture employs Graph Attention Networks for feature aggregation, MinCutPool layers for dimensionality reduction, and Transformer modules to capture high-order biological dependencies. Independently validated in each of 16 distinct The Cancer Genome Atlas cancer types, scBGDL significantly outperformed existing methods in prognostic accuracy (mean C-index: 0.7060 versus 0.6709 max competitor), demonstrating robustness and generalizability to diverse transcriptional architectures. To demonstrate clinical versatility, we further evaluated scBGDL in three therapeutic contexts using multicenter cohorts: lung adenocarcinoma survival prediction (n = 1099), epithelial ovarian cancer platinum-based chemotherapy response (n = 762), skin cutaneous melanoma immunotherapy outcome (n = 305). scBGDL consistently delivered robust risk stratification (log-rank P &lt; 0.05 across cohorts), identified key driver edges, and uncovered clinically relevant biological interpretations. By enabling multimodal data integration and interpretable biological insights, scBGDL advances precision oncology for prognosis prediction, therapy optimization, and biomarker discovery. The source code for scBGDL model is available online (https://github.com/NEFLab/scBGDL).

Scanning window analysis of non-coding regions within normal-tumor whole-genome sequence samples

Abstract Genomics has benefited from an explosion in affordable high-throughput technology for whole-genome sequencing. The regulatory and functional aspects in non-coding regions may be an important contributor to oncogenesis. Whole-genome tumor-normal paired alignments were used to examine the non-coding regions in five cancer types and two races. Both a sliding window and a binning strategy were introduced to uncover areas of higher than expected variation for additional study. We show that the majority of cancer associated mutations in 154 whole-genome sequences covering breast invasive carcinoma, colon adenocarcinoma, kidney renal papillary cell carcinoma, lung adenocarcinoma and uterine corpus endometrial carcinoma cancers and two races are found outside of the coding region (4 432 885 in non-gene regions versus 1 412 731 in gene regions). A pan-cancer analysis found significantly mutated windows (292 to 3881 in count) demonstrating that there are significant numbers of large mutated regions in the non-coding genome. The 59 significantly mutated windows were found in all studied races and cancers. These offer 16 regions ripe for additional study within 12 different chromosomes—2, 4, 5, 7, 10, 11, 16, 18, 20, 21 and X. Many of these regions were found in centromeric locations. The X chromosome had the largest set of universal windows that cluster almost exclusively in Xq11.1—an area linked to chromosomal instability and oncogenesis. Large consecutive clusters (super windows) were found (19 to 114 in count) providing further evidence that large mutated regions in the genome are influencing cancer development. We show remarkable similarity in highly mutated non-coding regions across both cancer and race.

Learning directed acyclic graphs for ligands and receptors based on spatially resolved transcriptomic data of ovarian cancer

Abstract To unravel the mechanism of immune activation and suppression within tumors, a critical step is to identify transcriptional signals governing cell–cell communication between tumor and immune/stromal cells in the tumor microenvironment. Central to this communication are interactions between secreted ligands and cell-surface receptors, creating a highly connected signaling network among cells. Recent advancements in in situ-omics profiling, particularly spatial transcriptomic (ST) technology, provide unique opportunities to directly characterize ligand–receptor signaling networks that power cell–cell communication. In this paper, we propose a novel statistical method, LRnetST, to characterize the ligand–receptor interaction networks between adjacent tumor and immune/stroma cells based on ST data. LRnetST utilizes a directed acyclic graph model with a novel approach to handle the zero-inflated distributions of ST data. It also leverages existing ligand–receptor regulation databases as prior information, and employs a bootstrap aggregation strategy to achieve robust network estimation. Application of LRnetST to ST data of high-grade serous ovarian tumor samples revealed both common and distinct ligand–receptor regulations across different tumors. Some of these interactions were validated through both a MERFISH dataset and a CosMx SMI dataset of independent ovarian tumor samples. These results cast light on biological processes relating to the communication between tumor and immune/stromal cells in ovarian tumors. An open-source R package of LRnetST is available on GitHub at https://github.com/jie108/LRnetST.

Understanding the unimodal distributions of cancer occurrence rates: it takes two factors for a cancer to occur

Abstract Data from the SEER reports reveal that the occurrence rate of a cancer type generally follows a unimodal distribution over age, peaking at an age that is cancer-type specific and ranges from 30+ through 70+. Previous studies attribute such bell-shaped distributions to the reduced proliferative potential in senior years but fail to explain why some cancers have their occurrence peak at 30+ or 40+. We present a computational model to offer a new explanation to such distributions. The model uses two factors to explain the observed age-dependent cancer occurrence rates: cancer risk of an organ and the availability level of the growth signals in circulation needed by a cancer type, with the former increasing and the latter decreasing with age. Regression analyses were conducted of known occurrence rates against such factors for triple negative breast cancer, testicular cancer and cervical cancer; and all achieved highly tight fitting results, which were also consistent with clinical, gene-expression and cancer-drug data. These reveal a fundamentally important relationship: while cancer is driven by endogenous stressors, it requires sufficient levels of exogenous growth signals to happen, hence suggesting the realistic possibility for treating cancer via cleaning out the growth signals in circulation needed by a cancer.

Identification of key candidate genes for ovarian cancer using integrated statistical and machine learning approaches

Abstract Ovarian cancer (OC) is a highly lethal malignancy worldwide, necessitating the identification of key genes to uncover its molecular mechanisms and improve diagnostic and therapeutic strategies. This study utilized statistical and machine learning approaches to identify key candidate genes for OC. Three microarray datasets were obtained from the gene expression omnibus database, and analysis began with normalization and differential gene expression analysis using the Limma package. Highly discriminative differentially expressed genes (HDDEGs) were identified through a support vector machine-based approach, yielding 84 overlapping HDDEGs across the datasets. Enrichment analysis of HDDEGs was conducted using DAVID. A protein–protein interaction network constructed via STRING pinpointed central hub genes using CytoHubba metrics. Significant modules were analyzed with molecular complex detection, identifying 18 central hub genes, 11 hub module genes, and 54 meta-hub genes. The intersection of these three gene sets revealed eight shared key genes (FANCD2, BUB1B, BUB1, KIF4A, DTL, NCAPG, KIF20A, and UBE2C). Weighted gene co-expression network analysis identified key modules linked to clinical traits and confirmed grouping eight key candidate genes into a single cluster. These genes were validated using two independent datasets (GSE38666 and TCGA-OC), with area under the curve and survival analyses underscoring their predictive and prognostic significance in OC. This integrative approach advances understanding of OC’s molecular basis, identifies potential biomarkers, and emphasizes the clinical relevance of the eight key candidate genes for OC diagnosis, prognosis, and treatment.

Publisher

Oxford University Press (OUP)

ISSN

1467-5463