Journal

Scientific Data

Papers (14)

The compositional behavior of the human T cell receptor repertoire in ovarian cancer compared to healthy donors

The distinctive characteristics of an individual's T cell receptor repertoire are crucial in recognizing and responding to a diverse array of antigens, contributing to immune specificity and adaptability. The repertoire, famously vast due to a series of cellular mechanisms, can be quantified using repertoire sequencing. In this study, we sampled the repertoire of 85 women: ovarian cancer patients (OC) and healthy donors (HD), generating a dataset of T cell clones and their abundance. For the alpha chain we obtained 6.4·10

A Multi-Center, Multi-Parametric MRI Dataset of Primary and Secondary Brain Tumors

AbstractBrain metastases (BMs) and high-grade gliomas (HGGs) are the most common and aggressive types of malignant brain tumors in adults, with often poor prognosis and short survival. As their clinical symptoms and image appearances on conventional magnetic resonance imaging (MRI) can be astonishingly similar, their accurate differentiation based solely on clinical and radiological information can be very challenging, particularly for “cancer of unknown primary”, where no systemic malignancy is known or found. Non-invasive multiparametric MRI and radiomics offer the potential to identify these distinct biological properties, aiding in the characterization and differentiation of HGGs and BMs. However, there is a scarcity of publicly available multi-origin brain tumor imaging data for tumor characterization. In this paper, we introduce a multi-center, multi-origin brain tumor MRI (MOTUM) imaging dataset obtained from 67 patients: 29 with high-grade gliomas, 20 with lung metastases, 10 with breast metastases, 2 with gastric metastasis, 4 with ovarian metastasis, and 2 with melanoma metastasis. This dataset includes anonymized DICOM files alongside processed FLAIR, T1-weighted, contrast-enhanced T1-weighted, T2-weighted sequences images, segmentation masks of two tumor regions, and clinical data. Our data-sharing initiative is to support the benchmarking of automated tumor segmentation, multi-modal machine learning, and disease differentiation of multi-origin brain tumors in a multi-center setting.

Homologous Recombination Deficiency Unrelated to Platinum and PARP Inhibitor Response in Cell Line Libraries

AbstractWhile large publicly available cancer cell line databases are invaluable for preclinical drug discovery and biomarker development, the association between homologous recombination deficiency (HRD) and drug sensitivity in these resources remains unclear. In this study, we comprehensively analyzed molecular profiles and drug screening data from the Cancer Cell Line Encyclopedia. Unexpectedly, gene alterations in BRCA1/2 or homologous recombination-related genes, HRD scores, or mutational signature 3 were not positively correlated with sensitivity to platinum agents or PARP inhibitors. Rather, higher HRD scores and mutational signature 3 were significantly associated with resistance to these agents in multiple assays. These findings were consistent when analyzing exclusively breast and ovarian cancer cell lines and when using data from the COSMIC Cell Line Project. Collectively, the existing data from established cancer cell lines do not reflect the expected association between HRD status and drug response to platinum agents and PARP inhibitors in clinical tumors. This discrepancy may extend to other tumor characteristics, highlighting the importance of recognizing potential limitations in cell line data for researchers.

RIVA: An Image Dataset of Conventional Pap Smear Cytology with Multiple Independent Annotations

The Pap smear remains the primary screening test for cervical cancer in many low-resource regions, yet publicly available image datasets largely feature liquid-based preparations. We introduce RIVA, a high-resolution collection of 959 conventional-smear images (1024 × 1024 px) scanned at 40x magnification, sourced from 115 patients. To ensure label quality, each image was annotated by up to four independent medical professionals, with 42% of the images reviewed by all four, resulting in 26,158 annotations based on the Bethesda classification. Annotations provide coordinates of nuclei and classification labels by up to four annotators. The dataset includes 15,949 unique cells across five (pre)cancerous types (SCC, HSIL, ASCH, LSIL, ASCUS) and three non-lesion categories (NILM, ENDO, INFL). These four-expert annotations not only give RIVA a consensus-driven ground truth for robust AI training but also enable inter-annotator consistency analysis-agreement rates reach 94% for lesion vs. non-lesion and 74% across the full eight-category Bethesda scheme.

High-throughput drug screening identifies novel therapeutics for Low Grade Serous Ovarian Carcinoma

AbstractLow grade serous carcinoma (LGSOC) is a rare epithelial ovarian cancer with unique molecular characteristics compared to the more common tubo-ovarian high-grade serous ovarian carcinoma. Pivotal clinical trials guiding the management of epithelial ovarian cancer lack sufficient cases of LGSOC for meaningful subgroup analysis, hence overall findings cannot be extrapolated to rarer chemo-resistant subtypes such as LGSOC. Furthermore, there is a need for more effective therapies for the treatment of relapsed disease, as treatment options are limited. To address this, we conducted the largest quantitative high-throughput drug screening effort (n = 3436 compounds) in 12 patient-derived LGSOC cell lines and one normal ovary cell line to identify unexplored therapeutic avenues. Using a combination of high-throughput robotics, high-content imaging and novel data analysis pipelines, our data set identified 60 high and 19 moderate confidence hits which induced cancer cell specific cytotoxicity at the lowest compound dose assessed (0.1 µM). We also revealed a series of known (mTOR/PI3K/AKT) and novel (EGFR and MDM2-p53) drug classes in which LGSOC cell lines showed demonstrable susceptibility to.

Machine Learning-Enhanced Extraction of Biomarkers for High-Grade Serous Ovarian Cancer from Proteomics Data

AbstractComprehensive biomedical proteomic datasets are accumulating exponentially, warranting robust analytics to deconvolute them for identifying novel biological insights. Here, we report a strategic machine learning (ML)-based feature extraction workflow that was applied to unveil high-performing protein markers for high-grade serous ovarian carcinoma (HGSOC) from publicly available ovarian cancer tissue and serum proteomics datasets. Diagnosis of HGSOC, an aggressive form of ovarian cancer, currently relies on diagnostic methods based on tissue biopsy and/or non-specific biomarkers such as the cancer antigen 125 (CA125) and human epididymis protein 4 (HE4). Our newly developed ML-based approach enabled the identification of new serum proteomic biomarkers for HGSOC. The performance verification of these marker combinations using two independent cohorts affirmed their outperformance against known biomarkers for ovarian cancer including clinically used serum markers with >97% AUC. Our analysis also added novel biological insights such as enriched cancer-related processes associated with HGSOC.

GWAS Explorer: an open-source tool to explore, visualize, and access GWAS summary statistics in the PLCO Atlas

AbstractThe Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Trial is a prospective cohort study of nearly 155,000 U.S. volunteers aged 55–74 at enrollment in 1993–2001. We developed the PLCO Atlas Project, a large resource for multi-trait genome-wide association studies (GWAS), by genotyping participants with available DNA and genomic consent. Genotyping on high-density arrays and imputation was performed, and GWAS were conducted using a custom semi-automated pipeline. Association summary statistics were generated from a total of 110,562 participants of European, African and Asian ancestry. Application programming interfaces (APIs) and open-source software development kits (SKDs) enable exploring, visualizing and open data access through the PLCO Atlas GWAS Explorer website, promoting Findable, Accessible, Interoperable, and Re-usable (FAIR) principles. Currently the GWAS Explorer hosts association data for 90 traits and >78,000,000 genomic markers, focusing on cancer and cancer-related phenotypes. New traits will be posted as association data becomes available. The PLCO Atlas is a FAIR resource of high-quality genetic and phenotypic data with many potential reuse opportunities for cancer research and genetic epidemiology.

The Swedish Cervical Screening Cohort

AbstractThe Cervical Screening Cohort enrols women screened for human papillomavirus (HPV) and cervical abnormalities within the capital region of Sweden from the organised screening program and the non-organised testing of cervical samples. The cohort started in 2011 and has enrolled more than 670,000 women, contributing with more than 1.2 million biobanked samples. The cohort is systematically updated with individual-level data from the Swedish National Cervical Screening Registry (NKCx). Key variables include birthdate, sampling date, cytological, histopathological and HPV analysis results, and invitation history. Each sampling and subsequent clinical follow-up is sequentially registered, allowing for longitudinal analyses of screening results and associated results of the clinical workup. The cohort is ideal for longitudinal, long-term follow-up studies due to its validated documentation and registry-derived information. From the data, it is possible to penetrate important human health mechanisms. The data are available as open-data and GDPR-compliant. Samples are available after getting the required permissions. Results will help researchers understand factors that increase cancer risk and other diseases.

BMT: A Cross-Validated ThinPrep Pap Cervical Cytology Dataset for Machine Learning Model Training and Validation

AbstractIn the past several years, a few cervical Pap smear datasets have been published for use in clinical training. However, most publicly available datasets consist of pre-segmented single cell images, contain on-image annotations that must be manually edited out, or are prepared using the conventional Pap smear method. Multicellular liquid Pap image datasets are a more accurate reflection of current cervical screening techniques. While a multicellular liquid SurePath™ dataset has been created, machine learning models struggle to classify a test image set when it is prepared differently from the training set due to visual differences. Therefore, this dataset of multicellular Pap smear images prepared with the more common ThinPrep® protocol is presented as a helpful resource for training and testing artificial intelligence models, particularly for future application in cervical dysplasia diagnosis. The “Brown Multicellular ThinPrep” (BMT) dataset is the first publicly available multicellular ThinPrep® dataset, consisting of 600 clinically vetted images collected from 180 Pap smear slides from 180 patients, classified into three key diagnostic categories.

A large annotated cervical cytology images dataset for AI models to aid cervical cancer screening

Accurate detection of abnormal cervical cells in cervical cancer screening increases the chances of timely treatment. The vigorous development of deep learning methods has established a new ecosystem for cervical cancer screening, which has been proven to effectively improve efficiency and accuracy of cell detection in many studies. Although many contributing studies have been conducted, limited public datasets and time-consuming collection efforts may hinder the generalization performance of those advanced models and restrict further research. Through this work, we seek to provide a large dataset of cervical cytology images with exhaustive annotations of abnormal cervical cells. The dataset consists of 8,037 images derived from 129 scanned Thinprep cytologic test (TCT) slide images. Furthermore, we performed evaluation experiments to demonstrate the performance of representative models trained on our dataset in abnormal cells detection.

Large-scale uterine myoma MRI dataset covering all FIGO types with pixel-level annotations

AbstractUterine myomas are the most common pelvic tumors in women, which can lead to abnormal uterine bleeding, abdominal pain, pelvic compression symptoms, infertility, or adverse pregnancy. In this article, we provide a dataset named uterine myoma MRI dataset (UMD), which can be used for clinical research on uterine myoma imaging. The UMD is the largest publicly available uterine MRI dataset to date including 300 cases of uterine myoma T2-weighted imaging (T2WI) sagittal patient images and their corresponding annotation files. The UMD covers 9 types of uterine myomas classified by the International Federation of Obstetrics and Gynecology (FIGO), which were annotated and reviewed by 11 experienced doctors to ensure the authority of the annotated data. The UMD is helpful for uterine myomas classification and uterine 3D reconstruction tasks, which has important implications for clinical research on uterine myomas.

Annotated Pap cell images and smear slices for cell classification

AbstractMachine learning-based systems have become instrumental in augmenting global efforts to combat cervical cancer. A burgeoning area of research focuses on leveraging artificial intelligence to enhance the cervical screening process, primarily through the exhaustive examination of Pap smears, traditionally reliant on the meticulous and labor-intensive analysis conducted by specialized experts. Despite the existence of some comprehensive and readily accessible datasets, the field is presently constrained by the limited volume of publicly available images and smears. As a remedy, our work unveils APACC (Annotated PAp cell images and smear slices for Cell Classification), a comprehensive dataset designed to bridge this gap. The APACC dataset features a remarkable array of images crucial for advancing research in this field. It comprises 103,675 annotated cell images, carefully extracted from 107 whole smears, which are further divided into 21,371 sub-regions for a more refined analysis. This dataset includes a vast number of cell images from conventional Pap smears and their specific locations on each smear, offering a valuable resource for in-depth investigation and study.

Pixel-wise segmentation of cells in digitized Pap smear images

AbstractA simple and cheap way to recognize cervical cancer is using light microscopic analysis of Pap smear images. Training artificial intelligence-based systems becomes possible in this domain, e.g., to follow the European recommendation to screen negative smears to reduce false negative cases. The first step for such a process is segmenting the cells. A large and manually segmented dataset is required for this task, which can be used to train deep learning-based solutions. We describe a corresponding dataset with accurate manual segmentations for the enclosed cells. Altogether, the APACS23 (Annotated PAp smear images for Cell Segmentation 2023) dataset contains about 37 000 manually segmented cells and is separated into dedicated training and test parts, which could be used for an official benchmark of scientific investigations or a grand challenge.

Histopathological whole slide image dataset for classification of treatment effectiveness to ovarian cancer

AbstractOvarian cancer is the leading cause of gynecologic cancer death among women. Regardless of the development made in the past two decades in the surgery and chemotherapy of ovarian cancer, most of the advanced-stage patients are with recurrent cancer and die. The conventional treatment for ovarian cancer is to remove cancerous tissues using surgery followed by chemotherapy, however, patients with such treatment remain at great risk for tumor recurrence and progressive resistance. Nowadays, new treatment with molecular-targeted agents have become accessible. Bevacizumab as a monotherapy in combination with chemotherapy has been recently approved by FDA for the treatment of epithelial ovarian cancer (EOC). Prediction of therapeutic effects and individualization of therapeutic strategies are critical, but to the authors’ best knowledge, there are no effective biomarkers that can be used to predict patient response to bevacizumab treatment for EOC and peritoneal serous papillary carcinoma (PSPC). This dataset helps researchers to explore and develop methods to predict the therapeutic effect of patients with EOC and PSPC to bevacizumab.

Publisher

Springer Science and Business Media LLC

ISSN

2052-4463