Interpretable Machine Learning for Proteomics‐Based Subtyping and Tumor Mutational Burden Prediction in Endometrial Cancer
ABSTRACT
Background
Endometrial carcinoma (EC) represents a significant clinical challenge due to its pronounced molecular heterogeneity, directly influencing prognosis and therapeutic responses. Accurate classification of molecular subtypes (CNV‐high, CNV‐low, MSI‐H, POLE) and precise tumor mutational burden (TMB) assessment is crucial for guiding personalized therapeutic interventions. Integrating proteomics data with advanced machine learning (ML) techniques offers a promising strategy for achieving precise, clinically actionable classification and biomarker discovery in EC.
Materials and Methods
Using proteomic data from 95 EC patients (83 endometrioid, 12 serous), sourced from the Clinical Proteomic Tumor Analysis Consortium (CPTAC), we developed an ML pipeline integrating proteomic feature selection (Lasso‐penalized logistic regression), classification modeling, and interpretability analysis. The dataset was divided into training (70%) and test (30%) sets, with synthetic minority oversampling (SMOTE) applied to address the class imbalance. Logistic regression models were trained for molecular subtypes classification, and the TMB prediction model performance was evaluated using accuracy, AUC, precision, recall, and F1‐score. Model interpretability was enhanced using explainable AI (XAI) techniques: SHapley Additive exPlanations (SHAP) and Local Interpretable Model‐agnostic Explanations (LIME).
Results
Feature selection reduced the proteomic dataset from 11,000 to eight key proteins. The proteomics‐based ML model demonstrated robust predictive performance, accurately classifying EC molecular subtypes (accuracy: 82.8%; AUC: 0.990) and distinguishing high (≥10 mutations/Mb) versus low TMB (<10 mutations/Mb) cases (accuracy: 89.7%; AUC: 0.984). SHAP analysis highlighted clinically recognized biomarkers (MLH1, PMS2, STAT1) and identified novel protein candidates (MTHFD2, MAST4, RPL22L1, MX2, SEC16A). LIME analysis provided individualized prediction interpretations, clarifying each protein biomarker's influence on model decisions.
Conclusion
Our proteomics‐driven ML approach demonstrates high accuracy and interpretability in EC subtype classification and TMB prediction. By identifying validated and novel biomarkers, this strategy provides essential biological insights and a strong foundation for the future development of non‐invasive diagnostics, personalized treatments, and precision medicine in EC.