Integrated bioinformatic and machine learning analysis identifies MCM7 and ADAM17 as potential biomarkers for early stage gastric cancer
Original Article

Integrated bioinformatic and machine learning analysis identifies MCM7 and ADAM17 as potential biomarkers for early stage gastric cancer

Fei Li1#, Gang Liu1#, Jin Wang2#, Zhizhai Luo2,3

1Department of Gastrointestinal Surgery, The First People’s Hospital of Nanning, The Fifth Affiliated Hospital of Guangxi Medical University, Nanning, China; 2Department of Gland Surgery, Affiliated Hospital of Youjiang Medical University for Nationalities, Baise, China; 3Key Laboratory of Tumor Molecular Pathology of Baise, Baise, China

Contributions: (I) Conception and design: F Li, G Liu, J Wang; (II) Administrative support: Z Luo; (III) Provision of study materials or patients: F Li, G Liu, J Wang; (IV) Collection and assembly of data: J Wang; (V) Data analysis and interpretation: F Li, G Liu; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

#These authors contributed equally to this work.

Correspondence to: Zhizhai Luo, MD. Department of Gland Surgery, Affiliated Hospital of Youjiang Medical University for Nationalities, Baise, China; Key Laboratory of Tumor Molecular Pathology of Baise, No. 18 Zhongshan Second Road, Baise 533000, China. Email: 2102@ymcn.edu.cn.

Background: Early detection of gastric cancer is crucial for improving prognosis, yet current diagnostic biomarkers remain insufficient for identifying early gastric cancer (EGC, stage I–II). While previous studies have proposed molecular markers, few have systematically validated them across multiple cohorts, and their diagnostic accuracy and immune relevance remain unclear. This study aimed to identify and validate potential early diagnostic biomarkers for EGC using an integrated bioinformatic and machine learning framework.

Methods: The transcriptome data from four Gene Expression Omnibus (GEO) datasets comprising 434 tumor and 100 normal samples were integrated. Only stage I–II gastric cancer samples, defined by pathological criteria according to the American Joint Committee on Cancer Tumor-Node-Metastasis (AJCC TNM) staging system, were included in this study, while advanced-stage cases were excluded to ensure a homogeneous early-stage cohort. Normal gastric tissues were obtained from non-tumor regions of gastrectomy specimens and served as controls. Differentially expressed genes (DEGs) were identified using the limma algorithm. Three machine-learning methods [i.e., least absolute shrinkage and selection operator (LASSO) regression, support vector machine recursive feature elimination (SVM-RFE), and random forest (RF)] were applied to screen feature genes. A diagnostic support vector machine (SVM) model was constructed based on the overlapping DEGs. External validation was conducted using The Cancer Genome Atlas – Stomach Adenocarcinoma (TCGA-STAD) and Human Protein Atlas (HPA) datasets. Functional enrichment and CIBERSORT immune infiltration analyses were performed to explore potential mechanisms.

Results: A total of 101 DEGs were identified, and four feature genes (i.e., MCM7, ADAM17, DPT, and KIT) were selected by all three machine-learning algorithms. The SVM diagnostic model showed excellent performance [area under the curve (AUC) =0.998, sensitivity =96.5%, specificity =95.2%]. Among these, MCM7 and ADAM17 were significantly overexpressed in the tumor tissues and associated with a poor prognosis (P<0.05, AUC >0.85). The SHapley Additive exPlanations (SHAP) analysis revealed that these two genes contributed most to the model’s predictions. The functional analysis showed MCM7 was enriched in DNA replication and cell cycle pathways, while ADAM17 was involved in inflammatory and tumor-related signaling. The immune infiltration analysis indicated that both genes were significantly associated with various immune cell subpopulations, suggesting a potential role in modulating the tumor immune microenvironment.

Conclusions: This study identified MCM7 and ADAM17 as potential biomarkers for EGC through integrated multi-cohort bioinformatic analysis. Further experimental and clinical studies are required to validate their diagnostic specificity and applicability in real-world settings.

Keywords: Gastric cancer; early diagnosis; machine learning; immune microenvironment


Submitted Jul 29, 2025. Accepted for publication Sep 01, 2025. Published online Sep 26, 2025.

doi: 10.21037/jgo-2025-604


Highlight box

Key findings

• This study identified MCM7 and ADAM17 as robust diagnostic biomarkers for early gastric cancer (EGC). A machine learning-based diagnostic model integrating least absolute shrinkage and selection operator (LASSO) regression, support vector machine recursive feature elimination (SVM-RFE), and random forest (RF) algorithms achieved high accuracy [area under the curve (AUC) =0.998]. The MCM7 and ADAM17 expression levels were significantly correlated with immune cell infiltration patterns, which suggesting roles in tumor-immune interactions.

What is known, and what is new?

• Previous studies have explored molecular markers for gastric cancer, but few have focused on early stage disease using integrated multi-cohort analysis and cross-validation. This study used the multi-algorithm machine-learning and transcriptomic integration of Gene Expression Omnibus (GEO) and The Cancer Genome Atlas (TCGA) datasets, combined with SHapley Additive exPlanations (SHAP) analysis, to construct a highly interpretable diagnostic model specific to stage I–II gastric cancer.

• This study was the first to show the immunomodulatory roles of MCM7 and ADAM17, which were linked to specific immune cell subsets and tumor microenvironment (TME) changes.

What is the implication, and what should change now?

• MCM7 and ADAM17 may serve as effective biomarkers for the early detection of gastric cancer, which could improve diagnostic rates in the asymptomatic phase. Future clinical studies should focus on applying these biomarkers into non-invasive diagnostic tools, such as liquid biopsies. Functional validation and prospective studies need to be conducted to confirm their utility in clinical decision making and screening programs.


Introduction

According to GLOBOCAN 2020 estimates, gastric cancer accounted for approximately one million new cases and 700,000 deaths worldwide every year (1). Gastric cancer is the fifth most commonly diagnosed malignancy and the fourth leading cause of cancer-related death worldwide, and its incidence is particularly high in Asia (1). The absolute number of gastric cancer cases is expected to further increase in the coming years due to ongoing risk factors, such as Helicobacter pylori infection, high-salt diets, and tobacco and alcohol use, as well as population aging; thus gastric cancer represents a substantial healthcare burden (2).

Compared with advanced gastric cancer (AGC), early gastric cancer (EGC) is associated with a significantly better prognosis. The 5-year survival rate of EGC patients exceeds 90%, while that of AGC patients is less than 30%, and that of stage IV disease patients is less than 15% (3,4). Therefore, early detection is critical. However, the global detection rate of EGC remains suboptimal. Despite technological advances, current endoscopic techniques such as white light endoscopy, narrow band imaging, and flexible spectral imaging color enhancement still exhibit limited sensitivity and specificity in detecting EGC, as demonstrated in recent meta-analyses (5). These limitations underscore the importance of complementary diagnostic biomarkers to improve early detection.

Countries with established screening systems, such as Japan and South Korea, report EGC detection rates exceeding 50%; however, the detection rate remains around 20% in China and several Western nations (6). A number of key factors contribute to the low detection rate, including the following: (I) asymptomatic presentation or non-specific symptoms that resemble gastritis or peptic ulcers, often leading to diagnostic delays; (II) variability in endoscopic quality due to differences in equipment and operator experience, resulting in missed diagnoses; (III) the limited sensitivity of imaging techniques such as computed tomography (CT) in detecting superficial mucosal lesions or early nodal metastasis; and (IV) socioeconomic constraints, including limited screening capacity in primary care settings (7,8). These limitations highlight the need for alternative strategies beyond traditional endoscopy and imaging, and have prompted interest in molecular biomarkers with high sensitivity and specificity for early detection. In particular, biomarker-based approaches may help address the challenge of differentiating EGC from benign gastric diseases or gastric lymphoma, which represent key diagnostic dilemmas in clinical practice. Furthermore, prognostic heterogeneity exists even within early-stage disease, with intestinal and diffuse subtypes showing comparable survival outcomes among elderly patients, suggesting that molecular stratification beyond histology is needed (9).

In recent years, advances in multi-omics technologies, including serum profiling, exosome analysis, and tissue transcriptomics, have enabled the development of minimally invasive or non-invasive diagnostic approaches, offering new avenues for EGC screening (10,11). Several studies have explored diagnostic biomarkers for EGC. Nevertheless, most biomarker studies have important limitations. Many rely on small, single-cohort datasets with limited cross-platform validation, raising concerns about reproducibility. Others have focused on AGC or mixed-stage cohorts, thereby limiting their applicability to true early-stage (I–II) disease. Moreover, algorithmic strategies such as least absolute shrinkage and selection operator (LASSO) regression, random forest (RF), and support vector machine recursive feature elimination (SVM-RFE) have been widely applied, but prior studies often lack integration of multiple methods or fail to provide interpretable models, which hampers their clinical translation. For instance, one transcriptome-based study developed a 15-gene diagnostic model capable of detecting lymph node metastasis in early stage cases, with an area under the curve (AUC) of approximately 0.76. The model outperformed traditional tumor markers (e.g., CEA and CA19-9) and CT imaging in the validation cohort, although its clinical utility requires further confirmation (due to the limited sample size of the study), and external validation (12).

Another study identified AGT, SERPINH1, and MMP7 using least absolute shrinkage and selection operator (LASSO) regression, random forest (RF), and support vector machine recursive feature elimination (SVM-RFE) algorithms. These genes showed diagnostic and prognostic relevance in gastric cancer, and were associated with serum exosomal levels and cell migration. However, the study lacked in-depth analysis specific to EGC and primarily focused on peripheral blood samples, without sufficient validation at the tissue level (13). In addition, several investigations have assessed immune cell infiltration using algorithms such as CIBERSORT, and have found that these features are correlated with gene expression (14); however, most analyses have been limited to single-cohort studies, lacked cross-platform validation, and rarely focused on early stage (I–II) samples (15).

This study aimed to identify diagnostic biomarkers specific to EGC (stage I–II) and construct an interpretable diagnostic model. Transcriptome data for stage I–II gastric cancer and matched normal tissues were obtained from Gene Expression Omnibus (GEO) datasets and The Cancer Genome Atlas – Stomach Adenocarcinoma (TCGA-STAD) dataset. After batch-effect correction and dataset integration, the differentially expressed genes (DEGs) were identified. Feature selection was performed using three machine-learning algorithms (i.e., LASSO regression, SVM-RFE, and RF), and the overlapping genes were used to develop a support vector machine (SVM)-based diagnostic model. A SHapley Additive exPlanations (SHAP) analysis was applied to assess the contribution of individual genes to model performance. Gene expression and diagnostic efficacy were validated using the GEO/TCGA cohorts and the Human Protein Atlas (HPA). The prognostic relevance of candidate genes was evaluated using the Kaplan-Meier Plotter database. Further, a gene set enrichment analysis (GSEA) was conducted based on gene-specific expression stratification to explore the associated signaling pathways. Immune infiltration was assessed using CIBERSORT, and the correlations between gene expression and the immune cell subsets were analyzed to elucidate the potential mechanisms underlying EGC. Collectively, this study provides candidate biomarkers and methodological insights to enhance the accuracy of EGC diagnosis and support precision treatment strategies. We present this article in accordance with the TRIPOD reporting checklist (available at https://jgo.amegroups.com/article/view/10.21037/jgo-2025-604/rc).


Methods

The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments.

Raw data acquisition

Four gastric cancer datasets, comprising transcriptomic and sample data, were retrieved from the GEO database. Only EGC (stage I–II) and normal tissue samples were included in the subsequent analyses. In relation to the datasets, GSE66229 comprised 100 normal gastric tissue samples; GSE26253 comprised 235 stage I–II gastric cancer samples; GSE26942 comprised 87 stage I–II gastric cancer samples; and GSE84437 comprised 112 stage I–II gastric cancer samples. In addition, transcriptomic and clinical data from TCGA-STAD dataset, which comprised 36 normal samples and 180 stage I–II gastric cancer samples, were obtained.

Batch-effect correction, merging, and differential expression analysis of the GEO datasets

After extracting the transcriptomic profiles of the stage I–II samples from each of the four GEO datasets, the batch effects were corrected using the “ComBat” function in R package sva (https://bioconductor.org/packages/release/bioc/html/sva.html). The corrected datasets were merged into a combined cohort, comprising 100 normal and 434 tumor tissue samples, for further analysis. A principal component analysis (PCA) was conducted before and after batch correction to assess data distribution. DEGs were identified using the limma package (https://bioconductor.org/packages/release/bioc/html/limma.html) based on the following filtering criteria: |log fold change (FC)| >2 and false discovery rate (FDR) <0.05.

Identification of disease-specific genes and construction of the machine learning-based diagnostic model

To identify the gastric cancer-specific signature genes, three machine-learning algorithms [i.e., LASSO regression (glmnet package, https://cran.r-project.org/web/packages/glmnet/index.html), SVM-RFE, and RF] were applied to the DEGs in the combined GEO cohort. The intersection genes identified by all three algorithms was determined using the ggvenn package (https://cran.r-project.org/web/packages/ggvenn/index.html) and used for model construction. Additionally, the circlize package (https://jokergoo.github.io/circlize_book/book/) was used to visualize the chromosomal distribution of the selected genes.

SHAP analysis of the model

To improve model interpretability, a SHAP analysis was performed on the optimal classifier [determined by the highest receiver operating (ROC)-AUC]. The expression data of the candidate genes were extracted and merged with sample labels (tumor vs. normal). Stratified sampling was performed using the “createDataPartition” function in the caret package (https://cran.r-project.org/web/packages/caret/index.html) to split the data into training (70%) and test (30%) sets. The optimal model was trained on the training set, and the SHAP values were computed using the “permshap” function to evaluate the contribution of each gene to model predictions (https://cran.r-project.org/web/packages/permshap/index.html). Visualizations, including bar plots, bee-swarm plots, and scatter plots, were generated using the shapviz package (https://cran.r-project.org/web/packages/shapviz/index.html) to illustrate global and local feature importance. These results provided insights into the model decision mechanisms and the roles of key genes in classification.

Expression and prognostic relevance of diagnostic genes

Boxplots were used to assess the differential expression of the model genes between the tumor and normal tissues in both the GEO and TCGA cohorts. The Kaplan-Meier Plotter database (https://kmplot.com/analysis/) was used to evaluate the association between gene expression and overall survival in the gastric cancer patients. The patients were allocated into high- and low-expression groups based on the optimal cut-off value. The genes with statistically significant survival differences (log-rank P<0.05) were identified as the key genes for further analysis. The diagnostic value of each gene was assessed by ROC curve analysis in both the GEO and TCGA cohorts, and the 95% confidence intervals (CIs) for the AUCs were calculated using the non-parametric bootstrap method. The immunohistochemical (IHC) validation of the key genes was performed using representative histological images from the Human Protein Atlas (HPA, https://www.proteinatlas.org/). The HPA tissue microarrays contain paraffin-embedded specimens from normal and tumor tissues, collected under standardized protocols and evaluated by certified pathologists. For each biomarker, staining intensity (negative, weak, moderate, strong), subcellular localization (nuclear, cytoplasmic, membrane), and the proportion of positive cells were recorded and compared between normal gastric mucosa and gastric cancer samples.

GSEA-based on single-gene stratification

To investigate the biological pathways associated with different expression levels of key genes, a GSEA was conducted based on the high- and low-expression groups. Normalized expression matrices were filtered for the tumor samples, and the samples were divided according to the median expression of each key gene. The log FC for each gene was calculated based on the average expression difference between the high- and low-expression groups to generate ranked gene lists. The enrichment analysis was performed using the “GSEA” function in the clusterProfiler package (https://bioconductor.org/packages/release/bioc/html/clusterProfiler.html), with the Kyoto Encyclopedia of Genes and Genomes (KEGG; c2.cp.kegg, https://www.genome.jp/kegg/) gene set as the background and a significance threshold of P<0.05. Normalized enrichment scores were used to plot the top five significantly enriched pathways in both expression groups, highlighting the potential functional roles of key genes.

Immune cell infiltration and correlation with key genes

The CIBERSORT algorithm (https://cibersort.stanford.edu/) was used to estimate immune cell infiltration in the stage I–II gastric cancer samples in the combined GEO cohort, using a normalized expression matrix and the signature gene set of 22 immune cell types, with 1,000 permutations. Stacked bar plots and boxplots were used to compare immune cell distributions between the tumor and normal samples, with differences assessed using the Wilcoxon rank-sum test.

Further, the correlation between the key genes and immune cell populations was analyzed. Expression data from the tumor samples were matched to the CIBERSORT matrix, excluding cell types with zero variance. Spearman’s correlation coefficients and P values were calculated to assess associations, which were categorized based on significance and correlation strength. Correlation networks between the genes and immune cells were visualized using the linkET package (https://cran.r-project.org/web/packages/linkET/index.html).

Statistical analysis

All the analyses were performed using R (version 4.4.3). The batch correction of the GEO datasets was conducted using the “ComBat” function in the sva package. After merging, the expression matrix comprised 100 normal and 434 tumor samples. A differential expression analysis was performed using the limma package based on the following criteria: FDR <0.05 and |logFC| >2. ROC curves and AUC values (with 95% CIs) were generated using the pROC package to evaluate diagnostic performance of EGC. Prognostic analyses were conducted with Kaplan-Meier Plotter, using optimal cut-off values and the log-rank test (a P value <0.05 was considered statistically significant). A GSEA was performed using the clusterProfiler package with KEGG as the reference database and a P value <0.05 as the threshold. An immune infiltration analysis was conducted using the CIBERSORT algorithm, and the correlations between the key genes and immune cells were assessed using Spearman’s method. The correlation networks were visualized using linkET. The group comparisons for the continuous variables were performed using the Wilcoxon rank-sum test, and a P value <0.05 was considered statistically significant. All P values <0.05 were considered statistically significant (two-sided).


Results

DEG screening and data integration

Four GEO datasets, comprising 100 normal and 434 tumor tissue samples, related to EGC were selected. The PCA analysis revealed significant batch effects before correction, while the post-correction data showed consistent distributions in the principal component space, indicating effective batch removal (Figure 1A,1B). The differential expression analysis using the limma package identified 101 DEGs, of which 68 were upregulated and 33 were downregulated (Figure 1C,1D).

Figure 1 PCA and identification of DEGs. (A,B) PCA plots before (A) and after (B) batch-effect correction in the merged GEO datasets. (C) Heatmap of the top DEGs in the merged GEO cohort. (D) Volcano plot showing the DEGs between the early-stage gastric cancer and normal tissues. DEGs, differentially expressed genes; FC, fold change; GEO, Gene Expression Omnibus; N, node; PCA, principal component analysis; T, tumor.

Feature gene selection via multiple machine-learning algorithms

To identify the diagnostic signature genes for EGC, the DEGs were analyzed using three machine-learning algorithms; that is, LASSO regression (glmnet), SVM-RFE, and RF. The LASSO regression identified 30 genes (Figure 2A,2B), SVM-RFE identified 14 genes (Figure 2C,2D), and RF identified 10 genes (Figure 2E,2F). Four intersecting genes (i.e., DPT, KIT, MCM7, and ADAM17) were identified (Figure 3A). A chromosomal ideogram was generated to visualize their genomic locations (Figure 3B).

Figure 2 Feature gene selection using three machine-learning algorithms. LASSO regression results: selection of optimal lambda (A) and corresponding gene coefficients (B). SVM-RFE results: cross-validation accuracy (C) and selected features (D). RF results: error rates by tree number (E) and genes ranked by importance score (F). CV, cross-validation; LASSO, least absolute shrinkage and selection operator; RF, random forest; SVM-RFE, support vector machine-recursive feature elimination.
Figure 3 Diagnostic gene selection and model interpretation. (A) Venn diagram showing four intersecting genes selected by all three algorithms. (B) The chromosomal locations of the signature genes. (C) ROC curves for 10 classifiers; the SVM model achieved best performance (AUC =0.998). (D) SHAP bar plot indicating global feature importance. SHAP bee-swarm (E) and scatter (F) plots illustrating gene-wise contributions to model predictions. AUC, area under the curve; DTS, decision tree series; GBM, gradient boosting machine; glmBoost, generalized linear model with boosting; KNN, k-nearest neighbors; LASSO, least absolute shrinkage and selection operator; PLS, partial least squares; RF, random forest; RFE, recursive feature elimination; ROC, receiver operating characteristic; SVM, support vector machine; SHAP, SHapley Additive exPlanations; XGBoost, extreme gradient boosting.

SHAP-based interpretation and feature importance evaluation

A diagnostic model was constructed using the intersecting feature genes. Multiple algorithms, including SVM, RF, and LASSO regression, were evaluated via fivefold cross-validation. All the models had high AUC values; the best-performing model was SVM, which had an AUC of 0.998 (sensitivity =96.5%, specificity =95.2%) (Figure 3C). A SHAP analysis was then conducted to interpret the model predictions. The SHAP bar plots showed that ADAM17 contributed the most to model classification, followed by MCM7, DPT, and KIT (Figure 3D). The SHAP bee-swarm and scatter plots further showed the directional effects of gene expression on model predictions (Figure 3E,3F).

Expression and prognostic significance of model genes

The expression analysis revealed that DPT and KIT were significantly downregulated in the gastric cancer tissues compared to the normal tissues in both the GEO and TCGA cohorts (P<0.001), while MCM7 and ADAM17 were significantly upregulated (P<0.001) (Figure 4A,4B). The Kaplan-Meier analysis showed that MCM7 (P=0.006) and ADAM17 (P<0.001) expression was associated with worse overall survival, while DPT and KIT had no significant prognostic value (P>0.05) (Figure 4C-4F). The ROC curve analysis showed that MCM7 and ADAM17 had AUCs above 0.85 in both the GEO and TCGA datasets, indicating strong diagnostic performance (Figure 5A-5D).

Figure 4 Expression and prognostic value of diagnostic genes. (A,B) Boxplots comparing gene expression between the tumor and normal tissues in the GEO and TCGA cohorts. ***, P<0.001. (C-F) Kaplan-Meier survival curves for overall survival stratified by high vs. low expression of DPT, KIT, MCM7, and ADAM17. CI, confidence interval; GEO, Gene Expression Omnibus; HR, hazard ratio; TCGA, The Cancer Genome Atlas.
Figure 5 Diagnostic performance of the key genes. (A,B) ROC curves for MCM7 (A) and ADAM17 (B) in the GEO datasets. (C,D) ROC curves for MCM7 (C) and ADAM17 (D) in TCGA datasets. AUC, area under the curve; CI, confidence interval; GEO, Gene Expression Omnibus; ROC, receiver operating characteristic; TCGA, The Cancer Genome Atlas.

IHC validation of key genes using the HPA database

To further validate the differential protein expression of the key genes, we analyzed the IHC results of MCM7 and ADAM17 using the HPA database. The results showed that MCM7 was moderately expressed in the normal gastric tissues, and it was mainly localized in the nuclei of glandular cells, with strong staining intensity and a positive cell proportion of less than 25% (https://www.proteinatlas.org/ENSG00000166508-MCM7/tissue/stomach). In the gastric cancer tissues, the expression of MCM7 was elevated, and it was predominantly localized in the nuclei of tumor cells, with strong staining intensity and a positive cell proportion ranging from 25% to 75% (https://www.proteinatlas.org/ENSG00000166508-MCM7/cancer/stomach+cancer). ADAM17 was moderately expressed in the normal gastric tissues, and it was mainly distributed in the cytoplasm and cell membrane, with a positive cell proportion greater than 75% and moderate staining intensity (https://www.proteinatlas.org/ENSG00000151694-ADAM17/tissue/stomach). In the gastric cancer tissues, ADAM17 expression was further upregulated, but it was still localized in the cytoplasm and membrane, with strong staining intensity and a positive cell proportion exceeding 75% (https://www.proteinatlas.org/ENSG00000151694-ADAM17/cancer/stomach+cancer). Collectively, these IHC results from the HPA database suggest that both MCM7 and ADAM17 are expressed at higher protein levels in gastric cancer tissues than normal tissues, indicating their potential roles in tumorigenesis and progression.

GSEA-based functional enrichment analysis of key genes

A GSEA was the performed to investigate the functional implications of MCM7 and ADAM17 expression in EGC. In the MCM7 high-expression group, the significantly enriched KEGG pathways included base excision repair, cell cycle, DNA replication, homologous recombination, and mismatch repair, indicating a heightened activity of pathways related to cell proliferation and DNA repair (Figure 6A). Conversely, the MCM7 low-expression group was enriched in pathways such as cell adhesion molecules (CAMs), extracellular matrix (ECM)-receptor interaction, focal adhesion, gap junction, and the transforming growth factor-beta (TGF-beta) signaling pathway, suggesting the potential suppression or reduced activity of adhesion and signaling pathways (Figure 6B).

Figure 6 GSEA of the key genes. Top five enriched KEGG pathways in MCM7 high- (A) and low-expression (B) groups. Top five enriched KEGG pathways in ADAM17 high- (C) and low-expression (D) groups. GSEA, gene set enrichment analysis; KEGG, Kyoto Encyclopedia of Genes and Genomes.

In the ADAM17 high-expression group, the enriched KEGG pathways included basal cell carcinoma, cytokine-cytokine receptor interaction, dilated cardiomyopathy, the hedgehog signaling pathway, and hematopoietic cell lineage, reflecting ADAM17 involvement in carcinogenesis, inflammation, and cell differentiation (Figure 6C). In the ADAM17 low-expression group, the enriched pathways were related to the cell cycle, gap junction, pancreatic cancer, protein export, and ubiquitin-mediated proteolysis, but had relatively low enrichment scores, implying limited pathway activation (Figure 6D).

Taken together, these results suggest that MCM7 is associated with the activation of cell cycle and DNA repair pathways, which suggests that it plays a critical role in cellular proliferation and genomic stability. Conversely, its low expression may affect cell adhesion and intercellular communication. ADAM17 overexpression is linked to pathways involved in inflammation, tumorigenesis, and hematopoietic differentiation, which suggests that it plays an oncogenic role in the tumor microenvironment (TME). Low ADAM17 expression is associated with protein metabolism and cancer subtype-specific pathways, which suggests that it has a context-dependent regulatory function.

Immune infiltration and correlation between key genes and immune cells

The immune infiltration analysis revealed that the expression levels of MCM7 and ADAM17 were significantly correlated with immune cell distribution in the gastric cancer microenvironment. The stacked bar chart revealed distinct immune cell compositions between the normal (green) and tumor (red) tissues (Figure 7A). The box plots showed that the tumor tissues had significantly higher proportions of activated cluster of differentiation 4 (CD4) memory T cells, resting and activated natural killer (NK) cells, M0 and M1 macrophages, resting and activated dendritic cells (DCs), activated mast cells, and eosinophils (P<0.05). Conversely, the proportions of memory B cells, plasma cells, CD8+ T cells, resting CD4 memory T cells, Gamma delta T cells, M2 macrophages, and resting mast cells were significantly reduced (P<0.05) (Figure 7B).

Figure 7 Immune infiltration analysis and correlation with MCM7 and ADAM17 expression. (A) Stacked bar chart showing immune cell composition in the tumor and normal tissues. (B) Boxplots of the differentially infiltrated immune cells between the tumor and normal samples. **, P<0.01; ***, P<0.001. (C) Heatmap displaying the Spearman correlations between key gene expression and the proportions of 22 immune cell types.

The correlation heatmaps revealed significant associations between gene expression and immune cell infiltration. MCM7 expression was positively correlated with T follicular helper (Tfh) cells and M0 macrophages, and negatively correlated with resting CD4 memory T cells, monocytes, resting DCs, and activated mast cells. ADAM17 expression was positively associated with naïve and memory B cells, naïve CD4 T cells, Tfh cells, activated NK cells, M0 macrophages, and eosinophils, but was negatively correlated with resting and activated CD4 memory T cells, resting NK cells, monocytes, resting DCs, and activated mast cells (Figure 7C).

These findings suggest that MCM7 and ADAM17 are closely involved in modulating the immune microenvironment of gastric cancer, and may influence immune cell recruitment and functional states.


Discussion

In this study, transcriptomic data from four independent GEO datasets comprising stage I–II gastric cancer and normal tissue samples were integrated to create a dataset that comprised a total of 100 normal and 434 tumor tissues. A total of 101 DEGs were identified using the limma package. Subsequently, three machine-learning algorithms (i.e., LASSO regression, SVM-RFE, and RF) were applied for feature selection, and four key genes (i.e., DPT, KIT, MCM7, and ADAM17) were identified by intersecting the DEGs. Based on these four key genes, a SVM diagnostic model was constructed, which had an AUC of 0.998. A SHAP analysis was used to improve model interpretability, with MCM7 and ADAM17 contributing most significantly to the prediction. External validation using the GEO and TCGA datasets confirmed that MCM7 and ADAM17 were significantly upregulated in the tumor tissues and associated with an unfavorable prognosis (P<0.05), with both showing excellent diagnostic performance (AUC >0.85). Immunohistochemistry data from the HPA database further corroborated that these genes were overexpressed at the protein level. The functional enrichment analyses revealed that MCM7 was primarily associated with DNA replication and cell cycle-related pathways, while ADAM17 was enriched in inflammation- and cancer-related signaling pathways. An immune infiltration analysis via CIBERSORT showed that the expression of these two genes was correlated with various immune cell subsets, such as M0 macrophages, Tfh cells, and NK cells, suggesting their potential immunoregulatory roles in the TME.

In recent years, the integration of high-throughput omics data with machine-learning approaches has offered novel opportunities for biomarker discovery in oncology. Compared to traditional statistical models, machine learning provides a superior non-linear modeling capacity, automatic feature extraction, and better generalizability, and has thus emerged as a promising tool for molecular diagnostics in cancer (16,17). However, many existing diagnostic models for EGC are limited by small sample sizes, algorithmic homogeneity, and a lack of interpretability, hindering their translational application.

To address these challenges, this study adopted a large-scale multi-cohort bioinformatic approach to integrate EGC datasets, thereby mitigating the effects of platform heterogeneity and sample scarcity. Moreover, an independent validation cohort comprising early stage cases from TCGA-STAD (36 normal and 180 tumor tissues) was included to assess the robustness of the model. The combined application of LASSO regression, SVM-RFE, and RF in the feature selection stage improved model stability and reliability. LASSO regression reduces overfitting via regularization, while SVM-RFE is well suited for high-dimensional small-sample data, and RF captures non-linear relationships effectively (18). The final model based on the four intersecting genes (i.e., MCM7, ADAM17, DPT, and KIT) achieved an AUC of 0.998, outperforming models based on single algorithms.

Previous findings also support the use of the multi-algorithm approach in robust biomarker identification. For example, Zeng et al. identified early diagnostic and prognostic markers for pancreatic cancer using combined LASSO and SVM-RFE methods (19). Further, the application of the SHAP analysis in this study improved model interpretability, helping to elucidate the contribution of individual genes to prediction outcomes (20), which may improve the clinical guidance for the diagnosis of EGC.

Based on the above findings, this study identified MCM7 and ADAM17 as two key genes with significantly altered expression in EGC and constructed a high-performance diagnostic model validated in an independent cohort. From a clinical perspective, the consistent upregulation and high diagnostic accuracy (AUC >0.85) of MCM7 and ADAM17 in both GEO and TCGA datasets, together with IHC validation, support their potential application as adjunctive biomarkers in EGC diagnosis. In current practice, they could be used to complement histopathological assessment of endoscopic biopsy specimens, especially when morphological changes are subtle. Additionally, future studies should explore their detection in liquid biopsy samples, enabling non-invasive screening in high-risk populations. Combining these biomarkers with established clinical and endoscopic parameters may further enhance risk stratification and optimize early detection strategies.

MCM7, a core subunit of the MCM2-7 helicase complex, plays an essential role in the initiation of DNA replication and cell cycle progression (21). Previous studies have shown that MCM7 is upregulated in various tumors and closely associated with cell proliferation. For example, in hepatocellular carcinoma and breast cancer, MCM7 expression is strongly correlated with Ki67 and indicative of a poor prognosis (21). In the context of gastric cancer, Yang et al. reported that the expression of MCM7 was increased in premalignant and malignant lesions, particularly in adenomas and early stage carcinomas, and its expression was positively correlated with Ki67 levels (22). They proposed MCM7 as a sensitive proliferation marker for evaluating gastric cancer and precursor lesions.

This study further confirmed the elevated expression of MCM7 in gastric cancer through comprehensive GEO, TCGA, and HPA analyses, and showed its association with an unfavorable prognosis. The functional enrichment of MCM7 indicated significant involvement in cell cycle and DNA replication pathways, reinforcing its role in tumor proliferation. Additionally, in vitro experiments have shown that MCM7 knockdown suppresses tumor cell proliferation and colony formation (23), which is consistent with our findings. The GSEA showed that low MCM7 expression was associated with enrichment in pathways such as CAMs, ECM-receptor interaction, TGF-beta signaling, and gap junctions, suggesting its potential role in epithelial-mesenchymal transition suppression and tissue homeostasis (24). Notably, the TGF-beta pathway is known for its dual role in tumor suppression and promotion, and is closely associated with ECM remodeling in gastric cancer (25). Thus, the expression level of MCM7 may influence the cellular shift between proliferative and homeostatic states.

ADAM17 is a membrane-bound metalloprotease responsible for the ectodomain shedding of multiple substrates, including tumor necrosis factor-alpha (TNF-α) and pro-epidermal growth factor (pro-EGF) ligands, which in turn modulates intercellular communication and the TME (26). It plays a pivotal role in inflammation and tumor progression. ADAM17 is frequently upregulated in gastric cancer and is associated with adverse pathological features such as lymph node metastasis. One study reported that ADAM17 promotes tumor cell proliferation and migration by activating Notch and Wnt signaling in lymph node-positive gastric cancer cell lines (27), which is consistent with our observations of its upregulation in early lesions. Importantly, ADAM17 is also involved in immune regulation in the TME. It is expressed in tumor-associated macrophages, and regulates the release of proinflammatory cytokines such as TNF-a and cyclooxygenase-2 (COX-2), facilitating tumor initiation and progression (28).

Our CIBERSORT analysis revealed that ADAM17 expression was correlated with the infiltration of various immune cells, including macrophages and DCs, which provides further evidence of its role in shaping the inflammatory and immune landscape of EGC. The GSEA further showed that low ADAM17 expression was associated with metabolic pathways, including proteasome, ubiquitin-mediated proteolysis, and protein transport, indicating a more quiescent and homeostatic cellular state.

Additionally, the immune cell composition analysis revealed a distinct immunological pattern in EGC. The tumor tissues displayed significantly higher proportions of activated CD4+ memory T cells, M1 macrophages, activated NK cells, and DCs, while the proportions of CD8+ T cells, M2 macrophages, resting NK cells, and plasma cells were notably reduced. This suggests an immune activation pattern marked by a shift from innate to adaptive responses. Early stage gastric tumors may stimulate antigen presentation and innate immune activation via NK and DCs, which then prime CD4+ T cells for anti-tumor activity (29).

Our study showed that MCM7 was positively associated with Tfh cells and M0 macrophages, while ADAM17 was positively correlated with activated NK cells and DCs, which suggests that these genes may have immunomodulatory roles in addition to their proliferative functions. Notably, the enrichment of cytokine-cytokine receptor interaction and hematopoietic lineage pathways in the high ADAM17 expression group suggests its involvement in immune cell trafficking and differentiation (30). These findings support those of previous studies on the role of ADAM family proteins in inducing inflammatory tumor environments (31,32), and suggest that ADAM17 could serve as a potential immunotherapeutic target.

Despite the strengths of this study, including multi-cohort integration, machine learning-based feature selection, and comprehensive immune analysis, certain limitations remain. While this study comprehensively validated MCM7 and ADAM17 using independent transcriptomic datasets and immunohistochemical images from the HPA database, we recognize that additional verification on fresh patient-derived tumor materials would further enhance the robustness of our conclusions. Future work will therefore include prospective collection of early-stage gastric cancer specimens to perform gene and protein expression analyses, providing functional and clinical corroboration of our current bioinformatics findings. Moreover, future studies using gene knockout, overexpression, or CRISPR interference approaches should be conducted to elucidate the mechanistic roles of MCM7 and ADAM17 in gastric cancer proliferation, migration, and immune regulation. It should be noted that our analysis did not include benign gastric diseases such as gastric ulcer or gastric lymphoma as controls, which may limit the diagnostic specificity of MCM7 and ADAM17. Further validation in broader clinical cohorts will therefore be required. Second, this study focused on tissue-based transcriptomic data, without evaluating non-invasive samples such as serum or gastric fluid. Further research needs to be conducted to assess the feasibility of using MCM7 and ADAM17 as biomarkers for early detection via liquid biopsy. Finally, while the diagnostic model demonstrated high statistical performance, its practical clinical application will require consideration of the assay cost, simplicity, and feasibility.


Conclusions

In this study, we systematically integrated multi-cohort transcriptomic data and applied three mainstream machine-learning algorithms to identify key diagnostic biomarkers for EGC. Two robust candidate genes, MCM7 and ADAM17, were identified and validated in independent cohorts. Both genes were significantly overexpressed in tumor tissues, associated with poor prognosis, and showed strong diagnostic power. The functional enrichment and immune infiltration analyses further suggested that these two genes play critical roles in cell proliferation and modulation of the tumor immune microenvironment. Although the model demonstrated excellent statistical performance, further biological validation in cell and animal models is warranted. Moreover, future studies should explore the feasibility of using MCM7 and ADAM17 in non-invasive diagnostic strategies, such as liquid biopsies. Our findings provide novel insights into EGC detection and establish a foundation for future translational research.


Acknowledgments

None.


Footnote

Reporting Checklist: The authors have completed the TRIPOD reporting checklist. Available at https://jgo.amegroups.com/article/view/10.21037/jgo-2025-604/rc

Peer Review File: Available at https://jgo.amegroups.com/article/view/10.21037/jgo-2025-604/prf

Funding: This work was supported by the Science and Technology Plan Projects in Baise (No. BK20243434).

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://jgo.amegroups.com/article/view/10.21037/jgo-2025-604/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. This study used publicly available data, so ethical approval was not applicable. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.


References

  1. Sung H, Ferlay J, Siegel RL, et al. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA Cancer J Clin 2021;71:209-49. [Crossref] [PubMed]
  2. Malfertheiner P, Camargo MC, El-Omar E, et al. Helicobacter pylori infection. Nat Rev Dis Primers 2023;9:19. [Crossref] [PubMed]
  3. Huang J, Lucero-Prisno DE 3rd, Zhang L, et al. Updated epidemiology of gastrointestinal cancers in East Asia. Nat Rev Gastroenterol Hepatol 2023;20:271-87. [Crossref] [PubMed]
  4. Sundar R, Nakayama I, Markar SR, et al. Gastric cancer. Lancet 2025;405:2087-102. [Crossref] [PubMed]
  5. Liu Z, Ji J, Yang Q, et al. Diagnostic performance of white light endoscopy, narrow band imaging, and flexible spectral imaging color enhancement for early gastric cancer: a systematic review and meta-analysis. Quant Imaging Med Surg 2025;15:5534-45. [Crossref] [PubMed]
  6. Wang FH, Zhang XT, Li YF, et al. The Chinese Society of Clinical Oncology (CSCO): Clinical guidelines for the diagnosis and treatment of gastric cancer, 2021. Cancer Commun (Lond) 2021;41:747-95. [Crossref] [PubMed]
  7. Fu XY, Mao XL, Chen YH, et al. The Feasibility of Applying Artificial Intelligence to Gastrointestinal Endoscopy to Improve the Detection Rate of Early Gastric Cancer Screening. Front Med (Lausanne) 2022;9:886853. [Crossref] [PubMed]
  8. Su JW, Zeng YT, Luo SA, et al. Adjuvant Chemotherapy in Node-Negative Advanced Gastric Cancer Patients. J Oncol 2022;2022:2286040. [Crossref] [PubMed]
  9. Yin P, Cai R, Zhou X, et al. Comparable prognosis of early gastric cancer between intestinal type and diffuse type in patients of age 75 and older: a SEER-based cohort study. Transl Cancer Res 2024;13:888-99. [Crossref] [PubMed]
  10. Guo X, Lv X, Ru Y, et al. Circulating Exosomal Gastric Cancer-Associated Long Noncoding RNA1 as a Biomarker for Early Detection and Monitoring Progression of Gastric Cancer: A Multiphase Study. JAMA Surg 2020;155:572-9. [Crossref] [PubMed]
  11. Ma S, Zhou M, Xu Y, et al. Clinical application and detection techniques of liquid biopsy in gastric cancer. Mol Cancer 2023;22:7. [Crossref] [PubMed]
  12. Izumi D, Gao F, Toden S, et al. A genomewide transcriptomic approach identifies a novel gene expression signature for the detection of lymph node metastasis in patients with early stage gastric cancer. EBioMedicine 2019;41:268-75. [Crossref] [PubMed]
  13. Liu L, Pang H, He Q, et al. A novel strategy to identify candidate diagnostic and prognostic biomarkers for gastric cancer. Cancer Cell Int 2021;21:335. [Crossref] [PubMed]
  14. Zhang B, Zhang B, Wang T, et al. Integrated bulk and single-cell profiling characterize sphingolipid metabolism in pancreatic cancer. BMC Cancer 2024;24:1347. [Crossref] [PubMed]
  15. Xu L, Liu J, An Y, et al. Glycolysis-related genes predict prognosis and indicate immune microenvironment features in gastric cancer. BMC Cancer 2024;24:979. [Crossref] [PubMed]
  16. Swanson K, Wu E, Zhang A, et al. From patterns to patients: Advances in clinical machine learning for cancer diagnosis, prognosis, and treatment. Cell 2023;186:1772-91. [Crossref] [PubMed]
  17. Wan R, Pan L, Wang Q, et al. Decoding Gastric Cancer: Machine Learning Insights Into the Significance of COMMDs Family in Immunotherapy and Diagnosis. J Cancer 2024;15:3580-95. [Crossref] [PubMed]
  18. Patton MJ, Liu VX. Predictive Modeling Using Artificial Intelligence and Machine Learning Algorithms on Electronic Health Record Data: Advantages and Challenges. Crit Care Clin 2023;39:647-73. [Crossref] [PubMed]
  19. Zeng L, Chen Z. Screening of genes characteristic of pancreatic cancer by LASSO regression combined with support vector machine and recursive feature elimination, and immune correlation analysis. J Int Med Res 2024;52:3000605241233160. [Crossref] [PubMed]
  20. Ladbury C, Zarinshenas R, Semwal H, et al. Utilization of model-agnostic explainable artificial intelligence frameworks in oncology: a narrative review. Transl Cancer Res 2022;11:3853-68. [Crossref] [PubMed]
  21. Lashen AG, Toss MS, Rutland CS, et al. Prognostic and Clinical Significance of the Proliferation Marker MCM7 in Breast Cancer. Pathobiology 2025;92:18-27. [PubMed]
  22. Yang JY, Li D, Zhang Y, et al. The Expression of MCM7 is a Useful Biomarker in the Early Diagnostic of Gastric Cancer. Pathol Oncol Res 2018;24:367-72. [Crossref] [PubMed]
  23. Qiu YT, Wang WJ, Zhang B, et al. MCM7 amplification and overexpression promote cell proliferation, colony formation and migration in esophageal squamous cell carcinoma by activating the AKT1/mTOR signaling pathway. Oncol Rep 2017;37:3590-6. [Crossref] [PubMed]
  24. Lou X, Deng W, Shuai L, et al. RAI2 acts as a tumor suppressor with functional significance in gastric cancer. Aging (Albany NY) 2023;15:11831-44. [Crossref] [PubMed]
  25. Moaaz M, Lotfy H, Elsherbini B, et al. TGF-β Enhances the Anti-inflammatory Effect of Tumor- Infiltrating CD33+11b+HLA-DR Myeloid-Derived Suppressor Cells in Gastric Cancer: A Possible Relation to MicroRNA-494. Asian Pac J Cancer Prev 2020;21:3393-403. [Crossref] [PubMed]
  26. Wang K, Xuan Z, Liu X, et al. Immunomodulatory role of metalloproteinase ADAM17 in tumor development. Front Immunol 2022;13:1059376. [Crossref] [PubMed]
  27. Li W, Wang D, Sun X, et al. ADAM17 promotes lymph node metastasis in gastric cancer via activation of the Notch and Wnt signaling pathways. Int J Mol Med 2019;43:914-26. [PubMed]
  28. Bohrer LR, Chaffee TS, Chuntova P, et al. ADAM17 in tumor associated leukocytes regulates inflammatory mediators and promotes mammary tumor formation. Genes Cancer 2016;7:240-53. [Crossref] [PubMed]
  29. Fu W, Han X, Hao X, et al. Dynamic changes of host immune response during Helicobacter pylori-induced gastric cancer development. Clin Exp Immunol 2025;219:uxae109. [Crossref] [PubMed]
  30. Song SH, Jeon MS, Nam JW, et al. Aberrant GATA2 epigenetic dysregulation induces a GATA2/GATA6 switch in human gastric cancer. Oncogene 2018;37:993-1004. [Crossref] [PubMed]
  31. Zadka L, Kulus MJ, Piatek K. ADAM protein family - its role in tumorigenesis, mechanisms of chemoresistance and potential as diagnostic and prognostic factors. Neoplasma 2018;65:823-39. [Crossref] [PubMed]
  32. Chen J, Yuan Q, Guan H, et al. Unraveling the role of ADAMs in clinical heterogeneity and the immune microenvironment of hepatocellular carcinoma: insights from single-cell, spatial transcriptomics, and bulk RNA sequencing. Front Immunol 2024;15:1461424. [Crossref] [PubMed]
Cite this article as: Li F, Liu G, Wang J, Luo Z. Integrated bioinformatic and machine learning analysis identifies MCM7 and ADAM17 as potential biomarkers for early stage gastric cancer. J Gastrointest Oncol 2025;16(5):1862-1877. doi: 10.21037/jgo-2025-604

Download Citation