A systematic review of artificial intelligence and machine learning for gut microbiome-based CRC screening

Mythri Chittilla; Priyanka Nagdev

doi:10.21037/jgo-2026-1-0006

Review Article

A systematic review of artificial intelligence and machine learning for gut microbiome-based CRC screening

Mythri Chittilla, Priyanka Nagdev

Department of Medicine, UNC Health Blue Ridge, Morganton, NC, USA

Contributions: (I) Conception and design: M Chittilla; (II) Administrative support: Both authors; (III) Provision of study materials or patients: Both authors; (IV) Collection and assembly of data: Both authors; (V) Data analysis and interpretation: Both authors; (VI) Manuscript writing: Both authors; (VII) Final approval of manuscript: Both authors.

Correspondence to: Mythri Chittilla, DO. Department of Medicine, UNC Health Blue Ridge, 2203 S. Sterling Street, Morganton, NC 28655, USA. Email: mythri.chittilla@unchealth.unc.edu.

Background: Over the last decade, ample evidence has shown that gut microbiome dysbiosis is significantly associated with colorectal cancer (CRC). More recently, studies have shown that artificial intelligence (AI) and machine learning (ML) models paired with gut microbiome data can detect CRC. The objectives for this systematic review are: (I) to systematically evaluate the diagnostic performance of AI/ML models using gut microbiome data for CRC detection; (II) to compare AI/ML-based microbiome screening approaches; (III) to identify microbial genera consistently associated with CRC across included studies; (IV) to assess study quality and risk of bias using QUADAS-2 and use the GRADE approach to assess certainty.

Methods: A systematic review was conducted across PubMed, MEDLINE, Scopus, Embase, and The Cochrane Library including studies from January 1, 2023 to November 1, 2025. Studies applying AI/ML models to human gut microbiome data for CRC screening and reporting diagnostic performance metrics and significant genera were included. Risk of bias was assessed with QUADAS-2 and used the GRADE approach to assess certainty. Primary outcomes included area under the receiver operating characteristic curve (AUC). Secondary outcomes include balanced accuracy, sensitivity, specificity, examine AI/ML approaches used, and identify significant microbial genera associated with CRC detected by AI/ML models. No meta-analysis was done due to the heterogeneity, which came from various microbiome methods, AI/ML/microbiome models used, and validation strategies applied. Thereby the data was synthesized narratively.

Results: Twelve studies met inclusion criteria. AI/ML models demonstrated moderate diagnostic performance, with internal validation sets AUC values ranging from 0.61 to 0.98 and external validation sets AUC ranging from 0.70–0.87. Random forest and XGBoost models showed consistent performance, with multi-omics approaches achieving the highest AUC of 0.98. Porphyromonas (e.g., P. asaccharolytica), Peptostreptococcus (e.g., P. stomatis), Fusobacterium (particularly F. nucleatum subspecies: animalis, vincentii, polymorphum, sensu stricto), Parvimonas micra, Gemella morbillorum, Bacteroides (B. fragilis), and Streptococcus species were significantly enriched in CRC as predicted the selected AI/ML/microbiome models.

Conclusions: AI/ML-based gut microbiome models demonstrate moderate AUC for CRC detection and may enhance noninvasive screening strategies pending prospective validation. Limitations include heterogeneity in AI/ML/microbiome models, microbiome methodologies, and model validation, with predominantly retrospective case–control studies and limited external validation.

Keywords: Colorectal cancer screening (CRC screening); non-invasive; gut microbiome

Submitted Jan 03, 2026. Accepted for publication Mar 09, 2026. Published online Mar 27, 2026.

doi: 10.21037/jgo-2026-1-0006

Highlight box

Key findings

• Artificial intelligence (AI) and machine learning (ML) models using gut microbiome data consistently moderate diagnostic performance for colorectal cancer (CRC) detection.

What is known and what is new?

• CRC is associated with gut microbiome dysbiosis and existing stool-based screening tests, fecal immunochemical testing and guaiac fecal occult blood testing, have limited sensitivity, particularly for early-stage CRC and adenomas.

• This review shows AI/ML microbiome models achieve moderate area under the curve values and can detect CRC by using a multi-omics approach with microbiome genome data.

What is the implication, and what should change now?

• AI/ML gut microbiome models show moderate area under the curve for CRC detection and warrant consideration as complementary screening tools, pending prospective validation in true screening cohorts.

Introduction

Rational and knowledge gap

Colon cancer accounts for approximately 10% of cancers world-wide and there is a need to develop better screening methods to improve early detection and reduce mortality. Currently, colonoscopy is the gold standard to screen for colorectal cancer (CRC). Noninvasive methods of CRC screening include guaiac fecal occult blood testing (gFOBT) and fecal immunochemical testing (FIT). Although colonoscopy is the gold standard for CRC screening, there are absolute contraindications, relative contraindications, and disadvantages, including but not limited to: active colitis, in ability to tolerate sedation, severe cardiopulmonary disease, risk of perforation, and higher costs compared to stool-base screening (1). Artificial intelligence (AI) is increasingly recognized as a valuable tool in the management of CRC, with applications in risk prediction, diagnosis, prognostication, and monitoring treatment response. AI models, particularly those based on machine learning (ML) and deep learning, have demonstrated the ability to integrate clinical, pathological, histologic, and molecular data to stratify patient risk, predict CRC outcomes, and evaluate therapeutic efficacy (2,3). In the realm of non-invasive screening, AI and ML are increasingly being studied to potentially use biomarkers, genetic profiles, and gut microbiome composition to improve early detection and risk stratification. Applying AI and ML towards noninvasive CRC screening offers several advantages, including the potential to reduce the number of unnecessary colonoscopies, lower healthcare costs (1), and more effectively identify individuals at higher risk for CRC.

Objective

This systemic review aims to discuss how AI and ML in conjunction with microbiome data are being studied for CRC screening. Due to the rapid and significant advancements in the field, this systemic review focuses on recent developments from January 1, 2023 to November 1, 2025. Objectives of this systemic review include: (I) to systematically evaluate the diagnostic performance (primarily AUC and secondarily balanced accuracy, sensitivity, specificity) of AI/ML models using gut microbiome data for CRC detection; (II) to compare AI/ML-based microbiome screening approaches; (III) to identify microbial genera consistently associated with CRC across included studies; (IV) to assess study quality and risk of bias using QUADAS-2 and certainty was assessed with the GRADE approach.

Background: gut microbiome and CRC

Over the past decade, many studies have shown certain gut microbiomes to be correlated with increased risk of CRC. Human studies from 2023 to present day demonstrate significant associations and potential mechanisms of action of how certain flora can be a marker for CRC (4-6). A 2025 multi-cohort study analyzed fecal genomes from 1,647 participants, including CRC patients, first-degree relatives, and healthy controls, revealing distinct microbial differences that were predictive of CRC (4,7). Specifically, Fusobacterium nucleatum and Bacteroides fragilis were significantly elevated in CRC patients, suggesting involvement in colorectal tumorogenesis (4). Proposed mechanisms of tumorigenesis include adhesion to colonic epithelia, genotoxin production (e.g., colibactin from pks+ Escherichia coli), and suppression of anti-tumor immunity via FadA adhesin and polysaccharide A pathways (8,9). Contrary, short-chain fatty acid (SCFA)-producing genera, like Roseburia, Coprococcus, and Faecalibacterium prausnitzii, were significantly depleted, correlating with reduced epithelial barrier integrity and increased inflammatory cytokines, like IL-6 and TNF-α (4,6). New data shows Fusobacterium nucleatum has the ability to induce DNA damage and create microsatellite instability (8,10). Before 2023, due to dysbiosis, ML models were beginning to be developed and achieved up to 92% AUC for CRC detection, surpassing fecal immunochemical tests in identifying early-stage disease (4).

Additionally, a 2025 meta-analysis of 31 human studies reported elevated odds ratios for Fusobacterium, Bacteroides, and Helicobacter pylori colonization associated with CRC, while Bifidobacterium has a an OR of 0.72, suggesting a protective role (5,9,11). Longitudinal cohort data from 2024–2025 demonstrated Fusobacterium overgrowth and Clostridium shifts to be correlated with increased adenoma and carcinoma rate, explaining that decreased butyrate and increased deoxycholic acid levels may activate β-catenin/Wnt signaling (8,12). A 2025 review of 15 prospective studies reported that high-fiber intake restored Ruminococcaceae diversity and lowered CRC incidence by 15–20% in at-risk populations (6,9,10,12). Recent research has shown gut microbiomes to be heritable; relatives of CRC patients tended to have similar microbiome alterations, such as Bacteroides ovatus enrichment, to their CRC relatives, predisposing them to adenoma formation, suggesting familial transmission via shared environment or vertical inheritance (4,7). Collectively, these studies prompt the integration of microbiome-based data, alongside FIT/gFOBT, to better stratify patients for CRC risk and early detection (11). We present this article in accordance with the PRISMA reporting checklist (available at https://jgo.amegroups.com/article/view/10.21037/jgo-2026-1-0006/rc).

Methods

Protocol and registration

This systematic review was registered in PROSPERO (registration No. CRD420251272700). Review protocol can be accessed on PROSPERO Website. Major and Minor amendments were made throughout to precisely fit the review. Methods, Data collection and synthesis, and Results sections were refined, but unchanged in proof-of concept. Search and study selection was done independently by two reviewers.

Search criteria

A comprehensive literature search was performed by using the following databases: PubMed, MEDLINE, Scopus, The Cochrane Library (CLIB), Embase-Embase via Ovid, and Embase.com. Controlled vocabulary and free-text terms were used to ensure a comprehensive and reproducible search strategy. For example, in PubMed/MEDLINE, Medical Subject Headings (MeSH) and free-text terms were combined as follows: (“Colorectal Neoplasms”[Mesh] OR “Colon Neoplasms”[Mesh] OR colorectal cancer OR CRC OR colon cancer) AND (“Artificial Intelligence”[Mesh] OR “Machine Learning”[Mesh] OR “Deep Learning”[Mesh] OR “Algorithms”[Mesh] OR artificial intelligence OR machine learning OR deep learning OR AI OR ML) AND (“Microbiota”[Mesh] OR “Gastrointestinal Microbiome”[Mesh] OR microbiome OR gut microbiome OR intestinal microbiota) AND (“Early Detection of Cancer”[Mesh] OR “Mass Screening”[Mesh] OR screening OR detection OR diagnosis OR diagnostic accuracy OR AUC OR noninvasive OR non-invasive), with filters for humans, English language, and dates from January 1, 2023, to November 1, 2025. Equivalent Emtree terms were used in Embase. All terms were combined using Boolean operators (AND/OR), and full database-specific search strategies are provided in Table S1 for reproducibility. Studies only in English were included with dates from January 1, 2023, to November 1, 2025.

Reference lists of included studies were manually screened to identify additional eligible publications. The term “non-invasive” was defined as a procedure that does not require inserting an instrument through the skin or into a body opening by the National Cancer Institute (13). By this definition, colonoscopy is considered invasive as was excluded. Publications selected for inclusion had keywords searched manually to ensure relevant and keywords were unknowingly excluded. Figure 1 displays the studies identified and included in the review.

Figure 1 PRISMA 2020 flow diagram showing study selection for AI-based microbiome CRC screening. AI, artificial intelligence; CRC, colorectal cancer; ML, machine learning.

Eligibility criteria

Studies were included if they met the following criteria: (I) published from January 1, 2023 to November 1, 2025; (II) involved human subjects and human biodata; (III) utilized AI or ML algorithms with human gut microbiome data. Studies were excluded if they focused exclusively on invasive diagnostic procedures, imaging without screening intent, treatment response prediction, prognostic modeling, animal or in vitro experiments, meta-analyses, narrative reviews, conference abstracts, and editorials. Studies were excluded if they included an ineligible population. An ineligible population was defined as: (I) participants with CRC combined with other primary gastrointestinal malignancies (e.g., gastric, esophageal, or hepatocellular carcinoma) without separate reporting of CRC outcomes; (II) inflammatory bowel disease (Crohn’s disease or ulcerative colitis); (III) irritable bowel syndrome (IBS-C or IBS-D); or (IV) recent antibiotic, probiotic, or prebiotic use prior to stool sampling and (V) history of bowel resection. These conditions were excluded due to their independent and well-established effects on gut microbiome composition, which could confound diagnostic performance estimates. Other exclusion criteria include; incomplete metadata (i.e., age, sex, diagnosis, sample type, BMI, sequencing information); and sequencing runs with insufficient read depth (<10,000 reads after quality control). Incomplete metadata can introduce confounding bias, incorrect labeling for ML models, inability to stratify study groups (e.g., eoCRC vs. aoCRC or CRC vs. controls), and poor reproducibility of results, which may compromise the validity and reliability of the analysis. All records were screened using Rayyan systematic review software, where exclusion reasons were predefined and applied consistently during title/abstract and full-text screening.

Data extraction

Data were independently extracted by two reviewers. Information collected included: year, study population, study design, validation strategy, AI/ML model, population, comparator group, type of data, AUC, balanced accuracy, sensitivity, specificity, and significant genus identified by AI/ML associated with CRC. The information of interest was found in each study’s methods, material, results, discussion, and conclusion. A template for the data collection form was customly made and included aforementioned details.

Data analysis

Due to the heterogeneity of the data, a meta-analysis was not performed. The main reasons for heterogeneity included: microbiome methods, differences in AI/ML approaches, and various validation strategies used across 12 studies. Several studies utilized publicly available 16S rRNA sequencing datasets obtained from microbial genetic repositories. In these cases, the original DNA extraction protocols—including whether mechanical lysis or bead-beating was performed—were often not fully reported or were unavailable. In contrast, studies that included external validation cohorts and/or directly collected human stool samples typically provided detailed descriptions of their DNA extraction methods. Eight studies relied primarily on previously deposited microbiome datasets (Freitas et al., 2023; Novielli et al., 2024; Rynazal et al., 2023; Pateriya et al., 2025; Bakir-Gungor et al., 2024; Tsai et al., 2025; Liu et al., 2024), for which extraction protocols were not provided and could not be verified. This limitation was considered when interpreting differences in reported microbial genera and diagnostic performance across studies, as variability in extraction methods may influence taxonomic representation.

In addition, substantial heterogeneity resulted from differences in the AI/ML approaches used in each study. Table 1 summarizes the specific AI/ML used per study and the corresponding diagnostic performance metrics. All studies reported AUC values, seven reported balanced accuracy, and only three reported both sensitivity and specificity, limiting direct comparison across studies. Different ML models use distinct training procedures, feature selection methods, and parameter tuning, which can produce different performance results even when applied to similar datasets. Even when studies used the same ML algorithm [e.g., random forest (RF)], preprocessing and feature selection differed (e.g., SHAP for interpretation, PAM clustering, LEfSe, or LASSO), leading to different feature sets and thus non-comparable performance metrics across models.

Table 1

Diagnostic performance of AI and ML models for CRC detection

Study	AI/ML	AUC	Balanced accuracy	Sensitivity	Specificity
Freitas et al. (14), 2023	RF and SHAP algorithm (XAI) and oversampling	0.92	67%	X	X
Jayakrishnan et al. (15), 2024	DIABLO	With metabolomics 0.98; with microbiome-based 0.61	X	X	X
Novielli et al. (16), 2024	RF and SHAP algorithm (XAI)	0.699	67.30%	X	X
Rynazal et al. (17), 2023	RF and SHAP algorithm (XAI)	0.72	X	X	X
Pateriya et al. (18), 2025	XGBoost	0.9	80%	X	X
Bakir-Gungor et al. (19), 2024	EC nomenclature-based GSM (RF, enzyme data, MaTE, miRModuleNet, CogNet, PriPath, miRdisNET, GeNetOntology)	0.769	69.50%	61.50%	76.70%
Novielli et al. (20), 2025	CatBoost and SHAP Analysis (XAI)	Internal 0.71; external 0.70	Internal 69%; external 63%	X	X
Tsai et al. (21), 2025	RF	Internal 0.90; external 0.87	Internal 85%; external 67%	Internal 43%; external 38%	Internal 97%; external 96%
Rotelli et al. (22), 2024	XGBoost and SHAP Analysis (XAI)	0.98	86%	X	X
Lu et al. (23), 2023	RF and LEfSe and LASSO	0.926	75%	77.80%	66.70%
Liu et al. (24), 2024	RF and LEfSe with PAM clustering algorithm	Internal 0.84; external 0.85	X	X	X
Piccinno et al. (25), 2023	RF	Internal 0.87; external 0.85	X	X	X

X = not reported; internal = indicates values for internal validation group; external = indicated values for external cohort validation group. AI, artificial intelligence; AUC, area under the curve; DIABLO, Data Integration Analysis for Biomarker Discovery using Latent variable approaches for Omics studies; EC, enzyme commission; GSM, Grouping-Scoring-Modeling; LASSO, least absolute shrinkage and selection operator; LefSe, linear discriminant analysis effect size; ML, machine learning; PAM, partitioning around medoids; RF, random forest; SHAP, SHapley Additive exPlanations; XAI, explainable artificial intelligence; XGBoost, eXtreme gradient boosting.

Regarding model validation strategies, the 12 studies used various methods of validation strategies, seen in Table 2, which contributed to heterogeneity and prevented pooling. Varied validation strategies across the 12 studies (Table 2)—including k-fold cross-validation (CV), leave-one-out cross-validation (LOOCV), and train-test splits— and some included external validation with external cohort. This introduced methodological heterogeneity by producing AUCs with differing bias and variance profiles, preventing pooling. As a result, reported AUC values are not methodological equal and prevent direct comparison across studies and meaningful pooling. Pooling such estimates risks combining optimistic and more conservative performance metrics, potentially producing a misleading summary estimate. Therefore, variability in internal validation approaches contributes to statistical and methodological heterogeneity and justifies a narrative synthesis rather than quantitative meta-analysis.

Table 2

Characteristics of included studies

Study	Design	Validation	Population	Control	Data
Freitas et al., 2023 (14)	Retrospective multi-class machine learning classification study to develop a machine learning approach to distinguish types of cancer based on microbiome flora	Internal; stratified 5-fold CV	Total (n=512): HNSC, n=155; STAD n=127; COAD, n=125; ESCA, n=60; READ, n=45	None	TCMA containing 16SrRNA sequences
Jayakrishnan et al., 2024 (15)	Observational cross-sectional comparative study	Internal; LOOCV	Total (n=139): individuals with stages I–IV CRC (n=64), eoCRC (age ≤50 years, n=20), aoCRC (age ≥60 years, n=44), healthy (n=49)	Healthy	Human speciman using 16S rRNA amplicon sequencing
Novielli et al., 2024 (16)	Secondary analysis of previously collected case-control cohorts	Internal; 5-fold CV with nested 3-fold CV	Total (n=442): CRC (n=251), healthy (n=191)	Healthy	Microbiome bank; three different dataset of three different works (Zackular et al., 2014; Zeller et al., 2014; Baxter et al., 2016); 16S metagenomic sequencing of the V4 region from Canada, France United States of America
Rynazal et al., 2023 (17)	Observational, retrospective multi-dataset case-control study	Internal; LODO analysis	Total (n=802): CRC (n=424), control (n=378)	Healthy	curatedMetagenomicData R package Containing taxonomic abundance data sets from 5 countries (Japan, China, USA, France, Germany)
Pateriya et al., 2025 (18)	Retrospective, multicohort, case-control diagnostic accuracy study	Internal; 10-fold CV	Total (n=1,728): CRC (n=1,022), adenoma (n=141), healthy (n=728)	Healthy	Used 1728 publicly available gut metagenome samples from 11 studies across eight countries (India, France, USA, Japan, China, Italy, Germany, Australia)
Bakir-Gungor et al., 2024 (19)	Observational case-control study to identify CRC-associated microbes and microbial enzymes	Internal; 10-fold MCCV	Total (n=1,262): CRC (n=600), healthy control (n=662)	Healthy	Metagenomic dataset collected from eight different countries
Novielli et al., 2025 (20)	Retrospective, multi-cohort diagnostic modelling study	Internal with 5-fold CV and external cohort	Total n=453 (adenoma or CRC)	None; AP vs. CRC	V4 16S rRNA gene sequencing data from Canada, France, and the United States of America
Tsai et al., 2025 (21)	Secondary data analysis diagnostic study using a case-control design	Internal with stratified, nested 10-fold CV and external cohort	Total (n=1,181): control (n=454), non-advanced adenomas (n=373), advanced adenomas (n=197), CRC (n=157)	Healthy	16 S rRNA sequencing datasets from North American and East Asia. Five publicly available 16 S rRNA sequencing datasets (Baxter, Dadkhah, Zackular, Yang, and Cong)
Rotelli et al., 2024 (22)	Retrospective observational case-control	Internal; 10-fold CV	148 samples, belonging to 61 patients. AP: 34 samples form 16 patients. CRC: 114 samples from 46	None; AP vs. CRC	Data originated from prior observational research project using V3-V4 region of bacterial 16S rRNA sequencing
Lu et al., 2023 (23)	Cross-sectional case-control study	Internal; 10-fold CV	Total (n=38): healthy (n=17), CRC (n=21)	Healthy	Collected human stool and used 16S rRNA gene sequencing
Liu et al., 2024 (24)	Multicohort cross-sectional case-control study	Internal with random hold-out with discovery cohort and external cohort	Total (n=672): 339 CRC (n=339), healthy controls (n=333)	Healthy	Microbiome bank and external human cohort and used 16S rRNA gene sequencing
Piccinno et al., 2023 (25)	Pooled analysis with six external cohorts	Internal with 10-fold CV and LODO and external cohort	Total (n=2,116): CRC (n=930), adenomas (n=210), healthy control (n=976)	Healthy	Publicly available gut microbiome sporadic CRC cohorts with six newly collected and sequenced in house cohorts (cohort 1, 2, 3, 4, 5 and 6), for a total of 1,625 new shotgun gut metagenomes

aoCRC, average onset CRC; AP, adenomatous polyps; COAD, colon adenocarcinoma; CRC, colorectal cancer; CV, cross-validation; eoCRC, early onset CRC; ESCA, esophageal carcinoma; HNSC, head and neck squamous cell carcinoma; LODO, leave-one-dataset-out; LOOCV, leave-one-out cross validation; READ, rectum adenocarcinoma; STAD, stomach adenocarcinoma.

Data was transcribed into a Microsoft Excel worksheet on a file sharing service. Discrepancies between authors were resolved with discussion and both authors unanimously agreed. Risk of bias was assessed with QUADAS-2 for each source. Certainty of findings were assessed with the GRADE approach. Included studies will be synthesized narratively.

Results

Study selection

A comprehensive literature search identified 1,247 records across six electronic databases: PubMed (n=312), MEDLINE (n=276), Scopus (n=295), Embase via Ovid (n=221), Embase.com (n=103), and The Cochrane Library (n=40). After removal of duplicate records, 1,089 unique studies remained for title and abstract screening. During initial screening, 978 records were excluded for failing to meet inclusion criteria, most commonly due to evaluation of an ineligible population (n=450), absence of AI or ML methodology (n=320), or lack of reported diagnostic performance metrics (n=208). Full-text review was performed for 111 articles, all of which were successfully retrieved and independently assessed by two reviewers. Of these, 99 studies were excluded, including review articles (n=40), animal-only studies (n=25), conference abstracts without sufficient methodological detail or outcome reporting (n=29), duplicate publications (n=1), and studies reporting methodological trends or microbiome associations without primary diagnostic performance outcomes (n=4). Ultimately, 12 studies met all eligibility criteria and were included in the qualitative synthesis of this systematic review (Figure 1). Disagreements were resolved by consensus.

Characteristics of included studies

The 12 included studies, published between 2023 and 2025, evaluated AI and ML models applied to gut microbiome data for CRC detection. All studies were observational in design, most commonly retrospective case-control or cohort analyses. The majority included adult patients with histologically confirmed CRC compared with non-cancer controls, including healthy individuals, patients with benign colorectal disease, or individuals with adenomas to approximate screening-relevant populations. Sample sizes varied substantially, ranging from small, highly curated cohorts (n=64) to large datasets exceeding 2,000 participants. Most studies included mixed-stage CRC populations, while several focused specifically on early-stage CRC or advanced adenomas to assess screening performance. Study settings were predominantly academic or multicenter, with data derived from hospital-based recruitment, biobanks, or publicly available microbiome databases. Microbiome profiling methods were heterogeneous across studies. Most utilized 16S rRNA gene sequencing or shotgun metagenomic sequencing of stool samples to derive microbial taxonomic and functional features. Two studies, Lu et al. and Jayakrishna et al., used primary sequencing whereas the remaining nine studies used secondary sequencing. ML methodologies included RF, XGBoost, and neural networks, and ensemble models, frequently paired with explainable AI (XAI) frameworks (e.g., SHAP) to enhance interpretability. Comparator groups varied by study design but many studies contained healthy controls as the comparator group (Table 2). Many of the AI and ML models used a multi-omics approach, incorporating metabolites, enzymes, demographics, and clinical data in conjunction with microbiome data to detect CRC. Nine studies compared CRC cases with healthy controls, while two others included individuals with adenomas. Table 2 displays included study characteristics.

Study quality

QUADAS-2

The QUADAS-2 assessment demonstrated an overall high risk of bias among the included studies evaluating AI/ML models using microbiome data for CRC detection. All twelve studies were rated as high risk of bias in the patient selection domain, largely due to their retrospective design, non-random sampling, and use of pre-selected case–control datasets, which may overestimate diagnostic performance. The index test domain, the specific AI/ML combined with microbiome data, showed high or unclear risk in most studies, reflecting limited reporting on model development and potential overfitting of ML algorithms. In contrast, the reference standard domain was consistently rated as low risk of bias across all studies, as CRC diagnosis was confirmed using accepted clinical or pathological standards such as colonoscopy and histopathology. The flow and timing domain was frequently rated as unclear or high risk, indicating insufficient information regarding whether all participants received the same reference standard, whether exclusions occurred after enrollment, or whether the timing between microbiome sampling and diagnosis was appropriate. Applicability concerns for patient selection were also commonly rated as high or unclear, suggesting that the study populations may not be representative of real-world screening settings, particularly because many studies used small, enriched, or highly selected cohorts. Overall, the QUADAS-2 results indicate methodological limitations that may lead to overestimation of diagnostic accuracy, and these findings support downgrading the certainty of evidence in the GRADE assessment. Table 3 displays the QUADAS-2 assessment results.

Table 3

QUADAS-2 risk of bias assessment

Study	Risk of bias				Applicability concerns
Study	Patient selection	Index test	Reference standard	Flow and timing	Patient selection	Index test
Freitas et al. (14), 2023	High	High	Low	Unclear	High	High
Jayakrishnan et al. (15), 2024	High	High	Low	Unclear	High	Low
Novielli et al. (16), 2024	High	Unclear	Low	Unclear	Unclear	Unclear
Rynazal et al. (17), 2023	High	Unclear	Low	Unclear	Unclear	Unclear
Pateriya et al. (18), 2025	High	High	Low	Unclear	Unclear	Unclear
Bakir-Gungor (19), 2024	High	High	Low	Unclear	High	High
Novielli et al. (20), 2025	High	Unclear	Low	Unclear	Unclear	Unclear
Tsai et al. (21), 2025	High	Unclear	Low	Unclear	Unclear	Unclear
Rotelli et al. (22), 2024	High	High	Low	High	High	High
Lu et al. (23), 2023	High	High	Low	High	High	Unclear
Liu et al. (24), 2024	High	High	Low	Unclear	Unclear	Unclear
Piccinno et al. (25), 2023	High	Unclear	Low	Unclear	Unclear	Unclear

QUADAS-2, Quality Assessment of Diagnostic Accuracy Studies 2.

The GRADE approach

Per PRISMA 2020 guidelines, the GRADE approach was used to assess the certainty of evidence for whether AI/ML-microbiome models outperform gFOBT/FIT for CRC detection, using AUC as the primary outcome. Despite some high AUCs, GRADE rated overall evidence quality as low due to high QUADAS-2 risk of bias, result inconsistency, and indirectness (retrospective case-control designs). Large effect sizes were insufficient to offset these limitations. As assessed by QUADAS-2, there was a high risk of bias across most studies, largely due to their observational and retrospective designs. Many studies relied on secondary datasets or retrospective biobank samples, increasing susceptibility to selection bias and overfitting. Due to the retrospective nature of these studies, patient selection was often non-random, and blinding or predefined thresholds were not consistently reported, further contributing to bias.

Inconsistency was also observed across studies, reflected by substantial heterogeneity in AUC values. Across the twelve studies, AUC values ranged from 0.70 to 0.87 in external validation sets and from 0.61 to 0.98 in internal validation sets. Only four of the twelve studies included external validation, while all twelve reported internal validation. The external validation cohorts generally had smaller sample sizes and reduced statistical power, limiting the robustness and generalizability of findings. Internal validation sample sizes ranged widely, from as few as 35 participants to as many as 2,116. The wide variation in AUC values is likely attributable to differences in AI/ML algorithms, feature selection methods, microbiome sequencing platforms, and preprocessing techniques.

Further inconsistency arises from incomplete reporting of performance metrics. Four studies did not report balanced accuracy, and only three studies reported both sensitivity and specificity. This incomplete reporting contributes to concerns regarding “missing results” and selective outcome reporting, thereby lowering the certainty of evidence. Sensitivity and specificity are particularly important for CRC diagnostics and risk stratification, as they allow direct comparison with established stool-based screening tests, particularly gFOBT and FIT. Without consistent reporting of these measures, it is difficult to meaningfully compare AI/ML microbiome-based models with current clinical standards.

Indirectness of evidence was also present. Two studies, Bankir-Gungor et al. and Jayakrishna et al., assessed enzymes and metabolites produced by bacterial genera, respectively, rather than directly analyzing microbial taxa. While metabolite- and enzyme-based profiling is biologically informative and has been shown to correlate significantly with CRC (15,19), it represents an indirect measure of gut microbiota composition. Such surrogate markers may not fully capture the gut flora and may introduce additional biological variability. Nevertheless, these studies were included because metabolomic and enzymatic outputs are downstream functional products of the microbiome and still constitute microbiome-related data. Additional indirectness arises from comparisons between AI/ML model AUCs and the sensitivity and specificity of gFOBT and FIT. Because AUC and sensitivity/specificity represent different performance metrics, such indirect comparisons limit interpretability and reduce confidence in direct clinical applicability. The was low concern imprecision and publication bias. Overall, due to high risk of bias, heterogeneity of results, incomplete reporting of key diagnostic metrics, and indirectness of evidence, the certainty of evidence for AUC as a measure of diagnostic performance remains low. Due to this systemic review not being a pooled study, risk ratios, odds ratio, or hazard ratio are not applicable. Table 4 showed the quality of evidence using the GRADE approach.

Table 4

Quality of evidence using the GRADE approach

Outcome	Range of AUC of validation sets	No. of studies	Study design	Risk of bias	Inconsistency	Indirectness	Imprecision	Publication bias	Certainty
AUC for CRC detection	External AUC: 0.70–0.87; Internal AUC: 0.61–0.98	12	Observational diagnostic studies	Serious	Serious	Serious	Not serious	Not serious	Low

AUC, area under the curve; CRC, colorectal cancer; GRADE, Grading of Recommendations Assessment, Development and Evaluation.

Outcomes

Diagnostic performance

Diagnostic performance of AI and ML models for CRC detection using gut microbiome-based features was evaluated across 12 observational studies published between 2023 and 2025 (Table 2). All 12 studies reported area under the receiver operating characteristic curve (AUC) as a primary outcome. Differences in validation design created a wide spread in AUCs and are a source of heterogeneity for this meta-analysis, which also prevented pooling of the data. Purely internal CV tends to give higher AUCs than external validation. Internal validation (any k-fold CV, LOOCV, random hold-out) reuses the same source population and data structure, so estimates reflect “reproducibility” within that setting and are typically optimistic compared with performance in new settings (26). For instance, studies that lacked external cohort validation (Rotelli et al., 2024, Lu et al., 2023, Pateriya et al., 2025) (18,22,23) reported high AUC values, ranging from 0.90 to 0.98. Standard k-fold CV or random hold-out within one cohort is “easier”, because train and test share recruitment center, lab pipeline, and case mix, thereby yielding in higher AUC values (27). Designs such as leave-one-dataset-out (LODO/LOSO, internal-external CV) explicitly test generalization across cohorts and are known to lower AUC compared with within-cohort CV. In this meta-analysis, Rynazal et al., 2023 and Piccinno et al., 2023 (17,25) both use LODO of internal validation and are operating in a stricter validation regime than simple k-fold CV, so their AUCs are more conservative. Prior research has shown 10-fold CV tends to give relatively stable, low-bias AUC estimates across models and sample sizes, often performing better than other internal methods (28). LOOCV uses all data for training each time, so it can be high-variance and sometimes over estimate AUC values with smaller sample sizes, which is demonstrated in Jayakrishnan et al., 2024 (15) with AUC 0.98 for a metabolomics model but only 0.61 for a microbiome-only model. In all the following studies had the least strict validation method, thereby inflating the AUC: Freitas et al., 2023, Pateriya et al., 2025, Rotelli et al., 2024, Lu et al., 2023, and Bakir-Gungor et al., 2024 (14,18,19,22,23). This is due to assessing reproducibility under nearly identical conditions (28). Moderate strictness included Novielli et al., 2024 and Tsai et al., 2025 (16,21), as nested schemes reduce inflated AUC values compared with simple CV. Most strict validation methods include LODO, LOSO, and internal-external, which are present in the following studies: Rynazal et al., 2023, Piccinno et al., 2023, Novielli et al., 2025, Tsai et al., 2025, and Liu et al., 2024 (17,21,23,25). Studies with external cohorts had internal and external AUC values that were generally comparable and did not demonstrate extreme divergence, which supports that their internal AUCs are less overfitted. Because AUC values for prediction models vary substantially across validation settings due to differences in validation methods, combining high AUCs from less stringent internal cross-validation with lower AUCs from nested, LODO, or external validation designs would artificially inflate between-study heterogeneity and compromise the validity of any pooled estimate. Table 2 reports validation strategy per study and Table 1 reports AI/ML, AUC, sensitivity, specificity, and balanced accuracy by study.

AI/ML models and validation

All included studies employed supervised ML algorithms. The most frequently used models were RF followed by gradient boosting methods (e.g., XGBoost), reflecting their suitability for high-dimensional biological data. Several studies also explored deep learning architectures, particularly when integrating multi-omics datasets. Model development typically involved internal cross-validation, with limited use of independent external validation cohorts. Table 1 describes the AI and ML used for each study and, Table 2 demonstrates which studies used internal validation, external validation, or both.

Across the 12 studies, validation strategies ranged from relatively optimistic internal cross-validation to more stringent external and LODO designs, and this spectrum helps explain some of the variability in reported AUCs. Most studies relied solely on internal validation using k-fold cross-validation or related resampling: Freitas et al., Pateriya et al., Rotelli et al., Lu et al., and Bakir-Gungor et al. (18,19,22,23), all evaluated performance within a single cohort, which tends to yield higher, more optimistic estimates because training and test sets share the same underlying population and technical conditions. Nested cross-validation, as used by Novielli et al. [2024], and Tsai et al., (16,21) offers some protection against overfitting during feature selection and hyperparameter tuning and may partly explain why the AUC reported by Novielli et al. [2024] is lower than several other internal validation-only studies.

A few studies used stricter forms of internal-external validation that better probe generalisability across datasets. Rynazal et al. employed LODO analysis and Piccinno et al. (17,25) combined 10-fold CV with LODO and external testing, both of which enforce a domain shift between training and test sets and therefore provide more conservative, clinically relevant estimates than simple within-cohort CV. Four studies included a separate external cohort: Novielli et al. [2025], Tsai et al., Liu et al., and Piccinno et al. (21,24,25). In these, the small differences between internal and external AUCs suggest that the models may be reasonably robust to between-cohort variation, although the external datasets were often diagnostic/enriched rather than true screening populations.

Finally, the use of different cross-validation schemes likely contributes to heterogeneity. Jayakrishnan et al. applied LOOCV and observed a very high AUC of 0.98 for the metabolomics model but only 0.61 for the microbiome-only model (15), underscoring that both the validation scheme and data modality influence apparent performance. Taken together, high AUCs from less strict internal CV, coexist with more modest AUCs from nested, LODO, and external validations, and this mixture of validation rigor is likely an important factor of between-study heterogeneity. Table 2 shows the specific type and method of validation.

Comparisons with gFOBT/FIT

There was no direct or head-to-head comparison with gFOBT and FIT across 12 studies. There were only indirect comparisons with gFOBT and FIT. Only three studies reported sensitivity and specificity. The sensitivity and specificity in each of the 3 studies was calculated in diagnostic/enriched cohorts. Tsai et al. [2025] reported a sensitivity of 38% in the external validation cohort (composed of 30 CRC cases, 30 adenoma cases, and 30 healthy cases) in the CRC vs. control (21). This represents a sensitivity of 38% in a retrospective, enrichment/diagnostic-type case-control sample, not a true screening population. The other two studies, Bakir-Gungor et al. [2024] and Lu et al. [2023] (19,23) report sensitivity and specificity for internal validation data sets comparing healthy control and CRC cases. The sensitivity and specificity reported for these two studies reflects a diagnostic/enriched case-control cohort, not a true screening population. Due to th sensitivity and specificity being reported in diagnostic/enriched cohorts and not true screening populations, these values cannot be directly compared with sensitivity/specificity for gFOBT and FIT derived from true screening populations. Sensitivity and specificity in diagnostic/enriched cohorts can be misleading.

In the context of AI/ML/microbiome models, sensitivity and specificity estimated in diagnostic/enrichment cohorts are still useful as proof-of-concept measures showing that the model can, in principle, discriminate CRC from non-CRC using non-invasive samples. They are valuable for early-phase method development and model selection (e.g., comparing algorithms or feature sets), for hypothesis generation about informative microbial or molecular signatures, and as an optimistic upper bound against which future performance in true screening cohorts can be contrasted. However, these metrics should be framed as development-stage performance in enriched case-control settings, not as estimates of real-world screening accuracy

Significant microbial genera and model interpretability

Across the 12 studies, several taxa emerged as consistently enriched in CRC, including Porphyromonas (e.g., P. asaccharolytica), Peptostreptococcus (e.g., P. stomatis), Fusobacterium (particularly F. nucleatum subspecies: animalis, vincentii, polymorphum, sensu stricto), Parvimonas micra, Gemella morbillorum, Bacteroides (B. fragilis), and Streptococcus species, while protective butyrate-producers were typically depleted, such as Faecalibacterium (F. prausnitzii), Eubacterium (E. eligens, E. hallii), Roseburia intestinalis, and Lachnospiraceae (14-25). Site- and stage-specific patterns included oral taxa (Veillonella parvula/atypica, Streptococcus parasanguinis) elevated in right-sided CRC. Hungatella hathewayi and Methanobrevibacter smithii, a methane producer, were significantly elevated in stage IV/metastatic disease (25). Atopobium and Haemophilus were significantly associated with adenomas (21). Early-onset CRC markers were identified, like Parasutterella and Ruminococcaceae UCG-002 (15). Bakir-Gungor et al. uniquely emphasized CRC-linked enzymes (glycosidases, CoA-transferases) from pathogens including Escherichia coli, Klebsiella pneumoniae, and Clostridioides difficile (19), while Lu et al. and Liu et al. highlighted E. coli/Escherichia-Shigella and depleted Lachnospiraceae/Faecalibacterium (23,24). Table 5 summarizes the significant microbial genera and species associated with colorectal cancer identified in the 12 studies. Compared with prior literature, Porphyromonas, Granulicatella, Peptostreptococcus, Parvimonas, and Escherichia, organisms previously linked to mucosal inflammation, dysbiosis, and colorectal carcinogenesis (29-31).

Table 5

Significant microbial genera and species associated with colorectal cancer

Study	Summary of significant findings
Freitas et al., 2023 (14)	Colon adenocarcinoma significantly enriched genera: Bacteroides, Fusobacterium
	Rectal adenocarcinoma significantly enriched genera: Streptococcus, Parabacteroides
	Significantly enriched in both: Solobacterium, Prophyromonas, Granulicatella
Jayakrishnan et al., 2024 (15)	Significantly enriched in eoCRC: Parasutterella, Ruminococcaceae UCG 002, and Acidovorax
Novielli et al., 2024 (16)	Porphyromonas > Peptostreptococcus > Fusobacterium > Parvimonas
Rynazal et al., 2023 (17)	Peptostreptococcus stomatis > Fusobacterium nucleatum> Gemella morbillorum > Solobacterium moorei > Clostridium symbiosum
Pateriya et al., 2025 (19)	Top discriminants enriched in CRC: Prophromonas asacharolytica, Parvimonas micra, Gemella morbillorium, Fusobacterium animalis, Fusobacterium nucleatum
	Depleted in CRC but elevated in healthy gut: Roseburia intestinalis, Lachnospira eligens, Faecalibacterium sp. HTF-F, Anaerostopes hardrus
	Enriched in Healthy Gut: Butyrivibrio crossotus, Eubacterium sp. MJ-33, Romboutisa ilealis, Pseudobutyrivibrio xylanivorans
Bakir-Gungor et al., 2024 (19)	Determined most significant and prominent genera and species based on enzymes produced:
	Significant enzymes: glycosidases, CoA-transferases, hydro-lyases, oligo-1,6-glucosidase, rotonobetainyl-CoA hydratase, and citrate CoA-transferase enzymes
	Enzymes synthesized by: Escherichia coli, Salmonella enterica, Klebsiella pneumoniae, Staphylococcus aureus, Streptococcus pneumoniae, and Clostridioides dificile
Novielli et al., 2025 (20)	Peptostreptococcus, Fusobacterium, Porphyromonas
Novielli et al., 2025 (20)	Eubacterium eligens was a significant robust negative predictor of CRC
Tsai et al., 2025 (21)	Most significant genera enriched in CRC: Porphyromonas, Peptostreptococcus, Parvimonas, Fusobacterium, Collinsella
Tsai et al., 2025 (21)	Most abundant in Adenomas: Atopobium, Haemophilus
Rotelli et al., 2024 (11)	Serriatia > Rumainococcus gnavus > Faecalibacterium > Parvimonas > Eubacterium coprostanoligenes (significantly elevated in CRC but not in AP) > Subduligranulum
Lu et al., 2023 (23)	Highly abundant in CRC: E. coli, Escherichia- Shigella, Prevotella
Lu et al., 2023 (23)	Dominant in Health Gut Control: Lachnospiraceae
Liu et al., 2024 (24)	Significantly elevated in CRC: Bacteroides, Bifidobacterium, Streptococcus, Fusobacterium, Klebsellia, Prvimonas, Alistipes, Peptostreptococcus, Rothia, Granulicatella, Fusobacterium, Gemella
Liu et al., 2024 (24)	Significantly depleted in CRC: Faecalibacterium, Eubacterium
Piccinno et al., 2023 (25)	Significant enriched species associated with CRC: Parvimonas micra, Gemella morbillorum and Peptostreptococcus stomatis
	Significant general in CRC: Bacteroides fragilis, Hungatella hathewayi
	Found new subgroups of F. nucleatum significantly elevated in CRC: F. nucleatum animalis, F. nucleatum vincentii, F. nucleatum sensu stricto, F. nucleatum polymorphum
	Most significantly abundant in CRC stage IV: H. Hathewayi, Methanobrevibacter smithii
	Significantly increased oral flora in right-sided CRC: Veillonella parvula, Veillonella atypica, Trueperella pyogenes
	Non-oral microbes significantly elevated in right-sided CRC: Streptococcus parasanguinis, and Veillonella spp
	Metastatic CRC had significantly elevated Methanobrevibacter smithii (methane producer)
	Left-sided CRC had significantly elevated Clostridia sp.

CRC, colorectal cancer.

To address concerns regarding model transparency, six studies (Novielli et al. 2024; Rynazal et al. 2023; Pateriya et al. 2025; Tsai et al. 2025; Jayakrishnan et al. 2024; Bakir-Gungor 2025) (15-19,21) incorporated XAI techniques such as SHAP values or feature attribution analyses. These methods enabled linkage of model predictions to specific microbial genera without compromising diagnostic accuracy. Notably, the genera highlighted through XAI approaches were biologically consistent with established mechanisms of colorectal tumorigenesis (29-31), strengthening confidence in the clinical relevance of AI-derived microbiome models. Explainability is crucial for clinical adoption because regulators and clinicians must understand why a model produces a given output to judge its safety and validity. Clear, interpretable explanations support regulatory review, help clinicians detect errors or bias, and make them more willing to trust and act on AI-assisted, non-invasive CRC predictions.

Discussion

Key findings and explanation

This systematic review synthesizes current evidence on the application of AI and ML models towards gut microbiome data for CRC screening. Across 12 observational studies published between January 1, 2023 and November 1, 2025, AI/ML-based microbiome models consistently demonstrated moderate diagnostic performance, with AUC values ranging from internal validation sets AUC values ranging from 0.61–0.98 for internal validation sets and 0.70-0.87 for external validation sets. Together, these findings suggest that microbiome data paired with AI/ML models show moderate AUCs for CRC detection and warrant consideration as complementary screening tools, pending prospective validation in true screening cohorts. The QUADAS-2 assessment indicated high risk of bias across studies, primarily due to retrospective case-control designs and selective outcome reporting. GRADE evaluation rated the overall evidence quality as low, reflecting potential missing results, lack of prospective validation in true screening cohorts, and methodological heterogeneity. These limitations underscore the preliminary nature of current findings and emphasize the need for higher-quality, prospective studies before clinical implementation.

Several patterns emerged across the 12 studies. First, diagnostic performance was robust across a range of ML and AI architectures, with ensemble-based methods—including RF and XGBoost—demonstrating the most stable and reproducible results. From Table 1, eight studies employed RF and had a range of AUC values from 0.699–0.926 (14,16,17,19,21,23-25) and two studies used XGBoost with AUC values ranging from 0.90–0.98 (18,22). The data support the use of RF and XGBoost given their established effectiveness with high-dimensional microbiome datasets. While deep learning approaches achieved good discrimination in select analyses, performance was more variable and often dependent on larger sample sizes, underscoring challenges related to model complexity and generalizability in microbiome research.

Second, models integrating multi-omics data or clinical variables generally outperformed microbiome-only approaches, as Jayakrishna et al. (15) reported an AUC of 0.98 with multionics approach as opposed to 0.61 with just microbial data. Sensitivity and specificity reported by Bakir-Gungor et al., Lu et al., and Tsai et al. (19,21,23) suggest potential comparability to gFOBT and FIT but should be interpreted with caution as the diagnostic/enriched cohorts and not true screening cohorts. Microbiome AI/ML models showed internal AUCs of 0.61–0.98 and external AUCs of 0.70–0.87. Traditionally, FIT generally has a sensitivity of approximately 69–86% and a specificity of 92–95% for CRC detection. Additionally, gFOBT generally has a sensitivity 50–75%, with specificity between 78–98% (32-37). However, direct comparisons are limited by case–control enrichment in AI studies vs. prospective screening cohorts for FIT/gFOBT, threshold differences (AUC vs. fixed sensitivity/specificity), and lack of true population-based validation for models.

Beyond diagnostic accuracy, the identification of potential microbial associations with CRC strengthens the utility of these models. Across multiple independent cohorts, Fusobacterium—particularly Fusobacterium nucleatum—emerged as the most consistently enriched genus in CRC patients and was frequently ranked among the most influential predictive features (38-40). Other genera, including Porphyromonas, Peptostreptococcus, and Parvimonas, were repeatedly associated with CRC, aligning with prior evidence linking these taxa to inflammation, mucosal invasion, and tumor-promoting microenvironments (30,41-45). Contrary, depletion of beneficial commensal genera such as Faecalibacterium (F. prausnitzii), Eubacterium (E. eligens, E. hallii), Roseburia intestinalis, and Lachnospiraceae were commonly observed, supporting the concept of CRC-associated dysbiosis characterized by loss of protective microbial functions (10,46,47).

Importantly, many studies employed XAI techniques to enhance model interpretability. Feature attribution methods, including SHAP values and feature importance rankings, enabled transparent linkage between model predictions and specific microbial genera or metabolic pathways. The concordance between AI-identified features and established biological mechanisms addresses a major barrier to clinical adoption of ML-based diagnostics, namely the “black box” nature of complex models. These findings suggest that explainability can be achieved without sacrificing predictive performance.

Comparison with similar research

The findings of this systemic review are consistent with earlier examining AI and ML models to CRC detection using gut microbiome data. Prior to 2023, systematic reviews and narrative syntheses have demonstrated that microbiome-based models can discriminate CRC from healthy controls with moderate to high accuracy, reporting AUC values from 0.75 to 0.90 (45,48-50). However, earlier analyses resulted in lower AUC, potentially limited to the early developmental stage of AI and ML modalities. In contrast, the present review focuses exclusively on recent human-based studies and reflects methodological advances, including XAI frameworks, integration of multi-omics data, and improved validation strategies, that were largely absent from earlier work (45,48-51). Compared to prior work, the included studies more frequently evaluated early-stage CRC and adenomas, placing emphasis on screening and risk stratification rather than disease classification alone (21,51). Notably, several studies directly or indirectly compared model performance with established noninvasive screening modalities such as FIT or gFOBT, an aspect not addressed in earlier studies (16,21,35,48,50,51). Overall, AI and ML microbiome models have progressed from proof-of-concept to clinically relevant screening augmentation by showing consistently high AUC, complementing rather than replacing existing CRC screening strategies.

Strengths and limitations

This systematic review provides a comprehensive, methodologically rigorous, and focused synthesis of contemporary studies applying AI and ML to gut microbiome data for CRC detection. Notably, the main strength is the consistency and reproducibility of findings across diverse human populations and modeling strategies, supporting the emerging and potential role of microbiome-based AI approaches for CRC screening. Despite favorable results, this body of evidence has notable limitations. All included studies were observational, predominantly retrospective case-control designs, which may overestimate diagnostic performance. There was substantial heterogeneity in sequencing platforms, selection strategies, and validation methods, limiting direct comparability across studies and precluding quantitative meta-analysis. Additionally, external validation was inconsistently performed, and few studies evaluated model performance in true screening populations.

Future directions

Future research should prioritize prospective, multicenter validation in asymptomatic screening cohorts, standardized reporting of diagnostic metrics, and direct comparison with established screening tools using uniform study designs. Integration of microbiome-based AI and ML models into existing CRC screening offers a noninvasive approach to enhance risk stratification. Additionally, continued emphasis on explainability and biological correlation and causation of microbiome with CRC will be needed to support clinician acceptance. While microbiome-based AI models show promise for non-invasive CRC detection, practical barriers to clinical implementation remain substantial, including high sequencing costs, turnaround times of several weeks, and lack of standardized sample collection and bioinformatics pipelines across laboratories. These constraints currently limit scalability and reproducibility compared to established tests like FIT. Future research should prioritize cost-reduction strategies, rapid sequencing technologies, and protocol harmonization to enable real-world deployment.

Conclusions

AI/ML gut microbiome models show reproducible CRC detection performance (AUCs 0.61–0.98), with ensemble methods (RF, XGBoost) and multi-omics approaches performing best, and XAI techniques identifying biologically plausible microbial signatures. Despite methodological heterogeneity and retrospective designs limiting clinical translation, these findings support their potential as complementary noninvasive screening tools, pending prospective validation against FIT/gFOBT in true screening cohorts.

Acknowledgments

None.

Footnote

Reporting Checklist: The authors have completed the PRISMA reporting checklist. Available at https://jgo.amegroups.com/article/view/10.21037/jgo-2026-1-0006/rc

Peer Review File: Available at https://jgo.amegroups.com/article/view/10.21037/jgo-2026-1-0006/prf

Funding: None.

Conflicts of Interest: Both authors have completed the ICMJE uniform disclosure form (available at https://jgo.amegroups.com/article/view/10.21037/jgo-2026-1-0006/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Areia M, Mori Y, Correale L, et al. Cost-effectiveness of artificial intelligence for screening colonoscopy: a modelling study Lancet Digit Health 2022;4:e436-44.
Mansur A, Saleem Z, Elhakim T, et al. Role of artificial intelligence in risk prediction, prognostication, and therapy response assessment in colorectal cancer: current state and future directions Front Oncol 2023;13:1065402.
Yin Z, Yao C, Zhang L, et al. Application of artificial intelligence in diagnosis and treatment of colorectal cancer: A novel Prospect Front Med (Lausanne) 2023;10:1128084.
Wang H, Zhu W, Lei J, et al. Gut microbiome differences and disease risk in colorectal cancer relatives and healthy individuals Front Cell Infect Microbiol 2025;15:1573216.
Bai B, Ma J, Xu W, et al. Gut microbiota and colorectal cancer: mechanistic insights, diagnostic advances, and microbiome-based therapeutic strategies Front Microbiol 2025;16:1699893.
Takamaru H, Tsay C, Shiba S, et al. Microbiome and Colorectal Cancer in Humans: A Review of Recent Studies J Anus Rectum Colon 2025;9:20-4.
Tudorache M, Treteanu AR, Gradisteanu Pircalabioru G, et al. Gut Microbiome Alterations in Colorectal Cancer: Mechanisms, Therapeutic Strategies, and Precision Oncology Perspectives Cancers (Basel) 2025;17:2294.
Khawaja TW, Zhao L, Siddiq R, et al. Unmasking the microbiome: the hidden role of gut bacteria in the pathogenesis of colorectal cancer and its prevention strategies Explor Target Antitumor Ther 2025;6:1002351.
Russ CA, Zertalis NA, Nanton V. Gut Bacterial Microbiome Profiles Associated with Colorectal Cancer Risk: A Narrative Review and Meta-Analysis EMJ Gastroenterol 2024;13:72-83.
Chen C, Su Q, Zi M, et al. Harnessing gut microbiota for colorectal cancer therapy: From clinical insights to therapeutic innovations. npj Biofilms Microbiomes 2025;11:190.
Yan R, Zheng R, Han Y, et al. Meta-analysis of gut microbiome reveals patterns of dysbiosis in colorectal cancer patients J Med Microbiol 2025;74:002042.
Seneviwickrama M, Gunasekera KM, Gamage K, et al. Role of gut microbiome in colorectal cancer: a comprehensive umbrella review protocol BMJ Open 2025;15:e104450.
[13] Definition of noninvasive—NCI Dictionary of Cancer Terms—NCI. (2011, February 2). [nciAppModulePage]. (nciglobal,ncienterprise). Available online: https://www.cancer.gov/publications/dictionaries/cancer-terms/def/noninvasivehttps://www.cancer.gov/publications/dictionaries/cancer-terms/def/noninvasive
Freitas P, Silva F, Sousa JV, et al. Machine learning-based approaches for cancer prediction using microbiome data Sci Rep 2023;13:11821.
Jayakrishnan TT, Sangwan N, Barot SV, et al. Multi-omics machine learning to study host-microbiome interactions in early-onset colorectal cancer NPJ Precis Oncol 2024;8:146.
Novielli P, Romano D, Magarelli M, et al. Explainable artificial intelligence for microbiome data analysis in colorectal cancer biomarker identification Front Microbiol 2024;15:1348974.
Rynazal R, Fujisawa K, Shiroma H, et al. Leveraging explainable AI for gut microbiome-based colorectal cancer classification Genome Biol 2023;24:21.
Pateriya D, Malwe AS, Sharma VK. CRCpred: An AI-ML tool for colorectal cancer prediction using gut microbiome Comput Biol Med 2025;195:110592.
Bakir-Gungor B, Ersoz NS, Yousef M, et al. Integrating Biological Domain Knowledge with Machine Learning for Identifying Colorectal-Cancer-Associated Microbial Enzymes in Metagenomic Data Appl Sci 2025;15:2940.
Novielli P, Baldi S, Romano D, et al. Personalized colorectal cancer risk assessment through explainable AI and Gut microbiome profiling Gut Microbes 2025;17:2543124.
Tsai YJ, Lyu WN, Liao NS, et al. Gut microbiome-based machine learning model for early colorectal cancer and adenoma screening Gut Pathog 2025;17:80.
Rotelli A, Salman A, Di Gloria L, et al. Analysis of Microbiome for AP and CRC Discrimination Bioengineering (Basel) 2025;12:713.
Lu F, Lei T, Zhou J, et al. Using gut microbiota as a diagnostic tool for colorectal cancer: machine learning techniques reveal promising results. J Med Microbiol 2023;
Liu G, Su L, Kong C, et al. Improved diagnostic efficiency of CRC subgroups revealed using machine learning based on intestinal microbes BMC Gastroenterol 2024;24:315.
Piccinno G, Thompson KN, Manghi P, et al. Pooled analysis of 3,741 stool metagenomes from 18 cohorts for cross-stage and strain-level reproducible microbial biomarkers of colorectal cancer Nat Med 2025;31:2416-29.
Wang LY, Lee WC. A permutation method to assess heterogeneity in external validation for risk prediction models PLoS One 2015;10:e0116957.
Karwowska Z, Aasmets O. Effects of data transformation and model selection on feature importance in microbiome classification data Microbiome 2025;13:2.
Zhang C, Yan R, Liu X, et al. Empirical simulation of internal validation methods for prediction models: comparing k-fold cross-validation with bootstrap-based optimism correction J Clin Epidemiol 2026;190:112101.
Ma Y, Chen T, Sun T, et al. The oncomicrobiome: New insights into microorganisms in cancer Microb Pathog 2024;197:107091.
Osman MA, Neoh HM, Ab Mutalib NS, et al. Parvimonas micra, Peptostreptococcus stomatis, Fusobacterium nucleatum and Akkermansia muciniphila as a four-bacteria biomarker panel of colorectal cancer Sci Rep 2021;11:2925.
Abu-Ghazaleh N, Chua WJ, Gopalan V. Intestinal microbiota and its association with colon cancer and red/processed meat consumption J Gastroenterol Hepatol 2021;36:75-88.
Lee JK, Liles EG, Bent S, et al. Accuracy of fecal immunochemical tests for colorectal cancer: systematic review and meta-analysis Ann Intern Med 2014;160:171.
Allison JE, Tekawa IS, Ransom LJ, et al. A comparison of fecal occult-blood tests for colorectal-cancer screening N Engl J Med 1996;334:155-9.
Tinmouth J, Lansdorp-Vogelaar I, Allison JE. Faecal immunochemical tests versus guaiac faecal occult blood tests: what clinicians and colorectal cancer screening programme organisers need to know Gut 2015;64:1327-37.
Imperiale TF, Ransohoff DF, Itzkowitz SH, et al. Multitarget stool DNA testing for colorectal-cancer screening N Engl J Med 2014;370:1287-97.
Hewitson P, Glasziou P, Watson E, et al. Cochrane systematic review of colorectal cancer screening using the fecal occult blood test (hemoccult): an update Am J Gastroenterol 2008;103:1541-9.
Grobbee EJ, Wisse PHA, Schreuders EH, et al. Guaiac-based faecal occult blood tests versus faecal immunochemical tests for colorectal cancer screening in average-risk individuals Cochrane Database Syst Rev 2022;6:CD009276.
Ou S, Wang H, Tao Y, et al. Fusobacterium nucleatum and colorectal cancer: From phenomenon to mechanism Front Cell Infect Microbiol 2022;12:1020583.
Zepeda-Rivera M, Minot SS, Bouzek H, et al. A distinct Fusobacterium nucleatum clade dominates the colorectal cancer niche Nature 2024;628:424-32.
Yang Y, Weng W, Peng J, et al. Fusobacterium nucleatum Increases Proliferation of Colorectal Cancer Cells and Tumor Development in Mice by Activating Toll-Like Receptor 4 Signaling to Nuclear Factor-κB, and Up-regulating Expression of MicroRNA-21 Gastroenterology 2017;152:851-866.
Bokhari SFH, Bakht D, Amir M, et al. Granulicatella infections: Comprehensive review of an elusive opportunistic pathogen World J Clin Cases 2025;13:110965.
Senthakumaran T, Tannæs TM, Moen AEF, et al. Detection of colorectal-cancer-associated bacterial taxa in fecal samples using next-generation sequencing and 19 newly established qPCR assays Mol Oncol 2025;19:412-29.
Dai W, Li C, Li T, et al. Super-taxon in human microbiome are identified to be associated with colorectal cancer BMC Bioinformatics 2022;23:243.
Alexander JL, Posma JM, Scott A, et al. Pathobionts in the tumour microbiota predict survival following resection for colorectal cancer Microbiome 2023;11:100.
Shen X, Li J, Li J, et al. Fecal Enterotoxigenic Bacteroides fragilis-Peptostreptococcus stomatis-Parvimonas micra Biomarker for Noninvasive Diagnosis and Prognosis of Colorectal Laterally Spreading Tumor Front Oncol 2021;11:661048.
Pandey H, Tang DWT, Wong SH, et al. Gut Microbiota in Colorectal Cancer Biological Role and Therapeutic Opportunities Cancers (Basel) 2023;15:866.
Saus E, Iraola-Guzmán S, Willis JR, et al. Microbiome and colorectal cancer: Roles in carcinogenesis and clinical potential Mol Aspects Med 2019;69:93-106.
Thomas AM, Manghi P, Asnicar F, et al. Metagenomic analysis of colorectal cancer datasets identifies cross-cohort microbial diagnostic signatures and a link with choline degradation Nat Med 2019;25:667-78.
Wirbel J, Pyl PT, Kartal E, et al. Meta-analysis of fecal metagenomes reveals global microbial signatures that are specific for colorectal cancer Nat Med 2019;25:679-89.
Loomba R, Seguritan V, Li W, et al. Gut Microbiome-Based Metagenomic Signature for Non-invasive Detection of Advanced Fibrosis in Human Nonalcoholic Fatty Liver Disease Cell Metab 2017;25:1054-62.
Topol EJ. High-performance medicine: the convergence of human and artificial intelligence Nat Med 2019;25:44-56.

Cite this article as: Chittilla M, Nagdev P. A systematic review of artificial intelligence and machine learning for gut microbiome-based CRC screening. J Gastrointest Oncol 2026;17(2):95. doi: 10.21037/jgo-2026-1-0006

A systematic review of artificial intelligence and machine learning for gut microbiome-based CRC screening

Highlight box

Introduction

Rational and knowledge gap

Objective

Background: gut microbiome and CRC

Methods

Protocol and registration

Search criteria

Eligibility criteria

Data extraction

Data analysis

Table 1

Table 2

Results

Study selection

Characteristics of included studies

Study quality

QUADAS-2

Table 3

The GRADE approach

Table 4

Outcomes

Diagnostic performance

AI/ML models and validation

Comparisons with gFOBT/FIT

Significant microbial genera and model interpretability

Table 5

Discussion

Key findings and explanation

Comparison with similar research

Strengths and limitations

Future directions

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share