World Journal of Oncology, ISSN 1920-4531 print, 1920-454X online, Open Access
Article copyright, the authors; Journal compilation copyright, World J Oncol and Elmer Press Inc
Journal website https://www.wjon.org

Original Article

Volume 14, Number 5, October 2023, pages 406-422


Development of a Machine Learning-Based Prognostic Model for Hormone Receptor-Positive Breast Cancer Using Nine-Gene Expression Signature

Figures

Figure 1.
Figure 1. Extraction of all prognosis-related genes and preparation of a recurrence prediction model using machine learning. (a) Volcano plot illustrating the differentially expressed mRNAs of BC patients comparing with and without distant recurrence in the METABRIC HR+ HER2- cohort are shown. X-axes: log2 FC; Y-axes: -log 10 adjusted P-value from limma analysis. mRNAs with adjusted P-value < 0.05 and log2 FC > 0.25 are marked in red, with adjusted P-value > 0.05 and log2 FC < 0.25 in green, with adjusted P-value < 0.05 and log2 FC < 0.25 in blue, all others in black. (b) A heatmap illustrating the expression intensity of 155 genes extracted by (a), with colors ranging from red to blue as indicated in the key are shown. Both rows and columns are clustered using correlation distance and average linkage. (c) Logarithm of the integrated hazard ratio for all 155 genes extracted by (a) are shown. The complete list of these genes identified by meta-analysis is provided in Supplementary Material 1 (www.wjon.org). (d) Kaplan-Meier curves for distant RFS in METABRIC HR+ HER2- patients based on high and low risk in recurrence prediction model are shown. BC: breast cancer; METABRIC: Molecular Taxonomy of Breast Cancer International Consortium; FC: fold change; RFS: recurrence free survival; HR+: hormone receptor positive; HER2: human epidermal growth receptor 2; LRM: logistic regression model.
Figure 2.
Figure 2. Validation of the relationship between the recurrence prediction model and survival rate in other HR+ HER2- BC cohorts. Kaplan-Meier plots of the association of the recurrence prediction model with RFS for the recurrence prediction model, applied on the TCGA, GSE199135, GSE9195, GSE6532, and GSE21653 are shown. HR+: hormone receptor positive; HER2: human epidermal growth receptor 2; BC: breast cancer; RFS: recurrence free survival; TCGA: The Cancer Genome Atlas; FC: fold change.
Figure 3.
Figure 3. Gene expression profiles based on high and low risk in recurrence prediction model. GSEA of BC patients in METABRIC HR+ HER2- cohort comparing high and low risk in recurrence prediction model are shown. Upregulated pathways included mitotic spindle, G2/M check point, E2F targets, MYC target v2, and PI3K-AKT-mTOR signaling in high risk compared with low risk in LRM. The significance of each pathway was classified by a threshold of NES > 1.6 or < -1.6 and FDR q-value < 0.025. GSEA: Gene Set Enrichment Analysis; BC: breast cancer; METABRIC: Molecular Taxonomy of Breast Cancer International Consortium; HR+: hormone receptor positive; HER2: human epidermal growth receptor 2; mTOR: mammalian target of rapamycin; LRM: logistic regression model; NES: normalized enrichment score; FDR: false discovery rate.
Figure 4.
Figure 4. Differences in TME compositions for high and low risk in recurrence prediction model. We explored the difference in TME composition between high- and low- risk in recurrence prediction model utilizing xCell. Box plot of the relationship between recurrence risk in recurrence prediction model and TME in METABRIC HR+ HER2- cohort are shown. The left panel shows the cell fraction with up-regulation in high risk, and the right panel shows the cell fraction with up-regulation in low risk. ****P < 0.0001, ***P < 0.001, **P < 0.01, *P < 0.05. TME: tumor microenvironment; METABRIC: Molecular Taxonomy of Breast Cancer International Consortium; HR+: hormone receptor positive; HER2: human epidermal growth receptor 2; CYT: immune cytolytic activity; CD4+ tcm: the central memory CD4+ T cell; CD8+ tem: the effector memory CD8+ T cell; NKT: natural killer T cells; Tregs: regulatory T cells; aDC: activated dendritic cell; MSC: mesenchymal stem cell; CD4+ tem: the effector memory CD4+ T cell; CD8+ tcm: the central memory CD8+ T cell; cDC: conventional dendritic cell; iDC: immature dendritic cell; CMP: common myeloid progenitor; GMP: granulocyte-macrophage progenitor; HSC: hematopoietic stem cell; MEP: megakaryocyte-erythroid progenitor.
Figure 5.
Figure 5. Validation of the relationship between risk classification in recurrence prediction model and the therapeutic effect of chemotherapy and endocrine therapy for HR+ HER2- BC patients. Kaplan-Meier plots of distant RFS, total RFS, and local RFS of the association between recurrence risk in recurrence prediction model and chemotherapy- and endocrine-treated patients in METABRIC HR+ HER2- cohort are shown. For total RFS and local RFS, Kaplan-Meier plots of total recurrence are shown on the far right. HR+: hormone receptor positive; HER2: human epidermal growth receptor 2; RFS: recurrence free survival; METABRIC: Molecular Taxonomy of Breast Cancer International Consortium.
Figure 6.
Figure 6. Analysis of the tumor microenvironment with no additional effect of chemotherapy and endocrine therapy. Box plots of the relationship between recurrence by treatments and signaling pathways in GSVA (a) and CYT (b) and immune cell composition (c) in METABRIC HR+ HER2- cohort are shown. Based on treatments and recurrence, we classified patients into following three categories: a group of patients who were treated with ET and CT but relapsed as CT rec, a group of patients who were treated with ET alone but relapsed as ET rec, and a group of patients who were treated with ET with or without CT and not relapsed as No rec. ****P < 0.0001, ***P < 0.001, **P < 0.01, *P < 0.05. METABRIC: Molecular Taxonomy of Breast Cancer International Consortium; GSVA: Gene Set Variant Analysis; CYT: immune cytolytic activity; ET: endocrine therapy; CT: chemotherapy; E2F: E2F_TARGETS; G2M: G2M_CHECKPOINT; Myc2: MYC_TARGETS_v2; PI3K: PI3K_AKT_MTOR_SIGNALING; FA, FATTY_ACID_METABOLISM; PS: PROTEIN_SECRETION; XM: XENOBIOTIC_METABOLISM; ERE: ESTROGEN_RESPONSE_EARLY; ERL: ESTROGEN_RESPONSE_LATE; WNTβ: WNT_BETA_CATENIN_SIGNALING; M1: M1 macrophage; Tfh: follicular helper cells; M2: M2 macrophage; Tregs: CD4+ regulatory T cells.

Tables

Table 1. Key resources
 
ResourceSourceIdentifier
Deposited data
  METABRICMETABRIC[31]
  TCGATCGA PanCancer Atlas[31]
  GSE199135Takeshita et al [24][32]
  GSE9195; GSE6532Loi et al, 2010 dataset [25][32]
  GSE21653Sabatier et al, 2011 [26][32]
Software and algorithms
  Python 3.11.0Python Software Foundation[33]
  Numpy v 1.23.4Van Der Waltetal, 2011 [27][34]
  SciPy v 1.9.3Virtanen et al, 2020 [28][35]
  Pandas v 1.5.1Pandas - Python Data Analysis Library[36]
  Seaborn v 0.12.1Waskom, 2021 [29][37]
  Matplotlib v 3.6.2Hunter, 2007 [30][38]
  R4.0.2The R Foundation[39]

 

Table 2. The Nine Genes Best Predictors Extracted From 23 Signature Genes Using the Cox-PH Model With Recursive Feature Elimination
 
coefstd errzP > |z|(0.0250.975)
Cox-PH: Cox proportional-hazards.
23 genes
  const-0.5680.684-0.830.406-1.9090.773
  AGL0.28220.1092.5910.010.0690.496
  BIRC50.16920.2270.7470.455-0.2750.613
  C1orf64-0.2160.061-3.5650-0.335-0.097
  CDCA30.14510.2820.5140.607-0.4080.698
  CENPF0.19270.2430.7950.427-0.2830.668
  CEP55-0.73960.378-1.9570.05-1.480.001
  CIDEC0.0210.0770.2720.785-0.130.172
  CKAP2L0.80040.441.820.069-0.0621.663
  CRTAP-0.35570.197-1.8080.071-0.7410.03
  CYP4F22-0.16550.09-1.8450.065-0.3410.01
  E2F2-0.36320.253-1.4340.152-0.860.133
  FHL2-0.00560.098-0.0570.955-0.1970.186
  FOS0.04340.0770.5660.571-0.1070.194
  GSTM2-0.07540.08-0.9370.349-0.2330.082
  HNMT-0.51290.213-2.4050.016-0.931-0.095
  KIF20A1.28290.3243.95600.6471.919
  LAD10.20090.0832.4230.0150.0380.363
  PIP0.05040.0441.1550.248-0.0350.136
  PRC1-0.54010.27-20.045-1.069-0.011
  S100P0.16840.0473.57100.0760.261
  SEPP10.29430.132.2670.0230.040.549
  STAT1-0.07740.111-0.6950.487-0.2960.141
  TUBA3D-0.26730.074-3.6190-0.412-0.123
13 genes
  const-0.930.458-2.0290.042-1.828-0.032
  AGL0.26950.1052.5570.0110.0630.476
  C1orf64-0.21890.058-3.7740-0.333-0.105
  CEP55-0.71080.353-2.0120.044-1.403-0.019
  CKAP2L0.77660.41.9430.052-0.0071.56
  CRTAP-0.3450.184-1.8760.061-0.7050.015
  CYP4F22-0.17310.088-1.960.05-0.346-3.02E-05
  HNMT-0.37550.198-1.8930.058-0.7640.013
  KIF20A1.28160.3074.16800.6791.884
  LAD10.21070.0812.6110.0090.0530.369
  PRC1-0.5130.261-1.9660.049-1.024-0.002
  S100P0.17330.0463.7400.0820.264
  SEPP10.28120.1232.2770.0230.0390.523
  TUBA3D-0.24890.07-3.5320-0.387-0.111
12 genes
  const-0.95190.458-2.0790.038-1.85-0.054
  AGL0.31020.1033.0130.0030.1080.512
  C1orf64-0.21240.058-3.6680-0.326-0.099
  CEP55-0.6310.35-1.8030.071-1.3170.055
  CKAP2L0.93410.3922.3830.0170.1661.702
  CYP4F22-0.170.088-1.9260.054-0.3430.003
  HNMT-0.51470.184-2.7930.005-0.876-0.153
  KIF20A1.19140.3033.93100.5971.785
  LAD10.21050.0812.6030.0090.0520.369
  PRC1-0.57020.259-2.2020.028-1.078-0.063
  S100P0.16510.0463.58200.0750.255
  SEPP10.1830.1121.6360.102-0.0360.402
  TUBA3D-0.23440.07-3.3530.001-0.371-0.097
11 genes
  const-0.84280.452-1.8660.062-1.7280.042
  AGL0.31010.1033.0140.0030.1080.512
  C1orf64-0.20050.057-3.4950-0.313-0.088
  CEP55-0.56660.347-1.6320.103-1.2470.114
  CKAP2L0.84710.3882.1820.0290.0861.608
  CYP4F22-0.17720.088-2.0140.044-0.35-0.005
  HNMT-0.35850.157-2.2890.022-0.665-0.051
  KIF20A1.15520.3023.82200.5631.748
  LAD10.20130.082.5010.0120.0440.359
  PRC1-0.55270.259-2.1340.033-1.06-0.045
  S100P0.16090.0463.50400.0710.251
  TUBA3D-0.23760.07-3.3980.001-0.375-0.101
10 genes
  const-0.75690.447-1.6920.091-1.6330.12
  AGL0.29790.1022.9070.0040.0970.499
  C1orf64-0.18670.057-3.2940.001-0.298-0.076
  CKAP2L0.58690.3521.670.095-0.1021.276
  CYP4F22-0.17490.088-1.990.047-0.347-0.003
  HNMT-0.40060.154-2.6020.009-0.702-0.099
  KIF20A1.0360.2923.54500.4631.609
  LAD10.18670.082.3410.0190.030.343
  PRC1-0.66830.249-2.6880.007-1.156-0.181
  S100P0.16090.0463.50900.0710.251
  TUBA3D-0.23810.07-3.4160.001-0.375-0.101
9 genes
  const-1.06510.407-2.6170.009-1.863-0.267
  AGL0.31070.1023.040.0020.110.511
  C1orf64-0.19320.057-3.4160.001-0.304-0.082
  CYP4F22-0.1760.088-2.0090.045-0.348-0.004
  HNMT-0.40240.154-2.6110.009-0.705-0.1
  KIF20A1.26870.2584.91500.7631.775
  LAD10.18410.082.3090.0210.0280.34
  PRC1-0.49680.226-2.20.028-0.939-0.054
  S100P0.16450.0463.59600.0750.254
  TUBA3D-0.23520.07-3.380.001-0.372-0.099

 

Table 3. Patients and Clinical Characteristics Associated With Recurrence Prediction Model in METABRIC HR+ HER2- Cohort
 
VariablesNumber of patients (%)P-value
Total (N = 1,355)Recurrence prediction model
High risk (N = 486)Low risk (N = 869)
*It was also significance in univariate and multivariate analysis. P < 0.05 is considered statistically significant. METABRIC: Molecular Taxonomy of Breast Cancer International Consortium, HR+: hormone receptor positive; HER2: human epidermal growth factor receptor 2; PgR: progesterone receptor.
Age
  ≥ 50220 (16.2)58 (11.9)162 (18.6)0.0013*
  < 501,135 (83.8)428 (88.1)707 (81.4)
Menopausal state
  Pre220 (16.2)58 (11.9)162 (18.6)0.0013*
  Post1,135 (83.8)428 (88.1)707 (81.4)
Tumor size (cm)
  ≥ 2601 (44.4)179 (36.8)422 (48.6)0.000027*
  < 2742 (54.8)303 (62.3)439 (50.5)
  Unknown12 (0.9)4 (0.8)8 (0.9)
Lymph node metastases
  Negative745 (55)246 (50.6)499 (57.4)0.016*
  Positive610 (45)240 (49.4)370 (42.6)
Histopathology
  Ductal1,006 (74.2)395 (81.3)611 (70.3)0.000051*
  Lobular118 (8.7)29 (6)89 (10.2)
  Others/unknown231 (17)62 (12.8)169 (19.4)
Tumor grade
  1159 (11.7)18 (3.7)141 (16.2)< 0.00001*
  2, 31,135 (83.8)452 (93)683 (78.6)
  Unknown61 (4.5)16 (3.3)45 (5.2)
Clinical stage
  I/II933 (68.9)317 (65.2)616 (70.9)0.13
  III/IV70 (5.2)30 (6.2)40 (4.6)
  Unknown352 (26)139 (28.6)213 (24.5)
PgR
  Negative411 (30.3)206 (42.4)205 (23.6)< 0.00001*
  Positive944 (69.7)280 (57.6)664 (76.4)
Molecular characterization
  Luminal A656 (48.4)137 (28.2)519 (59.7)< 0.00001*
  Luminal B419 (30.9)222 (45.7)197 (22.7)
  HER263 (4.6)53 (10.9)10 (1.2)
  Basal-like25 (1.8)22 (4.5)3 (0.3)
  Claudin-low72 (5.3)19 (3.9)53 (6.1)
  Normal114 (8.4)30 (6.2)84 (9.7)