Detecting novel lung cancer groups with Mapper

Published Paper DOI
10.1371/journal.pone.0284820


Materials and methods

  • FPKM counts of RNAseq data from human lung tissue:
    ↪ 19,648 genes per sample.
    ↪ 314 healthy samples (GTEx).
    ↪ 500 cancerous samples (TCGA).
  • Fit a Gaussian Mixture Model (GMM):
    ↪ Accurate transformation to a unimodal Gaussian.
    ↪ Consider Z-scores onward.
  • Compute pairwise correlation across all samples
    ↪ Filter data by mean correlation value.

Novel subgroups revealed with Mapper

  • Vary the number of bins 60 ≤ b ≤ 110.
  • Vary the overlap percentage 30 ≤ p ≤ 80.
  • Bright yellow: 100% cancerous samples.
  • Deep purple: 100% healthy samples.
  • Regardless of parameters:
    Healthy samples tend to stay at the center.
    Cancerous samples are split between both ends.

Split not captured by tSNE

  • tSNE separates healthy from cancerous samples but,
  • Fine structural details are lost with tSNE:
    (a) Using FPKM, (b) Using GMM Z-scores.

GO Enrichment Analysis

  • Two possible processes for forming lung cancer.
  • Strand -1: Mostly tumor cells
    ↪ Primarily upregulating inflammatory reactions.
  • Strand +1: Mixed bag but high risk
    ↪ Environmental factors and tumor gene interactions.
  • Similar conclusions when analyzing KEGG pathways.

Acknowledgements

This research was supported by National Institute of General Medical Sciences - Centers of Biomedical Research Excellence (COBRE) grant number P20GM125508.