TDA for Medical Data

Overview

Topological Data Analysis (TDA) is a rapidly growing branch of applied mathematics and machine learning that extracts shape-based features from data. Unlike conventional statistical or geometric methods that depend on metric choices and assumptions of linearity, TDA captures intrinsic topological invariants — connected components (H0), loops (H1), voids (H2), and higher-dimensional structures — that persist across multiple scales.

My research explores how TDA can be applied across medical, biological, and AI domains to extract patterns that traditional pipelines often miss: from genomic subtype discovery, to cohort stratification of disease progression, to robustness analysis of deep learning models. The unifying theme is using shape as a complementary signal to numerical statistics.

Persistent Homology & Mapper

Two core TDA techniques drive most of this work:

Persistent Homology tracks topological features across a continuum of scales, producing persistence diagrams and barcodes. These representations are stable under small perturbations, making them ideal for noisy biomedical data.
Mapper Algorithm generates a graph-based summary of high-dimensional data, exposing branching and cyclic structures that reflect biological progression, treatment response trajectories, or class boundaries.

Both can be combined with deep learning via persistence images, persistence landscapes, or differentiable persistence layers to train end-to-end models that respect topology.

Application Areas

Pan-Cancer Genomics: Apply TDA to gene-expression and mutation data across multiple TCGA cohorts to discover novel topological biomarkers and refine molecular subtype classifications (basis of the H2C panel accepted at IEEE COINS 2026).
Biosignal Processing (EEG / ECG / EMG): Extract topological features from physiological time series for anomaly detection, seizure prediction, and arrhythmia classification.
Medical Image Analysis: Use cubical or alpha complex persistence on tumors and tissue scans to capture morphology beyond pixel-wise statistics.
Drug Discovery & Molecular Design: Apply persistent homology to molecular point clouds, protein-ligand interaction networks, and binding pocket geometries — guiding virtual screening, lead optimization, and structure-based drug design with shape-aware features.
Adversarial Robustness: Investigate TDA-based detectors for adversarial examples — generative-AI-driven perturbations often shift topological signatures even when imperceptible to humans.
Patient Stratification: Use Mapper to identify subgroups in clinical cohorts that share latent disease trajectories, supporting personalized treatment decisions.

Why TDA for Medical Data?

Medical datasets are typically high-dimensional, noisy, heterogeneous, and small in sample size — conditions that often defeat purely statistical or fully data-hungry deep learning approaches. TDA provides a complementary lens: it is noise-tolerant, coordinate-free, and multiscale. Topological features remain meaningful even with few samples, and they often align with clinically interpretable structures (e.g., disease subtypes form distinct connected components or loops).

Combined with modern ML, TDA enables hybrid pipelines: shape features extracted by persistent homology are fed into neural networks, gradient-boosted models, or used as priors in Bayesian frameworks — leading to more robust, more interpretable models.

H2C — Pan-Cancer Gene Panel via Persistent Homology (IEEE COINS 2026)

In collaboration with Permillion, this project applies TDA to pan-cancer genomic datasets (TCGA + supplementary public sources) to discover novel variables and hidden topological structures that conventional statistical methods cannot identify. Specifically, we apply persistent homology over the latent space of a topological autoencoder to extract topological features that persist across cancer-type embeddings, and use them to discover a compact pan-cancer gene panel with diagnostic and prognostic value.

This work resulted in the paper "H2C: A Pan-Cancer Gene Panel Discovered via Persistent Homology in Topological Autoencoder Latent Space" (Sunjun Hwang, Dohyun Hwang, Eunho Choi), accepted at IEEE COINS 2026 — IEEE International Conference on Omni-Layer Intelligent Systems.

Tools & Frameworks

GUDHI Ripser scikit-tda Giotto-tda KeplerMapper PyTorch scikit-learn Bioconductor TCGA APIs

Related Publications

"H2C: A Pan-Cancer Gene Panel Discovered via Persistent Homology in Topological Autoencoder Latent Space" — IEEE COINS 2026 (Accepted, Jun 2026)