The Data Scientist’s Guide to Topological Data Analysis: Preamble
Topological Data Analysis, abbreviated TDA, is a suite of data analytic methods inspired by the mathematical field of algebraic topology. TDA is attractive yet elusive for most data scientists, since its potential as a data exploration tool is often communicated through esoteric terminology unfamiliar to non-mathematicians. The purpose of this guide is to bridge the communication gap between academia and industry, so that non-mathematician data scientists may add current TDA methods to their analytic toolkits and anticipate new developments in the field of TDA.
The guide begins with an overview of Mapper, a TDA algorithm that has recently transitioned from academia to industry with commercial success. We explain the Mapper algorithm, demo open-source software, and present a handful of its commercial use-cases (some of which are original). Then, we switch to persistent homology, a TDA method that has not yet broken through to industry but is supported by a growing body of academic work. We explain the intuition behind homotopy, approximation, homology, and persistence, and demo open-source persistent homology software. It is hoped that the data scientist reading this guide will be inspired to give Mapper a try in their future analytic work, and be on the lookout for future developments in persistent homology that push it from academia to industry.
Mapper
- Algorithm. The Mapper algorithm maps high-dimensional data into smaller networks that retain the main topological features of the data and are easy to visualize.
- Software. To run the Mapper algorithm on small to medium-size datasets, one can use the open source R package TDAmapper.
- Use-Cases at Ayasdi. On a larger scale, Mapper has been used commercially by the company Ayasdi to forecast returns, detect fraud, aid in oil and gas exploration, plan ad campaigns, and discover biomarkers.
- Use-Cases at Aunalytics. At Aunalytics, Mapper (via R's TDAmapper) provided granular insights on a location tracking dataset, and revealed insights in a sparse call-center dataset even though there was little cohesion in the resulting network.
Persistent Homology
- Homotopy. Algebraic topology aims to describe the connectivity of any arbitrary space. It does this by computing the homotopy, or number of "loops" in each dimension.
- Approximation. In computational topology, datasets can be interpreted as samples taken from an underlying topological space, and for any given margin of error a topology can be constructed to approximate the underlying space.
- Homology. Homotopy groups are extremely difficult to compute in high dimensions. Homology is a similar concept which can be easier to compute.
- Persistence. Persistence barcode plots show which topological features persist through many scales of the data, and can be used to calculate similarity between different spaces.
- Software. To compute persistent homology of small to medium-size datasets, one can use the open source R package TDA.