PCA, t-SNE, and UMAP: Modern Approaches to Dimension Reduction

written by Sean Law and Benjamin Zaitlen on 2018-06-12

Dimension reduction is the task of finding a low dimensional representation of high dimensional data. This has uses as a visualisation technique (by reducing to 2 or 3 dimensions), and as a pre-processing step for further machine learning tasks, such as clustering, or classification. This talk will provide an overview of different approaches to dimension reduction, looking at more recent approaches like t-SNE, before introducing a new algorithm called UMAP.

The talk will provide an introduction to dimension reduction in general, before building the theory that motivates UMAP, explaining how the algorithm works. Next, we will look at the implementation details including how Numba allows the implementation to be both efficient, and highly flexible (supporting custom distance metrics). Finally the talk will discuss how future versions of UMAP will support semi-supervised dimension reduction, and potentially operate directly on arbitrary pandas dataframes.


Leland McInnes is a researcher at the Tutte Institute for Mathematics and Computing in Canada. Leland's background is in pure mathematics, but he currently works in data science and machine learning with a focus on unsupervised learning, and fairness, accountability, and transparency in machine learning.