One of the aims of dimension reduction is to find intrinsic coordinates that describe the data manifold. Manifold Learning algorithm developed in Machine Learning return abstract coordinates; finding their physical or domain-related meaning is not formalized and left to domain experts. In this talk, I propose a method to explain embedding coordinates of a manifold as non-linear compositions of functions from a user-defined dictionary. We show that this problem can be set up as a sparse linear Group Lasso recovery problem, find sufficient recovery conditions, and demonstrate its effectiveness on data. With this class of new methods, called ManifoldLasso, a scientist can specify a (large) set of functions of interest, and

obtain from them intrinsic coordinates for her data in a semi-automatic, principled fashion.

In the more general case, when functions with physical meaning are not available, I will present a statistically founded methodology to estimate and then cancel out the distortions introduced by a manifold learning algorithm, thus effectively preserving the Riemannian

geometry of the original data. This method builds on the relationship between the Laplace-Beltrami operator and the Riemannian metric on a manifold. The method can be taken further, to relax a manifold embedding towards isometry, or to optimize the embedding parameters in

a data driven fashion.

All the methods described are implemented by the python package megaman, and can be applied to data sets up to a million points.

This work is part of Marina Meila¹s current research program² Unsupervised Validation for Unsupervised Learning² which aims to design broad-ranging, mathematically and statistically grounded methods to interpret, verify and validate the output of Unsupervised Machine Learning algorithms with a minimum of assumptions and of human intervention.