Researchers: Joaquín Fontbona, Christopher Paul Ley, Daniel Remenik and Felipe Tobar
Coordinator: Joaquín Fontbona
The Data Science (DS) group at CMM seeks to take advantage of the various opportunities offered by the data revolution and to address its myriad open questions and challenges, with a distinctive approach based on mathematical thinking and modeling. The group’s approach to DS is strongly interdisciplinary, with a team of mathematicians, computer scientists, physicists and engineers who collaborate with experts in fields of application to provide solutions spanning the entire spectrum of data science and practice. At the same time, members of the group collaborate actively with other applied research groups at CMM in data-intensive challenges. In this sense, and since DS is becoming an increasingly relevant asset in most areas of science and technology, the efforts of the CMM-Data group are transversal to the activities of the center.
The DS activities at CMM take place at three different levels:
- Applied DS projects and technological transfer. Here the focus is on the design and implementation of innovative solutions for data-related challenges, where we seek to make a contribution towards the improvement of processes within both the private and public sectors. In practice, our methodology is twofold: we implement and adapt solutions based on existing theory, and we also develop novel theoretical grounds to address situations for which off-the-shelf methods fail to deliver. Our field experience in applied DS includes projects in areas such as bio and astroinformatics, marketing, retail, security, finance, audio processing and natural language processing.
- Basic research in DS. We seek to address technical and theoretical issues in the development, interpretation and training of machine learning (ML) models, as well as the application of automatic (i.e. artificial intelligence) methods to various domains. We approach this problem from a mathematical perspective, as we believe that mathematical methods and insights have much to say to consolidate the rigorous practice of DS; at the same time, this challenge will require and lead to the development of novel and interesting mathematics. Two particular subjects of interest are: (i) The interplay between the remarkable success and generalization properties of neural networks and their interpretability, as well as the theoretical understanding of probabilistic and PDE models based on first principles; (ii) The Bayesian approach to ML, which is better suited to represent and take advantage of uncertainty than its frequentist counterparts, but whose development has lagged due to intractabilities for general latent variable models, and where insights from mathematical theories such as optimal transport should play a key role.
- Education. Making modern DS methods available to professionals and the general public is essential to achieve the technological literacy needed in our country. Here, CMM-Data operates in three fronts: (i) Formal education programs, and in particular the newly created Master of Data Science at University of Chile, of which the center is one of its main drivers; (ii) Continuing education, in particular through tailored courses on DS methods and modeling to the industry and public sector institutions; (iii) Outreach activities, a highlight of which is the CMM-Data Days series which are public events focused on data-driven subjects with social value (past versions of the events have addressed subjects such as digital health, smart agriculture, the Internet of Things, and voting mechanisms and representativity.
Astroinformatics
A new generation of astronomical observatories is producing data at unprecedented rates. Extracting physical knowledge from this data requires new skills and interdisciplinary working methodologies. They include ingesting large volumes of data in real time, managing and processing data in a distributed fashion, developing new statistical and machine learning methods, visualizing data, connecting to an interoperable infrastructure, or developing new methods for inferring physical parameters. At the astroinformatics laboratory we are currently developing the ALeRCE project, which requires many of the tools previously listed in an interdisciplinary approach. This project involves several institutions in Chile (e.g. MAS, DO) and abroad (e.g. Caltech, U. Harvard, U. Washington), and it currently serves a community of users in more than 60 countries. Apart from ALeRCE, the astroinformatics laboratory has been involved in projects such as the discovery of supernova in real time, the inverse problem in radio interferometry, or the galaxy classification problem.
NLHPC
CMM hosts the National Laboratory for High Performance Computing (NLHPC), which is the national supercomputing center in Chile. This laboratory manages Guacolda-Leftraru, the most powerful supercomputer in Chile and one of the largest in South America.
The laboratory was created in 2010 in partnership with several universities and research centers (at present more than 45 institutions), to provide an answer for the increasing demand for the processing of large volumes of data and the simulation of very complex systems such as large optimization problems, astronomical data processing, simulation of particle systems and complex physical systems, and a large number of data science methods, just to mention a few.
Some featured projects:
- Astroinformatics: High cadence Transient Survey (HiTS), real-time detection of transient events
- Astroinformatics: ALeRCE (Automatic Learning for the Rapid Classification of Events)
- National Laboratory for High Performance Computing (NLHPC)
- Marketing analytics: collaboration with NoiseGrasp
- Group of Machine leArning, infErence and Signals (GAMES)
- The Multioutput Gaussian process Toolkit
- University of Chile’s Master of Data Science