We are overwhelmed with data of different types and sizes. In everyday life, we prefer to make decisions by considering all the available information, and often find that the inclusion of even seemingly circumstantial evidence provides an advantage. It would be great if our computational methods could conjecture in a similar way. In molecular biology, incorporating all the data, especially indirect information, may provide substantial gains in the accuracy of predictions and help to deepen our overall understanding of the problem data.
In this tutorial, we will present a high-level introduction to biomedical data integration and data fusion, highlighting the state of the art in the field as well as outstanding computational challenges. These include an introduction to the goals of typical data fusion studies and the bioinformatics techniques currently available to achieve them. We will then focus on matrix factorization-based data fusion approaches. We will introduce a technique for fusing any number of data sets that is agnostic to any specific model organism or data domain, discuss our own work on large-scale data fusion, and demonstrate our data fusion toolbox on several studies. The tutorial will include an easy-to-follow explanation of relevant mathematical concepts and a showcase with benchmarks and examples that attendees will be able to reproduce during the tutorial through either programmatic access to our toolbox or via tinkering in a user-friendly visual programming environment.
Data fusion approaches that will be discussed in detail in the tutorial build upon latent models and matrix factorization to collectively address possibly large volumes of heterogeneous data and construct accurate prediction models. In particular, for problems with heterogeneity in both the prediction task and data, we have recently proposed a data mining approach that can jointly learn in multiple related tasks with overlapping, partially overlapping or even completely different feature spaces. The factorization-based data fusion leaves the data in their own domain spaces, requires no or minimal data engineering at the input, and is able to leverage heterogeneous data to improve learning performance. Just like with Lego bricks, the approach assembles the data in a mosaic called a data fusion graph that through joint compression of all data matrices yields a model wherein every piece of evidence counts, even if distantly related to the prediction task. The method we will present can be applied both to unsupervised and supervised modeling of the data, and provides a venue to jointly collectively consider a plethora of information available in molecular biology.
The goal of the tutorial is to familiarize the audience with the concepts and methods of data fusion. The presentation of computational approaches will be accompanied with practical training using visual programming and scripting tools. By the end of the tutorial, the attendees should be able to execute data fusion-based analysis on their own data and use data fusion in functional genomics and gene prioritization. Our goals are to:
To achieve these goals we will intertwine short lectures with hand-son activities, where attendees will have the opportunity to work with our data fusion software and explore the benefits of data fusion in a series of real-world examples.
A basic knowledge of data mining and omics is assumed. We will make every effort to explain data fusion intuitively and in simple terms, thus trying to reach out to both biologists and computational scientists. We will present the mathematical background of the data fusion methodology, guide the hands-on parts of the tutorial, and use Python programming in a short section of tutorial, taking care not to be too technical and with the aim of motivating the computational scientists in the audience to explore further.
Computational scientists, data mining researchers, and molecular biologists with interest in data analysis and large-scale data integration.
Attendees are encouraged to bring their own laptops to the tutorial and to install our visual programming software prior to the event. A special installation of Orange data mining suite will be prepared for the tutorial and will be accompanied with the data needed to perform the case studies.
Tuesday, August 25 2015
Coffee break (2:30-3:00pm)
Data fusion with case studies (3:00-5:00pm)
Discussion and conclusions (5:00-5:30pm)
Marinka Žitnik (University of Ljubljana Slovenia)
Blaž Zupan (University of Ljubljana, Slovenia & Baylor College of Medicine, Houston, USA)
Blaz Zupan (http://www.fri.uni-lj.si/en/blaz-zupan/default.html) is professor of computer science at the University of Ljubljana, and a visiting professor at Baylor College of Medicine in Houston. His research focuses on methods for data mining and applications in bioinformatics and systems biology. Marinka Zitnik (http://helikoid.si) is currently working towards her PhD in computer science. Her research interests include machine learning, optimization and matrix analysis.
Blaz Zupan is coauthor of Orange (http://orange.biolab.si), a Python-based and visual programming data mining suite, and several bioinformatics web applications, such as dictyExpress for gene expression analytics (http://dictyexpress.biolab.si) and GenePath for epistasis analysis (http://genepath.biolab.si).
Together with Marinka Zitnik, they developed Nimfa, a Python programming library for nonnegative matrix factorization (http://nimfa.biolab.si). They have recently coorganized several workshops (http://goo.gl/kt1Ao9, http://goo.gl/7EdjFp) with a few hundred audience members about techniques for large-scale functional genomics data analysis in Orange. Workshops included lectures and hands-on activities with practical exercises in data mining for researchers from the University of Ljubljana, Baylor College of Medicine, Houston, and the University of Toronto.
Figure 1: The following figure shows a data fusion workflow in Orange data mining toolbox (http://orange.biolab.si) with an example on mining user-movie preferences and additionally using information on movie actors and genres. The tutorial will include a hands-on part where we will use Orange and its data fusion add-on to learn about matrix tri-factorization, data fusion, model scoring and post-processing of latent factors to gain understanding of the data.