Length: 3 hours (half day)
Intended Audience: This tutorial is intended for both researchers and practitioners from a variety of areas,
e.g., cancer research, health care, communication, business processes, and
databases, who are interested in integration of information (including several data
sets of the same type or data sets of distinct types) to filter noise in information and
applying machine learning and statistical methods to identify objects of interest, e.g.,
the true mutations in DNA of cancer patients.
Description: Big data has tremendous potential to transform businesses and research but raises
significant challenges in pre-processing and extracting useful information and
information integration to identify objects of interest. In this tutorial, I will present
some statistical methods/machine learning for fusion and analysis of big data in
cancer research, e.g., DNA sequencing data, gene expression data (RNA-seq) from
The Cancer Genome Atlas (TCGA), protein expression and clinical features of
cancer patients. This tutorial aims to cover both useful statistical/data mining
methods and the cutting-edge directions.
Topics include the following: (1) integration of data sets to filter noise in the
information, (2) sampling of big data to reduce computational burden but retain
certain prediction accuracy, (3) applying machine learning/statistics to identify true
objects, e.g., true mutations in DNA sequencing data of cancer patients, and (4)
integration of distinct types of information to identify objects, e.g., using DNA,
RNA gene expression and protein data, and clinical features of cancer patients to
find novel drug targets for cancers and identify prognosis markers of cancer patients.
Prerequisites: Basic knowledge of probability and statistics, data mining or
databases will be helpful.
Presenter: Grace S. Shieh
Grace S. Shieh is a full research fellow/professor at Institute of Statistical Science,
Academia Sinica/National Taiwan University. She received her PhD in Statistics
from University of Wisconsin-Madison, taught at University of Missouri-Columbia
in 1990-94, and joined ISS-AS since 1994; she branched into computational biology
in 2000. Her research expertise includes integration of data (information),
2
information quality, machine learning, directional statistics and association. She has
worked on problems of integrating distinct types of information (data) to uncover
novel drug targets and find prognosis markers for cancers, preprocessing in
information fusion, and integrating several data sets (especially the cutting-edge
biotechnology such as next generation sequencing data) to identify true mutations in
DNA, among others. Her research was funded by government agencies as well as IT
companies such as Taiwan Semiconductor Manufacturing Company. She has
published numerous papers and is an elected fellow of International Statistical
Institute. She has served as a committee member, session chair, organizer and
workshop/tutorial lecturer for numerous international conferences. She is also an
associate editor for Statistical Methodology, Frontiers in Statistical Methodology
and Genetics and STAT.