Résumé

In recent years, significant advances in next generation sequencing technologies have made RNA sequencing (RNA-seq) a popular choice for studies of gene expression. Although microarrays and RNA-seq both aim to characterize transcriptional activity, the statistical tools developed for the analysis of the former are ill-suited to the latter. To date, the methodological developments for RNA-seq data have mainly focused on normalization and differential analysis, but the testing procedures currently proposed lack power to detect differentially expressed genes; little methodological research has been devoted to the identification of coexpressed genes in RNA-seq data. However, as costs for RNA-seq experiments continue to decrease, it is likely that such studies will replace the use of microarrays for many applications involving investigations of the transcriptome. It is therefore crucial to pursue research on the development of statistical methods that allow biologists to exploit RNA-seq data. In the MixStatSeq project, we focus on three main biological questions for RNA-seq data: (i) the detection of differentially expressed genes, (ii) the detection of co-expressed gene clusters, and (iii) the detection of invariant genes, i.e., those with stable expression in several biological conditions. To address these three biological questions, we propose to develop a suite of statistically sound methods based on mixture models.
For the analysis of differential expression, two points of view are envisaged. In the first, we aim to construct a powerful testing procedure by first performing a gene clustering step, followed by a testing procedure for each subgroup of genes and a correction for multiple testing. In the second, we will investigate model-based clustering procedures that directly cluster genes into groups representing differential and nondifferential expression.
For the detection of co-expressed gene clusters, we will extend our preliminary work on the use of mixture models. In particular, as the number of RNA-seq experiments will continue to increase in the coming years, it is crucial to develop variable selection procedures, as well as to incorporate external biological knowledge, in order to improve the interpretability of gene clustering.
For the detection of invariant genes, we aim to develop a non-asymptotic multiple hypothesis testing procedure to test a single distribution against a mixture of distributions, and to study its theoretical properties to ensure a powerful test. Beyond the biological application, such a development is a difficult theoretical challenge.
Throughout the MixStatSeq project, the team will foster collaborations with biologists of several laboratories to validate chosen models and test the developed approaches on real RNA-seq data obtained from different organisms. The originality of the MixStatSeq project will be the continuous exchange between theoretical, methodological and applied research, including the assessment of biologists, in order to ensure the immediate potential impact of the developed procedures. Moreover, beyond the RNA-seq data study, this project will provide new theoretical and methodological knowledge for the study of count data with
mixtures.