Thomas Moreau

A lower bound and a near-optimal algorithm for bilevel empirical risk minimization
Mathieu Dagréou, Thomas Moreau, Samuel Vaiter, Pierre Ablin, May 2024, In proceedings of AISTATS

Bilevel optimization problems, which are problems where two optimization problems are nested, have more and more applications in machine learning. In many practical cases, the upper and the lower objectives correspond to empirical risk minimization problems and therefore have a sum structure. In this context, we propose a bilevel extension of the celebrated SARAH algorithm. We demonstrate that the algorithm requires O((n+m)^{1/2}ε^{−1}) gradient computations to achieve ε-stationarity ...

Bilevel optimization problems, which are problems where two optimization problems are nested, have more and more applications in machine learning. In many practical cases, the upper and the lower objectives correspond to empirical risk minimization problems and therefore have a sum structure. In this context, we propose a bilevel extension of the celebrated SARAH algorithm. We demonstrate that the algorithm requires O((n+m)^{1/2}ε^{−1}) gradient computations to achieve ε-stationarity with n+m the total number of samples, which improves over all previous bilevel algorithms. Moreover, we provide a lower bound on the number of oracle calls required to get an approximate stationary point of the objective function of the bilevel problem. This lower bound is attained by our algorithm, which is therefore optimal in terms of sample complexity.

FaDIn: Fast Discretized Inference for Hawkes Processes with General Parametric Kernels
Guillaume Staerman, Cédric Allain, Alexandre Gramfort & Thomas Moreau, Jul 2023, In proceedings of International Conference on Machine Learning (ICML)

Temporal point processes (TPP) are a natural tool for modeling event-based data. Among all TPP models, Hawkes processes have proven to be the most widely used, mainly due to their simplicity and computational ease when considering exponential or non-parametric kernels. Although non-parametric kernels are an option, such models require large datasets. While exponential kernels are more data efficient and relevant for certain applications where events immediately trigger more events, they are...

Temporal point processes (TPP) are a natural tool for modeling event-based data. Among all TPP models, Hawkes processes have proven to be the most widely used, mainly due to their simplicity and computational ease when considering exponential or non-parametric kernels. Although non-parametric kernels are an option, such models require large datasets. While exponential kernels are more data efficient and relevant for certain applications where events immediately trigger more events, they are ill-suited for applications where latencies need to be estimated, such as in neuroscience. This work aims to offer an efficient solution to TPP inference using general parametric kernels with finite support. The developed solution consists of a fast L2 gradient-based solver leveraging a discretized version of the events. After supporting the use of discretization theoretically, the statistical and computational efficiency of the novel approach is demonstrated through various numerical experiments. Finally, the effectiveness of the method is evaluated by modeling the occurrence of stimuli-induced patterns from brain signals recorded with magnetoencephalography (MEG). Given the use of general parametric kernels, results show that the proposed approach leads to a more plausible estimation of pattern latency compared to the state-of-the-art.

Sliced-Wasserstein on Symmetric Positive Definite Matrices for M/EEG Signals
Clément Bonet, Benoît Malézieux, Alain Rakotomamonjy, Lucas Drumetz, Thomas Moreau, Matthieu Kowalski, Nicolas Courty, Jul 2023, In proceedings of International Conference on Machine Learning (ICML)

When dealing with electro or magnetoencephalography records, many supervised prediction tasks are solved by working with covariance matrices to summarize the signals. Learning with these matrices requires using Riemanian geometry to account for their structure. In this paper, we propose a new method to deal with distributions of covariance matrices and demonstrate its computational efficiency on M/EEG multivariate time series. More specifically, we define a Sliced-Wasserstein distance between...

When dealing with electro or magnetoencephalography records, many supervised prediction tasks are solved by working with covariance matrices to summarize the signals. Learning with these matrices requires using Riemanian geometry to account for their structure. In this paper, we propose a new method to deal with distributions of covariance matrices and demonstrate its computational efficiency on M/EEG multivariate time series. More specifically, we define a Sliced-Wasserstein distance between measures of symmetric positive definite matrices that comes with strong theoretical guarantees. Then, we take advantage of its properties and kernel methods to apply this distance to brain-age prediction from MEG data and compare it to state-of-the-art algorithms based on Riemannian geometry. Finally, we show that it is an efficient surrogate to the Wasserstein distance in domain adaptation for Brain Computer Interface applications.

Test like you Train in Implicit Deep Learning
Zaccharie Ramzi, Pierre Ablin, Gabriel Peyré, Thomas Moreau, May 2023, preprint preprint

Implicit deep learning has recently gained popularity with applications ranging from meta-learning to Deep Equilibrium Networks (DEQs). In its general formulation, it relies on expressing some components of deep learning pipelines implicitly, typically via a root equation called the inner problem. In practice, the solution of the inner problem is approximated during training with an iterative procedure, usually with a fixed number of inner iterations. During inference, the inner problem...

Implicit deep learning has recently gained popularity with applications ranging from meta-learning to Deep Equilibrium Networks (DEQs). In its general formulation, it relies on expressing some components of deep learning pipelines implicitly, typically via a root equation called the inner problem. In practice, the solution of the inner problem is approximated during training with an iterative procedure, usually with a fixed number of inner iterations. During inference, the inner problem needs to be solved with new data. A popular belief is that increasing the number of inner iterations compared to the one used during training yields better performance. In this paper, we question such an assumption and provide a detailed theoretical analysis in a simple setting. We demonstrate that overparametrization plays a key role: increasing the number of iterations at test time cannot improve performance for overparametrized networks. We validate our theory on an array of implicit deep-learning problems. DEQs, which are typically overparametrized, do not benefit from increasing the number of iterations at inference while meta-learning, which is typically not overparametrized, benefits from it.

Using convolutional dictionary learning to detect task-related neuromagnetic transients and ageing trends in a large open-access dataset
Lindsey Power, Cédric Allain, Thomas Moreau, Alexandre Gramfort, Timothy Bardouille, Feb 2023, NeuroImage

Human neuromagnetic activity is characterised by a complex combination of transient bursts with varying spatial and temporal characteristics. The characteristics of these transient bursts change during task performance and normal ageing in ways that can inform about underlying cortical sources. Many methods have been proposed to detect transient bursts, with the most successful ones being those that employ multi-channel, data-driven approaches to minimize bias in...

Human neuromagnetic activity is characterised by a complex combination of transient bursts with varying spatial and temporal characteristics. The characteristics of these transient bursts change during task performance and normal ageing in ways that can inform about underlying cortical sources. Many methods have been proposed to detect transient bursts, with the most successful ones being those that employ multi-channel, data-driven approaches to minimize bias in the detection procedure. There has been little research, however, into the application of these data-driven methods to large datasets for group-level analyses. In the current work, we apply a data-driven convolutional dictionary learning (CDL) approach to detect neuromagnetic transient bursts in a large group of healthy participants from the Cam-CAN dataset. CDL was used to extract repeating spatiotemporal motifs in 538 participants between the ages of 18-88 during a sensorimotor task. Motifs were then clustered across participants based on similarity, and relevant task-related clusters were analysed for age-related trends in their spatiotemporal characteristics. Seven task-related motifs resembling known transient burst types were identified through this analysis, including beta, mu, and alpha type bursts. All burst types showed positive trends in their activation levels with age that could be explained by increasing burst rate with age. This work validated the data-driven CDL approach for transient burst detection on a large dataset and identified robust information about the complex characteristics of human brain signals and how they change with age.

A framework for bilevel optimization that enables stochastic and global variance reduction algorithms
Mathieu Dagréou, Pierre Ablin, Samuel Vaiter, Thomas Moreau, Nov 2022, In proceedings of Advances in Neural Information Processing System (NeurIPS)

Bilevel optimization, the problem of minimizing a value function which involves the arg-minimum of another function, appears in many areas of machine learning. In a large scale setting where the number of samples is huge, it is crucial to develop stochastic methods, which only use a few samples at a time to progress. However, computing the gradient of the value function involves solving a linear system, which makes it difficult to derive unbiased stochastic estimates. To overcome...

Bilevel optimization, the problem of minimizing a value function which involves the arg-minimum of another function, appears in many areas of machine learning. In a large scale setting where the number of samples is huge, it is crucial to develop stochastic methods, which only use a few samples at a time to progress. However, computing the gradient of the value function involves solving a linear system, which makes it difficult to derive unbiased stochastic estimates. To overcome this problem we introduce a novel framework, in which the solution of the inner problem, the solution of the linear system, and the main variable evolve at the same time. These directions are written as a sum, making it straightforward to derive unbiased estimates. The simplicity of our approach allows us to develop global variance reduction algorithms, where the dynamics of all variables is subject to variance reduction. We demonstrate that SABA, an adaptation of the celebrated SAGA algorithm in our framework, has convergence rate, and that it achieves linear convergence under Polyak-Lojasciewicz assumption. This is the first stochastic algorithm for bilevel optimization that verifies either of these properties. Numerical experiments validate the usefulness of our method.

Deep invariant networks with differentiable augmentation layers
Cédric Rommel, Thomas Moreau, Alexandre Gramfort, Nov 2022, In proceedings of Advances in Neural Information Processing System (NeurIPS)

Designing learning systems which are invariant to certain data transformations is critical in machine learning. Practitioners can typically enforce a desired invariance on the trained model through the choice of a network architecture, e.g. using convolutions for translations, or using data augmentation. Yet, enforcing true invariance in the network can be difficult, and data invariances are not always known a piori. State-of-the-art methods for learning data augmentation policies requires...

Designing learning systems which are invariant to certain data transformations is critical in machine learning. Practitioners can typically enforce a desired invariance on the trained model through the choice of a network architecture, e.g. using convolutions for translations, or using data augmentation. Yet, enforcing true invariance in the network can be difficult, and data invariances are not always known a piori. State-of-the-art methods for learning data augmentation policies require held-out data and are based on bilevel optimization problems, which are complex to solve and often computationally demanding. In this work we investigate new ways of learning invariances only from the training data. Using learnable augmentation layers built directly in the network, we demonstrate that our method is very versatile. It can incorporate any type of differentiable augmentation and be applied to a broad class of learning problems beyond computer vision. We provide empirical evidence showing that our approach is easier and faster to train than modern automatic data augmentation techniques based on bilevel optimization, while achieving comparable results. Experiments show that while the invariances transferred to a model through automatic data augmentation are limited by the model expressivity, the invariance yielded by our approach is insensitive to it by design.

Benchopt: Reproducible, efficient and collaborative optimization benchmarks
Thomas Moreau, Mathurin Massias, Alexandre Gramfort, et al., Nov 2022, In proceedings of Advances in Neural Information Processing System (NeurIPS)

Numerical validation is at the core of machine learning research as it allows us to assess the actual impact of new methods, and to confirm the agreement between theory and practice. Yet, the rapid development of the field poses several challenges: researchers are confronted with a profusion of methods to compare, limited transparency and consensus on best practices, as well as tedious re-implementation work. As a result, validation is often very partial, which can lead to wrong...

Numerical validation is at the core of machine learning research as it allows us to assess the actual impact of new methods, and to confirm the agreement between theory and practice. Yet, the rapid development of the field poses several challenges: researchers are confronted with a profusion of methods to compare, limited transparency and consensus on best practices, as well as tedious re-implementation work. As a result, validation is often very partial, which can lead to wrong conclusions that slow down the progress of research. We propose Benchopt, a collaborative framework to automatize, publish and reproduce optimization benchmarks in machine learning across programming languages and hardware architectures. Benchopt simplifies benchmarking for the community by providing an off-the-shelf tool for running, sharing and extending experiments. To demonstrate its broad usability, we showcase benchmarks on three standard ML tasks: -regularized logistic regression, Lasso and ResNet18 training for image classification. These benchmarks highlight key practical findings that give a more nuanced view of state-of-the-art for these problems, showing that for practical evaluation, the devil is in the details.

Data augmentation for learning predictive models on EEG: a systematic comparison
Cédric Rommel, Joseph Paillard, Thomas Moreau, Alexandre Gramfort, Nov 2022, Journal of Neural Engineering

Objective. The use of deep learning for electroencephalography (EEG) classification tasks has been rapidly growing in the last years, yet its application has been limited by the relatively small size of EEG datasets. Data augmentation, which consists in artificially increasing the size of the dataset during training, can be employed to alleviate this problem. While a few augmentation transformations for EEG data have been proposed in the literature, their positive impact on...

Objective. The use of deep learning for electroencephalography (EEG) classification tasks has been rapidly growing in the last years, yet its application has been limited by the relatively small size of EEG datasets. Data augmentation, which consists in artificially increasing the size of the dataset during training, can be employed to alleviate this problem. While a few augmentation transformations for EEG data have been proposed in the literature, their positive impact on performance is often evaluated on a single dataset and compared to one or two competing augmentation methods. This work proposes to better validate the existing data augmentation approaches through a unified and exhaustive analysis.
Approach. We compare quantitatively 13 different augmentations with two different predictive tasks, datasets and models, using three different types of experiments. Main results. We demonstrate that employing the adequate data augmentations can bring up to 45% accuracy improvements in low data regimes compared to the same model trained without any augmentation. Our experiments also show that there is no single best augmentation strategy, as the good augmentations differ on each task.
Significance. Our results highlight the best data augmentations to consider for sleep stage classification and motor imagery brain–computer interfaces. More broadly, it demonstrates that EEG classification tasks benefit from adequate data augmentation.

SHINE: SHaring the INverse Estimate from the forward pass for bi-level optimization and implicit models
Zaccharie Ramzi, Florian Mannel, Shaojie Bai, Jean-Luc Starck, Philippe Ciuciu, Thomas Moreau, Apr 2022, In proceedings of International Conference on Learning Representations (ICLR)

In recent years, implicit deep learning has emerged as a method to increase the depth of deep neural networks. While their training is memory-efficient, they are still significantly slower to train than their explicit counterparts. In Deep Equilibrium Models~(DEQs), the training is performed as a bi-level problem, and its computational complexity is partially driven by the iterative inversion of a huge Jacobian matrix.

In recent years, implicit deep learning has emerged as a method to increase the depth of deep neural networks. While their training is memory-efficient, they are still significantly slower to train than their explicit counterparts. In Deep Equilibrium Models~(DEQs), the training is performed as a bi-level problem, and its computational complexity is partially driven by the iterative inversion of a huge Jacobian matrix. In this paper, we propose a novel strategy to tackle this computational bottleneck from which many bi-level problems suffer. The main idea is to use the quasi-Newton matrices from the forward pass to efficiently approximate the inverse Jacobian matrix in the direction needed for the gradient computation. We provide a theorem that motivates using our method with the original forward algorithms. In addition, by modifying these forward algorithms, we further provide theoretical guarantees that our method asymptotically estimates the true implicit gradient. We empirically study this approach in many settings, ranging from hyperparameter optimization to large Multiscale DEQs applied to CIFAR and ImageNet. We show that it reduces the computational cost of the backward pass by up to two orders of magnitude. All this is achieved while retaining the excellent performance of the original models in hyperparameter optimization and on CIFAR, and giving encouraging and competitive results on ImageNet.

Dictionary and prior learning with unrolled algorithms for unsupervised inverse problems
Benoît Malézieux, Thomas Moreau, Matthieu Kowalski, Apr 2022, In proceedings of International Conference on Learning Representations (ICLR)

Dictionary learning consists of finding a sparse representation from noisy data and is a common way to encode data-driven prior knowledge on signals. Alternating minimization (AM) is standard for the underlying optimization, where gradient descent steps alternate with sparse coding procedures. The major drawback of this method is its prohibitive computational cost, making it unpractical on large real-world data sets.

Dictionary learning consists of finding a sparse representation from noisy data and is a common way to encode data-driven prior knowledge on signals. Alternating minimization (AM) is standard for the underlying optimization, where gradient descent steps alternate with sparse coding procedures. The major drawback of this method is its prohibitive computational cost, making it unpractical on large real-world data sets. This work studies an approximate formulation of dictionary learning based on unrolling and compares it to alternating minimization to find the best trade-off between speed and precision. We analyze the asymptotic behavior and convergence rate of gradients estimates in both methods. We show that unrolling performs better on the support of the inner problem solution and during the first iterations. Finally, we apply unrolling on pattern learning in magnetoencephalography (MEG) with the help of a stochastic algorithm and compare the performance to a state-of-the-art method.

CADDA: Class-wise Automatic Differentiable Data Augmentation for EEG Signals
Cédric Rommel ~Cédric_Rommel1 , Thomas Moreau, Joseph Paillard, Alexandre Gramfort, Apr 2022, In proceedings of International Conference on Learning Representations (ICLR)

Data augmentation is a key element of deep learning pipelines, as it informs the network during training about transformations of the input data that keep the label unchanged. Manually finding adequate augmentation methods and parameters for a given pipeline is however rapidly cumbersome. In particular, while intuition can guide this decision for images, the design and choice of augmentation policies remains unclear for more complex types of data, such as neuroscience signals.

Data augmentation is a key element of deep learning pipelines, as it informs the network during training about transformations of the input data that keep the label unchanged. Manually finding adequate augmentation methods and parameters for a given pipeline is however rapidly cumbersome. In particular, while intuition can guide this decision for images, the design and choice of augmentation policies remains unclear for more complex types of data, such as neuroscience signals. Besides, class-dependent augmentation strategies have been surprisingly unexplored in the literature, although it is quite intuitive: changing the color of a car image does not change the object class to be predicted, but doing the same to the picture of an orange does. This paper investigates gradient-based automatic data augmentation algorithms amenable to class-wise policies with exponentially larger search spaces. Motivated by supervised learning applications using EEG signals for which good augmentation policies are mostly unknown, we propose a new differentiable relaxation of the problem. In the class-agnostic setting, results show that our new relaxation leads to optimal performance with faster training than competing gradient-based methods, while also outperforming gradient-free methods in the class-wise setting. This work proposes also novel differentiable augmentation operations relevant for sleep stage classification.

DriPP: Driven Point Processes to Model Stimuli Induced Patterns in M/EEG Signals
Cédric Allain, Alexandre Gramfort, Thomas Moreau, Apr 2022, In proceedings of International Conference on Learning Representations (ICLR)

The quantitative analysis of non-invasive electrophysiology signals from electroencephalography (EEG) and magnetoencephalography (MEG) boils down to the identification of temporal patterns such as evoked responses, transient bursts of neural oscillations but also blinks or heartbeats for data cleaning. Several works have shown that these patterns can be extracted efficiently in an unsupervised way, e.g., using Convolutional Dictionary Learning.

The quantitative analysis of non-invasive electrophysiology signals from electroencephalography (EEG) and magnetoencephalography (MEG) boils down to the identification of temporal patterns such as evoked responses, transient bursts of neural oscillations but also blinks or heartbeats for data cleaning. Several works have shown that these patterns can be extracted efficiently in an unsupervised way, e.g., using Convolutional Dictionary Learning. This leads to an event-based description of the data. Given these events, a natural question is to estimate how their occurrences are modulated by certain cognitive tasks and experimental manipulations. To address it, we propose a point process approach. While point processes have been used in neuroscience in the past, in particular for single cell recordings (spike trains), techniques such as Convolutional Dictionary Learning make them amenable to human studies based on EEG/MEG signals. We develop a novel statistical point process model – called driven temporal point processes (DriPP) – where the intensity function of the point process model is linked to a set of point processes corresponding to stimulation events. We derive a fast and principled expectation-maximization algorithm to estimate the parameters of this model. Simulations reveal that model parameters can be identified from long enough signals. Results on standard MEG datasets demonstrate that our methodology reveals event-related neural responses – both evoked and induced – and isolates non-task specific temporal patterns.

Deep invariant networks with differentiable augmentation layers
Cédric Rommel, Thomas Moreau, Alexandre Gramfort, Apr 2022, preprint Arxiv

Designing learning systems which are invariant to certain data transformations is critical in machine learning. Practitioners can typically enforce a desired invariance on the trained model through the choice of a network architecture, e.g. using convolutions for translations, or using data augmentation. Yet, enforcing true invariance in the network can be difficult, and data invariances are not always known a piori.

Designing learning systems which are invariant to certain data transformations is critical in machine learning. Practitioners can typically enforce a desired invariance on the trained model through the choice of a network architecture, e.g. using convolutions for translations, or using data augmentation. Yet, enforcing true invariance in the network can be difficult, and data invariances are not always known a piori. State-of-the-art methods for learning data augmentation policies require held-out data and are based on bilevel optimization problems, which are complex to solve and often computationally demanding. In this work we investigate new ways of learning invariances only from the training data. Using learnable augmentation layers built directly in the network, we demonstrate that our method is very versatile. It can incorporate any type of differentiable augmentation and be applied to a broad class of learning problems beyond computer vision. We provide empirical evidence showing that our approach is easier and faster to train than modern automatic data augmentation techniques based on bilevel optimization, while achieving comparable results. Experiments show that while the invariances transferred to a model through automatic data augmentation are limited by the model expressivity, the invariance yielded by our approach is insensitive to it by design.

A framework for bilevel optimization that enables stochastic and global variance reduction algorithms
Mathieu Dagréou, Pierre Ablin, Samuel Vaiter, Thomas Moreau, Jan 2022, preprint Arxiv

Bilevel optimization, the problem of minimizing a value function which involves the arg-minimum of another function, appears in many areas of machine learning. In a large scale setting where the number of samples is huge, it is crucial to develop stochastic methods, which only use a few samples at a time to progress. However, computing the gradient of the value function involves solving a linear system, which makes it difficult to derive unbiased stochastic estimates.

Bilevel optimization, the problem of minimizing a value function which involves the arg-minimum of another function, appears in many areas of machine learning. In a large scale setting where the number of samples is huge, it is crucial to develop stochastic methods, which only use a few samples at a time to progress. However, computing the gradient of the value function involves solving a linear system, which makes it difficult to derive unbiased stochastic estimates. To overcome this problem we introduce a novel framework, in which the solution of the inner problem, the solution of the linear system, and the main variable evolve at the same time. These directions are written as a sum, making it straightforward to derive unbiased estimates. The simplicity of our approach allows us to develop global variance reduction algorithms, where the dynamics of all variables is subject to variance reduction. We demonstrate that SABA, an adaptation of the celebrated SAGA algorithm in our framework, has O(1/T) convergence rate, and that it achieves linear convergence under Polyak-Lojasciewicz assumption. This is the first stochastic algorithm for bilevel optimization that verifies either of these properties. Numerical experiments validate the usefulness of our method.

Risk of death in individuals hospitalized for COVID-19 with and without psychiatric disorders: an observational multicenter study in France
Nicolas Hoertel, Marina Sánchez-Rico, Pedro de La Muela, Miriam Abellán, Carlos Blanco, Marion Leboyer, Céline Cougoule, Erich Gulbins, Johannes Kornhuber, Alexander Carpinteiro, Katrin Anne Becker, Raphaël Vernet, Nathanaël Beeker, Antoine Neuraz, Jesús M Alvarado, Juan José Herrera-Morueco, Guillaume Airagnes, Cédric Lemogne, Frédéric Limosin, Pierre-Yves Ancel, Alain Bauchet, Vincent Benoit, Mélodie Bernaux, Ali Bellamine, Romain Bey, Aurélie Bourmaud, Stéphane Breant, Anita Burgun et al., Jan 2022, Biological psychiatry global open science

Prior research suggests that psychiatric disorders could be linked to increased mortality among patients with COVID-19. However, whether all or specific psychiatric disorders are intrinsic risk factors of death in COVID-19 or whether these associations reflect the greater prevalence of medical risk factors in people with psychiatric disorders has yet to be evaluated.

Background: Prior research suggests that psychiatric disorders could be linked to increased mortality among patients with COVID-19. However, whether all or specific psychiatric disorders are intrinsic risk factors of death in COVID-19 or whether these associations reflect the greater prevalence of medical risk factors in people with psychiatric disorders has yet to be evaluated.

Methods We performed an observational, multicenter, retrospective cohort study to examine the association between psychiatric disorders and mortality among patients hospitalized for laboratory-confirmed COVID-19 at 36 Greater Paris University hospitals.

Results Of 15,168 adult patients, 857 (5.7%) had an ICD-10 diagnosis of psychiatric disorder. Over a mean follow-up period of 14.6 days (SD = 17.9), 326 of 857 (38.0%) patients with a diagnosis of psychiatric disorder died compared with 1276 of 14,311 (8.9%) patients without such a diagnosis (odds ratio 6.27, 95% CI 5.40–7.28, p < .01). When adjusting for age, sex, hospital, current smoking status, and medications according to compassionate use or as part of a clinical trial, this association remained significant (adjusted odds ratio 3.27, 95% CI 2.78–3.85, p < .01). However, additional adjustments for obesity and number of medical conditions resulted in a nonsignificant association (adjusted odds ratio 1.02, 95% CI 0.84–1.23, p = .86). Exploratory analyses after the same adjustments suggested that a diagnosis of mood disorders was significantly associated with reduced mortality, which might be explained by the use of antidepressants.

Conclusions: These findings suggest that the increased risk of COVID-19–related mortality in individuals with psychiatric disorders hospitalized for COVID-19 might be explained by the greater number of medical conditions and the higher prevalence of obesity in this population and not by the underlying psychiatric disease.

Leveraging Global Parameters for Flow-based Neural Posterior Estimation
Pedro L. C. Rodrigues, Thomas Moreau, Gilles Louppe, Alexandre Gramfort, Dec 2021, In proceedings of Advances in Neural Information Processing Systems (NeurIPS)

Inferring the parameters of a stochastic model based on experimental observations is central to the scientific method. A particularly challenging setting is when the model is strongly indeterminate, i.e., when distinct sets of parameters yield identical observations.

Inferring the parameters of a stochastic model based on experimental observations is central to the scientific method. A particularly challenging setting is when the model is strongly indeterminate, i.e., when distinct sets of parameters yield identical observations. This arises in many practical situations, such as when inferring the distance and power of a radio source (is the source close and weak or far and strong?) or when estimating the amplifier gain and underlying brain activity of an electrophysiological experiment. In this work, we present a method for cracking such indeterminacy by exploiting additional information conveyed by an auxiliary set of observations sharing global parameters. Our method extends recent developments in simulation-based inference(SBI) based on normalizing flows to Bayesian hierarchical models. We validate quantitatively our proposal on a motivating example amenable to analytical solutions, and then apply it to invert a well known non-linear model from computational neuroscience.

Multivariate semi-blind deconvolution of fMRI time series
Hamza Cherkaoui, Thomas Moreau, Abderrahim Halimi, Claire Leroy, Philippe Ciuciu, Nov 2021, NeuroImage

Whole brain estimation of the haemodynamic response function (HRF) in functional magnetic resonance imaging (fMRI) is critical to get insight on the global status of the neurovascular coupling of an individual in healthy or pathological condition. Most of existing approaches in the literature works on task-fMRI data and relies on the experimental paradigm as a surrogate of neural activity, hence remaining inoperative on resting-stage fMRI (rs-fMRI) data.

Whole brain estimation of the haemodynamic response function (HRF) in functional magnetic resonance imaging (fMRI) is critical to get insight on the global status of the neurovascular coupling of an individual in healthy or pathological condition. Most of existing approaches in the literature works on task-fMRI data and relies on the experimental paradigm as a surrogate of neural activity, hence remaining inoperative on resting-stage fMRI (rs-fMRI) data. To cope with this issue, recent works have performed either a two-step analysis to detect large neu-ral events and then characterize the HRF shape or a joint estimation of both the neural and haemodynamic components in an univariate fashion. In this work, we express the neural activity signals as a combination of piece-wise constant temporal atoms associated with sparse spatial maps and introduce an haemodynamic parcel-lation of the brain featuring a temporally dilated version of a given HRF model in each parcel with unknown dilation parameters. We formulate the joint estimation of the HRF shapes and spatio-temporal neural representations as a multivariate semi-blind deconvolution problem in a paradigm-free setting and introduce constraints inspired from the dictionary learning literature to ease its identifiability. An efficient alternating minimization algorithm is proposed and validated on both synthetic and real rs-fMRI data at the subject level. To demonstrate its significance at the population level, we apply this new framework to the UK Biobank data set, first for the discrimination of haemodynamic territories between balanced groups (n = 24 individuals in each) patients with an history of stroke and healthy controls and second, for the analysis of normal aging on the neurovascular coupling. Overall, we statistically demonstrate that a pathology like stroke or a condition like normal brain aging induce longer haemodynamic delays in certain brain areas (e.g. Willis polygon, occipital, temporal and frontal cortices) and that this haemodynamic feature may be predictive with an accuracy of 74 % of the individual's age in a supervised classification task performed on n = 459 subjects.

Diabetes increases severe COVID-19 outcomes primarily in younger adults
Marc Diedisheim, Etienne Dancoisne, Jean-François Gautier, Etienne Larger, Emmanuel Cosson, Bruno Fève, Philippe Chanson, Sébastien Czernichow, Sopio Tatulashvili, Marie-Laure Raffin-Sanson, Kankoé Sallah, Muriel Bourgeon, Christiane Ajzenberg, Agnès Hartemann, Christel Daniel, Thomas Moreau, Ronan Roussel, Louis Potier, Sep 2021, The Journal of Clinical Endocrinology & Metabolism

Diabetes is reported as a risk factor for severe coronavirus disease 2019 (COVID-19), but whether this risk is similar in all categories of age remains unclear. To investigate the risk of severe COVID-19 outcomes in hospitalized patients with and without diabetes according to age categories.

Context Diabetes is reported as a risk factor for severe coronavirus disease 2019 (COVID-19), but whether this risk is similar in all categories of age remains unclear.

Objective To investigate the risk of severe COVID-19 outcomes in hospitalized patients with and without diabetes according to age categories.

Design Setting and Participants We conducted a retrospective observational cohort study of 6314 consecutive patients hospitalized for COVID-19 between February and 30 June 2020 in the Paris metropolitan area, France; follow-up was recorded until 30 September 2020.

Main Outcome Measure(s) The main outcome was a composite outcome of mortality and orotracheal intubation in subjects with diabetes compared with subjects without diabetes, after adjustment for confounding variables and according to age categories.

Results Diabetes was recorded in 39% of subjects. Main outcome was higher in patients with diabetes, independently of confounding variables (hazard ratio [HR] 1.13 [1.03-1.24]) and increased with age in individuals without diabetes, from 23% for those <50 to 35% for those >80 years but reached a plateau after 70 years in those with diabetes. In direct comparison between patients with and without diabetes, diabetes-associated risk was inversely proportional to age, highest in <50 years and similar after 70 years. Similarly, mortality was higher in patients with diabetes (26%) than in those without diabetes (22%, P < 0.001), but adjusted HR for diabetes was significant only in patients younger than age 50 years (HR 1.81 [1.14-2.87]).

Conclusions Diabetes should be considered as an independent risk factor for the severity of COVID-19 in young adults more so than in older adults, especially for individuals younger than 70 years.

Wavelets in the Deep Learning Era
Zaccharie Ramzi, Jean-Luc Starck, Thomas Moreau, Philippe Ciuciu, Jan 2021, In proceedings of European Signal Processing Conference (EUSIPCO)

Sparsity based methods, such as wavelets, have been state-of-the-art for more than 20 years for inverse problems before being overtaken by neural networks. In particular, U-nets have proven to be extremely effective. Their main ingredients are a highly non-linear processing, a massive learning made possible by the flourishing of optimization algorithms with the power of computers (GPU) and the use of large available data sets for training.

Sparsity based methods, such as wavelets, have been state-of-the-art for more than 20 years for inverse problems before being overtaken by neural networks. In particular, U-nets have proven to be extremely effective. Their main ingredients are a highly non-linear processing, a massive learning made possible by the flourishing of optimization algorithms with the power of computers (GPU) and the use of large available data sets for training. While the many stages of non-linearity are intrinsic to deep learning, the usage of learning with training data could also be exploited by sparsity based approaches. The aim of our study is to push the limits of sparsity with learning, and comparing the results with U-nets. We present a new network architecture, which conserves the properties of sparsity based methods such as exact reconstruction and good generalization properties, while fostering the power of neural networks for learning and fast calculation. We evaluate the model on image denoising tasks and show it is competitive with learning-based models.

Learning to solve TV regularized problems with unrolled algorithms
Cherkaoui Hamza; Sulam Jeremias; Moreau Thomas, Dec 2020, In proceedings of Advances in Neural Information Processing System

In this paper, we accelerate such iterative algorithms by unfolding proximal gradient descent solvers in order to learn their parameters for 1D TV regularized problems. While this could be done using the synthesis formulation, we demonstrate that this leads to slower performances. The main difficulty in applying such methods in the analysis formulation lies in proposing a way to compute the derivatives through the proximal operator.

Total Variation (TV) is a popular regularization strategy that promotes piece-wise constant signals by constraining the`1-norm of the first order derivative of the estimated signal. The resulting optimization problem is usually solved using iterative algorithms such as proximal gradient descent, primal-dual algorithms or ADMM. However, such methods can require a very large number of iterations to converge to a suitable solution. In this paper, we accelerate such iterative algorithms by unfolding proximal gradient descent solvers in order to learn their parameters for 1D TV regularized problems. While this could be done using the synthesis formulation, we demonstrate that this leads to slower performances. The main difficulty in applying such methods in the analysis formulation lies in proposing a way to compute the derivatives through the proximal operator. As our main contribution, we develop and characterize two approaches to do so, describe their benefits and limitations, and discuss the regime where they can actually improve over iterative procedures. We validate those findings with experiments on synthetic and real data.

NeuMiss networks: differential programming for supervised learning with missing values
Marine Le Morvan, Julie Josse, Thomas Moreau, Erwan Scornet, Gaël Varoquaux, Dec 2020, In proceedings of Advances in Neural Information Processing Systems (NeurIPS)

The presence of missing values makes supervised learning much more challenging. Indeed, previous work has shown that even when the response is a linear function of the complete data, the optimal predictor is a complex function of the observed entries and the missingness indicator. As a result, the computational or sample complexities of consistent approaches depend on the number of missing patterns, which can be exponential in the number of dimensions.

The presence of missing values makes supervised learning much more challenging. Indeed, previous work has shown that even when the response is a linear function of the complete data, the optimal predictor is a complex function of the observed entries and the missingness indicator. As a result, the computational or sample complexities of consistent approaches depend on the number of missing patterns, which can be exponential in the number of dimensions. In this work, we derive the analytical form of the optimal predictor under a linearity assumption and various missing data mechanisms including Missing at Random (MAR) and self-masking (Missing Not At Random). Based on a Neumann-series approximation of the optimal predictor, we propose a new principled architecture, named NeuMiss networks. Their originality and strength come from the use of a new type of non-linearity: the multiplication by the missingness indicator. We provide an upper bound on the Bayes risk of NeuMiss networks, and show that they have good predictive accuracy with both a number of parameters and a computational complexity independent of the number of missing data patterns. As a result they scale well to problems with many features, and remain statistically efficient for medium-sized samples. Moreover, we show that, contrary to procedures using EM or imputation, they are robust to the missing data mechanism, including difficult MNAR settings such as self-masking.

DiCoDiLe: Distributed Convolutional Dictionary Learning
Thomas Moreau and Alexandre Gramfort, Nov 2020, IEEE Transaction on Pattern Analysis and Machine Intelligence (PAMI)

DiCoDiLe: a distributed and asynchronous algorithm, employing locally greedy coordinate descent and an asynchronous locking mechanism that does not require a central server.

Convolutional dictionary learning (CDL) estimates shift invariant basis adapted to multidimensional data. CDL has proven useful for image denoising or inpainting, as well as for pattern discovery on multivariate signals. As estimated patterns can be positioned anywhere in signals or images, optimization techniques face the difficulty of working in extremely high dimensions with millions of pixels or time samples, contrarily to standard patch-based dictionary learning. To address this optimization problem, this work proposes a distributed and asynchronous algorithm, employing locally greedy coordinate descent and an asynchronous locking mechanism that does not require a central server. This algorithm can be used to distribute the computation on a number of workers which scales linearly with the encoded signal's size. Experiments confirm the scaling properties which allows us to learn patterns on large scales images from the Hubble Space Telescope.

International electronic health record-derived COVID-19 clinical course profiles: the 4CE consortium
Gabriel A Brat, Griffin M Weber, Nils Gehlenborg, Paul Avillach, Nathan P Palmer, Luca Chiovato, James Cimino, Lemuel R Waitman, Gilbert S Omenn, Alberto Malovini, Jason H Moore, Brett K Beaulieu-Jones, Valentina Tibollo, Shawn N Murphy, Sehi L’Yi, Mark S Keller, Riccardo Bellazzi, David A Hanauer, Arnaud Serret-Larmande, Alba Gutierrez-Sacristan, John J Holmes, Douglas S Bell, Kenneth D Mandl, Robert W Follett, Jeffrey G Klann, Douglas A Murad, Luigia Scudeller, Mauro Bucalo, et al., Aug 2020, Npj Digital Medicine

We leveraged the largely untapped resource of electronic health record data to address critical clinical and epidemiological questions about Coronavirus Disease 2019 (COVID-19). To do this, we formed an international consortium (4CE) of 96 hospitals across five countries (www.covidclinical.net). Contributors utilized the Informatics for Integrating Biology and the Bedside (i2b2) or Observational Medical Outcomes Partnership (OMOP) platforms to map to a common data model.

We leveraged the largely untapped resource of electronic health record data to address critical clinical and epidemiological questions about Coronavirus Disease 2019 (COVID-19). To do this, we formed an international consortium (4CE) of 96 hospitals across five countries (www.covidclinical.net). Contributors utilized the Informatics for Integrating Biology and the Bedside (i2b2) or Observational Medical Outcomes Partnership (OMOP) platforms to map to a common data model. The group focused on temporal changes in key laboratory test values. Harmonized data were analyzed locally and converted to a shared aggregate form for rapid analysis and visualization of regional differences and global commonalities. Data covered 27,584 COVID-19 cases with 187,802 laboratory tests. Case counts and laboratory trajectories were concordant with existing literature. Laboratory tests at the time of diagnosis showed hospital-level differences equivalent to country-level variation across the consortium partners. Despite the limitations of decentralized data generation, we established a framework to capture the trajectory of COVID-19 disease in patients and their response to interventions.

Extraction of Nystagmus Patterns from Eye-Tracker Data with Convolutional Sparse Coding
Clément Lalanne, Maxence Rateaux, Laurent Oudre, Matthieu Robert, Thomas Moreau, Jul 2020, In proceedings of Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC)

The analysis of the Nystagmus waveforms from eye-tracking records is crucial for the clinical interpretation of this pathological movement. A major issue to automatize this analysis is the presence of natural eye movements and eye blink artefacts that are mixed with the signal of interest. We propose a method based on Convolutional Dictionary Learning that is able to automatically highlight the Nystagmus waveforms, separating the natural motion from the pathological movements.

The analysis of the Nystagmus waveforms from eye-tracking records is crucial for the clinical interpretation of this pathological movement. A major issue to automatize this analysis is the presence of natural eye movements and eye blink artefacts that are mixed with the signal of interest. We propose a method based on Convolutional Dictionary Learning that is able to automatically highlight the Nystagmus waveforms, separating the natural motion from the pathological movements. We show on simulated signals that our method can indeed improve the pattern recovery rate and provide clinical examples to illustrate how this algorithm performs.

Hydroxychloroquine with or without azithromycin and in-hospital mortality or discharge in patients hospitalized for COVID-19 infection: a cohort study of 4,642 in-patients in France
Emilie Sbidian, Julie Josse, Guillaume Lemaitre, Imke Meyer, Mélodie Bernaux, Alexandre Gramfort, Nathanaël Lapidus, Nicolas Paris, Antoine Neuraz, Ivan Lerner, Nicolas Garcelon, Bastien Rance, Olivier Grisel, Thomas Moreau, Ali Bellamine, Pierre Wolkenstein, Gaël Varoquaux, Eric Caumes, Marc Lavielle, Armand Mekontso Dessap, Etienne Audureau, AP-HP Covid CDR Initiative, Jun 2020, preprint MedArxiv

To assess the clinical effectiveness of oral hydroxychloroquine (HCQ) with or without azithromycin (AZI) in preventing death or leading to hospital discharge. Design Retrospective cohort study.

Objective: To assess the clinical effectiveness of oral hydroxychloroquine (HCQ) with or without azithromycin (AZI) in preventing death or leading to hospital discharge.

Design Retrospective cohort study.

Setting: An analysis of data from electronic medical records and administrative claim data from the French Assistance Publique - Hôpitaux de Paris (AP-HP) data warehouse, in 39 public hospitals, Ile-de-France, France.

Participants: All adult inpatients with at least one PCR-documented SARS-CoV-2 RNA from a nasopharyngeal sample between February 1st, 2020 and April 6th, 2020 were eligible for analysis. The study population was restricted to patients who did not receive COVID-19 treatments assessed in ongoing trials, including antivirals and immunosuppressive drugs. End of follow-up was defined as the date of death, discharge home, day 28 after admission, whichever occurred first, or administrative censoring on May 4, 2020.

Intervention: Patients were further classified into 3 groups: (i) receiving HCQ alone, (ii) receiving HCQ together with AZI, and (iii) receiving neither HCQ nor AZI. Exposure to a HCQ/AZI combination was defined as a simultaneous prescription of the 2 treatments (more or less one day).

Main outcome measures: The primary outcome was all-cause 28-day mortality as a time-to-event endpoint under a competing risks survival analysis framework. The secondary outcome was 28-day discharge home. Augmented inverse probability of treatment weighted (AIPTW) estimates of the average treatment effect (ATE) were computed to account for confounding.

Results: A total of 4,642 patients (mean age: 66.1 ± 18; males: 2,738 (59%)) were included, of whom 623 (13.4%) received HCQ alone, 227 (5.9%) received HCQ plus AZI, and 3,792 (81.7%) neither drug. Patients receiving ‘HCQ alone’ or ‘HCQ plus AZI’ were more likely younger, males, current smokers and overall presented with slightly more co-morbidities (obesity, diabetes, any chronic pulmonary diseases, liver diseases), while no major difference was apparent in biological parameters. After accounting for confounding, no statistically significant difference was observed between the ‘HCQ’ and ‘Neither drug’ groups for 28-day mortality: AIPTW absolute difference in ATE was +1.24% (−5.63 to 8.12), ratio in ATE 1.05 (0.77 to 1.33). 28-day discharge rates were statistically significantly higher in the ‘HCQ’ group: AIPTW absolute difference in ATE (+11.1% [3.30 to 18.9]), ratio in ATE (1.25 [1.07 to 1.42]). As for the ‘HCQ+AZI’ vs neither drug, trends for significant differences and ratios in AIPTW ATE were found suggesting higher mortality rates in the former group (difference in ATE +9.83% [-0.51 to 20.17], ratio in ATE 1.40 [0.98 to 1.81];p=0.062).

Conclusions: Using a large non-selected population of inpatients hospitalized for COVID-19 infection in 39 hospitals in France and robust methodological approaches, we found no evidence for efficacy of HCQ or HCQ combined with AZI on 28-day mortality. Our results suggested a possible excess risk of mortality associated with HCQ combined with AZI, but not with HCQ alone. Significantly higher rates of discharge home were observed in patients treated by HCQ, a novel finding warranting further confirmation in replicative studies. Altogether, our findings further support the need to complete currently undergoing randomized clinical trials.

Super-efficiency of automatic differentiation for functions defined as a minimum
Pierre Ablin; Gabriel Peyré and Thomas Moreau, Feb 2020, preprint /

We study the different techniques to differentiate a function defined as a min of an other.

In min-min optimization or max-min optimization, one has to compute the gradient of a function defined as a minimum. In most cases, the minimum has no closed-form, and an approximation is obtained via an iterative algorithm. There are two usual ways of estimating the gradient of the function: using either an analytic formula obtained by assuming exactness of the approximation, or automatic differentiation through the algorithm. In this paper, we study the asymptotic error made by these estimators as a function of the optimization error. We find that the error of the automatic estimator is close to the square of the error of the analytic estimator, reflecting a super-efficiency phenomenon. The convergence of the automatic estimator greatly depends on the convergence of the Jacobian of the algorithm. We analyze it for gradient descent and stochastic gradient descent and derive convergence rates for the estimators in these cases. Our analysis is backed by numerical experiments on toy problems and on Wasserstein barycenter computation. Finally, we discuss the computational complexity of these estimators and give practical guidelines to chose between them.

Learning step sizes for unfolded sparse coding

Pierre Ablin; Thomas Moreau; Mathurin Massias; Alexandre Gramfort, Dec 2019, In proceedings of Advances in Neural Information Processing Sytems (NeurIPS)

This paper presents a theoretical study on how LISTA can learn to accelerate computation compared to ISTA based on larger step sizes adapted to the sparsity distribution of the solution estimate. this mechanism is the only one which ensure assymptotic convergence to the Lasso estimator.

Sparse coding is typically solved by iterative optimization techniques, such as the Iterative Shrinkage-Thresholding Algorithm (ISTA). Unfolding and learning weights of ISTA using neural networks is a practical way to accelerate estimation. In this paper, we study the selection of adapted step sizes for ISTA. We show that a simple step size strategy can improve the convergence rate of ISTA by leveraging the sparsity of the iterates. However, it is impractical in most large-scale applications. Therefore, we propose a network architecture where only the step sizes of ISTA are learned. We demonstrate that for a large class of unfolded algorithms, if the algorithm converges to the solution of the Lasso, its last layers correspond to ISTA with learned step sizes. Experiments show that our method is competitive with state-of-the-art networks when the solutions are sparse enough.

A Data Set for the Study of Human Locomotion with Inertial Measurements Units
Charles Truong, Rémi Barrois-Müller, Thomas Moreau, Clément Provost, Aliénor Vienne-Jumeau, Albane Moreau, Pierre-Paul Vidal, Nicolas Vayatis, Stéphane Buffat, Alain Yelnik, Damien Ricard, Laurent Oudre, Nov 2019, Image Processing On Line

A data set of 1020 multivariate gait signals collected with two inertial measurement units, from 230 subjects undergoing a fixed protocol.

This article thoroughly describes a data set of 1020 multivariate gait signals collected with two inertial measurement units, from 230 subjects undergoing a fixed protocol: standing still, walking 10 m, turning around, walking back and stopping. In total, 8.5~h of gait time series are distributed. The measured population was composed of healthy subjects as well as patients with neurological or orthopedic disorders. An outstanding feature of this data set is the amount of signal metadata that are provided. In particular, the start and end time stamps of more than 40,000 footsteps are available, as well as a number of contextual information about each trial. This exact data set was used in [Oudre et al., Template-based step detection with inertial measurement units, Sensors 18, 2018] to design and evaluate a step detection procedure.

Sparsity-based blind deconvolution of neural activation signal in fMRI
Hamza Cherkaoui, Thomas Moreau, Abderrahim Halimi, Philippe Ciuciu, May 2019, In proceedings of IEEE International Conference on Acoustic Speech and Signal Processing

In this work, we formulate the joint estimation of the HRF and neural activation signal as a semi blind deconvolution problem.

The estimation of the hemodynamic response function (HRF) in functional magnetic resonance imaging (fMRI) is critical to deconvolve a time-resolved neural activity and get insights on the underlying cognitive processes. Existing methods pro-pose to estimate the HRF using the experimental paradigm(EP) in task fMRI as a surrogate of neural activity. These approaches induce a bias as they do not account for latencies in the cognitive responses compared to EP and cannot be applied to resting-state data as no EP is available. In this work, we formulate the joint estimation of the HRF and neural activation signal as a semi blind deconvolution problem. Its solution can be approximated using an efficient alternate minimization algorithm. The proposed approach is applied to task fMRI data for validation purpose and compared to a state-of-the-art HRF estimation technique. Numerical experiments suggest that our approach is competitive with others while not requiring EP information.

Multivariate Convolutional Sparse Coding for Electromagnetic Brain Signals
Tom Dupré La Tour, Thomas Moreau, Mainak Jas and Alexandre Gramfort, Dec 2018, In proceedings of Advances in Neural Information Processing System (NIPS)

A multivariate CSC with rank-1 constrain algorithm designed to study brain activity waveforms

Frequency-specific patterns of neural activity are traditionally interpreted as sustained rhythmic oscillations, and related to cognitive mechanisms such as attention, high level visual processing or motor control. While alpha waves (8-12 Hz) are known to closely resemble short sinusoids, and thus are revealed by Fourier analysis or wavelet transforms, there is an evolving debate that electromagnetic neural signals are composed of more complex waveforms that cannot be analyzed by linear filters and traditional signal representations. In this paper, we propose to learn dedicated representations of such recordings using a multivariate convolutional sparse coding (CSC) algorithm. Applied to electroencephalography (EEG) or magnetoencephalography (MEG) data, this method is able to learn not only prototypical temporal waveforms, but also associated spatial patterns so their origin can be localized in the brain. Our algorithm is based on alternated minimization and a greedy coordinate descent solver that leads to state-of-the-art running time on long time series. To demonstrate the implications of this method, we apply it to MEG data and show that it is able to recover biological artifacts. More remarkably, our approach also reveals the presence of non-sinusoidal mu-shaped patterns, along with their topographic maps related to the somatosensory cortex.

Template-Based Step Detection with Inertial Measurement Units
Laurent Oudre , Rémi Barrois-Müller, Thomas Moreau, Charles Truong, Aliénor Vienne-Jumeau, Damien Ricard, Nicolas Vayatis and Pierre-Paul Vidal, Nov 2018, Sensors

Step detection in inertial recordings using template based detection.

This article presents a method for step detection from accelerometer and gyrometer signals recorded with Inertial Measurement Units (IMUs). The principle of our step detection algorithm is to recognize the start and end times of the steps in the signal thanks to a predefined library of templates. The algorithm is tested on a database of 1020 recordings, composed of healthy subjects and patients with various neurological or orthopedic troubles. Simulations on more than 40,000 steps show that the template-based method achieves remarkable results with a 98% recall and a 98% precision. The method adapts well to pathological subjects and can be used in a medical context for robust step estimation and gait characterization.

DICOD: Distributed Convolutional Coordinate Descent for Convolutional Sparse Coding
Thomas Moreau, Laurent Oudre and Nicolas Vayatis, Jul 2018, In proceedings of International Conference on Machine Learning (ICML)

In this paper, we introduce DICOD, a distributed convolutional sparse coding algorithm to build shift invariant representations for long signals.

In this paper, we introduce DICOD, a distributed convolutional sparse coding algorithm to build shift invariant representations for long signals. This algorithm is designed to run in a distributed setting, with local message passing, making it communication efficient. It is based on coordinate descent and uses locally greedy updates which accelerate the resolution compared to greedy coordinate selection. We prove the convergence of this algorithm and highlight its computational speed-up which is super-linear in the number of cores used. We also provide empirical evidence for the acceleration properties of our algorithm compared to state-of-the-art methods.

Convolutional Sparse Representations -- application to physiological signals and interpretability for Deep Learning
Thomas Moreau, Dec 2017, /

Convolutional representations extract recurrent patterns which lead to the discovery of local structures in a set of signals. In this dissertation, we describe recent advances on both computational and theoretical aspects of these models.

Convolutional representations extract recurrent patterns which lead to the discovery of local structures in a set of signals. They are well suited to analyze physiological signals which requires interpretable representations in order to understand the relevant information. Moreover, these representations can be linked to deep learning models, as a way to bring interpretability in their internal representations. In this dissertation, we describe recent advances on both computational and theoretical aspects of these models. First, we show that the Singular Spectrum Analysis can be used to compute convolutional representations. This representation is dense and we describe an automatized procedure to improve its interpretability. Also, we propose an asynchronous algorithm, called DICOD, based on greedy coordinate descent, to solve convolutional sparse coding for long signals. Our algorithm has super-linear acceleration.In a second part, we focus on the link between representations and neural networks. An extra training step for deep learning, called post-training, is introduced to boost the performances of the trained network by making sure the last layer is optimal. Then, we study the mechanisms which allow to accelerate sparse coding algorithms with neural networks. We show that it is linked to a factorization of the Gram matrix of the dictionary.Finally, we illustrate the relevance of convolutional representations for physiological signals. Convolutional dictionary learning is used to summarize human walk signals and Singular Spectrum Analysis is used to remove the gaze movement in young infant’s oculometric recordings.

Understanding the Learned Iterative Soft Thresholding Algorithm with matrix factorization
Thomas Moreau and Joan Bruna, Jun 2017, preprint Arxiv

This paper aims to extend the results from our previous paper studying the mechanisms of LISTA and gives an acceleration certificate for generic dictionaries.

Sparse coding is a core building block in many data analysis and machine learning pipelines. Typically it is solved by relying on generic optimization techniques, such as the Iterative Soft Thresholding Algorithm and its accelerated version (ISTA, FISTA). These methods are optimal in the class of first-order methods for non-smooth, convex functions. However, they do not exploit the particular structure of the problem at hand nor the input data distribution. An acceleration using neural networks, coined LISTA, was proposed in Gregor and Le Cun (2010), which showed empirically that one could achieve high quality estimates with few iterations by modifying the parameters of the proximal splitting appropriately.
In this paper we study the reasons for such acceleration. Our mathematical analysis reveals that it is related to a specific matrix factorization of the Gram kernel of the dictionary, which attempts to nearly diagonalise the kernel with a basis that produces a small perturbation of the ℓ1 ball. When this factorization succeeds, we prove that the resulting splitting algorithm enjoys an improved convergence bound with respect to the non-adaptive version. Moreover, our analysis also shows that conditions for acceleration occur mostly at the beginning of the iterative process, consistent with numerical experiments. We further validate our analysis by showing that on dictionaries where this factorization does not exist, adaptive acceleration fails.

Understanding Trainable Sparse Coding with Matrix Factorization
Thomas Moreau and Joan Bruna, Apr 2017, In proceedings of International Conference on Learning Representations (ICLR)

In this paper we study the mechanisms behind LISTA.

Sparse coding is a core building block in many data analysis and machine learning pipelines. Typically it is solved by relying on generic optimization techniques, that are optimal in the class of first-order methods for non-smooth, convex functions, such as the Iterative Soft Thresholding Algorithm and its accelerated version (ISTA, FISTA). However, these methods don't exploit the particular structure of the problem at hand nor the input data distribution. An acceleration using neural networks was proposed in Gregor10, coined LISTA, which showed empirically that one could achieve high quality estimates with few iterations by modifying the parameters of the proximal splitting appropriately.
In this paper we study the reasons for such acceleration. Our mathematical analysis reveals that it is related to a specific matrix factorization of the Gram kernel of the dictionary, which attempts to nearly diagonalise the kernel with a basis that produces a small perturbation of the ℓ1 ball. When this factorization succeeds, we prove that the resulting splitting algorithm enjoys an improved convergence bound with respect to the non-adaptive version. Moreover, our analysis also shows that conditions for acceleration occur mostly at the beginning of the iterative process, consistent with numerical experiments. We further validate our analysis by showing that on dictionaries where this factorization does not exist, adaptive acceleration fails.

Post Training in Deep Learning with Last Kernel
Thomas Moreau and Julien Audiffren, Nov 2016, preprint Arxiv

Additional training step for deep networks to optimize the use of the features learnerd during the classical training. A link with existing kernel methods is then discussed.

One of the main challenges of deep learning methods is the choice of an appropriate training strategy. In particular, additional steps, such as unsupervised pre-training, have been shown to greatly improve the performances of deep structures. In this article, we propose an extra training step, called post-training, which only optimizes the last layer of the network. We show that this procedure can be analyzed in the context of kernel theory, with the first layers computing an embedding of the data and the last layer a statistical model to solve the task based on this embedding. This step makes sure that the embedding, or representation, of the data is used in the best possible way for the considered task. This idea is then tested on multiple architectures with various data sets, showing that it consistently provides a boost in performance.

Distributed Convolutional Sparse Coding via Message Passing Interface
Thomas Moreau, Laurent Oudre and Nicolas Vayatis, Dec 2015, In NIPS Workshop Nonparametric Methods for Large Scale Representation Learning

Asynchronous algorithm to solve the convolutional sparse coding. This algorithm can be implemented using the MPI framework.

We consider the problem of building shift-invariant representations of signals from sensors with a large frequency of acquisition. We propose a distributed algorithm for convolutional sparse coding called DICOD that is based on coordinate descent and scales up with a speed up that is quadratic with respect to the number of processing units. Indeed, our implementation avoids sharing variables between cores and does not require any lock or synchronization at every step. We present theoretical results and empirical evidence of convergence of DICOD, and also provide numerical comparisons with respect to widely used algorithms for convolutional sparse coding.

Groupement automatique pour l’analyse du spectre singulier
Thomas Moreau, Laurent Oudre and Nicolas Vayatis, Sep 2015, In proceedings of the Groupe de Recherche et d'Etudes en Traitement du Signal et des Images (GRETSI)

This paper introduces several automatic grouping strategies for Singular Spectrum Analysis (SSA) components in a unified framework.

This paper introduces several automatic grouping strategies for Singular Spectrum Analysis (SSA) components. This step is useful to retrieve meaningful insight about the temporal dynamics of the series. A unifying framework is proposed to evaluate and compare the efficiency of different original methods compared to the existing ones.

Détection de pas à partir de données d'accélérométrie
Laurent Oudre , Thomas Moreau , Charles Truong , Rémi Barrois-Müller , Robert Dadashi and Thomas Grégory, Sep 2015, In proceedings of the Groupe de Recherche et d'Etudes en Traitement du Signal et des Images (GRETSI)

This article presents a method for step detection from accelerometer signals based on template matching.

This article presents a method for step detection from accelerometer signals based on template matching. This method uses a library of step templates extracted from real data in order to not only count the steps but also to retrieve the start and end times of each step. The algorithm is tested on a large database of 300 recordings, composed of healthy patients and patients with various orthopaedic troubles. Simulations show that even with only 20 templates, our method achieves remarkable results with a 97% recall and a 96% precision, is robust and adapts well to pathological subjects.

Thomas Moreau

Parietal - Inria Saclay

thomas.moreau [AT] inria.fr

Publications