#1. Sensitivity Analysis via the Proportion of Unmeasured Confounding
Matteo Bonvini, Edward H Kennedy
In observational studies, identification of ATEs is generally achieved by assuming "no unmeasured confounding," possibly after conditioning on enough covariates. Because this assumption is both strong and untestable, a sensitivity analysis should be performed. Common approaches include modeling the bias directly or varying the propensity scores to probe the effects of a potential unmeasured confounder. In this paper, we take a novel approach whereby the sensitivity parameter is the proportion of unmeasured confounding. We consider different assumptions on the probability of a unit being unconfounded. In each case, we derive sharp bounds on the average treatment effect as a function of the sensitivity parameter and propose nonparametric estimators that allow flexible covariate adjustment. We also introduce a one-number summary of a study's robustness to the number of confounded units. Finally, we explore finite-sample properties via simulation, and apply the methods to an observational database used to assess the effects of right...
#2. Outlier detection and a tail-adjusted boxplot based on extreme value theory
Shrijita Bhattacharya, Jan Beirlant
Whether an extreme observation is an outlier or not, depends strongly on the corresponding tail behaviour of the underlying distribution. We develop an automatic, data-driven method to identify extreme tail behaviour that deviates from the intermediate and central characteristics. This allows for detecting extreme outliers or sets of extreme data that show less spread than the bulk of the data. To this end we extend a testing method proposed in Bhattacharya et al 2019 for the specific case of heavy tailed models, to all max-domains of attraction. Consequently we propose a tail-adjusted boxplot which yields a more accurate representation of possible outliers. Several examples and simulation results illustrate the finite sample behaviour of this approach.
#3. Advanced analysis of temporal data using Fisher-Shannon information: theoretical development and application in geosciences
Fabian Guignard, Mohamed Laib, Federico Amato, Mikhail Kanevski
Complex non-linear time series are ubiquitous in geosciences. Quantifying complexity and non-stationarity of these data is a challenging task, and advanced complexity-based exploratory tool are required for understanding and visualizing such data. This paper discusses the Fisher-Shannon method, from which one can obtain a complexity measure and detect non-stationarity, as an efficient data exploration tool. The state-of-the-art studies related to the Fisher-Shannon measures are collected, and new analytical formulas for positive unimodal skewed distributions are proposed. Case studies on both synthetic and real data illustrate the usefulness of the Fisher-Shannon method, which can find application in different domains including time series discrimination and generation of times series features for clustering, modeling and forecasting. The paper is accompanied with Python and R libraries for the non-parametric estimation of the proposed measures.
#4. Another look at the Lady Tasting Tea and permutation-based randomization tests
Jesse Hemerik, Jelle J. Goeman
Fisher's famous Lady Tasting Tea experiment is often referred to as the first permutation test or as an example of such a test. Permutation tests are special cases of the general group invariance test. Recently it has been emphasized that the set of permutations used within a permutation test should have a group structure, in the algebraic sense. If not, the test can be very anti-conservative. In this paper, however, we note that in the Lady Tasting Tea experiment, the type I error rate is controlled even if the set of permutations used does not correspond to a group. We explain the difference between permutation-based tests that fundamentally rely on a group structure, and permutation-based tests that do not. The latter are tests based on randomization of treatments. When using such tests, it can be useful to consider a randomization scheme that does correspond to a group. In particular, we can use randomization schemes where the number of possible treatment patterns is larger than in standard permutation-based randomization...
#5. Bayesian Semiparametric Longitudinal Drift-Diffusion Mixed Models for Tone Learning in Adults
Giorgio Paulon, Fernando Llanos, Bharath Chandrasekaran, Abhra Sarkar
Understanding how the adult humans learn to to categorize novel auditory categories is an important problem in auditory behavioral neuroscience. Drift diffusion models are popular, neurobiologically relevant approaches to assess the mechanisms underlying speech learning. Motivated by these problems, we develop a novel inverse-Gaussian drift-diffusion mixed model for multi-alternative decision making processes in longitudinal settings. Our methodology builds on a novel Bayesian semiparametric framework for longitudinal data in the presence of a categorical covariate that allows automated assessment of the predictor's local time-varying influences. We design a Markov chain Monte Carlo algorithm for posterior computation. We evaluate the method's empirical performances through synthetic experiments. Applied to a speech category learning data set, the method provides novel insights into the underlying mechanisms.
Total Words: 16919
Unqiue Words: 3755

J. Kenneth Tay, Robert Tibshirani
Sparse generalized additive models (GAMs) are an extension of sparse generalized linear models which allow a model's prediction to vary non-linearly with an input variable. This enables the data analyst build more accurate models, especially when the linearity assumption is known to be a poor approximation of reality. Motivated by reluctant interaction modeling (Yu et al. 2019), we propose a multi-stage algorithm, called $\textit{reluctant additive modeling (RAM)}$, that can fit sparse generalized additive models at scale. It is guided by the principle that, if all else is equal, one should prefer a linear feature over a non-linear feature. Unlike existing methods for sparse GAMs, RAM can be extended easily to binary, count and survival data. We demonstrate the method's effectiveness on real and simulated examples.
Total Words: 8470
#7. Covariate-dependent control limits for the detection of abnormal price changes in scanner data
Youngrae Kim, Sangkyun Kim, Johan Lim, Sungim Lee, Won Son, Heejin Hwang
Currently, large-scale sales data for consumer goods, named scanner data, are obtained by scanning the bar codes of individual products at the points of sale in retail outlets. Many national statistical offices (NSOs) attempt to use scanner data to build consumer price statistics. As in other statistical procedures, the detection of abnormal transactions in sales prices is an important step in the analysis. Two of the most popular methods for outlier detection are the quartile method and Tukey algorithm. Both methods are solely based on information about price changes and not on other covariates (e.g., sales volume or types of retail shops) that are also available from the scanner data. In this paper, we propose a new method to detect abnormal changes in price that takes into account other extra covariates, particularly sales volume. We assume that the variance of the log of the price change is a smooth function of the sales volumes and estimate it from the previously observed data. We numerically show the advantage of the new...
#8. Bayesian Group Selection in Logistic Regression with Application to MRI Data Analysis
Kyoungjae Lee, Xuan Cao
We consider Bayesian logistic regression models with group-structured covariates. In high-dimensional settings, it is often assumed that only small portion of groups are significant, thus consistent group selection is of significant importance. While consistent frequentist group selection methods have been proposed, theoretical properties of Bayesian group selection methods for logistic regression models have not been investigated yet. In this paper, we consider a hierarchical group spike and slab prior for logistic regression models in high-dimensional settings. Under mild conditions, we establish strong group selection consistency of the induced posterior, which is the first theoretical result in the Bayesian literature. Through simulation studies, we demonstrate that the performance of the proposed method outperforms existing state-of-the-art methods in various settings. We further apply our method to an MRI data set for predicting Parkinson's disease and show its benefits over other contenders.
