##### #1. Shapley Interpretation and Activation in Neural Networks
We propose a novel Shapley value approach to help address neural networks' interpretability and "vanishing gradient" problems. Our method is based on an accurate analytical approximation to the Shapley value of a neuron with ReLU activation. This analytical approximation admits a linear propagation of relevance across neural network layers, resulting in a simple, fast and sensible interpretation of neural networks' decision making process. We then derived a globally continuous and non-vanishing Shapley gradient, which can replace the conventional gradient in training neural network layers with ReLU activation, and leading to better training performance. We further derived a Shapley Activation (SA) function, which is a close approximation to ReLU but features the Shapley gradient. The SA is easy to implement in existing machine learning frameworks. Numerical tests show that SA consistently outperforms ReLU in training convergence, accuracy and stability.
##### #2. A Knowledge Transfer Framework for Differentially Private Sparse Learning
###### Lingxiao Wang, Quanquan Gu
We study the problem of estimating high dimensional models with underlying sparse structures while preserving the privacy of each training example. We develop a differentially private high-dimensional sparse learning framework using the idea of knowledge transfer. More specifically, we propose to distill the knowledge from a "teacher" estimator trained on a private dataset, by creating a new dataset from auxiliary features, and then train a differentially private "student" estimator using this new dataset. In addition, we establish the linear convergence rate as well as the utility guarantee for our proposed method. For sparse linear regression and sparse logistic regression, our method achieves improved utility guarantees compared with the best known results (Kifer et al., 2012; Wang and Gu, 2019). We further demonstrate the superiority of our framework through both synthetic and real-world data experiments.
##### #3. Active learning for level set estimation under cost-dependent input uncertainty
###### Yu Inatsu, Masayuki Karasuyama, Keiichi Inoue, Ichiro Takeuchi
As part of a quality control process in manufacturing it is often necessary to test whether all parts of a product satisfy a required property, with as few inspections as possible. When multiple inspection apparatuses with different costs and precision exist, it is desirable that testing can be carried out cost-effectively by properly controlling the trade-off between the costs and the precision. In this paper, we formulate this as a level set estimation (LSE) problem under cost-dependent input uncertainty - LSE being a type of active learning for estimating the level set, i.e., the subset of the input space in which an unknown function value is greater or smaller than a pre-determined threshold. Then, we propose a new algorithm for LSE under cost-dependent input uncertainty with theoretical convergence guarantee. We demonstrate the effectiveness of the proposed algorithm by applying it to synthetic and real datasets.
##### #4. Shallow Self-Learning for Reject Inference in Credit Scoring
###### Nikita Kozodoi, Panagiotis Katsas, Stefan Lessmann, Luis Moreira-Matias, Konstantinos Papakonstantinou
Credit scoring models support loan approval decisions in the financial services industry. Lenders train these models on data from previously granted credit applications, where the borrowers' repayment behavior has been observed. This approach creates sample bias. The scoring model (i.e., classifier) is trained on accepted cases only. Applying the resulting model to screen credit applications from the population of all borrowers degrades model performance. Reject inference comprises techniques to overcome sampling bias through assigning labels to rejected cases. The paper makes two contributions. First, we propose a self-learning framework for reject inference. The framework is geared toward real-world credit scoring requirements through considering distinct training regimes for iterative labeling and model training. Second, we introduce a new measure to assess the effectiveness of reject inference strategies. Our measure leverages domain knowledge to avoid artificial labeling of rejected cases during strategy evaluation. We...
##### #5. d-blink: Distributed End-to-End Bayesian Entity Resolution
###### Neil G. Marchant, Rebecca C. Steorts, Andee Kaplan, Benjamin I. P. Rubinstein, Daniel N. Elazar
Entity resolution (ER) (record linkage or de-duplication) is the process of merging together noisy databases, often in the absence of a unique identifier. A major advancement in ER methodology has been the application of Bayesian generative models. Such models provide a natural framework for clustering records to unobserved (latent) entities, while providing exact uncertainty quantification and tight performance bounds. Despite these advancements, existing models do not scale to realistically-sized databases (larger than 1000 records) and they do not incorporate probabilistic blocking. In this paper, we propose "distributed Bayesian linkage" or d-blink -- the first scalable and distributed end-to-end Bayesian model for ER, which propagates uncertainty in blocking, matching and merging. We make several novel contributions, including: (i) incorporating probabilistic blocking directly into the model through auxiliary partitions; (ii) support for missing values; (iii) a partially-collapsed Gibbs sampler; and (iv) a novel perturbation...
##### #6. Estimating Fisher Information Matrix in Latent Variable Models based on the Score Function
###### Maud Delattre, Estelle Kuhn
The Fisher information matrix (FIM) is a key quantity in statistics as it is required for example for evaluating asymptotic precisions of parameter estimates, for computing test statistics or asymptotic distributions in statistical testing, for evaluating post model selection inference results or optimality criteria in experimental designs. However its exact computation is often not trivial. In particular in many latent variable models, it is intricated due to the presence of unobserved variables. Therefore the observed FIM is usually considered in this context to estimate the FIM. Several methods have been proposed to approximate the observed FIM when it can not be evaluated analytically. Among the most frequently used approaches are Monte-Carlo methods or iterative algorithms derived from the missing information principle. All these methods require to compute second derivatives of the complete data log-likelihood which leads to some disadvantages from a computational point of view. In this paper, we present a new approach to...
##### #7. Monte Carlo Approximation of Bayes Factors via Mixing with Surrogate Distributions
###### Chenguang Dai, Jun S. Liu
By mixing the posterior distribution with a surrogate distribution, of which the normalizing constant is tractable, we describe a new method to estimate the normalizing constant using the Wang-Landau algorithm. We then introduce an accelerated version of the proposed method using the momentum technique. In addition, several extensions are discussed, including (1) a parallel variant, which inserts a sequence of intermediate distributions between the posterior distribution and the surrogate distribution, to further improve the efficiency of the proposed method; (2) the use of the surrogate distribution to help detect potential multimodality of the posterior distribution, upon which a better sampler can be designed utilizing mode jumping algorithms; (3) a new jumping mechanism for general reversible jump Markov chain Monte Carlo algorithms that combines the Multiple-try Metropolis and the directional sampling algorithm, which can be used to estimate the normalizing constant when a surrogate distribution is difficult to come by. We...
##### #8. Generalized Records for Functional Time Series with Application to Unit Root Tests
###### Israel Martínez-Hernández, Marc G. Genton
A generalization of the definition of records to functional data is proposed. The definition is based on ranking curves using a notion of functional depth. This approach allows us to study the curves of the number of records over time. We focus on functional time series and apply ideas from univariate time series to demonstrate the asymptotic distribution describing the number of records. A unit root test is proposed as an application of functional record theory. Through a Monte Carlo study, different scenarios of functional processes are simulated to evaluate the performance of the unit root test. The generalized record definition is applied on two different datasets: Annual mortality rates in France and daily curves of wind speed at Yanbu, Saudi Arabia. The record curves are identified and the underlying functional process is studied based on the number of record curves observed.
##### #9. MACE: Multiscale Abrupt Change Estimation Under Complex Temporal Dynamics
###### Weichi Wu, Zhou Zhou
We consider the problem of detecting abrupt changes in an otherwise smoothly evolving trend whilst the covariance and higher-order structures of the system can experience both smooth and abrupt changes over time. The number of abrupt change points is allowed to diverge to infinity with the jump sizes possibly shrinking to zero. The method is based on a multiscale application of an optimal jump-pass filter to the time series, where the scales are dense between admissible lower and upper bounds. The MACE method is shown to be able to detect all abrupt change points within a nearly optimal range with a prescribed probability asymptotically. For a time series of length $n$, the computational complexity of MACE is $O(n)$ for each scale and $O(n\log^{1+\epsilon} n)$ overall, where $\epsilon$ is an arbitrarily small positive constant. Simulations and data analysis show that, under complex temporal dynamics, MACE performs favourably compared with some of the state-of-the-art multiscale change point detection methods.
##### #10. A Double Penalty Model for Interpretability
###### Wenjia Wang, Yi-Hui Zhou
Modern statistical learning techniques have often emphasized prediction performance over interpretability, giving rise to "black box" models that may be difficult to understand, and to generalize to other settings. We conceptually divide a prediction model into interpretable and non-interpretable portions, as a means to produce models that are highly interpretable with little loss in performance. Implementation of the model is achieved by considering separability of the interpretable and non-interpretable portions, along with a doubly penalized procedure for model fitting. We specify conditions under which convergence of model estimation can be achieved via cyclic coordinate ascent, and the consistency of model estimation holds. We apply the methods to datasets for microbiome host trait prediction and a diabetes trait, and discuss practical tradeoff diagnostics to select models with high interpretability.
