How can we fool LIME and SHAP? Adversarial Attacks on Post hoc Explanation Methods
As machine learning black boxes are increasingly being deployed in domains such as healthcare and criminal justice, there is growing emphasis on building tools and techniques for explaining these black boxes in an interpretable manner. Such explanations are being leveraged by domain experts to diagnose systematic errors and underlying biases of black boxes. In this paper, we demonstrate that post hoc explanations techniques that rely on input perturbations, such as LIME and SHAP, are not reliable. Specifically, we propose a novel scaffolding technique that effectively hides the biases of any given classifier by allowing an adversarial entity to craft an arbitrary desired explanation. Our approach can be used to scaffold any biased classifier in such a way that its predictions on the input data distribution still remain biased, but the post hoc explanations of the scaffolded classifier look innocuous. Using extensive evaluation with multiple real-world datasets (including COMPAS), we demonstrate how extremely biased (racist) classifiers crafted by our framework can easily fool popular explanation techniques such as LIME and SHAP into generating innocuous explanations which do not reflect the underlying biases.
NurtureToken New!

Token crowdsale for this paper ends in

Buy Nurture Tokens

Authors

Are you an author of this paper? Check the Twitter handle we have for you is correct.

Dylan Slack (edit)
Sophie Hilgard (add twitter)
Emily Jia (edit)
Sameer Singh (edit)
Himabindu Lakkaraju (add twitter)
Ask The Authors

Ask the authors of this paper a question or leave a comment.

Read it. Rate it.
#1. Which part of the paper did you read?

#2. The paper contains new data or analyses that is openly accessible?
#3. The conclusion is supported by the data and analyses?
#4. The conclusion is of scientific interest?
#5. The result is likely to lead to future research?

Github
User:
None (add)
Repo:
None (add)
Stargazers:
0
Forks:
0
Open Issues:
0
Network:
0
Subscribers:
0
Language:
None
Youtube
Link:
None (add)
Views:
0
Likes:
0
Dislikes:
0
Favorites:
0
Comments:
0
Other
Sample Sizes (N=):
Inserted:
Words Total:
Words Unique:
Source:
Abstract:
None
11/06/19 06:01PM
6,824
1,881
Tweets
hima_lakkaraju: @berkustun @Aaroth @tmiller_unimelb @adrian_weller @lilianedwards @mikarv I think you meant this paper on adversarial interpretations: https://t.co/qzcsWmPf4z; The current paper you mention is on human subject experiments on misleading explanations https://t.co/qa3MzDT67M (so it still counts as a frontier topic :))
kato_kohaku: 流行ってますなあ。 How can we fool LIME and SHAP? Adversarial Attacks on Post hoc Explanation Methods https://t.co/arAkHLW683
arxiv_pop: 2019/11/06 投稿 1位 LG(Machine Learning) How can we fool LIME and SHAP? Adversarial Attacks on Post hoc Explanation Methods https://t.co/cBoQMAWvGQ 15 Tweets 76 Retweets 246 Favorites
arxiv_cs_LG: How can we fool LIME and SHAP? Adversarial Attacks on Post hoc Explanation Methods. Dylan Slack, Sophie Hilgard, Emily Jia, Sameer Singh, and Himabindu Lakkaraju https://t.co/vhBnYXP8YD
reddit_ml: [R] How can we fool LIME and SHAP? Adversarial Attacks on Post hoc Explanation Methods -- post hoc explanation me... https://t.co/24TrHAuVHn
__olamilekan__: Looks like both LIME and SHAP can be fooled to produce wrong explanations. https://t.co/lHl7XCGWII
arxivml: "How can we fool LIME and SHAP? Adversarial Attacks on Post hoc Explanation Methods", Dylan Slack, Sophie Hilgard, … https://t.co/1gnnnFg40y
tmhk_ab: How can we fool LIME and SHAP? Adversarial Attacks on Post hoc Explanation Methods https://t.co/y4kgw6hots
berilsirmacek: Here #AI #wars start! How to fool #LIME and #SHAP #interpretability https://t.co/FLWnQf0O1r
SciFi: How can we fool LIME and SHAP? Adversarial Attacks on Post hoc Explanation Methods. https://t.co/6r894uv4R2
BrundageBot: How can we fool LIME and SHAP? Adversarial Attacks on Post hoc Explanation Methods. Dylan Slack, Sophie Hilgard, Emily Jia, Sameer Singh, and Himabindu Lakkaraju https://t.co/UqBOddxUND
hima_lakkaraju: Wondering if you can game explainability methods (e.g. LIME/SHAP) to say whatever you want to? Turns out you can! More details in our recent research: https://t.co/ihFHTCkH4E https://t.co/pU3QIOBMFg
hima_lakkaraju: Want to know how adversaries can game explainability techniques? Our latest research - "How can we fool LIME and SHAP? Adversarial Attacks on Explanation Methods" has answers: https://t.co/Bcx2geO3mv. Joint work with the awesome team: @dylanslack20, Sophie, Emily, @sameer_
hima_lakkaraju: Very excited about our latest research on "How can we fool LIME and SHAP? Adversarial Attacks on Explanation Methods" https://t.co/Bcx2geO3mv. Joint work with the awesome team: @dylanslack20, Sophie, Emily, @sameer_
Images
Related