ContextCite: Attributing Model Generation to Context
Benjamin Cohen-Wang*, Harshay Shah*, Kristian Georgiev*, Aleksander Mądry
Neural Information Processing Systems
(NeurIPS), 2024
+ Workshop on Next Generation AI Safety
(ICML NextGenAISafety), 2024
B. Cohen-Wang*, H. Shah*, K. Georgiev*, A. Mądry
NeurIPS 2024
How do language models actually use information provided as context when generating a response? Can we infer whether a particular generated statement is actually grounded in the context, a misinterpretation, or fabricated? To help answer these questions, we introduce the problem of context attribution: pinpointing the parts of the context (if any) that led a model to generate a particular statement. We then present ContextCite, a simple and scalable method for context attribution that can be applied on top of any existing language model. Finally, we showcase the utility of ContextCite through two case studies: (1) automatically verifying statements based on the attributed parts of the context and (2) improving response quality by extracting query-relevant information from the context.
@article{cohen2024contextcite,
title={ContextCite: Attributing Model Generation to Context},
author={Cohen-Wang, Benjamin and Shah, Harshay and Georgiev, Kristian and Madry, Aleksander},
journal={arXiv preprint arXiv:2409.00729},
year={2024}
}
Decomposing and Editing Predictions by Modeling Model Computation
Harshay Shah, Andrew Ilyas, Aleksander Mądry
International Conference on Machine Learning
(ICML), 2024
+ Workshop on Foundation Model Interventions
(NeurIPS MINT), 2024

H. Shah, A. Ilyas, A. Mądry
ICML 2024
How does the internal computation of a machine learning model transform inputs into predictions? In this paper, we introduce a task called component modeling that aims to address this question. The goal of component modeling is to decompose an ML model's prediction in terms of its components -- simple functions (e.g., convolution filters, attention heads) that are the "building blocks" of model computation. We focus on a special case of this task, component attribution, where the goal is to estimate the counterfactual impact of individual components on a given prediction. We then present COAR, a scalable algorithm for estimating component attributions; we demonstrate its effectiveness across models, datasets, and modalities. Finally, we show that component attributions estimated with COAR directly enable model editing across five tasks, namely: fixing model errors, ``forgetting'' specific classes, boosting subpopulation robustness, localizing backdoor attacks, and improving robustness to typographic attacks. We provide code for COAR at github.com/MadryLab/modelcomponents
@article{shah2024decomposing,
title={Decomposing and Editing Predictions by Modeling Model Computation},
author={Shah, Harshay and Ilyas, Andrew and Madry, Aleksander},
journal={arXiv preprint arXiv:2404.11534},
year={2024}
}
Parameters vs FLOPs: Scaling Laws for Optimal Sparsity of MoE Language Models
Samira Abnar*, Harshay Shah*, Dan Busbridge, Alaa El-Nouby, Josh Susskind, Vimal Thilak*
Workshop on Attributing Model Behavior at Scale
(NeurIPS Attrib), 2024
H. Shah, V. Thilak, D. Busbridge, A. El-Nouby, J. Susskind, S. Abnar
NeurIPS Attrib, 2024
Scaling the capacity of language models has consistently proven to be a reliable approach for improving performance and unlocking new capabilities. Capacity can be primarily defined by two dimensions: the number of model parameters and the compute per example. While scaling typically involves increasing both, the precise interplay between these factors and their combined contribution to overall capacity remains not fully understood. We explore this relationship in the context of sparse Mixture-of-Expert models (MoEs), which allow scaling the number of parameters without proportionally increasing the FLOPs per example. We investigate how varying the sparsity level, i.e., the ratio of non-active to total parameters, affects model performance in terms of both pretraining and downstream performance. We find that under different constraints (e.g. parameter size and total training compute), there is an optimal level of sparsity that improves both training efficiency and model performance. These results provide a better understanding of the impact of sparsity in scaling laws for MoEs and complement existing works in this area, offering insights for designing more efficient architectures.
@misc{abnar2025parameters,
title={Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models},
author={Samira Abnar and Harshay Shah and Dan Busbridge and Alaaeldin Mohamed Elnouby Ali and Josh Susskind and Vimal Thilak},
year={2025},
eprint={2501.12370},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
ModelDiff: A Framework for Comparing Learning Algorithms
Harshay Shah*, Sung Min Park*, Andrew Ilyas*, Aleksander Mądry
International Conference on Machine Learning
(ICML), 2023
+ Workshop on Spurious Correlations, Invariance, and Stability
(ICML SCIS), 2023

H. Shah*, S. M. Park*, A. Ilyas*, A. Mądry
ICML 2023
We study the problem of (learning) algorithm comparison, where the goal is to find differences between models trained with two different learning algorithms. We begin by formalizing this goal as one of finding distinguishing feature transformations, i.e., input transformations that change the predictions of models trained with one learning algorithm but not the other. We then present ModelDiff, a method that leverages the datamodels framework (Ilyas et al., 2022) to compare learning algorithms based on how they use their training data. We demonstrate ModelDiff through three case studies, comparing models trained with/without data augmentation, with/without pre-training, and with different SGD hyperparameters. Our code is available at github.com/MadryLab/modeldiff
@inproceedings{shah2023modeldiff,
title={Modeldiff: A framework for comparing learning algorithms},
author={Shah, Harshay and Park, Sung Min and Ilyas, Andrew and Madry, Aleksander},
booktitle={International Conference on Machine Learning},
pages={30646--30688},
year={2023},
organization={PMLR}
}
The Pitfalls of Simplicity Bias in Neural Networks
Harshay Shah, Kaustav Tamuly, Aditi Raghunathan, Prateek Jain, Praneeth Netrapalli
Neural Information Processing Systems
(NeurIPS), 2020
+ Workshop on Uncertainty and Robustness in Deep Learning
(ICML UDL), 2020
H. Shah, K. Tamuly, A. Raghunathan, P. Jain, P. Netrapalli
NeurIPS 2020
Several works have proposed Simplicity Bias (SB)—the tendency of standard training procedures such as Stochastic Gradient Descent (SGD) to find simple models—to justify why neural networks generalize well [Arpit et al. 2017, Nakkiran et al. 2019, Soudry et al. 2018]. However, the precise notion of simplicity remains vague. Furthermore, previous settings that use SB to theoretically justify why neural networks generalize well do not simultaneously capture the non-robustness of neural networks—a widely observed phenomenon in practice [Goodfellow et al. 2014, Jo and Bengio 2017]. We attempt to reconcile SB and the superior standard generalization of neural networks with the non-robustness observed in practice by designing datasets that (a) incorporate a precise notion of simplicity, (b) comprise multiple predictive features with varying levels of simplicity, and (c) capture the non-robustness of neural networks trained on real data. Through theory and empirics on these datasets, we make four observations: (i) SB of SGD and variants can be extreme: neural networks can exclusively rely on the simplest feature and remain invariant to all predictive complex features. (ii) The extreme aspect of SB could explain why seemingly benign distribution shifts and small adversarial perturbations significantly degrade model performance. (iii) Contrary to conventional wisdom, SB can also hurt generalization on the same data distribution, as SB persists even when the simplest feature has less predictive power than the more complex features. (iv) Common approaches to improve generalization and robustness—ensembles and adversarial training—can fail in mitigating SB and its pitfalls. Given the role of SB in training neural networks, we hope that the proposed datasets and methods serve as an effective testbed to evaluate novel algorithmic approaches aimed at avoiding the pitfalls of SB; code and data available at github.com/harshays/simplicitybiaspitfalls.
@article{shah2020pitfalls,
title={The Pitfalls of Simplicity Bias in Neural Networks},
author={Shah, Harshay and Tamuly, Kaustav and Raghunathan, Aditi and Jain, Prateek and Netrapalli, Praneeth},
journal={Advances in Neural Information Processing Systems},
volume={33},
year={2020}
}
Growing Attributed Networks through Local Processes
Harshay Shah, Suhansanu Kumar, Hari Sundaram
World Wide Web Conference
(WWW), 2019
H. Shah, S. Kumar, H. Sundaram
WWW, 2019
This paper proposes an attributed network growth model. Despite the knowledge that individuals use limited resources to form connections to similar others, we lack an understanding of how local and resource-constrained mechanisms explain the emergence of rich structural properties found in real-world networks. We make three contributions. First, we propose a parsimonious and accurate model of attributed network growth that jointly explains the emergence of in-degree distributions, local clustering, clustering-degree relationship and attribute mixing patterns. Second, our model is based on biased random walks and uses local processes to form edges without recourse to global network information. Third, we account for multiple sociological phenomena: bounded rationality, structural constraints, triadic closure, attribute homophily, and preferential attachment. Our experiments indicate that the proposed Attributed Random Walk (ARW) model accurately preserves network structure and attribute mixing patterns of six real-world networks; it improves upon the performance of eight state-of-the-art models by a statistically significant margin of 2.5-10x.
@inproceedings{shah2019growing,
title={Growing Attributed Networks through Local Processes},
author={Shah, Harshay and Kumar, Suhansanu and Sundaram, Hari},
booktitle={The World Wide Web Conference},
pages={3208--3214},
year={2019},
organization={ACM}
}