As data becomes the fuel driving technological and economic growth, a fundamental challenge is how to quantify the value of data in algorithmic predictions and decisions. For example, in healthcare and consumer markets, it has been suggested that individuals should be compensated for the data that they generate, but it is not clear what is an equitable valuation for individual data. In this talk, we discuss a principled framework to address data valuation in the context of supervised machine learning. Given a learning algorithm trained on a number of data points to produce a predictor, we propose data Shapley as a metric to quantify the value of each training datum to the predictor performance. Data Shapley value uniquely satisfies several natural properties of equitable data valuation. We introduce Monte Carlo and gradient-based methods to efficiently estimate data Shapley values in practical settings where complex learning algorithms, including neural networks, are trained on large datasets. We then briefly discuss the notion distributional Shapley, where the value of a point is defined in the context of underlying data distribution.
He is a PhD student at Stanford University. He has previously been a research intern in Google Brain and Google Brain Medical working on machine learning interpretability and fairness. Before joining Stanford, he studied as an Electrical Engineering Bachelor of Science student in Sharif University of Technology working on problems in Signal Processing and Game Theory. He has been working on Machine Learning Fairness, Machine Learning Interpretability, and Applications of Machine Learning in Healthcare and Genomicsm.