Presented on Feb 18, 2016
There are two goals in analyzing data: prediction and extracting information (or understanding and exploring the data). There are two approaches toward these goals: data modeling and algorithmic modeling.
The data modeling culture is popular among statisticians, and consists of assuming a stochastic model of how the data came to be (e.g., linear/logisitic regression and the Cox model) and then fitting these model parameters from the data. The fit parameters can then be used for either information extraction or prediction. Models are validated using goodness of fit tests and examining residuals.
The algorithmic modeling culture considers the process from the input variables x to the output variables y to be unknown. Instead of assuming the data was generated in a particular way, this approach is to find some function to approximate/mimic the data generation process, such as decision trees and neural nets. The models are validated with predictive accuracy and this approach is most popular in disciplines outside of statistics, like computer science.
Even more recently in the data modeling culture, but not in the paper: discriminative models vs. generative models. The author only focuses on discriminate data models.
This paper argues that the focus on data modeling in statistics has lead to:
irrelevant theory and questionable scientific conclusions
prevented statisticians from using more suitable algorithmic models
prevented statisticians from working on interesting problems
The ozone project: a complete failure due to high false alarm rate. He blames generative modeling.
The chlorine project: success with 95% accuracy, using a decision tree. He claims it is a win for discriminative models.
Data models: conclusions drawn about the model, not about the data. Gender bias in salary: a study collected 25 variables and concluded at the 5% significance level that genre bias exists in salaries. He concludes this could be erroneous, but does not give solutions.
Limitations of data models: he brushes away more complicated models (outside regression, etc.) as being too difficult to interpret.
Theory of algorithmic modeling still assumes that data are iid form an unknown multivariate distribution.
Rashomon: multiplicity of good models
in stats, many equations can produce the same error rate; which to choose
bagging (by Breiman, the author) helps for both algorithms and data modeling
interpretability and prediction accuracy are at odds; Breimen’s advice is to go for predictive accuracy first
Bellman’s curse of dimensionality
want to reduce the number of variables to reduce the variance of our model
Greg showed that in genomics probabilistic models are more popular (in terms of citations), but algorithmic models are also very popular.
Comments / responses:
Efron: “At first glance Leo Breiman’s stimulating paper looks like an argument against parsimony and scientific insight, and in favor of black boxes with lots of knobs to twiddle. At second glance it still looks that way, but the paper is stimulating, and Leo has some important points to hammer home.”
Hadley “Algorithmic modeling is a very important area of statistics. It has evolved naturally in environments with lots of data and lots of decisions. But you can do it without suffering the Occam dilemma; for example, use medium trees with interpretable GAMs in the leaves.”
Parson: prediction (management/profit) vs. science (truth)