Statistical Inference, Observational Data and Machine Learning
- Post author:Alex Uriarte
- Post published:April 28, 2025
- Post category:Fan

I spent some time thinking about where statistical inference is considered in predictions made by Machine Learning (ML) models and came to the conclusion that the answer is most (if not all) of the time, nowhere. The reason is the typical use of observational data by ML models. Here is my reasoning.
Statistical Inference
Statistical inference consists of using observed events to draw conclusions about the underlying process that generated them. Most commonly, this is used to draw conclusions about an unobserved population based on an observed sample.
Perhaps the most important tool in statistical inference is random sampling. Random sampling consists of selecting subsets of a population (samples) at random, that is, through some experiment where the outcome cannot be predicted in advance and where each unit of observation has the same chance of being selected for the sample.
Random samples have a few properties that allow us to make statements about the population (or the generating process). In particular, if we have a random sample:
-
The law of large numbers (LLM) states that the average of variables in that sample converge to the average of those variables in the population, the larger the size of the sample
-
The central limit theorem (CLT) states that, as multiple random samples are pulled from the same population, the average of the samples is distributed normally and can be normalized to be present a standard normal distribution (with mean equal to 0 and variance equal to 1)
Random sampling, the LLM and the CLT are all key in allowing us to draw conclusions about a population (or generating process) based on an observed sample.
Note that all the above refers to statistical inference and not causal inference. In statistical inference we are drawing conclusions about a population based on a sample. In causal inference we are drawing conclusions about how one factor may cause another. Random sampling by itself does not help with causal inference as it does with statistical inference. Causal inference is where, for example, randomized control trials (RCTs) are particularly powerful. But that may be the subject for another post.
Observational data
So what do you do when you have a set of data that was not generated through some random process, but rather, where the data were simply observed? As far as I understand, the only real option is to limit our analysis to the analysis of the sample data itself. That is, draw no inferences about the generating process. This, however, is rarely the purpose of data analysis. We most often analyze the data to be able to draw conclusions about the generating process and be able to predict future outcomes, not just to understand past observed data
The more common practice seems to be to continue the analysis of data as if they were the result of a random sample of some generated process. For example, we often see econometric models where someone conducts a linear regression using observational data, estimates the p-values of the parameters (the coefficients of the regression), and then interprets those p-values as telling us the probability of observing values for the coefficients that are as far away or more from zero, in a situation where the true value of the coefficients in a hypothetical population were in fact 0. In other word, if they see a p-value low enough, (say 5%) they say that in only 5% of samples extracted from a population would we estimate a coefficient as far from 0 as the ones we estimated and, therefore, they conclude that there is a very good chance that the true coefficient in the population is in fact not zero. They then celebrate that they found a “significant” correlation in the data. But all this interpretation assumes that the observed data are randomly sampled through some generating process. With observational data, that is not the case. So the use of statistical inference, calculation of p-values, hypothesis testing make no sense.
Additional confusion is generated by terminology commonly used in linear regression. For example, the Gauss-Markov theorem gives us conditions under which an estimate is the Best Linear Unbiased Estimator (BLUE). However:
-
“Estimator” refers to a linear function for obtaining an expected value for the explained variable Y given values for the explanatory variable Xs. It does NOT refer a function for obtaining estimates of population parameters based on a sample;
-
Similarly, “unbiased” refers to the estimator of the Y in the observed data set, given the explanatory X variables, not to an estimator of the population Y.
So what does this mean for ML models?
Machine Learning
Because ML typically relies on observational data, my conclusion is that traditional machine learning models suffer from the same constraints of any other statistical analysis of observational data: because observations are not pulled from a population at random, there is very little that we can say or do to ensure that our ML estimators are unbiased estimators of an underlying generating process.
What we can do is:
-
Knowing that data sets used in ML models are typically large (“big data”), consider the extent to which the data set is likely very close to the universe of interest, or even consider that the data may be in itself the universe of interest. In that case, no statistical inference is needed. In other words, consider the potential for “bias” in the observed data relative to a hypothetical population of interest.
- Consider whether bootstrapping sub-samples of data and testing the model on many sub-samples is likely to generate robust enough estimators that, perhaps combined with the large size of the population, we can be sufficiently confident that biases in the sample will not meaningfully affect predicted outcomes.
To understand the above, let us better understand traditional ML models.
Using again the example of linear regression, regression analysis is now commonly thought of as one more application of traditional ML. In traditional ML approaches, an algorithm is applied to a (typically large) data set in search of patterns. When the algorithm “labels” the outcomes and looks for patterns that identify the outcome, this is called supervised machine learning. This would be the case of regression analysis, since we observe the dependent variable in the data set. The algorithm is typically some optimizing criteria – such as least squares or maximum likelihood – but sometimes the criteria are based on sufficiency. In ML, the machine is said to “learn” when it has generated a model, that is, when it has identified patterns based on applying an algorithm to a dataset. In a non-supervised ML approach, there is no labelling of the outcome in the dataset, but in either case, some algorithm searches for patterns, and this means that:
-
“Features” of the data set where patterns should be found are either selected by the modeler before applying the algorithm or are part of the algorithm itself.
-
The criteria for identifying a pattern is defined in the choice of algorithm
The choice of algorithm is key, and the choice is often guided by the purpose of the model, that is, what type of event we are trying to identify in the data. However, if we want to compare different algorithms, and also when testing how good a model we obtain:
-
The common criteria used typically focus on internal validation, that is, how well the model fits the data. In supervised models, this means criteria such as precision, recall and the F1 score (that balances the two). In unsupervised models, there are different types of internal validation criteria depending on the algorithm used.
-
To increase the chances that the model will work well in other datasets, the dataset is typically broken into sub-samples and model parameters are estimated on more than one sub-sample. This reduces the “over fitting” of a model to one specific sub-sample.
So, back to the question of how to deal with observational data in ML models. It seems to me the best that we can do is to (again):
-
ask ourselves if the data at hand can be considered our universe of interest or to look for potential bias in the data (relative to our targeted universe); and
-
if we think there is bias relative to our population of interest, consider whether testing the model on many bootstrapped sub-samples is likely to generate robust enough estimators that we can be sufficiently confident that biases in the sample will not meaningfully affect predicted outcomes.
But the bootstrapped sub-samples, even being randomly generated from the data at hand, still do not allow us to draw conclusions about an underlying process and population, unless that population is assumed to be the observed data set.
Sources (and note)
I did not do a very good job at registering my sources for this post. I relied mainly on my own knowledge while searching Wikipedia and interacting with ChatGPT 4o. I Have become a big fan of ChatGPT 4o and have been using it considerably to learn. I find it is particularly good when we are interacting about subjects we have enough knowledge about to identify possible weak or incomplete answers and are able to follow-up with questions that lead to answers that are closer to what we wanted to know, and when we are able to judge somewhat the reliability of the answers based on previous knowledge. In the future I will try to better register Wikipedia pages and ChatGPT prompts.
OpenAI. ChatGPT 4o (omni – May 2024), accessed March-April 2025.
Wikipedia contributors. Various pages, accessed March-April of 2025