Summing Up the Data at Hand (Sampling Data Optimally)
Summarizing information is one of the most fundamental processes of a sentient being. You'd think that we'd have figured out how to automate this process in the data world by now, creating a reliable summary of a dataset for further analysis. After all, with the abundance of data today in many areas, it makes sense to be able to do that, to facilitate the work of analysts and data scientists. Nevertheless, most people still revert to sampling, which is a haphazard way of performing this task. The assumption is that if the data is selected randomly, you'd end up with a representative sample since randomness doesn't have any biases. Or so the statisticians would have us believe? When has ever Statistics been wrong?
In practice, having a reliable and succinct representation of the data involves the following requirements:
- the sample size needs to be customizable
- the sample needs to be void of biases, i.e., yield more or less the same descriptive metrics as the original dataset
- the sample needs to be the same every time the sampling method is run
- the process needs to be scalable and not too computationally expensive
Many people bypass the obvious shortcoming of a random sample (failure to address bullet point 3) by setting the random seed of the random number generator used for the random sampling. This approach, however, is just a cop-out since there is no guarantee whatsoever that the particular (pseudo-random) sample is going to be a good one since it may have more biases than other samples. As the analyst has better things to do than collect a bunch of random samples and pick the one that seems less biased (a process that's by definition subject to selection bias), she has to rely on the goodwill of the computer, which is another way of saying that she relies on luck!
Being able to obtain a representative sample without using randomness isn't easy, which is why Statistics doesn't offer (and probably never will offer) a solution. For better or for worse, Statistics is based on probabilities, and the latter are innately linked to randomness. What if there was a way to sample a dataset properly, ticking all four boxes of the previous list of bullet points?
Enter the BROOM framework, a data-driven library for data engineering tasks. One of the methods there involves the deterministic sampling of a variable and recently I've extended it to handle whole datasets (i.e., a collection of variables, which may or may not relate to each other; as long as they have the same number of data points, they can be merged into a matrix and processed by the method). The ofsd function, as it's called, manages to shrink the dataset methodically, taking into account all of the correlations of the variables at hand. As a bonus, it handles all potential outliers too (that's the "of" part of its name, which stands for "outlier-free"; I'll let you figure out what the "sd" part stands for!).
A deterministic sample may not sound like a big deal, but this isn't an easy problem to solve. Alongside other such problems, data engineering (aka preprocessing of the data) morphs into a challenging matter which is next to impossible to automate, without losing any hope for transparency. That's why many of us advocate that this part of the data science pipeline is done by a human, instead of A.I. models (we can use the A.I. models in the next part of the pipeline). What are your thoughts on this? How would you perform this sort of task? Let me know in the comments below, or feel free to DM me. Cheers!
Source: pixabay.com · Brief Overview of What's Hap ...
Source: Semantix Brasil · I generally don't opt fo ...
You have no groups that fit your search