Zacharias 🐝 Voulgaris

2 years ago · 2 min. reading time · ~10 ·

Blogging
>
Zacharias 🐝 blog
>
Summing Up the Data at Hand (Sampling Data Optimally)

Summing Up the Data at Hand (Sampling Data Optimally)

GfjDy.jpeg
Source: pixabay.com 

Summarizing information is one of the most fundamental processes of a sentient being. You'd think that we'd have figured out how to automate this process in the data world by now, creating a reliable summary of a dataset for further analysis. After all, with the abundance of data today in many areas, it makes sense to be able to do that, to facilitate the work of analysts and data scientists. Nevertheless, most people still revert to sampling, which is a haphazard way of performing this task. The assumption is that if the data is selected randomly, you'd end up with a representative sample since randomness doesn't have any biases. Or so the statisticians would have us believe? When has ever Statistics been wrong?

In practice, having a reliable and succinct representation of the data involves the following requirements:

  • the sample size needs to be customizable
  • the sample needs to be void of biases, i.e., yield more or less the same descriptive metrics as the original dataset
  • the sample needs to be the same every time the sampling method is run
  • the process needs to be scalable and not too computationally expensive

Many people bypass the obvious shortcoming of a random sample (failure to address bullet point 3) by setting the random seed of the random number generator used for the random sampling. This approach, however, is just a cop-out since there is no guarantee whatsoever that the particular (pseudo-random) sample is going to be a good one since it may have more biases than other samples. As the analyst has better things to do than collect a bunch of random samples and pick the one that seems less biased (a process that's by definition subject to selection bias), she has to rely on the goodwill of the computer, which is another way of saying that she relies on luck!

Being able to obtain a representative sample without using randomness isn't easy, which is why Statistics doesn't offer (and probably never will offer) a solution. For better or for worse, Statistics is based on probabilities, and the latter are innately linked to randomness. What if there was a way to sample a dataset properly, ticking all four boxes of the previous list of bullet points?

Enter the BROOM framework, a data-driven library for data engineering tasks. One of the methods there involves the deterministic sampling of a variable and recently I've extended it to handle whole datasets (i.e., a collection of variables, which may or may not relate to each other; as long as they have the same number of data points, they can be merged into a matrix and processed by the method). The ofsd function, as it's called, manages to shrink the dataset methodically, taking into account all of the correlations of the variables at hand. As a bonus, it handles all potential outliers too (that's the "of" part of its name, which stands for "outlier-free"; I'll let you figure out what the "sd" part stands for!).

A deterministic sample may not sound like a big deal, but this isn't an easy problem to solve. Alongside other such problems, data engineering (aka preprocessing of the data) morphs into a  challenging matter which is next to impossible to automate, without losing any hope for transparency. That's why many of us advocate that this part of the data science pipeline is done by a human, instead of A.I. models (we can use the A.I. models in the next part of the pipeline). What are your thoughts on this? How would you perform this sort of task? Let me know in the comments below, or feel free to DM me. Cheers!

Comments

Articles from Zacharias 🐝 Voulgaris

View blog
1 year ago · 2 min. reading time

In a world where information is abundant, being able to process it and do so efficiently is a valuab ...

1 year ago · 3 min. reading time

The problem with problems these days · There have always been problems we have had to solve across v ...

1 year ago · 4 min. reading time

Not-so-technical intro · Anyone who has delved into computers has heard and probably experienced pro ...

Related professionals

You may be interested in these jobs


  • eTeam Nashville, United States

    Role: Server/System Administrator · Location: Remote · Duration: 7 Months · Administer, operate, and maintain authentication systems and services. · Administer Microsoft Windows Active Directory domains using standard Microsoft snap-ins such as Active Directory Users and Computer ...


  • Trinity Health Corporation Albany, NY, United States Full time

    Employment Type: Full time Shift: Day Shift Description: Senior Registration Associate - Cardiology Assoc - Albany, NY - FT The Senior Registration Associate is responsible for performing and overall coordination of clerical duties related to the efficient and service-oriented o ...


  • Vlink Germantown, United States

    Position: C/C++ Developer · Location: Germantown, MD · Duration: Long Term · Top Secret Clearance Needed · Responsibilities · Software design, development, integration and support activities · Actively participate in software development · Complete development tasks withi ...