Zacharias 馃悵 Voulgaris

1 year ago 路 2 min. reading time 路 ~10 路

Contact the author

Like Share Comment
Zacharias 馃悵 blog
Summing Up the Data at Hand (Sampling Data Optimally)

Summing Up the Data at Hand (Sampling Data Optimally)

Source: pixabay.com聽

Summarizing information is one of the most fundamental processes of a sentient being. You'd think that we'd have figured out how to automate this process in the data world by now, creating a reliable summary of a dataset for further analysis. After all, with the abundance of data today in many areas, it makes sense to be able to do that, to facilitate the work of analysts and data scientists. Nevertheless, most people still revert to sampling, which is a haphazard way of performing this task. The assumption is that if the data is selected randomly, you'd end up with a representative sample since randomness doesn't have any biases. Or so the statisticians would have us believe? When has ever Statistics been wrong?

In practice, having a reliable and succinct representation of the data involves the following requirements:

  • the sample size needs to be customizable
  • the sample needs to be void of biases, i.e., yield more or less the same descriptive metrics as the original dataset
  • the sample needs to be the same every time the sampling method is run
  • the process needs to be scalable and not too computationally expensive

Many people bypass the obvious shortcoming of a random sample (failure to address bullet point 3) by setting the random seed of the random number generator used for the random sampling. This approach, however, is just a cop-out since there is no guarantee whatsoever that the particular (pseudo-random) sample is going to be a good one since it may have more biases than other samples. As the analyst has better things to do than collect a bunch of random samples and pick the one that seems less biased (a process that's by definition subject to selection bias), she has to rely on the goodwill of the computer, which is another way of saying that she relies on luck!

Being able to obtain a representative sample without using randomness isn't easy, which is why Statistics doesn't offer (and probably never will offer) a solution. For better or for worse, Statistics is based on probabilities, and the latter are innately linked to randomness. What if there was a way to sample a dataset properly, ticking all four boxes of the previous list of bullet points?

Enter the BROOM framework, a data-driven library for data engineering tasks. One of the methods there involves the deterministic sampling of a variable and recently I've extended it to handle whole datasets (i.e., a collection of variables, which may or may not relate to each other; as long as they have the same number of data points, they can be merged into a matrix and processed by the method). The ofsd function, as it's called, manages to shrink the dataset methodically, taking into account all of the correlations of the variables at hand. As a bonus, it handles all potential outliers too (that's the "of" part of its name, which stands for "outlier-free"; I'll let you figure out what the "sd" part stands for!).

A deterministic sample may not sound like a big deal, but this isn't an easy problem to solve. Alongside other such problems, data engineering (aka preprocessing of the data) morphs into a 聽challenging matter which is next to impossible to automate, without losing any hope for transparency. That's why many of us advocate that this part of the data science pipeline is done by a human, instead of A.I. models (we can use the A.I. models in the next part of the pipeline). What are your thoughts on this? How would you perform this sort of task? Let me know in the comments below, or feel free to DM me. Cheers!

Like Share Comment

More articles from Zacharias 馃悵 Voulgaris

View blog
1 month ago 路 5 min. reading time

Introducci贸n no tan t茅cnica 路 Cualquiera que se haya adentrado en el mundo de la inform谩tica ha o铆do ...

2 months ago 路 3 min. reading time

The problem with problems these days 路 There have always been problems we have had to solve across v ...

4 months ago 路 4 min. reading time

I have never been such a big fan of an operating system to try to get others to use it. I like how G ...

Related professionals

You may be interested in these jobs

  • Kentucky Orthopedic Rehab Team

    Physical Therapy Technician-KORT-Louisville, KY

    Found in: beBee S2 US - 3 days ago

    Kentucky Orthopedic Rehab Team Louisville Regular, Part Time

    Physical Therapy Technician-KORT-Louisville, KY (Executive Park) 路 Job ID 路 216635 路 Location 路 US-KY-Louisville 路 Experience (Years) 路 0 路 Category 路 Administrative - Administrative Services 路 Street Address 路 616 Executive Park 路 Company 路 Kentucky Orthopedic Rehab Team 路 Posit ...

  • UnitedHealth Group

    Senior Technical Product Manager, OptumInsight Technology

    Found in: beBee S2 US - 8 hours ago

    UnitedHealth Group Franklin Full time

    Combine two of the fastest-growing fields on the planet with a culture of performance, collaboration and opportunity and this is what you get. Leading edge technology in an industry that's improving the lives of millions. Here, innovation isn't about another gadget, it's about ma ...

  • Siemens

    Machine Operator, Interviewing Now

    Found in: beBee S2 US - 1 day ago

    Siemens Hingham Part time

    *Job Family: 路 Manufacturing 路 Req ID: 路 328493.Our Culture At Siemens, we live and foster an ownership culture, in which every employee takes personal responsibility for our company's success. 路 We utilize lean principles and digital factory technology to continually improve our ...