Zacharias 🐝 Voulgaris

2 years ago · 2 min. reading time · ~10 ·

Blogging

>

Zacharias 🐝 blog

>

Summing Up the Data at Hand (Sampling Data Optimally)

Source: pixabay.com

Summarizing information is one of the most fundamental processes of a sentient being. You'd think that we'd have figured out how to automate this process in the data world by now, creating a reliable summary of a dataset for further analysis. After all, with the abundance of data today in many areas, it makes sense to be able to do that, to facilitate the work of analysts and data scientists. Nevertheless, most people still revert to sampling, which is a haphazard way of performing this task. The assumption is that if the data is selected randomly, you'd end up with a representative sample since randomness doesn't have any biases. Or so the statisticians would have us believe? When has ever Statistics been wrong?

In practice, having a reliable and succinct representation of the data involves the following requirements:

the sample size needs to be customizable
the sample needs to be void of biases, i.e., yield more or less the same descriptive metrics as the original dataset
the sample needs to be the same every time the sampling method is run
the process needs to be scalable and not too computationally expensive

Many people bypass the obvious shortcoming of a random sample (failure to address bullet point 3) by setting the random seed of the random number generator used for the random sampling. This approach, however, is just a cop-out since there is no guarantee whatsoever that the particular (pseudo-random) sample is going to be a good one since it may have more biases than other samples. As the analyst has better things to do than collect a bunch of random samples and pick the one that seems less biased (a process that's by definition subject to selection bias), she has to rely on the goodwill of the computer, which is another way of saying that she relies on luck!

Being able to obtain a representative sample without using randomness isn't easy, which is why Statistics doesn't offer (and probably never will offer) a solution. For better or for worse, Statistics is based on probabilities, and the latter are innately linked to randomness. What if there was a way to sample a dataset properly, ticking all four boxes of the previous list of bullet points?

Enter the BROOM framework, a data-driven library for data engineering tasks. One of the methods there involves the deterministic sampling of a variable and recently I've extended it to handle whole datasets (i.e., a collection of variables, which may or may not relate to each other; as long as they have the same number of data points, they can be merged into a matrix and processed by the method). The ofsd function, as it's called, manages to shrink the dataset methodically, taking into account all of the correlations of the variables at hand. As a bonus, it handles all potential outliers too (that's the "of" part of its name, which stands for "outlier-free"; I'll let you figure out what the "sd" part stands for!).

A deterministic sample may not sound like a big deal, but this isn't an easy problem to solve. Alongside other such problems, data engineering (aka preprocessing of the data) morphs into a challenging matter which is next to impossible to automate, without losing any hope for transparency. That's why many of us advocate that this part of the data science pipeline is done by a human, instead of A.I. models (we can use the A.I. models in the next part of the pipeline). What are your thoughts on this? How would you perform this sort of task? Let me know in the comments below, or feel free to DM me. Cheers!

#Sampling Data #Sampling Data Optimally #Statistics #Hand #BROOM

in Data Science, Data Analytics, and Data Professionals in General

Comments

Articles from Zacharias 🐝 Voulgaris

View blog

1 year ago · 2 min. reading time

Related professionals

€500 hour

Javier Cámara-Rica 🐝🇪🇸

CEO & Co-founder at beBee

Technology / Internet

(4)

Madrid, Madrid

Marketing Strategy + 10

You may be interested in these jobs

Server/System Administrator

3 days ago

eTeam Nashville, United States

Role: Server/System Administrator · Location: Remote · Duration: 7 Months · Administer, operate, and maintain authentication systems and services. · Administer Microsoft Windows Active Directory domains using standard Microsoft snap-ins such as Active Directory Users and Computer ...
Senior Registration Assoc

4 days ago

Trinity Health Corporation Albany, NY, United States Full time

Employment Type: Full time Shift: Day Shift Description: Senior Registration Associate - Cardiology Assoc - Albany, NY - FT The Senior Registration Associate is responsible for performing and overall coordination of clerical duties related to the efficient and service-oriented o ...
C/C++ Developer with Top Secret(TS) Clearance

6 days ago

Vlink Germantown, United States

Position: C/C++ Developer · Location: Germantown, MD · Duration: Long Term · Top Secret Clearance Needed · Responsibilities · Software design, development, integration and support activities · Actively participate in software development · Complete development tasks withi ...

Zacharias 🐝 Voulgaris

Summing Up the Data at Hand (Sampling Data Optimally)

Comments

Articles from Zacharias 🐝 Voulgaris

Top 5 Benefits of Developing Your Data IQ

Effective Problem-solving and Creativity's Role in All This

Programmatic Problem-solving

Related professionals

Javier Cámara-Rica 🐝🇪🇸

You may be interested in these jobs

Server/System Administrator

Senior Registration Assoc

C/C++ Developer with Top Secret(TS) Clearance

for Recruiters

Information