Zacharias 🐝 Voulgaris

2 months ago · 2 min. reading time · visibility ~100 ·

chat Contact the author

thumb_up Relevant message Comment

A Modern Data Pipeline

Source: Semantix Brasil 


I generally don't opt for fancy animations and such, but sometimes this is the only way to convey a process' complexity and sophistication. In this case, it's the data science process, often referred to as a data pipeline (it's not the only one, while this particular one is just one of many potential implementations of this concept). Although it may seem overwhelming, this is the day-to-day work of a data scientist or a data science team. Let's delve into it.

For starters, we have a collection of data sources, depicted as the circles on the left. More often than not, these are databases, SQL or otherwise. However, they can be anything containing data, depending on the application at hand. If, for example, the data pipeline of a particular process involves gathering data from sensors, as in an IoT system, it's usually some form of a file transferred via the internet. In other scenarios, it can be a web application, some API, or some computer program. The data science process is quite flexible in that regard.

Once the data source is configured, often through the invaluable help of a data engineer, its content becomes available via a data loader or a data acquisition process. Often, this data is combined with archived data, usually in the form of a data lake. This latter data storage location is well within the domain of data architects or data modelers, professionals who work closely with data scientists, and who are responsible for organizing the data and securing it in the most appropriate location). Note that this can also be on the cloud or an organization’s private data center. During this phase of the pipeline, often referred to as data engineering, the data is extracted, transformed, and loaded (ETL) into the right places. Usually, a lot of data exploration takes place to understand the data better.

Following that, the data can be shared with other teams (e.g., developers who wish to use it as inputs for their applications, or some other database), it can be visualized (so that the management has an idea of what it can do with it), and it can be utilized through machine learning models. The latter are usually predictive and, more often than not, involve some form of A.I. in them.

Whenever the latter option is leveraged, data scientists are involved more, though they are often utilized for data visualization too, depending on the team. However, the latter is a task that can also be done by a data analyst, or some data visualization specialist.

Beyond this simple diagram, there is plenty more that's related to the data models built. However, this can get quite specialized, and it's better suited for a technical book or video. Note that even though it's not explicitly mentioned, throughout this process certain Cybersecurity protocols come into play. This involvement of Cybersecurity processes is especially the case when the data is in transit or stored in a location that other people can access, for example, in a database. So, even if it’s tacit, the presence of encryption and PII-protecting processes is there. Fortunately, this is often handled by specialized professionals, though in smaller organizations, a data scientist may need to deal with this too.

So, next time someone tells you that a data scientist is just a Stats professional who also knows some programming or a programmer who knows some Stats, do what I do and roll your eyes in contempt!

If you enjoy this sort of article, where I explore technical topics from a level that's easier to comprehend, abstaining from too much jargon and Math, you'd definitely like my blog, There I explore various topics related to data science, A.I., and Cybersecurity. Check it out when you have a moment. Cheers!

thumb_up Relevant message Comment

Zacharias 🐝 Voulgaris

2 months ago #2

Jerry Fletcher

2 months ago #1

Zacharias, Nice view from 30,000 feet. Completely understandable.

More articles from Zacharias 🐝 Voulgaris

View blog
2 weeks ago · 3 min. reading time

Facebook's Recent Issues and the Need for More Privacy

Source: · Brief Overview of What's Hap ...

2 months ago · 2 min. reading time

Why beBee Rocks Even Harder Now

Source: bebee.comAbout six months ago I wrote an a ...

3 months ago · 2 min. reading time

Facing the Heat (a Raspberry Pi article)

Source: · Lately, I've been working wi ...