Zacharias 🐝 Voulgaris

2 years ago · 2 min. reading time · ~100 ·

Blogging
>
Zacharias 🐝 blog
>
A Modern Data Pipeline

A Modern Data Pipeline

VNzBl.gif
Source: Semantix Brasil 

 

I generally don't opt for fancy animations and such, but sometimes this is the only way to convey a process' complexity and sophistication. In this case, it's the data science process, often referred to as a data pipeline (it's not the only one, while this particular one is just one of many potential implementations of this concept). Although it may seem overwhelming, this is the day-to-day work of a data scientist or a data science team. Let's delve into it.

For starters, we have a collection of data sources, depicted as the circles on the left. More often than not, these are databases, SQL or otherwise. However, they can be anything containing data, depending on the application at hand. If, for example, the data pipeline of a particular process involves gathering data from sensors, as in an IoT system, it's usually some form of a file transferred via the internet. In other scenarios, it can be a web application, some API, or some computer program. The data science process is quite flexible in that regard.

Once the data source is configured, often through the invaluable help of a data engineer, its content becomes available via a data loader or a data acquisition process. Often, this data is combined with archived data, usually in the form of a data lake. This latter data storage location is well within the domain of data architects or data modelers, professionals who work closely with data scientists, and who are responsible for organizing the data and securing it in the most appropriate location). Note that this can also be on the cloud or an organization’s private data center. During this phase of the pipeline, often referred to as data engineering, the data is extracted, transformed, and loaded (ETL) into the right places. Usually, a lot of data exploration takes place to understand the data better.

Following that, the data can be shared with other teams (e.g., developers who wish to use it as inputs for their applications, or some other database), it can be visualized (so that the management has an idea of what it can do with it), and it can be utilized through machine learning models. The latter are usually predictive and, more often than not, involve some form of A.I. in them.

Whenever the latter option is leveraged, data scientists are involved more, though they are often utilized for data visualization too, depending on the team. However, the latter is a task that can also be done by a data analyst, or some data visualization specialist.

Beyond this simple diagram, there is plenty more that's related to the data models built. However, this can get quite specialized, and it's better suited for a technical book or video. Note that even though it's not explicitly mentioned, throughout this process certain Cybersecurity protocols come into play. This involvement of Cybersecurity processes is especially the case when the data is in transit or stored in a location that other people can access, for example, in a database. So, even if it’s tacit, the presence of encryption and PII-protecting processes is there. Fortunately, this is often handled by specialized professionals, though in smaller organizations, a data scientist may need to deal with this too.

So, next time someone tells you that a data scientist is just a Stats professional who also knows some programming or a programmer who knows some Stats, do what I do and roll your eyes in contempt!


If you enjoy this sort of article, where I explore technical topics from a level that's easier to comprehend, abstaining from too much jargon and Math, you'd definitely like my blog, foxydatascience.com. There I explore various topics related to data science, A.I., and Cybersecurity. Check it out when you have a moment. Cheers!

Comments

Zacharias 🐝 Voulgaris

2 years ago #2

Jerry Fletcher

2 years ago #1

Zacharias, Nice view from 30,000 feet. Completely understandable.

Articles from Zacharias 🐝 Voulgaris

View blog
1 year ago · 1 min. reading time

Whether it's a solar panel or a rigged hamster wheel, you can make a first step in harnessing your p ...

1 year ago · 3 min. reading time

Overview · Lately, many professionals in the data world offer mentor and consult services. Oftentime ...

1 year ago · 4 min. reading time

Strategy is a broad concept involving planning and acting on a plan to tackle an often complex situa ...

Related professionals

You may be interested in these jobs

  • Jackson Physician Search

    Non-Invasive Cardiology

    Found in: beBee S2 US - 4 weeks ago


    Jackson Physician Search New York, United States Full time

    Join growing 50-year-old practice affiliated with a top nationally ranked hospital. The position offers excellent quality of life and excellent income potential in the 95th percentile. · The practice · • Full spectrum non-invasive services with predominantly outpatient clinical ...

  • PRIDE Health

    Specimen Processor

    Found in: Lensa US P 2 C2 - 9 hours ago


    PRIDE Health Kings Mountain, United States

    Pride Health is hiring a Specimen Tech to support our clients medical facility based in Kings Mountain NC This is a 3 month with the possibility of a contract-to-hire opportunity and a great way to start working with a top-tier healthcare organization · Title : Specimen Tech · ...

  • Mullins Mechanical

    Site Safety Coordinator

    Found in: Appcast Linkedin GBL C2 - 2 days ago


    Mullins Mechanical Jackson, United States

    About You · Are you a skilled safety specialist with industrial construction site experience? Do you have excellent awareness and advisory skills? If this sounds like you, then you should mull over a career with Mullins Mechanical. · We are looking for a Site Safety Coordinator ...