Zacharias 馃悵 Voulgaris

1 year ago 路 2 min. reading time 路 ~100 路

Zacharias 馃悵 blog
A Modern Data Pipeline

A Modern Data Pipeline

Source: Semantix Brasil聽

I generally don't opt for fancy animations and such, but sometimes this is the only way to convey a process' complexity and sophistication. In this case, it's the data science process, often referred to as a data pipeline (it's not the only one, while this particular one is just one of many potential implementations of this concept). Although it may seem overwhelming, this is the day-to-day work of a data scientist or a data science team. Let's delve into it.

For starters, we have a collection of data sources, depicted as the circles on the left. More often than not, these are databases, SQL or otherwise. However, they can be anything containing data, depending on the application at hand. If, for example, the data pipeline of a particular process involves gathering data from sensors, as in an IoT system, it's usually some form of a file transferred via the internet. In other scenarios, it can be a web application, some API, or some computer program. The data science process is quite flexible in that regard.

Once the data source is configured, often through the invaluable help of a data engineer, its content becomes available via a data loader or a data acquisition process. Often, this data is combined with archived data, usually in the form of a data lake. This latter data storage location is well within the domain of data architects or data modelers, professionals who work closely with data scientists, and who are responsible for organizing the data and securing it in the most appropriate location). Note that this can also be on the cloud or an organization鈥檚 private data center. During this phase of the pipeline, often referred to as data engineering, the data is extracted, transformed, and loaded (ETL) into the right places. Usually, a lot of data exploration takes place to understand the data better.

Following that, the data can be shared with other teams (e.g., developers who wish to use it as inputs for their applications, or some other database), it can be visualized (so that the management has an idea of what it can do with it), and it can be utilized through machine learning models. The latter are usually predictive and, more often than not, involve some form of A.I. in them.

Whenever the latter option is leveraged, data scientists are involved more, though they are often utilized for data visualization too, depending on the team. However, the latter is a task that can also be done by a data analyst, or some data visualization specialist.

Beyond this simple diagram, there is plenty more that's related to the data models built. However, this can get quite specialized, and it's better suited for a technical book or video. Note that even though it's not explicitly mentioned, throughout this process certain Cybersecurity protocols come into play. This involvement of Cybersecurity processes is especially the case when the data is in transit or stored in a location that other people can access, for example, in a database. So, even if it鈥檚 tacit, the presence of encryption and PII-protecting processes is there. Fortunately, this is often handled by specialized professionals, though in smaller organizations, a data scientist may need to deal with this too.

So, next time someone tells you that a data scientist is just a Stats professional who also knows some programming or a programmer who knows some Stats, do what I do and roll your eyes in contempt!

If you enjoy this sort of article, where I explore technical topics from a level that's easier to comprehend, abstaining from too much jargon and Math, you'd definitely like my blog, There I explore various topics related to data science, A.I., and Cybersecurity. Check it out when you have a moment. Cheers!


Jerry Fletcher

1 year ago #1

Zacharias, Nice view from 30,000 feet. Completely understandable.

Articles from Zacharias 馃悵 Voulgaris

View blog
1 month ago 路 3 min. reading time

I've never had any serious issues with my digestive system, but it doesn't hurt to be prepared. Afte ...

11 months ago 路 2 min. reading time

Lately, I鈥檝e been thinking a lot about podcasts. I suppose this has to do with the zeitgeist of quic ...

10 months ago 路 3 min. reading time

Overview 路 Mentoring is one of those subjects I can talk about till the cows come home (the other su ...

Related professionals

You may be interested in these jobs

  • Hair Cuttery

    Hair Stylist

    Found in: Jooble US - 3 days ago

    Hair Cuttery Newark, DE Full time

    Earn up to 75% commission鈥攖he highest in the industry 路 Exclusive time-management and financial goal-setting strategies to boost your earnings 路 Free training for top-dollar services-corrective color, chemical treatments, keratin, hair extensions, and more 路 All hair products ...

  • Burlington

    Retail Receiving Associate

    Found in: beBee S2 US - 4 days ago

    Burlington Citrus Heights Paid Work

    LOCATION 6145 San Juan Avenue Citrus Heights CA US 95610 路 Overview 路 If you want an exciting job with one of the largest off-price retail stores in the nation, join the Burlington Stores, Inc. team as a Receiving Associate Are you looking for a hands-on role in a fast-paced env ...

  • Orlando Health Orlando OTHER

    Position Summary 路 Orlando Health is a 3,200-bed system that includes 15 wholly-owned hospitals and emergency departments; rehabilitation services, cancer institutes, heart institutes, imaging and laboratory services, wound care centers, physician offices for adults and pediatric ...