Make a career shift and become a Data Engineer

Data is eating the world. Humans and IoT devices generate zettabytes of data! This data evolution created new opportunities for the software engineers, so far reduced in their data interactions to the punctual processing of megabytes of data. If you want to be a part of this revolution, mastering data engineering skills is necessary.

If you have some coding experience and want to make a career shift to adapt to the new data-driven companies, then you should check Become a Data Engineer course where in 11 modules you will learn data engineering concepts. Every module is composed of theoretical and demo lessons terminated by the practical homework coding exercise. It means that by the end you should be able to write streaming and batch pipelines and also understand better what is going on in the data world.

About your instructor

My name is Bartosz Konieczny and I am a data engineer working with software since 2009. I'm also an Apache Spark enthusiast, AWS and GCP certified cloud user, blogger, and speaker. I like to share and you can discover it on my waitingforcode.com blog or conferences like Spark+AI Summit 2019 or Data+AI Summit 2020.
Check me on social media: Bartosz Konieczny Twitter Github Stack Overflow Facebook YouTube

What I will learn?

Module 1
Data engineering 101
Theory Homework
Introduction to data engineering. The module covers all basic concepts that should help you work with the next modules.
⏲️ 130 minutes
FREE

Enroll in ➡️


  • History
  • Data engineer
  • Data pipeline
  • Batch processing
  • Streaming processing
  • Data architectures
  • Data stores
  • Standalone data processing 💻 (theory + demo)
  • Distributed data processing
  • "Data" in a nutshell
Module 2
Data ingestion - offline
Theory Demos GitHub code Homework
Data ingestion is the first task of each data system. In this module you'll see how to perform an offline ingestion. Demo tools: Airbyte, Apache Spark, Elasticsearch, Great Expectations
⏲️ 143 minutes
$21

Enroll in ➡️


  • Data ingestion types 💻 (theory + demo)
  • Data migration 💻
  • Scalability 💻
  • Auto-scaling
  • Eventual consistency
  • Bulk operations 💻
  • Compression 💻
  • Low Code
  • Data quality 💻
Module 3
Data ingestion - real-time
Theory Demos GitHub code Homework
Offline data ingestion from the previous module comes with some latency. The real-time approach tends to ingest the data as soon as possible, very often within seconds. Demo tools: Apache Kafka, Apache Spark, Debezium, Kafka Connect, KSQL
⏲️ 136 minutes
$21

Enroll in ➡️


  • API Gateway vs. direct ingestion
  • API Gateway 💻 (theory + demo)
  • Change Data Capture 💻
  • Files streaming 💻
  • Delivery semantics 💻
  • Idempotency 💻
  • Batch layer
  • Real-time ad-hoc querying 💻
  • Polyglot persistence
Module 4
Data cleansing
Theory Demos GitHub code Homework
You're doing great so far! The data is ingested to our system but unfortunately, it has some issues. In this module you'll see how to prepare it for further usage. Demo tools: Apache Avro, Apache Kafka, Apache Spark, Elasticsearch
70%
The module is not ready yet. Release period: May 2023

Inform me about the release >


  • Data enrichment 💻 (theory + demo)
  • Data anonymization
  • Deduplication 💻
  • Schema
  • Schema registry 💻
  • Schema management 💻
  • Metadata 💻
  • Binary file formats 💻
  • Monitoring and alerting 💻
Module 5
Streaming processing
Theory Demos GitHub code Homework
You've noticed it probably. Yes, you've implemented a streaming data processing pipeline in the previous module. But it was dedicated to the data cleansing and you haven't had a chance to see streaming concepts in depth. It'll be done in this module! Demo tools: Apache Flink, Apache Kafka, Apache Spark
70%
The module is not ready yet. Release period: May 2023

Inform me about the release >


  • Patterns 💻 (theory + demo)
  • Architectures
  • Transformations 💻
  • Event time vs. processing time 💻
  • Scalability 💻
  • Auto-scaling 💻
  • Reprocessing 💻
  • Messaging patterns
  • Backpressure 💻
  • Debugging
Module 6
Streaming stateful processing
Theory Demos GitHub code Homework
Streaming processing has multiple facets. Stateful is probably the most challenging one since it requires dealing with a state store besides the classical streaming data processing logic. Demo tools: Apache Flink, Apache Kafka, Apache Spark
25%
The module is not ready yet. Release period: June 2023

Inform me about the release >


  • Stateless vs. stateful
  • State store 💻 (theory + demo)
  • Incremental and full state
  • Triggers 💻
  • Watermarks and late data 💻
  • Fault-tolerance 💻
  • Idempotency 💻
  • Aggregations 💻
  • Arbitrary stateful processing 💻
  • Joins 💻
  • Windows 💻
  • Complex Event Processing 💻
Module 7
Batch processing - ETL
Theory Demos GitHub code Homework
Even though you went into streaming, the batch is still there! It's time to see the first pattern for writing batch pipelines called Extract Transform Load (ETL).
1%
The module is not ready yet. Release period: July 2023

Inform me about the release >


  • Staging area
  • Data pipeline steps 💻 (theory + demo)
  • Data orchestration 💻
  • Patterns 💻
  • Triggers 💻
  • Tasks 💻
  • Alerting
  • Idempotency: append data 💻
  • Idempotency: immutable data 💻
  • Data reprocessing 💻
  • Best practices
Module 8
Batch processing - ELT
Theory Demos GitHub code Homework
To complete the ETL approach, you can also use an alternative called Extract Load Transform (ELT). There is a subtle difference in the steps that involves a few other things.
1%
The module is not ready yet. Release period: July 2023

Inform me about the release >


  • Definition
  • Staging area
  • Data pipeline steps 💻 (theory + demo)
  • Data extraction 💻
  • Data loading 💻
  • Execution plans 💻
  • Idempotency 💻
  • Tests 💻
  • ETL or ELT?
Module 9
Data warehouse
Theory Demos GitHub code Homework
Data warehouse is still the most popular approach to deal with analytics data. However, the alternatives like data lakehouse exist and they will complete the lessons.
1%
The module is not ready yet. Release period: September 2023

Inform me about the release >


  • Data warehousing
  • Columnar format vs row format 💻 (theory + demo)
  • Data modeling 💻
  • Data marts
  • Denormalization
  • Encoding
  • Ad-hoc querying (SQL) 💻
  • Ad-hoc querying (JOINs) 💻
  • Ad-hoc querying (analytics) 💻
  • Approximate algorithms 💻
  • Data lakehouse 💻
Module 10
Data management
Theory Demos GitHub code Homework
Unmanaged data is often a sign of misknown data. The module will show a few important aspects in data management.
1%
The module is not ready yet. Release period: October 2023

Inform me about the release >


  • Data catalog 💻 (theory + demo)
  • Data lineage 💻
  • Data security - authentication 💻
  • Data security - authorization 💻
  • Data security - encryption 💻
  • Data privacy
  • Self-serve data platform
  • Audit
Module 11
Going further
Theory Demos GitHub code Homework
The course prepares you just to start. This module extends the basic knowledge and introduces the concepts you might see in the future follow-up parts of Become a Data Engineer!
1%
The module is not ready yet. Release period: October 2023

Inform me about the release >


  • Cloud computing
  • AWS cloud data services
  • GCP cloud data services
  • Azure cloud data services
  • Docker 💻 (theory + demo)
  • Kubernetes
  • Software engineering best practices 💻
  • Tests and data processing 💻
  • DevOps 💻
  • Data processing frameworks - going distributed 💻
  • Serverless
  • Data engineers and data science
  • Data engineers and data analysts

Money back guarantee

I did my best to give you a valuable content. If for whatever reason you're not satisfied, you can use my "5-Day Money-Back guarantee". Just reply to your purchase receipt email and I will issue a refund.

Do you have some questions?

What you should know before starting? Check What do I need to follow the course?. For other questions (and answers!), please check the FAQ page or write directly at
contact@becomedataengineer.com