Make a career shift and become a Data Engineer

You're trying to learn data engineering but there is...too much materials. Hadoop, Cloud, NoSQL, RDBMS, data warehouse, data lake, ... are you feeling lost?

Learning data engineering is fascinating but can become a real problem if, as a beginner, you don't know how to identify the materials that can help you to reach the goal. You will spend a lot of time and at the end, still be confused about the data engineering skills to master.

You are not alone and other people have faced the same problems before, including myself! Based on that experience, the Become a Data Engineer course proposes a 11-modules course covering all important data concepts and patterns that you should know before starting your new job role.

About your instructor

My name is Bartosz Konieczny and I am a data engineer working with software since 2009. I'm also an Apache Spark enthusiast, AWS and GCP certified cloud user, blogger, and speaker. I like to share and you can discover it on my blog or conferences like Spark+AI Summit 2019 or Data+AI Summit 2020.
Check me on social media: Bartosz Konieczny Twitter Github Stack Overflow Facebook YouTube

What I will learn?

Data ingestion - offline
Theory Demos GitHub code Homework
Data ingestion is the first task of each data system. In this module you'll see how to perform an offline ingestion. Demo tools: Airbyte, Apache Spark, Elasticsearch, Great Expectations
⏲️ 143 minutes

Enroll in ➡️

  • Data ingestion types 💻 (theory + demo)
  • Data migration 💻
  • Scalability 💻
  • Auto-scaling
  • Eventual consistency
  • Bulk operations 💻
  • Compression 💻
  • Low Code
  • Data quality 💻
Data ingestion - real-time
Theory Demos GitHub code Homework
Offline data ingestion from the previous module comes with some latency. The real-time approach tends to ingest the data as soon as possible, very often within seconds. Demo tools: Apache Kafka, Apache Spark, Debezium, Kafka Connect, KSQL
⏲️ 136 minutes

Enroll in ➡️

  • API Gateway vs. direct ingestion
  • API Gateway 💻 (theory + demo)
  • Change Data Capture 💻
  • Files streaming 💻
  • Delivery semantics 💻
  • Idempotency 💻
  • Batch layer
  • Real-time ad-hoc querying 💻
  • Polyglot persistence
Data cleansing
Theory Demos GitHub code Homework
You're doing great so far! The data is ingested to our system but unfortunately, it has some issues. In this module you'll see how to prepare it for further usage. Demo tools: Apache Avro, Apache Kafka, Apache Spark, Elasticsearch
⏲️ 140 minutes

Enroll in ➡️

  • Data enrichment 💻 (theory + demo)
  • Data anonymization
  • Deduplication 💻
  • Schema
  • Schema registry 💻
  • Schema management 💻
  • Metadata 💻
  • Binary file formats 💻
  • Monitoring and alerting 💻
Stream processing
Theory Demos GitHub code Homework
You've noticed it probably. Yes, you've implemented a streaming data processing pipeline in the previous module. But it was dedicated to the data cleansing and you haven't had a chance to see streaming concepts in depth. It'll be done in this module! Demo tools: Apache Flink, Apache Kafka, Apache Spark, Delta Lake
⏲️ 156 minutes

Enroll in ➡️

  • Patterns 💻 (theory + demo)
  • Architectures
  • Transformations 💻
  • Event time vs. processing time 💻
  • Scalability 💻
  • Auto-scaling 💻
  • Reprocessing 💻
  • Messaging patterns
  • Backpressure 💻
  • Debugging
Stateful stream processing
Theory Demos GitHub code Homework
Streaming processing has multiple facets. Stateful is probably the most challenging one since it requires dealing with a state store besides the classical streaming data processing logic. Demo tools: Apache Flink, Apache Kafka, Apache Spark, ScyllaDB
⏲️ 197 minutes

Enroll in ➡️

  • Stateless vs. stateful 💻 (theory + demo)
  • State store 💻
  • Incremental and full state 💻
  • Triggers 💻
  • Watermarks and late data 💻
  • Fault-tolerance 💻
  • Idempotency 💻
  • Aggregations 💻
  • Arbitrary stateful processing 💻
  • Joins 💻
  • Windows 💻
  • Complex Event Processing 💻
Batch processing - ETL
Theory Demos GitHub code Homework
Even though you went into streaming, the batch is still there! It's time to see the first pattern for writing batch pipelines called Extract Transform Load (ETL). Demo tools:
The module is not ready yet. Release period: September 2023

Inform me about the release >

  • ETL
  • ETL Steps 💻 (theory + demo)
  • Staging area 💻
  • Patterns - data flow 💻
  • Patterns - data exposition 💻
  • Data orchestration 💻
  • Triggers 💻
  • Late data
  • Tasks 💻
  • Idempotency: metadata strategy 💻
  • Idempotency: merge strategy 💻
  • Idempotency: proxy strategy 💻
  • Data backfilling 💻
  • Monitoring and alerting 💻
  • Best practices
Batch processing - ELT
Theory Demos GitHub code Homework
To complete the ETL approach, you can also use an alternative called Extract Load Transform (ELT). There is a subtle difference in the steps that involves a few other things. Demo tools:
The module is not ready yet. Release period: September 2023

Inform me about the release >

  • ELT
  • ELT Steps 💻 (theory + demo)
  • Staging area 💻
  • Data extraction 💻
  • Data loading 💻
  • Raw data
  • Data transformation 💻
  • Idempotency 💻
  • Tests 💻
  • Resources optimization
  • Execution plans 💻
  • ETL or ELT?
Data warehouse
Theory Demos GitHub code Homework
Data warehouse is still the most popular approach to deal with analytics data. However, the alternatives like data lakehouse exist and they will complete the lessons.
The module is not ready yet. Release period: October 2023

Inform me about the release >

  • Data warehousing
  • Columnar format vs row format 💻 (theory + demo)
  • Data modeling 💻
  • Data marts
  • Denormalization
  • Encoding
  • Ad-hoc querying (SQL) 💻
  • Ad-hoc querying (JOINs) 💻
  • Ad-hoc querying (analytics) 💻
  • Approximate algorithms 💻
  • Reverse ETL
  • Data lakehouse 💻
Going further
Theory Demos GitHub code Homework
The course prepares you just to start. This module extends the basic knowledge and introduces the concepts you might see in the future follow-up parts of Become a Data Engineer!
The module is not ready yet. Release period: November 2023

Inform me about the release >

  • Cloud computing
  • AWS cloud data services
  • GCP cloud data services
  • Azure cloud data services
  • Docker 💻 (theory + demo)
  • Kubernetes
  • DevOps 💻
  • Software engineering best practices 💻
  • Tests and data processing 💻
  • Data processing frameworks - going distributed 💻
  • Serverless
  • Data catalog 💻
  • Data lineage 💻
  • Data security - authentication 💻
  • Data security - authorization 💻
  • Data security - encryption 💻
  • Data privacy
  • Data engineers and data science
  • Data engineers and data analysts

Money back guarantee

I did my best to give you a valuable content. If for whatever reason you're not satisfied, you can use my "5-Day Money-Back guarantee". Just reply to your purchase receipt email and I will issue a refund.

Do you have some questions?

What you should know before starting? Check What do I need to follow the course?. For other questions (and answers!), please check the FAQ page or write directly at