Competencies: Data Engineering > Data Processing

Data Processing

Body of Knowledge

Topic Description Relevance Career Tracks

Batch Processing

Large-scale data processing, Spark, MapReduce paradigm, distributed computing

Critical

Data Engineer, ML Engineer

Real-Time Processing

Stream processing architectures, latency requirements, micro-batch vs true streaming

High

Data Engineer, Backend Developer

Apache Spark

DataFrames, RDDs, Spark SQL, cluster management, optimization techniques

High

Data Engineer, ML Engineer

Apache Kafka

Topics, partitions, consumer groups, producers, Kafka Connect, Schema Registry

High

Data Engineer, Backend Developer, Platform Engineer

Data Serialization

Avro, Parquet, ORC, Protocol Buffers, serialization performance, schema evolution

High

Data Engineer, Backend Developer

Distributed Computing

Partitioning strategies, shuffle operations, data locality, fault tolerance

High

Data Engineer, ML Engineer

pandas & Data Manipulation

DataFrames, vectorized operations, groupby, merge, performance optimization

High

Data Engineer, Data Scientist, Analytics Engineer

Data Cleaning

Missing data handling, deduplication, standardization, outlier detection

High

Data Engineer, Data Scientist

Data Transformation

Aggregations, joins, pivots, unpivots, window functions, complex transformations

Critical

Data Engineer, Analytics Engineer

Python Data Stack

NumPy, pandas, Polars, DuckDB, performance considerations

High

Data Engineer, Data Scientist

Personal Status

Topic Level Evidence Active Projects Gaps

To be populated

 — 

 — 

 — 

 —