Competencies: Data Engineering > Data Processing
Data Processing
Body of Knowledge
| Topic | Description | Relevance | Career Tracks |
|---|---|---|---|
Batch Processing |
Large-scale data processing, Spark, MapReduce paradigm, distributed computing |
Critical |
Data Engineer, ML Engineer |
Real-Time Processing |
Stream processing architectures, latency requirements, micro-batch vs true streaming |
High |
Data Engineer, Backend Developer |
Apache Spark |
DataFrames, RDDs, Spark SQL, cluster management, optimization techniques |
High |
Data Engineer, ML Engineer |
Apache Kafka |
Topics, partitions, consumer groups, producers, Kafka Connect, Schema Registry |
High |
Data Engineer, Backend Developer, Platform Engineer |
Data Serialization |
Avro, Parquet, ORC, Protocol Buffers, serialization performance, schema evolution |
High |
Data Engineer, Backend Developer |
Distributed Computing |
Partitioning strategies, shuffle operations, data locality, fault tolerance |
High |
Data Engineer, ML Engineer |
pandas & Data Manipulation |
DataFrames, vectorized operations, groupby, merge, performance optimization |
High |
Data Engineer, Data Scientist, Analytics Engineer |
Data Cleaning |
Missing data handling, deduplication, standardization, outlier detection |
High |
Data Engineer, Data Scientist |
Data Transformation |
Aggregations, joins, pivots, unpivots, window functions, complex transformations |
Critical |
Data Engineer, Analytics Engineer |
Python Data Stack |
NumPy, pandas, Polars, DuckDB, performance considerations |
High |
Data Engineer, Data Scientist |
Personal Status
| Topic | Level | Evidence | Active Projects | Gaps |
|---|---|---|---|---|
To be populated |
— |
— |
— |
— |