Distributed Data Storage Systems (partitioned data storage, sorting key, SerDes, data replication, caching and persistence)
Distributed Data Processing Systems (partitioning, predicate pushdown, sort by partition, maintaining size of shuffle blocks, window function, leveraging all cores and memory available in the cluster to improve concurrency)
Stream Data Processing (Real-time, Stream and Batch Processing)
Tools
Data Pipelines and Automation (Airbyte)
Data ingestion in Message Queues
Data Wrangling operations Pandas,numpy,re
Data Scraping requests/BeautifulSoup/lxml/Scrapy
Interacting with External APIs and other Data Sources, Logging