Key Responsibilities:
-
Design, develop, and maintain end-to-end ETL pipelines for large-scale data processing (GBs to TBs).
-
Work with Apache Spark and Hadoop to process and transform massive datasets.
-
Handle structured and semi-structured data stored in Parquet, ORC, and CSV formats.
-
Write optimized SQL queries with complex joins for efficient data retrieval and transformations.
-
Utilize Scala, Python, and Java (plus) to implement scalable data engineering solutions.
-
Implement test-driven development (TDD) using JUnit (for Java) and Python testing frameworks.
-
Work with Git, GitLab CI/CD, and GitHub Actions for version control and automated deployments.
-
Manage and deploy data processing applications using Docker, Kubernetes, and Helm.
-
Collaborate within an Agile development environment, ensuring iterative improvements and quick feature delivery.
-
Optimize performance, monitor data pipelines, and troubleshoot issues proactively.
-
Work with Trino (PrestoSQL) for distributed SQL query execution.