Key Responsibilities:
-
Design, develop, and maintain end-to-end ETL pipelines for large-scale data processing (GBs to TBs).
-
Work with Apache Spark and Hadoop to process and transform massive datasets.
-
Handle structured and semi-structured data stored in Parquet, ORC, and CSV formats.
-
Write optimized SQL queries with complex joins for efficient data retrieval and transformations.
-
Utilize Scala, Python, and Java (plus) to implement scalable data engineering solutions.
-
Implement test-driven development (TDD) using JUnit (for Java) and Python testing frameworks.
-
Work with Git, GitLab CI/CD, and GitHub Actions for version control and automated deployments.
-
Manage and deploy data processing applications using Docker, Kubernetes, and Helm.
-
Collaborate within an Agile development environment, ensuring iterative improvements and quick feature delivery.
-
Optimize performance, monitor data pipelines, and troubleshoot issues proactively.
-
Work with Trino (PrestoSQL) for distributed SQL query execution.
Required Skills & Qualifications:
-
1+ years of hands-on experience in data engineering and ETL pipeline development.
-
Strong knowledge of data structures, algorithms, and analytical problem-solving.
-
Proficiency in Python
-
Proficiency in Scala and Java experience is a plus
-
Strong expertise in SQL, particularly in complex joins and query optimization.
-
Experience working with file formats like Parquet, ORC, and CSV is a plus
-
Hands-on experience in Apache Spark and Hadoop for large-scale data processing is an advantage.
-
Familiarity with Trino (PrestoSQL) is a plus.
-
Hands-on experience with Git, GitLab CI/CD, and GitHub Actions.
-
Experience in test-driven development (TDD) using JUnit and Python frameworks.
-
Knowledge of Docker, Kubernetes, and Helm for container orchestration.
-
Comfortable working in an Agile development environment.
-
Strong problem-solving, debugging, and performance-tuning skills.
-
A self-starter mindset with the ability to take ownership of tasks and drive them to completion.
Preferred Qualifications:
-
Experience with workflow orchestration tools like Apache Airflow.
-
Cloud experience with AWS, GCP, or Azure.
-
Familiarity with Lakehouse architectures and modern data platforms.