We are seeking a highly skilled and motivated Data Engineer to join our dynamic team. The ideal candidate will have extensive experience in ETL, Data Modeling, and Data Architecture, with a strong proficiency in optimizing ETL processes and designing big data solutions using Python.
Key Responsibilities
- Develop and maintain a comprehensive data platform, including Data Lakes, cloud Data Warehouses, APIs, and both batch and streaming data pipelines.
- Design and implement scalable data pipelines and applications that efficiently process large datasets with low latency using Apache Spark and Apache Hive.
- Utilize orchestration tools like Airflow to automate and manage complex data workflows.
- Collaborate with project management tools such as JIRA and Confluence to track project progress and enhance team communication.
- Build data processing workflows leveraging Spark, SQL/PLSQL, and Python to transform and cleanse raw data into usable formats, employing Parquet/ORC for data storage solutions.
- Implement containerization with Docker and orchestration with Kubernetes for data applications.
- Optimize data storage and retrieval performance through effective data modeling techniques, including Relational, Dimensional, and E-R modeling.
- Ensure data integrity and quality by implementing robust validation and error handling mechanisms within ETL processes.
- Automate deployment processes using CI/CD tools like Jenkins and Spinnaker to guarantee reliable and consistent releases.
- Monitor and troubleshoot data pipelines with tools like DataDog and Splunk to identify performance bottlenecks and maintain system reliability.
- Participate in Agile methodologies such as Scrum and Kanban, including sprint planning, daily stand-ups, and retrospective meetings.
- Conduct code reviews to uphold coding standards, best practices, and scalability considerations.
- Maintain clear and comprehensive documentation using Confluence for data pipelines, schemas, and processes.
- Provide on-call support for production data pipelines, responding to incidents and resolving issues promptly.
- Collaborate with cross-functional teams, including developers, data scientists, and operations, to tackle complex data engineering challenges.
- Stay informed on emerging technologies and industry trends to continuously enhance data engineering processes and tools.
- Contribute to developing reusable components and frameworks to streamline data engineering tasks across various projects.
- Utilize version control systems like Git for effective codebase management and team collaboration.
- Leverage IDEs like IntelliJ IDEA for efficient development and debugging of data engineering code.
- Adhere to security best practices in handling sensitive data and implementing access controls within the data lake environment.
Job Requirements
- Experience. 6-8 years in data engineering or related fields.
- Programming Languages. Proficiency in Python, Bash/Unix/Linux.
- Big Data Technologies. Experience with Apache Spark and Apache Hive.
- Cloud Services. Familiarity with AWS services including EC2, ECS, S3, SNS, and CloudWatch.
- Databases. Proficient in PostgreSQL.
- Application Development. Experience with RCP Framework.
- Containerization & Orchestration. Hands-on experience with Docker and Kubernetes.
- CI/CD Tools. Proficient in GitHub, Jenkins, and Spinnaker.
- Additional Skills. Knowledge of Scala and Maven is a plus.
Join Infogain and be part of a forward-thinking team that drives innovation and transforms businesses through cutting-edge technology solutions. If you are passionate about data engineering and eager to make a significant impact, we would love to hear from you!