Job Description
Responsibilities Deployment & Automation
- Implement and maintain CI/CD pipelines using tools such as GitHub Actions, AWS CodePipeline, and Jenkins.
- Automate infrastructure provisioning and management using Infrastructure-as-Code (IaC) with Terraform, CloudFormation, or AWS CDK.
- Develop robust automation scripts and self-service tooling to minimize toil and enhance operational efficiency.
Capacity, Performance & Cost Optimization
- Lead and implement operational cost optimization initiatives across cloud infrastructure and data platforms.
- Configure, maintain, and tune auto-scaling policies and performance thresholds.
- Develop and execute Resiliency Test plans and provide critical support for Performance testing efforts.
Incident Management & SRE Principles
- Serve as a production on-call responder, employing strong troubleshooting skills to quickly resolve complex incidents.
- Proficiently utilize ITIL framework concepts and ITSM tools (e.g., ServiceNow) for incident and change management.
- Develop high-quality Root Cause Analysis (RCA) documentation and Knowledge articles to prevent future recurrence.
- Implement and enforce SRE principles, including the definition and tracking of Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets.
Observability & Monitoring
- Manage and leverage advanced observability platforms (Dynatrace preferred, AppDynamics, ELK, etc.).
- Implement distributed tracing with accurate context propagation across data services and applications.
- Optimize monitoring queries, and configure actionable dashboards, alerts, and anomaly detectors using tools like Dynatrace and Kibana.
Data Analytics Platform Reliability
- Ensure the reliability, performance tuning, and access control for Databricks cluster management and data pipelines.
- Maintain Informatica workflow orchestration, connector reliability, and error handling for critical data flows.
- Manage Power BI gateway health, access control, and ensure reliable data refresh processes.
Security & Compliance
- Manage service accounts, access permissions, and roles following the principle of least privilege.
- Create, deploy, and manage digital certificates and TLS/SSL configurations.
- Execute effective remediation tasks and respond to security incidents as part of the operational team.
Qualifications Education & Experience
- Bachelor's degree in Computer Science, Engineering, or a related technical field.
- 2 to 4 years of hands-on experience in a DevOps, Site Reliability Engineering (SRE), or Cloud Infrastructure role.
- Practical, working experience with major cloud platforms, specifically AWS and Azure.
Technical Skills
- Mid-level proficiency in Python or other scripting languages (e.g., Bash, Go) for automation tasks.
- Mid-level proficiency with Configuration Management tools, including Ansible.
- Strong knowledge of containerization technologies (Docker, Kubernetes/ECS).
- Solid understanding of Linux systems and networking fundamentals (TCP/IP, DNS, Load Balancing).
- Working knowledge of relational, cloud-native (e.g., AWS RDS), and NoSQL database technologies.
- Direct hands-on experience supporting and maintaining data platforms like Databricks, Informatica, or Power BI is highly desirable.
Professional Attributes
- Excellent written and verbal communication skills, with a proven ability to document complex systems.
- Demonstrated ability to work independently, manage shifting priorities, and drive initiatives to completion.
- Availability for on-call duties and to work outside of standard business hours as required to support a 24/7 production environment.
Job Tags
Work experience placement, Shift work,