- Be responsible for the cloud infrastructure in terms of: scalability, availability, performance, (cost & resource) efficiency, capacity planning;
- Spend lesser than 50% of the time working through our vendors in carrying out the day-to-day IT operation such as: performance monitoring, attending to issues, manual intervention and service requests (infrastructure provisioning, deployment, data/system backups, patching and disaster recovery)
- Be responsible for system administration tasks such as OS & application patching, software upgrades, backup, restore, etc. for our cloud infrastructure (AWS, Azure);
- Drive troubleshooting, incident response/ resolution and blameless post-mortems;
- Be responsible to maintain services once they go live by measuring and monitoring availability, performance, and overall system health;
- Strive to streamline and secure cloud infrastructure management by proactively monitoring and protecting system boundaries, application deployment and release status, automating manual tasks, and keeping systems secured.
- Spend most of the time on development tasks such as writing Infrastructure-as-Code (IaC), continuous improvement, and driving initiatives to improve automation, scalability and reliability:
- Develop automation code for change control, configuration management, deployment and maintenance of infrastructure and applications through CI/CD pipeline;
- Improve service resiliency through high levels of automation, to effectively detect/ predict/ prevent issues in the environment;
- Develop and fine-tune change & incident management processes across teams;
- Scale systems sustainably through mechanisms like automation; evolve systems by pushing for changes that improve reliability and velocity;
- Automate provisioning by IaC and Configuration-as-Code (CaC);
- Review resource/ workload to optimise cost.
- To enjoy cloud, systems and security management, and relentlessly automating work.
- Related cloud experience attested by certifications in SysOps, DevOps (AWS preferred).
- Three or more years of hands-on experience operating and maintaining systems running on Cloud infrastructure (AWS preferred).
- Proven experience in various IaC/ CaC tools. E.g., Ansible, Terraform.
- Hands-on experience administering Unix & Windows operating systems as well as automating with shell scripts.
- Deep appreciation of infrastructure and application monitoring, logging, alerting, release and configuration management.
- Deep understanding in networking (e.g. HTTP protocol, TCP/IP, routing tables, network topology, load balancers, DNS, NTP).
- Experience in standard IT security practices (e.g., encryption, certificates, key management).
YOU WILL CATCH OUR ATTENTION IF YOU:
- Have proven experience in Site Reliability Engineering (SRE), DevSecOps or FinOps practices and methodologies.
- Have experience in operating containerised workloads (using Docker, Kubernetes).
- Have experience operating internet-facing 24/7 high-load applications (e.g. eCommerce).