SITE RELIABILITY ENGINEER, AUSTIN, TX (HYBRID)
Contract: 6 Months (3 Yrs Ext)
Deadline: 4/7/26
Job Description:
Site Reliability Engineer will be responsible for ensuring the reliability, availability, performance, and scalability of production systems by applying software engineering practices to infrastructure and operations. Partners with development teams to build resilient, observable, and automated platforms that meet defined service level objectives (SLOs).
Required Skills:
8 Yrs of experience in systems engineering, DevOps, or site reliability engineering roles
8 Yrs of Strong experience with Linux/Unix systems and system internals
8 Yrs of Proficiency in one or more programming/scripting languages (Python, Go, Java, Bash)
8 Yrs of Experience designing and operating highly available, distributed systems
8 Yrs of Strong knowledge of cloud platforms (AWS, or GCP) and cloud-native services
8 Yrs of Experience with containerization and orchestration (Docker, Kubernetes)
8 Yrs of Strong understanding of monitoring, alerting, and logging concepts
8 Yrs of Experience defining and managing SLIs, SLOs, and error budgets
8 Yrs of Familiarity with incident management, root cause analysis (RCA), and postmortems
8 Yrs of Experience integrating security and compliance into operational workflows
Preferred Skills:
4 Yrs of Familiarity with observability tools (Prometheus, Grafana, Application Insights, Datadog, Splunk)
4 Yrs of Experience operating 24x7 production environments with on-call rotations
4 Yrs of Experience with chaos engineering and resiliency testing
4 Yrs of Experience with feature flags, canary deployments, and progressive delivery
4 Yrs of Strong documentation skills for runbooks, dashboards, and operational standards