Opportunity Description
MANTECH seeks motivated, career, and customer-oriented **Site Reliability Engineer (SRE)** for a new initiative. This effort supports the rapid design, deployment, operation, and sustainment of enterprise-scale AI, data, and mission platform capabilities across cloud, edge, and classified operational environment
This role supports the operational reliability, scalability, monitoring, and incident response for the enterprise AI systems. You will focus on operational outcomes and optimizing system performance.
**Responsibilities include but are not limited to:**
+ Apply core reliability engineering principles to ensure high availability and stability of production systems.
+ Manage incident response, root cause analysis, and post-mortem processes for the AI platform.
+ Implement and optimize observability operations using OpenTelemetry, Prometheus, Grafana, Loki, or Tempo.
+ Oversee capacity planning, performance optimization, and FinOps practices.
This role supports the operational reliability, scalability, monitoring, and incident response for the enterprise AI systems. You will focus on operational outcomes and optimizing system performance.
**Responsibilities include but are not limited to:**
+ Apply core reliability engineering principles to ensure high availability and stability of production systems.
+ Manage incident response, root cause analysis, and post-mortem processes for the AI platform.
+ Implement and optimize observability operations using OpenTelemetry, Prometheus, Grafana, Loki, or Tempo.
+ Oversee capacity planning, performance optimization, and FinOps practices.
Ready to Apply?
Submit your application for Site Reliability Engineer at ManTech
Apply for this Position