NextEra تعلن عن وظيفة Site Reliability Engineer في الرياض

Site Reliability Engineer

🏢 NextEra

🕒 نُشرت: 1 يوليو 2026 (اليوم) 📍 الرياض وظائف الهندسة والتقنية

تفاصيل الوظيفة

تعلن شركة NextEra عن توفر وظيفة Site Reliability Engineer - L2 Support في مدينة الرياض، السعودية، للعمل ضمن فريق Cloud & Infrastructure Operations / Managed Services. المهندس سيتولى مسؤولية الحفاظ على موثوقية وتوفر وأداء واستقرار التطبيقات والخدمات المستضافة على بيئات GCP وOCI وKubernetes وخدمات الشبكات وKafka وRedis وS3-compatible storage وEvent bus وقواعد البيانات ومنصات المراقبة مفتوحة المصدر.

المهام والمسؤوليات

تقديم دعم المستوى الثاني (L2) للحوادث وطلبات الخدمة والتنبيهات والمشكلات التشغيلية عبر GCP وOCI وKubernetes وخدمات الشبكات وKafka وRedis وS3-compatible storage وEvent bus وقواعد البيانات ومنصات المراقبة مفتوحة المصدر.
إدارة فرز الحوادث والتحقيق الفني واستعادة الخدمة وتنسيق التصعيد وتوثيق الإغلاق للحوادث ذات الأولوية ضمن مستويات الخدمة المتفق عليها.
إجراء استكشاف أخطاء معمق باستخدام المقاييس والسجلات والتتبعات ولوحات التحكم والأحداث وبيانات صحة Kubernetes وقياس عن بعد للخدمات السحابية وتنبيهات قواعد البيانات ومؤشرات تأخر Kafka ومقاييس Redis وأحداث التخزين وتشخيصات الشبكة.
مراقبة توفر الخدمة وزمن الاستجابة ومعدلات الخطأ والتشبع واتجاهات السعة وصحة البودات وصحة العقد واستخدام موارد العنقود وتأخر مستهلك Kafka وأداء Redis وتوفر قاعدة البيانات وصحة تخزين الكائنات وأخطاء Event bus والاتصال بالشبكة باستخدام Grafana وPrometheus وLoki وTempo.
دعم إدارة الحوادث الكبرى من خلال تقديم التحليل الفني وتقييم الأثر وتوصيات الحلول البديلة ومدخلات الجدول الزمني وبيانات مراجعة ما بعد الحادث لمشكلات السحابة والحاويات والتراسل والتخزين المؤقت والتخزين وقواعد البيانات وخدمات الشبكة.
التنسيق مع فرق المستوى الأول للتحقق من صحة التنبيهات وإثراء التذاكر والالتزام بدفاتر التشغيل والفحوصات الأولية وتسليم الأدلة قبل المشاركة في المستوى الثاني.
تصعيد المشكلات المعقدة إلى فرق الهندسة أو إدارة المنصات أو مسؤولي Kubernetes أو فرق العمليات السحابية أو مسؤولي قواعد البيانات أو فرق الشبكات أو موردي المنتجات مع تقديم أدلة تقنية واضحة وتفاصيل الأثر.
تنفيذ المهام التشغيلية المعتمدة بما في ذلك إعادة تشغيل الخدمات وإعادة تشغيل البودات والتحقق من صحة النشر وفحص النيمسبيس ومراجعة السجلات والتحقق من التكوين وفحص الشهادات والتحقق من الوصول للتخزين وفحص مواضيع ومستهلكي Kafka وفحوصات صحة Redis وفحوصات اتصال قاعدة البيانات والتحقق من مسارات الشبكة والفحوصات الصحية المجدولة.
تحديد الحوادث المتكررة والمساهمة في إدارة المشكلات من خلال تحليل السبب الجذري وتوثيق الأخطاء المعروفة وإنشاء الإجراءات التصحيحية والتوصية بالإصلاحات الدائمة.
إنشاء وتحديث والحفاظ على إجراءات التشغيل القياسية ومقالات المعرفة وأدلة استكشاف الأخطاء ودفاتر تشغيل العمليات وأدلة الاستجابة للتنبيهات ووثائق تسليم المناوبات للمكدس التقني المحدد.
دعم أتمتة وتوحيد المهام التشغيلية المتكررة مثل الفحوصات الصحية وإثراء التنبيهات وجمع السجلات وتشخيص Kubernetes وفحص الموارد السحابية وفحص حالة Kafka وفحوصات Redis والتحقق من صحة التخزين والتحقق من اتصال قواعد البيانات.
المشاركة في دعم تنفيذ التغييرات والتحقق من الصحة قبل التغيير والمراقبة بعد التغيير والتحقق من النشر ودعم التراجع وفحوصات الجاهزية للإنتاج لتغييرات GCP وOCI وKubernetes والتراسل والتخزين المؤقت والتخزين وقواعد البيانات والمراقبة.
ضمان الامتثال لعمليات الحوادث والمشكلات والتغييرات والإصدارات والوصول والأمان والحوكمة التشغيلية ضمن نموذج تشغيل SRE المحدد.
إعداد تقارير تشغيلية يومية وأسبوعية وشهرية تغطي اتجاهات الحوادث وأداء SLA والمشكلات المتكررة وملاحظات السعة وفجوات المراقبة وفرص الأتمتة وإجراءات تحسين الخدمة.
العمل في نوبات دوارة ودعم الاستعداد عند الطلب ودعم عطلات نهاية الأسبوع أو فترات الدعم الممتدة بناءً على متطلبات العمل والعملاء.

المهارات المطلوبة

فهم قوي لمبادئ SRE بما في ذلك الموثوقية والتوفر وقابلية التوسع والمراقبة وتقليل الجهد اليدوي والاستجابة للحوادث وأهداف مستوى الخدمة والوعي بميزانية الأخطاء.
خبرة عملية في دعم الإنتاج من المستوى الثاني أو دعم المنصات أو عمليات السحابة أو عمليات Kubernetes أو دعم الخدمات المدارة.
معرفة عملية بخدمات GCP وOCI، بما في ذلك الحوسبة والتخزين والشبكات والهوية والمراقبة والتسجيل وصحة الخدمة واستكشاف الأخطاء التشغيلية.
فهم عملي لمفاهيم Kubernetes بما في ذلك العناقيد والعقد والنيمسبيس والبودات والنشر والخدمات والإنجرس والكونفج ماب والسيكرتس والأقراص الدائمة وحدود الموارد والأحداث والسجلات وأوامر استكشاف الأخطاء الأساسية.
معرفة بخدمات الشبكات بما في ذلك DNS وموازنات التحميل وجدران الحماية والتوجيه وفحوصات الاتصال والشهادات والمنافذ وزمن الاستجابة وفقدان الحزم واستكشاف مشكلات الوصول إلى الخدمة.
خبرة في دعم Kafka، بما في ذلك التحقق من صحة المواضيع وفحوصات صحة الوسطاء وحالة مجموعات المستهلكين وتحليل تأخر المستهلك واتصال المنتج والمستهلك وفرز مشكلات تدفق الأحداث.
خبرة في دعم Redis، بما في ذلك فحوصات توفر الذاكرة المؤقتة واستخدام الذاكرة وزمن الاستجابة ومشكلات الاتصال وفحوصات Keyspace وحالة التكرار واستكشاف مشكلات الأداء الأساسية.
فهم للتخزين المتوافق مع S3، بما في ذلك الوصول إلى الحاويات وتوفر الكائنات والأذونات وسلوك دورة الحياة وأحداث التخزين وفشل نقل البيانات ومشكلات تكامل التطبيقات.
معرفة عملية بمنصات Event bus والبنى المدفوعة بالأحداث، بما في ذلك نشر الأحداث وفشل الاشتراك وتوجيه الرسائل وسلوك إعادة المحاولة ومعالجة الرسائل الميتة واستكشاف مشكلات التكامل.
فهم أساسي لقواعد البيانات العلائقية وNoSQL، بما في ذلك فحوصات الاتصال والتحقق من صحة تنفيذ الاستعلامات والتحقق من حالة الوظائف وتنبيهات الأداء وحالة النسخ الاحتياطي وحالة النسخ المتماثل وتحليل توفر قواعد البيانات.
خبرة عملية مع Grafana لمراجعة لوحات التحكم وتصور التنبيهات ومراقبة صحة الخدمة وإعداد التقارير التشغيلية.
خبرة في استخدام Prometheus لجمع المقاييس والتحقق من صحة قواعد التنبيه وصحة أهداف الخدمة وفحوصات حالة الجمع واستكشاف الأخطاء بناءً على المقاييس.
خبرة في استخدام Loki للبحث في السجلات وربط السجلات وتحليل سجلات التطبيقات ومراجعة سجلات البنية التحتية والتحقيق في الحوادث.
خبرة في استخدام Tempo لتحليل التتبعات الموزعة ومراجعة تدفق الطلبات والتحقيق في زمن الاستجابة ورسم خرائط التبعيات واستكشاف الأخطاء على مستوى المعاملات.

عرض النص الأصلي للإعلان

Dear All,

NextEra is looking for dynamic resources in IT-Site Reliability Engineer role.

Role Title: Site Reliability Engineer - L2 Support

Function: Cloud & Infrastructure Operations / Managed Services

Role Level: L2 Technical Support

Location: Riyadh

The SRE L2 Support Engineer will be responsible for maintaining the reliability, availability, performance, and operational stability of business-critical applications and platform services hosted across GCP, OCI, Kubernetes, network services, Kafka, Redis, S3-compatible storage, event bus, databases, and open-source observability platforms. The role requires hands-on experience in incident management, monitoring, troubleshooting, service restoration, automation support, and operational governance using Grafana, Tempo, Loki, Prometheus, and related open-source ecosystem tools. The engineer will work closely with L1 support, L3/platform engineering teams, application teams, cloud teams, database teams, network teams, and customer stakeholders to ensure timely incident resolution, proactive problem management, and continuous improvement of service reliability.

Key Responsibilities

Provide L2-level support for incidents, service requests, alerts, and operational issues across GCP, OCI, Kubernetes, network services, Kafka, Redis, S3-compatible storage, event bus, databases, and open-source observability platforms.
Own incident triage, technical investigation, service restoration, escalation coordination, and closure documentation for priority incidents within agreed service levels.
Perform deep-dive troubleshooting using metrics, logs, traces, dashboards, events, Kubernetes health data, cloud service telemetry, database alerts, Kafka lag indicators, Redis metrics, storage events, and network diagnostics.
Monitor service availability, latency, error rates, saturation, capacity trends, pod health, node health, cluster resource utilization, Kafka consumer lag, Redis performance, database availability, object storage health, event bus failures, and network connectivity using Grafana, Prometheus, Loki, and Tempo.
Support major incident management by providing technical analysis, impact assessment, workaround recommendations, timeline inputs, and post-incident review data for cloud, container, messaging, caching, storage, database, and network service issues.
Coordinate with L1 teams for alert validation, ticket enrichment, runbook adherence, initial diagnostics, and handover of evidence before L2 engagement.
Escalate complex issues to L3 engineering, platform engineering, Kubernetes administrators, cloud operations teams, database administrators, network teams, or product vendors with clear technical evidence and impact details.
Execute approved operational tasks including service restarts, pod restarts, deployment validation, namespace checks, log review, configuration verification, certificate checks, storage access validation, Kafka topic and consumer checks, Redis health checks, database connectivity checks, network route validation, and scheduled health checks.
Identify recurring incidents and contribute to problem management by performing root cause analysis, documenting known errors, creating corrective actions, and recommending permanent fixes.
Create, update, and maintain SOPs, knowledge articles, troubleshooting guides, operational runbooks, alert response guides, and shift handover documentation for the defined technology stack.
Support automation and standardization of repetitive operational tasks such as health checks, alert enrichment, log collection, Kubernetes diagnostics, cloud resource checks, Kafka status checks, Redis checks, storage validation, and database connectivity validation.
Participate in change implementation support, pre-change validation, post-change monitoring, deployment verification, rollback support, and production readiness checks for GCP, OCI, Kubernetes, messaging, caching, storage, database, and observability changes.
Ensure compliance with incident, problem, change, release, access, security, and operational governance processes within the defined SRE operating model.
Prepare daily, weekly, and monthly operational reports covering incident trends, SLA performance, recurring issues, capacity observations, observability gaps, automation opportunities, and service improvement actions.
Work in rotational shifts, on-call support, weekend support, or extended support windows based on business and client requirements.

Required Technical Skills

Strong understanding of SRE principles including reliability, availability, scalability, observability, toil reduction, incident response, service-level objectives, and error-budget awareness.
Hands-on experience in L2 production support, platform support, cloud operations, Kubernetes operations, or managed services support.
Working knowledge of GCP and OCI services, including compute, storage, networking, identity, monitoring, logging, service health, and operational troubleshooting.
Hands-on understanding of Kubernetes concepts including clusters, nodes, namespaces, pods, deployments, services, ingress, config maps, secrets, persistent volumes, resource limits, events, logs, and basic troubleshooting commands.
Knowledge of network services including DNS, load balancers, firewalls, routing, connectivity checks, certificates, ports, latency, packet loss, and service reachability troubleshooting.
Experience supporting Kafka, including topic validation, broker health checks, consumer group status, consumer lag analysis, producer and consumer connectivity, and event streaming issue triage.
Experience supporting Redis, including cache availability checks, memory utilization, latency, connection issues, keyspace checks, replication status, and basic performance troubleshooting.
Understanding of S3-compatible storage, including bucket access, object availability, permissions, lifecycle behavior, storage events, data transfer failures, and application integration issues.
Working knowledge of event bus platforms and event-driven architectures, including event publishing, subscription failures, message routing, retry behaviour, dead-letter handling, and integration troubleshooting.
Basic understanding of relational and NoSQL databases, including connectivity checks, query execution validation, job status validation, performance alerts, backup status, replication status, and database availability analysis.
Hands-on experience with Grafana for dashboard review, alert visualization, service health monitoring, and operational reporting.
Experience using Prometheus for metrics collection, alert rule validation, service target health, scrape status checks, and metric-based troubleshooting.
Experience using Loki for log search, log correlation, application log analysis, infrastructure log review, and incident investigation.
Experience using Tempo for distributed trace analysis, request flow review, latency investigation, dependency mapping, and transaction-level troubleshooting.
Good working knowledge of the open-source ecosystem used in cloud-native operations, observability, automation, container platforms, logging, tracing, monitoring, and service reliability engineering.
Ability to analyze logs, metrics, traces, alerts, dashboards, events, and performance indicators to isolate technical issues and support timely service restoration.

Experience and Qualifications

3-6 years of experience in SRE, platform support, cloud operations, Kubernetes operations, application production support, or managed services support.
Bachelor’s degree in Computer Science, Information Technology, Engineering, or a related discipline.
Experience supporting business-critical production environments with defined SLA, OLA, escalation, and operational governance processes.
Exposure to 24x7 support models, shift operations, major incident calls, client-facing operations, and cross-functional technical coordination.
Preferred certifications include Google Cloud Associate Cloud Engineer, Google Cloud Professional Cloud DevOps Engineer, OCI Foundations Associate, OCI Architect Associate, Certified Kubernetes Administrator, Certified Kubernetes Application Developer, Kubernetes and Cloud Native Associate, Prometheus/Grafana observability training, SRE Foundation, or ITIL Foundation.

Behavioral and Operational Competencies

Strong analytical thinking and structured troubleshooting approach across cloud, Kubernetes, network, database, messaging, caching, storage, and observability layers.
Ability to work under pressure during high-severity incidents, production outages, cloud service degradation, Kubernetes failures, Kafka issues, Redis issues, database issues, storage access failures, and network connectivity problems.
Ability to prioritize multiple incidents, alerts, requests, platform issues, and operational tasks in a time-sensitive environment.
Good collaboration skills across L1, L2, L3, platform engineering, cloud, Kubernetes, network, database, application, and vendor teams.
Willingness to learn and continuously improve skills across GCP, OCI, Kubernetes, Kafka, Redis, S3-compatible storage, event bus, databases, Grafana, Tempo, Loki, Prometheus, and the wider open-source ecosystem.

Key Deliverables and Success Measures

Timely resolution of L2 incidents within agreed SLA and OLA targets across GCP, OCI, Kubernetes, network services, Kafka, Redis, S3-compatible storage, event bus, databases, and observability platforms.
Improved service availability, reduced recurring incidents, and faster mean time to restore service across the defined SRE support landscape.
High-quality incident documentation, RCA inputs, runbooks, SOPs, alert response guides, and knowledge articles for the supported technology stack.

المصدر: LinkedIn - أُضيفت للموقع في 1 يوليو 2026

NextEra تعلن عن وظيفة Site Reliability Engineer في الرياض

تفاصيل الوظيفة

المهام والمسؤوليات

المهارات المطلوبة

وظائف أخرى لدى NextEra

وظيفة مهندس NOC - مراقبة الشبكات لدى NextEra في الرياض

وظيفة مهندس NOC - مراقبة قواعد البيانات لدى NextEra في الرياض

شركة NextEra تعلن عن وظيفة أخصائي ITSM (الحوادث والتغييرات) في الشرقية السعودية

وظيفة وكيل مركز اتصال لدى NextEra في الرياض

شركة NextEra تعلن عن وظيفة مدير التعلم والتطوير في الرياض