Location
D05 Clementi New Town, Hong Leong Garden, Pasir Panjang
Job Type
Full-time
Experience
Mid
Category
General
Salary
$12,000 - $17,000
Posted
1 week ago
Expires
Apr 24, 2026
Views
0

Job Details

Vacancies

1 position

Experience Required

No experience required

Job Description

Key Responsibilities

  • Lead and mentor a team of system engineers responsible for delivery, operations, escalations, and technical improvement.
  • Manage and optimise OS lifecycle for GPU and CPU nodes, including patching, kernel tuning, driver and firmware updates, configuration hardening, and automation.
  • Oversee bare-metal provisioning and deployment for GPU platforms, including NVIDIA stack components such as CUDA, drivers, NCCL, and container runtimes.
  • Manage Kubernetes (k8s) clusters supporting GPU workload orchestration, including autoscaling, scheduling, node health, multi-tenant resource isolation, and capacity allocation.
  • Run and enhance container platforms (Docker/CRI-O), including image management, registry security, runtime troubleshooting, and performance optimisation.
  • Integrate and operate monitoring and telemetry systems, such as DCGM, Prometheus, node exporters, Weka telemetry, and alert pipelines.
  • Drive continuous improvement in GPU utilisation efficiency, benchmarking, platform stability, and cost/performance optimisation.
  • Own operational workflows including incident, problem, and change management, RCA execution, and improvement tracking.
  • Lead capacity planning across compute, GPU, network, and storage layers to support scale-up and customer growth.
  • Maintain complete system documentation including SOPs, runbooks, KB articles, architecture diagrams, configuration standards, and platform records.
  • Oversee the ticketing lifecycle across internal operations, customer interfaces, and vendor escalation including RMA tracking and replacement management.
  • Ensure strong SLA alignment and customer interaction through accurate troubleshooting and triage across GPU, Kubernetes, and OS environments.
  • Support ISO27001 and SOC2 compliance through configuration standards, access controls, logging, vulnerability remediation, and platform security practices.
  • Maintain audit readiness and evidence collection for operational and security compliance.
  • Collaborate with vendors, partners, and engineering teams to resolve systemic GPU, container, or orchestration issues.
  • Support budgeting and forecasting related to GPU expansion, licensing, storage growth, and platform evolution.

Skills and Experience

  • Bachelor’s degree in computer science, Engineering, or related discipline.
  • 15+ years’ experience in solution architecture, cloud engineering, HPC, or AI infrastructure.
  • Deep hands-on experience with Linux systems, GPU platforms, Kubernetes orchestration, and container runtimes.
  • Strong technical knowledge across drivers, firmware, OS tuning, and performance benchmarking.
  • Practical experience supporting large-scale GPU clusters or HPC environments.
  • Practical experience with monitoring and telemetry platforms such as DCGM, Prometheus, Grafana, and Weka.
  • Good understanding of platform automation and infrastructure-as-code tooling (e.g., Ansible, Terraform).
  • Strong knowledge of troubleshooting processes across complex stack layers (OS, container, GPU, network, storage).
  • Excellent communication skills to work effectively across technical and non-technical stakeholders.
  • Strong documentation discipline and ability to translate technical concepts into clear written content.
  • Knowledge of ticketing platforms and RMA management processes in large-scale compute environments.
  • Excellent documentation and diagramming abilities.
  • Self-driven, analytical, and detail-oriented.

Similar Jobs

BUILDBRIDGE PARTNERS PTE. LTD.

Site Manager (Building project)

BUILDBRIDGE PARTNERS PTE. LTD. Islandwide 8 hours ago
RECRUIT EXPERT PTE. LTD.

Secretary cum HR Executive (Financial Services)

RECRUIT EXPERT PTE. LTD. D01 Cecil, Marina, People’s Park, Raffles Place 8 hours ago
BUILDBRIDGE PARTNERS PTE. LTD.

Contract Manager (Building / A1 Main Contractor)

BUILDBRIDGE PARTNERS PTE. LTD. Islandwide 8 hours ago
RECRUIT EXPRESS PTE LTD

Speech Therapist (Locum/ Perm) #HHW

RECRUIT EXPRESS PTE LTD Islandwide 8 hours ago

[IMMEDIATE START!] Junior Business Associate- 🌟office hours🌟

ANEMO MARKETING SOLUTIONS Islandwide 8 hours ago

Response Reality Check

Quality: 95%
Response N/A
Company Stats
Response metrics N/A
Platform Spread
mycareersfuture
95%
Quality Score
N/A
Response Rate

YTL POWERSERAYA PTE. LIMITED

Ready to Apply?

This is a direct application to YTL POWERSERAYA PTE. LIMITED. No recruitment agencies involved.

Apply for this Position

Response rate not available - Direct application to employer