AI Systems & Platform Lead
AI Systems & Platform Lead
Location
D05 Clementi New Town, Hong Leong Garden, Pasir Panjang
Job Type
Full-time
Experience
Mid
Category
General
Salary
$12,000 - $17,000
Posted
1 week ago
Expires
Apr 24, 2026
Views
0
Job Details
Vacancies
1 position
Experience Required
No experience required
Job Description
Key Responsibilities
- Lead and mentor a team of system engineers responsible for delivery, operations, escalations, and technical improvement.
- Manage and optimise OS lifecycle for GPU and CPU nodes, including patching, kernel tuning, driver and firmware updates, configuration hardening, and automation.
- Oversee bare-metal provisioning and deployment for GPU platforms, including NVIDIA stack components such as CUDA, drivers, NCCL, and container runtimes.
- Manage Kubernetes (k8s) clusters supporting GPU workload orchestration, including autoscaling, scheduling, node health, multi-tenant resource isolation, and capacity allocation.
- Run and enhance container platforms (Docker/CRI-O), including image management, registry security, runtime troubleshooting, and performance optimisation.
- Integrate and operate monitoring and telemetry systems, such as DCGM, Prometheus, node exporters, Weka telemetry, and alert pipelines.
- Drive continuous improvement in GPU utilisation efficiency, benchmarking, platform stability, and cost/performance optimisation.
- Own operational workflows including incident, problem, and change management, RCA execution, and improvement tracking.
- Lead capacity planning across compute, GPU, network, and storage layers to support scale-up and customer growth.
- Maintain complete system documentation including SOPs, runbooks, KB articles, architecture diagrams, configuration standards, and platform records.
- Oversee the ticketing lifecycle across internal operations, customer interfaces, and vendor escalation including RMA tracking and replacement management.
- Ensure strong SLA alignment and customer interaction through accurate troubleshooting and triage across GPU, Kubernetes, and OS environments.
- Support ISO27001 and SOC2 compliance through configuration standards, access controls, logging, vulnerability remediation, and platform security practices.
- Maintain audit readiness and evidence collection for operational and security compliance.
- Collaborate with vendors, partners, and engineering teams to resolve systemic GPU, container, or orchestration issues.
- Support budgeting and forecasting related to GPU expansion, licensing, storage growth, and platform evolution.
Skills and Experience
- Bachelor’s degree in computer science, Engineering, or related discipline.
- 15+ years’ experience in solution architecture, cloud engineering, HPC, or AI infrastructure.
- Deep hands-on experience with Linux systems, GPU platforms, Kubernetes orchestration, and container runtimes.
- Strong technical knowledge across drivers, firmware, OS tuning, and performance benchmarking.
- Practical experience supporting large-scale GPU clusters or HPC environments.
- Practical experience with monitoring and telemetry platforms such as DCGM, Prometheus, Grafana, and Weka.
- Good understanding of platform automation and infrastructure-as-code tooling (e.g., Ansible, Terraform).
- Strong knowledge of troubleshooting processes across complex stack layers (OS, container, GPU, network, storage).
- Excellent communication skills to work effectively across technical and non-technical stakeholders.
- Strong documentation discipline and ability to translate technical concepts into clear written content.
- Knowledge of ticketing platforms and RMA management processes in large-scale compute environments.
- Excellent documentation and diagramming abilities.
- Self-driven, analytical, and detail-oriented.
Similar Jobs
Site Manager (Building project)
BUILDBRIDGE PARTNERS PTE. LTD.
•
Islandwide
•
8 hours ago
Secretary cum HR Executive (Financial Services)
RECRUIT EXPERT PTE. LTD.
•
D01 Cecil, Marina, People’s Park, Raffles Place
•
8 hours ago
Contract Manager (Building / A1 Main Contractor)
BUILDBRIDGE PARTNERS PTE. LTD.
•
Islandwide
•
8 hours ago
Speech Therapist (Locum/ Perm) #HHW
RECRUIT EXPRESS PTE LTD
•
Islandwide
•
8 hours ago
[IMMEDIATE START!] Junior Business Associate- 🌟office hours🌟
ANEMO MARKETING SOLUTIONS
•
Islandwide
•
8 hours ago
Response Reality Check
Quality: 95%
Response N/A
Company Stats
Response metrics N/A
Platform Spread
mycareersfuture
95%
Quality Score
N/A
Response Rate
YTL POWERSERAYA PTE. LIMITED
Ready to Apply?
This is a direct application to YTL POWERSERAYA PTE. LIMITED. No recruitment agencies involved.
Apply for this PositionResponse rate not available - Direct application to employer