Site Reliability Engineer, Machine Learning Operations
Site Reliability Engineer, Machine Learning Operations
Job Details
Vacancies
1 position
Experience Required
No experience required
Job Description
Purpose of Role:
- Frontline On-Call Ownership: Serve as the primary responder for the Applied Machine Learning Engine, taking ownership of system availability, health monitoring, and immediate incident response to ensure high reliability.
- Incident Lifecycle Management: Manage the end-to-end feedback loop for incidents, including rapid triage, effective resolution, and the facilitation of post-incident reviews to ensure closure and prevent recurrence.
- SOP Execution & Optimization: Execute upgrades and deployments strictly adhering to Standard Operating Procedures (SOPs), while actively leveraging Machine Learning and Infrastructure expertise to refine, automate, and improve these processes for greater efficiency.
Responsibilities:
- Analyse all kinds of user needs related to machine learning systems provided by AML department , through oncall shifting or any other mechanisms, then propose customer oriented solutions .
- Work with other software engineers to implement and deploy customer-oriented machine learning framework related solutions which are proposed by oneself or not .
- Update software, enhances existing software capabilities, and develops or deploy software testing 、deployment 、capacity management and validation procedures.
- Work with computer hardware engineers to integrate hardware and software systems and trouble-shooting specifications and performance requirements.
Minimum requirements:
- Bachelor’s degree in Computer Science or equivalent with 3+ years of relevant experience
- Proven experience in analyzing and troubleshooting distributed systems.
- Prior experience designing or maintaining large-scale systems.
- Scripting skills in at least one major language (Python, Go, or Shell/Bash) to automate repetitive operational tasks.
Nice to have:
- Experience defining and managing Service Level Indicators (SLIs), Service Level Objectives (SLOs), error budgets, and practicing Chaos Engineering.
- Experience operating MLOps platforms and toolkits such as Kubeflow, MLflow, Feast, or Ray.
- Deep understanding of Linux operating system internals or container technologies (Docker/Containerd) and orchestration platforms (Kubernetes) in a production environment.
- Basic understanding of Machine Learning concepts and familiarity with frameworks like TensorFlow Serving, TorchServe, or Triton Inference Server
Similar Jobs
(Japanese speaking) Customer Service Assistant (Medical industry)
Assistant relationship manager (Private Bank - North Asia Team)
Development Executive (Gaming Industry/ Japanese Speaking) – JK
Market Risk Analyst – Asset Management
Cleaning Operation Manager
Response Reality Check
MANPOWER STAFFING SERVICES (SINGAPORE) PTE LTD
About MANPOWER STAFFING SERVICES (SINGAPORE) PTE LTD
MANPOWER STAFFING SERVICES (SINGAPORE) PTE LTD Manpower is the global leader in contingent and permanent recruitment workforce solutions. It is part...
Ready to Apply?
This is a direct application to MANPOWER STAFFING SERVICES (SINGAPORE) PTE LTD. No recruitment agencies involved.
Apply for this PositionResponse rate not available - Direct application to employer