See all roles

4 Remote Nvidia Engineers

Work from home Full-time role Hiring

About the position We are seeking a highly skilled AI Infrastructure & Kubernetes Platform Engineer with a proven track record in deploying and managing NVIDIA DGX-based AI clusters, orchestrating containerized AI workloads using Kubernetes, and ensuring secure, high-throughput operations across InfiniBand-powered networks. The ideal candidate will hold a combination of Kubernetes certifications (CKA, CKAD, CKS) and NVIDIA certifications (NCA-AIIO, NCP-AIO, NCP-AII, NCP-AIN), coupled with hands-on training in DGX, BlueField, and high-speed network operations. This position plays a key role in supporting AI/ML infrastructure at scale, enabling efficient training and inference for complex models, and integrating NVIDIA's cutting-edge compute, storage, and fabric solutions with modern DevOps practices.

Responsibilities

  • Deploy and manage NVIDIA DGX BasePODs and SuperPODs for high-performance AI workloads.
  • Oversee DGX system lifecycle operations including provisioning, monitoring, firmware upgrades, and capacity planning.
  • Operate Base Command Manager to manage GPU clusters, schedule workloads, and integrate with MLOps tools.
  • Perform DGX node health validation, NCCL interconnect testing, and NVLink topology verification following new deployments or hardware changes.
  • Architect secure and scalable Kubernetes clusters optimized for GPU-accelerated workloads using NVIDIA GPU Operator.
  • Leverage expertise from CKA/CKAD/CKS to develop, deploy, and secure AI applications on Kubernetes.
  • Implement CI/CD pipelines and GitOps methodologies for deploying and managing ML workflows.
  • Administer InfiniBand networks and BlueField DPUs using Unified Fabric Manager (UFM).
  • Enable NVLink/NVSwitch performance across GPU nodes and tune fabric configurations for minimal latency and maximum throughput.
  • Use BlueField for offloading storage, firewalling, and telemetry, enhancing AI workload security and performance.
  • Apply best practices from the CKS certification to secure containerized AI environments.
  • Configure runtime security, secrets management, network segmentation, and auditing using DPU-enhanced Kubernetes deployments.
  • Support zero-trust architecture initiatives by enforcing workload identity, RBAC policies, and supply chain integrity across AI container images and model artifacts.
  • Monitor GPU, CPU, and I/O performance using NVIDIA DCGM, Prometheus, Grafana, and Base Command APIs.
  • Tune system performance and model training pipelines for cost-efficiency and throughput.
  • Build and maintain operational runbooks, incident response playbooks, and SLA reporting dashboards covering GPU utilization, thermal thresholds, and fabric health.

Requirements

  • NVIDIA Certification required or no interview
  • Kubernetes certifications (CKA, CKAD, CKS)
  • NVIDIA certifications (NCA-AIIO, NCP-AIO, NCP-AII, NCP-AIN)
  • Hands-on training in DGX, BlueField, and high-speed network operations
  • Expertise with DGX System, BasePOD, and SuperPOD Administration
  • Expertise with BlueField DPU Configuration & Operations
  • Expertise with InfiniBand Fabric and UFM Management
  • Expertise with Base Command Manager for workload orchestration

Apply tot his job Apply To this Job

You might like

Senior Software Engineer - Cloud and Kubernetes

Work from home Full-time role

Kubernetes / Cloud Infrastructure Administrator

Work from home Full-time role

Software Engineer- Kubernetes

Work from home Full-time role

Network Engineer, Global Sensor Network

Work from home Full-time role

Senior Windows Server Systems Administrator -- Remote, DC area preferred

Work from home Full-time role

Senior Network Operations Center (NOC) Engineer (Remote)

Work from home Full-time role

F5 Network Engineer

Work from home Full-time role

911 Network Planning Engineer --Remote

Work from home Full-time role

Network Engineering, Advisor

Work from home Full-time role

Systems Administrator Tier 2 - 2nd Shift

Work from home Full-time role

Experienced Part-Time Data Entry Specialist – Remote Amazon Operations

Work from home Full-time role

Experienced Full Stack Software Engineer – Web & Cloud Application Development at arenaflex

Work from home Full-time role

Agentic Systems Engineer

Work from home Full-time role

Experienced Customer Support Specialist – Virtual Call Center Representative for arenaflex

Work from home Full-time role

Experienced Customer Support Representative, Payroll (Hybrid) – Empower Small Businesses and Join Our Dynamic Team at arenaflex!

Work from home Full-time role

Experienced Full Stack Customer Service Coordinator – Food Manufacturing Industry

Work from home Full-time role

AMP Media - Account Executive, Podcast / Social

Work from home Full-time role

Experienced Live Chat Support Agent – Remote Customer Service Representative | Earn $25-$35/hr | No Experience Needed

Work from home Full-time role

Medical Science Liaison (Psychiatry) - Massachusetts

Work from home Full-time role

Experienced Data Entry Specialist – Online Data Management for Teens at arenaflex

Work from home Full-time role