Back to Jobs
Development 3h ago

Datacenter Hardware Operations Technician Lead, Industrial Compute

United StatesUnited States
Full-time
$86,400 - $228,000
Senior

Job Description

Key Skills Required

Master these to land this role

DevOpsBestseller 🔥
Learn in 63 Hours

Want to know if you're a match for this job?

Calculate My Match Score

About OpenAI & the Stargate Program: OpenAI is a premier, internationally recognized artificial intelligence research and deployment company operating on an absolute mission to ensure that general-purpose artificial intelligence benefits all of humanity. By pushing the absolute structural boundaries of machine learning capabilities, OpenAI safely deploys frontier models to millions of global users through its industry-defining product arrays. In close collaboration with elite global capital partners, OpenAI monumental Stargate program develops, builds, and deploys the world’s most advanced hyperscale AI infrastructure ecosystems. These large-scale AI campuses are engineered from the bare metal up to support the next generation of frontier model training and high-concurrency inference workloads. Driven by a high-performance, engineer-led corporate philosophy centered around absolute personal agency, safety, and rigorous scientific verification, the organization equips hardware pioneers with an uncompromised canvas to leverage breakthroughs in computing architecture, manipulate massive fleet logistics, and deploy robust hardware operations frameworks globally.

Position Overview

We are seeking a highly analytical, detail-obsessed, and systems-minded Datacenter Hardware Operations Technician Lead, Industrial Compute to join our core centralized Scaling division in a full-time capacity base-stationed on-site at our flagship AI campus in Abilene, Texas. Operating as OpenAI’s senior on-site technical authority for hardware reliability and fleet health, you will step up to claim true individual operational and strategic accountability over the availability, lifecycle integrity, and sustaining engineering of our compute infrastructure. Shifting completely away from routine software dashboard setup, minor remote configuration tweaks, simple database logging, or basic administrative folder typing, you will run an active bare-metal hardware triage, forensic failure analysis, and physical supercomputing engineering laboratory—partnering in lockstep with centralized Fleet Health Engineering, Data Center Operations, Network Infrastructure, Manufacturing, and OEM vendor teams. This position requires an internet technology or hyperscale infrastructure veteran with 8+ years of craft depth who maps out comprehensive system deployments fluidly natively using DevOps reliability standards, diagnoses complex server and rack-level breakdowns cleanly natively leveraging DevOps system configurations, and commands multi-thousand GPU clusters confidently under high-priority production parameters.

Key Responsibilities

  • On-Site Senior Hardware Governance: Serve as OpenAI’s primary on-site technical authority and operational lead for server, GPU, storage, and rack-integrated infrastructure, safeguarding hardware readiness cleanly natively utilizing DevOps principles.
  • Complex Failure Technical Triage: Direct and execute immediate technical triage, forensic hardware diagnostics, and full-stack repair resolutions for complex equipment failures impacting production clusters.
  • Sustaining Fleet Reliability Mapping: Interlock closely with Fleet Health Engineering teams to investigate recurring hardware bottlenecks, isolate failure patterns, and upgrade global fleet uptime metrics.
  • Root Cause Analysis (RCA) Leadership: Lead comprehensive root cause investigations for critical hardware incidents, engineering detailed corrective and preventive action (CAPA) plans natively leveraging DevOps diagnostic methodologies (including FRACAS, RCCA, 5-Why, and Fishbone frameworks).
  • Multi-Vendor Repair Coordination: Collaborate directly with Oracle operations groups and OEM hardware providers to coordinate system repairs, replacements, component upgrades, and full lifecycle lifecycle procedures.
  • Operational Runbook Standardization: Author, update, and continuamente improve rigorous hardware maintenance standards, technical troubleshooting guidelines, and scalable operational runbooks.
  • Telemetry and Analytics Assessment: Analyze hardware failure trends, component defect rates, and operational metrics to proactively identify reliability risks across the environment.
  • NPI Validation and Readiness Support: Support new hardware introductions (NPI), validating next-generation server clusters and driving strict production readiness reviews.
  • Supply Chain and Spare Parts Strategy: Coordinate with global supply chain networks to establish spare parts strategies, inventory planning models, and component allocations.
  • Technician Cohort Mentorship: Mentor junior technicians and distributed vendor partner teams on advanced bare-metal troubleshooting methodologies and hardware operational excellence.

Required Skills & Qualifications

  • Mandatory Physical On-Site Prerequisite: Continuous professional availability and absolute capacity to sit on-site at our flagship campus in Abilene, Texas 5 days per week.
  • A minimum of 8 years of verified professional history running advanced datacenter hardware support, site reliability operations, sustaining engineering management, hardware electronics diagnostics, or infrastructure technology consulting.
  • Expert-tier capability diagnosing hardware component failures, managing rack integration systems, and tuning bare-metal hardware layers natively utilizing DevOps execution rules.
  • Practical operational familiarity parsing system alerts, evaluating infrastructure dependencies, and structuring performance runbooks natively using DevOps engineering benchmarks across massive compute and storage architectures (specifically focusing on advanced server nodes and GPU systems).
  • Demonstrated experience successfully executing root cause analysis models and driving long-term corrective actions across live production environments.
  • Outstanding written, verbal, and scannable technical communication attributes in business-fluent English to influence high-level technical and operational procurement decisions.
  • Comfort operating independently in high-priority, high-concurrency production environments backed by heavy individual operational responsibility.
  • Location Context: Position open to qualified hardware operations leaders based permanently within the United States who can sit on-site in Abilene, Texas full-time, with flexibility for occasional travel to support new campus deployments.

Preferred Strategic Indicators (Nice to Have)

  • Prior technical deployment history operating explicitly within hyperscale cloud computing data centers, high-performance computing (HPC) settings, or large-scale AI/ML GPU clusters.
  • Familiarity with Linux system administration, bash/python shell scripts, or parsing telemetry dashboards from automated hardware monitoring tools.
  • Possession of industry-recognized certifications such as CompTIA Server+, OEM hardware certifications, or equivalent technical training.
  • Practical knowledge navigating Environmental Health and Safety (EHS) regulations within mission-critical datacenter infrastructure zones.

What We Offer

  • Vetted, Frontier AI-Infrastructure Salaried Blueprint: A highly competitive full-time baseline annual corporate salary package structured transparently between $86,400 and $228,000 USD, calibrated precisely against your individual market location, hardware craftsmanship, and domain depth, supplemented by exceptionally generous startup equity options and performance bonuses.
  • The spectacular professional canvas to claim an elite infrastructure leadership seat, directly managing the physical compute nodes that train the next generation of general-purpose AI.
  • Access to elite personal and family protection layers, including comprehensive Medical, Dental, and Vision insurance plans with direct employer contributions to Health Savings Accounts (HSA).
  • Progressive Personal Lifestyle Framework: Access to an Unlimited/Flexible PTO setup for exempt employees, 13+ paid company holidays, and multiple coordinated company-wide office closures throughout the year to focus and recharge.
  • Uncompromised financial and career upscaling benefits, featuring pre-tax commuter and FSA accounts, a matching 401(k) retirement system, up to 24 weeks of paid parental leave, an annual learning and development stipend, meal delivery credits, and full corporate relocation support.

How would you rate this job post?

See what other professionals think about this role.

Is this company safe?

Ask Hyrizon AI to scan this company for potential red flags before you apply.

Safety First

  • Never pay for a job application.
  • Do not share sensitive bank info.
  • Verify the client before starting work.
Learn More