Berliner Boersenzeitung - Clockwork.io Launches The Industry's First Contractual Commitment to End GPU Waste in AI Training

EUR -
AED 4.183048
AFN 72.314042
ALL 93.898303
AMD 419.153057
ANG 2.038998
AOA 1044.89568
ARS 1690.672427
AUD 1.651032
AWG 2.049928
AZN 1.936081
BAM 1.954785
BBD 2.294468
BDT 140.354657
BGN 1.925657
BHD 0.429413
BIF 3388.074763
BMD 1.138849
BND 1.476807
BOB 7.900759
BRL 5.945252
BSD 1.139188
BTN 108.572718
BWP 16.26327
BYN 3.318918
BYR 22321.433736
BZD 2.29117
CAD 1.618002
CDF 2579.49217
CHF 0.921021
CLF 0.02679
CLP 1054.369086
CNY 7.737281
CNH 7.738112
COP 3904.759012
CRC 518.521655
CUC 1.138849
CUP 30.179489
CVE 110.325979
CZK 24.236636
DJF 202.861103
DKK 7.474566
DOP 68.100581
DZD 151.843155
EGP 55.917926
ERN 17.08273
ETB 181.020431
FJD 2.556938
FKP 0.859051
GBP 0.858179
GEL 3.006554
GGP 0.859051
GHS 12.942983
GIP 0.859051
GMD 83.701678
GNF 9990.551529
GTQ 8.688336
GYD 238.302078
HKD 8.932844
HNL 30.429885
HRK 7.532575
HTG 148.950043
HUF 354.818526
IDR 20438.916901
ILS 3.400037
IMP 0.859051
INR 108.238169
IQD 1492.461169
IRR 1567055.755971
ISK 143.791239
JEP 0.859051
JMD 179.156974
JOD 0.807496
JPY 184.995771
KES 147.42431
KGS 99.592135
KHR 4566.782743
KMF 491.982899
KPW 1024.964193
KRW 1767.23083
KWD 0.352701
KYD 0.94939
KZT 546.006901
LAK 25624.094601
LBP 101983.897292
LKR 382.694568
LRD 207.612203
LSL 18.700172
LTL 3.362724
LVL 0.688878
LYD 7.300234
MAD 10.688123
MDL 20.147185
MGA 4862.883342
MKD 61.638162
MMK 2391.139854
MNT 4080.476394
MOP 9.204059
MRU 45.724815
MUR 53.751653
MVR 17.606532
MWK 1978.180039
MXN 19.972883
MYR 4.662561
MZN 72.71585
NAD 18.699794
NGN 1570.460673
NIO 41.704567
NOK 11.295781
NPR 173.716748
NZD 2.007261
OMR 0.437903
PAB 1.139188
PEN 3.886892
PGK 4.98589
PHP 70.159341
PKR 316.656978
PLN 4.29043
PYG 6924.283008
QAR 4.151678
RON 5.23005
RSD 117.337286
RUB 88.553635
RWF 1668.413287
SAR 4.272278
SBD 9.184861
SCR 15.319799
SDG 683.868824
SEK 11.081677
SGD 1.475521
SHP 0.850266
SLE 28.24243
SLL 23881.091149
SOS 650.862356
SRD 42.711946
STD 23571.867935
STN 24.883843
SVC 9.967649
SYP 125.879331
SZL 18.688698
THB 37.952699
TJS 10.537743
TMT 3.997359
TND 3.355333
TOP 2.742075
TRY 53.146539
TTD 7.733848
TWD 36.269712
TZS 2989.48117
UAH 51.070061
UGX 4174.758967
USD 1.138849
UYU 45.795417
UZS 13723.125953
VES 708.641199
VND 29952.289182
VUV 136.773869
WST 3.167006
XAF 655.605068
XAG 0.018926
XAU 0.000279
XCD 3.077795
XCG 2.053098
XDR 0.814298
XOF 653.130407
XPF 119.331742
YER 271.733346
ZAR 18.667214
ZMK 10250.993881
ZMW 20.739867
ZWL 366.708804
  • CMSD

    0.2000

    22.1

    +0.9%

  • VOD

    -0.1920

    13.033

    -1.47%

  • RYCEF

    0.4000

    19.5

    +2.05%

  • RBGPF

    0.6100

    65.61

    +0.93%

  • RELX

    -0.1600

    31.51

    -0.51%

  • RIO

    -1.2300

    93.7

    -1.31%

  • NGG

    -2.3800

    80.49

    -2.96%

  • GSK

    -1.2550

    51.165

    -2.45%

  • CMSC

    0.1900

    21.83

    +0.87%

  • BCC

    -1.8200

    75.81

    -2.4%

  • BTI

    -0.7700

    60.99

    -1.26%

  • BCE

    -0.1400

    21.37

    -0.66%

  • AZN

    -5.8800

    183.74

    -3.2%

  • JRI

    0.0090

    12.969

    +0.07%

  • BP

    -0.6750

    36.275

    -1.86%

Clockwork.io Launches The Industry's First Contractual Commitment to End GPU Waste in AI Training
Clockwork.io Launches The Industry's First Contractual Commitment to End GPU Waste in AI Training

Clockwork.io Launches The Industry's First Contractual Commitment to End GPU Waste in AI Training

"You Only Compute Once" (YOCO) guarantees to resolve 90% of AI training failures with no lost progress, or customers get credit

Text size:

PALO ALTO, CA / ACCESS Newswire / July 1, 2026 / Clockwork.io, pioneer of Software-Driven AI Fabrics™ and the company behind TorchPass AI fault tolerance, today announced the YOCO Guarantee - the industry's first contractual commitment to dramatically reduce the hidden, compounding cost of training failure in large-scale AI infrastructure. The announcement marks a turning point in how the AI industry measures infrastructure reliability - moving beyond uptime metrics designed for a previous era towards goals AI teams value most: whether the job finishes on time, without losing work.

Under the YOCO (You Only Compute Once) Guarantee, Clockwork.io commits that at least 90% of training failures on supported TorchPass workloads will be resolved through live GPU migration, with no lost training progress, no checkpoint rollback, and no recompute. If Clockwork.io falls short of that commitment in any contract year, customers receive a 25% credit against their next TorchPass renewal or expansion.

"We built TorchPass to make training failure irrelevant," said Suresh Vasudevan, CEO of Clockwork.io. "The YOCO Guarantee is a line in the contract. We're putting skin in the game because we know TorchPass delivers, and we want our customers to know it too."

The Hidden Tax on AI Progress

Every AI organization training at scale faces the same brutal math: GPU clusters fail constantly, and every failure triggers an expensive restart cycle. According to research published by Meta FAIR at HPCA 2025, a 1,024-GPU cluster experiences a mean time to failure of just 7.9 hours - and at 16,384 GPUs, that drops to 1.8 hours. Each failure forces teams to provision replacement nodes, restore from the last checkpoint, and recompute every training step since that checkpoint was taken. That recomputed work costs full GPU dollars - compute you already paid for, run again from scratch. The cycle typically costs three or more hours of progress per failure event, with losses accumulating daily.

The consequence is that current GPU clusters effectively operate at 30-50% of their theoretical performance - not because the hardware is slow, but because the reliability framework governing it was never designed for workloads of this nature, duration, or scale.

"AI teams need their models to be done, not their nodes to be up. The industry has been measuring node uptime and calling it reliability. YOCO holds us accountable for the only thing that matters - your model, done," said Vasudevan.

The financial toll is severe. In a typical 2,048-GPU H200 deployment, failure-driven restarts drain over $6 million per year in wasted compute - hundreds of thousands of GPU-hours lost to cascading retries, idle recovery time, and recomputed training steps. For AI builders, the real unit of value is not GPU uptime but time to trained model - yet the infrastructure contract they've been buying guarantees node availability, not job continuity. For AI operators, the gap is equally costly: when a customer's training job fails, restarts, and loses days of progress, the experience is one of unreliability - regardless of what the SLA technically said.

"Recompute and restart is the hidden tax of large-scale training," said Vasudevan. "Most teams treat it as a fact of life. It isn't."

The YOCO Guarantee changes that contract.

TorchPass: Reliability Redefined in Software

Clockwork.io's answer is to make reliability a software-defined property rather than a function of hardware uptime - a fundamental architectural rethink that decouples job continuity from the failure rate of any individual component.

TorchPass addresses failure at its root through live GPU migration - when a fault occurs, TorchPass transfers the training job's full in-memory state, including model weights, gradients, and optimizer state, to a healthy spare node. Training continues from exactly where it stopped, typically completing recovery in approximately three minutes. No checkpoint restore. No recompute. No lost progress.

TorchPass handles three classes of failure: unplanned migration for sudden, catastrophic faults - kernel crashes, power failures, GPU failures - where state is reconstructed from healthy replicas; pre-emptive migration triggered by early warning signals like rising ECC error rates or thermal thresholds, enabling a controlled handoff before failure occurs; and planned migration for proactive maintenance, security patching, and firmware updates, allowing infrastructure hygiene without interrupting training. Across all three scenarios, the job never stops.

This approach reduces wasted training progress by 90%, cutting lost time from approximately three hours per day to under ten minutes in a 1,024-GPU cluster - meaning research teams no longer discover hours of progress silently erased, and model release timelines become predictable rather than probabilistic.

In independent testing conducted by SemiAnalysis, a leading AI infrastructure research firm, TorchPass outperformed every competing fault-tolerance framework - the only solution that "maintains the same training performance as jobs without fault tolerance."

TorchPass is 100% software-based, runs in cloud and on-premises environments, and supports popular training frameworks including TorchTitan, Megatron-LM, and DeepSpeed, on schedulers including Kubernetes and Slurm. It works across NVIDIA and AMD hardware, and across InfiniBand, RoCE, and Ethernet fabrics - with no hardware lock-in of any kind.

Why the Guarantee Changes the Market

For AI builders, it redefines the SLA they should demand. The question is no longer "what is your node uptime?" but "what percentage of my training failures will be resolved without losing progress?" - a metric tied directly to GPU ROI, not an availability percentage that has historically had little relationship to whether models get trained on time. The YOCO Guarantee makes that question answerable and auditable.

For AI operators, it raises the competitive bar. AI Cloud operators and infrastructure providers who can offer job-level continuity guarantees - backed by contractual credits - will command premium pricing, win customers burned by restart-driven losses, and protect their margins by dramatically reducing their GPU idle time. Those who cannot will find themselves competing only on raw GPU price in a commoditizing market.

And for the industry as a whole, it establishes a new accountability standard. The AI infrastructure market has long accepted vendor claims about fault tolerance at face value, with no contractual obligation behind them. The YOCO Guarantee - measurable and contractually backed - introduces a standard the market will increasingly expect others to match or explain why they cannot.

"There's a big difference between a vendor making a slide that says their product works and them writing it into a contract," said Jordan Nanos, Member of Technical Staff and lead author of ClusterMAX at SemiAnalysis. "In our testing, TorchPass delivered the fastest and most efficient fault-tolerant performance for a GPT-OSS-120B training run on a 64x H200 cluster when compared to checkpoint-restart on job completion time. TorchPass also outperformed TorchFT (in terms of MFU and tokens/sec/GPU) for this job, while matching its recovery time. The YOCO Guarantee just reflects what we saw in testing, and makes it contractual."

"Every enterprise running large-scale AI training knows the cost of a failed job: hours of progress lost, recomputes billed, model timelines slipping. Every product decision we make at Scaleway comes back to one question: are we making our customers' outcomes more predictable? Node uptime answers a different question entirely. The YOCO Guarantee is the first infrastructure commitment we've seen built around the right metric - whether progress is protected and the jobs keep running to completion, not whether the hardware stays up. That's the accountability model the AI infrastructure market has been missing," said Fred Bardolle, Head of Products and AI at Scaleway.

Availability

The YOCO Guarantee is available to new and renewing TorchPass customers effective August 3, 2026. Existing TorchPass customers should contact their Clockwork.io account team to discuss adding the guarantee to their current agreement. To learn more or get started, visit clockwork.io/yoco.

Clockwork.io will be at RAISE Summit in Paris, France, July 8-9, Booth #27A. Suresh Vasudevan, CEO of Clockwork.io, will also take part in the panel "Infrastructure as Destiny: The Compute-Capital-Cloud Trinity" on July 8th at 10:40 a.m. local time on the Main Stage.

About Clockwork.io

Clockwork.io pioneers Software-Driven AI Fabrics™ - a programmable layer between hardware and workload that delivers nanosecond-accurate telemetry, AI fault tolerance, and performance optimization across any accelerator, network, or deployment model. Modern AI workloads need the whole cluster to act as one machine, but failures and infrastructure bottlenecks severely compromise efficiency. Clockwork.io's FleetIQ platform recovers that lost capacity, letting enterprises train, deploy, and serve the world's most demanding AI workloads faster, more reliably, and at lower cost - across any Ethernet, RoCE, or InfiniBand fabric, without hardware lock-in. TorchPass, Clockwork.io's AI fault tolerance product, is independently benchmarked by SemiAnalysis as the only solution that maintains full training throughput during failures, outperforming checkpoint-restart and leading open-source frameworks. Uber, Wells Fargo, DCAI, Nebius, NScale, and White Fiber trust Clockwork.io to power their AI infrastructure. Learn more at www.clockwork.io

© 2026 Clockwork Systems Inc. TorchPass and YOCO Guarantee are trademarks of Clockwork Systems Inc. All other trademarks are the property of their respective owners.

Media Contact

Dana Trismen
[email protected]
650-269-7478

SOURCE: Clockwork



View the original press release on ACCESS Newswire

(Y.Yildiz--BBZ)