The Challenge

Tensora AI, a 40-person startup backed by $22M in Series A funding, develops generative AI models for enterprise content creation. Their research team trains both diffusion models for image generation and large language models for specialized text tasks. The computational demands of this work are extraordinary — a single training run for their flagship LLM requires hundreds of GPU-hours on NVIDIA H100 accelerators, and the research team runs dozens of experiments per week.

The company had been managing their own GPU infrastructure on AWS, but the operational complexity was overwhelming a team whose expertise was in machine learning, not cloud infrastructure. The problems were compounding:

H100 procurement delays: Demand for P5 instances (powered by H100 GPUs) consistently exceeded availability. Tensora’s researchers frequently waited three to five days for capacity in their preferred region, stalling experiment timelines. Capacity requests were managed manually through the AWS console, with no automated fallback strategy when the primary instance type or region was unavailable.
Idle GPU costs draining runway: GPU instances are among the most expensive resources in AWS, and Tensora was paying for them whether they were training or not. Researchers would provision a cluster for a training run, and the instances would remain running through nights and weekends between experiments because nobody wanted to risk losing capacity. At peak, Tensora was spending $130K per month on GPU compute with an effective utilization rate below 55%.
Training pipeline fragility: Multi-node distributed training runs using PyTorch and DeepSpeed failed at a rate that the team estimated at 28%. Failures were caused by a combination of spot instance interruptions without graceful handling, EFA networking issues during checkpoint synchronization, storage throughput bottlenecks during data loading, and misconfigured NCCL parameters. Each failed run wasted hours of expensive compute and set back research timelines.
MLOps complexity: There was no standardized process for experiment tracking, model versioning, or artifact management. Researchers maintained their own scripts for launching training jobs, and configuration was passed through a combination of environment variables, config files, and command-line arguments. Reproducing a previous experiment’s exact configuration was unreliable, and comparing results across experiments required manual spreadsheet work.

Tensora needed GPU infrastructure that was reliable, cost-efficient, and fast to provision — managed by someone who understood the specific demands of distributed AI training workloads.

The Solution

Remangu designed and built a managed GPU training platform on AWS that addressed every dimension of Tensora’s infrastructure challenges: cost, reliability, provisioning speed, and operational workflow.

Intelligent GPU Scheduling and Spot Integration

The highest-impact change was replacing Tensora’s static GPU provisioning with an intelligent scheduling system that matched compute allocation to actual training demand.

Spot instance integration was implemented with a sophisticated fallback strategy. Training jobs were submitted to a scheduling layer that first attempted to provision spot P5 instances, falling back to spot P4d instances, and finally to on-demand capacity only when spot was unavailable. The scheduler considered real-time spot pricing and interruption frequency data across multiple regions and availability zones, automatically selecting the most cost-effective placement.

Automated cluster lifecycle management eliminated idle GPU costs entirely. Training clusters were provisioned when a job was submitted and terminated within minutes of job completion. Checkpointing ensured that no work was lost during teardown, and the provisioning system could restore a training run from its latest checkpoint on fresh instances in under 15 minutes.

Multi-region capacity pooling addressed the H100 availability problem. Rather than depending on a single region, the scheduling system could provision GPU clusters across four AWS regions. Training data was replicated to FSx for Lustre file systems in each region, ensuring that data locality didn’t constrain placement decisions. The system tracked available capacity across regions and automatically routed jobs to wherever P5 or P4d instances were available.

Preemption handling for spot instances was redesigned from the ground up. When AWS issued a two-minute spot termination notice, the system triggered an immediate checkpoint save, gracefully disconnected the node from the training ring, and initiated replacement instance provisioning. For DeepSpeed-based training runs, elastic training configuration allowed the job to continue with reduced parallelism while replacement nodes joined, minimizing the blast radius of a single spot interruption.

High-Performance Storage with FSx for Lustre

Training data throughput was a critical bottleneck that manifested as GPU idle time during data loading phases. We replaced the previous EBS-based storage with Amazon FSx for Lustre, a high-performance parallel file system designed for exactly this workload.

FSx for Lustre file systems were provisioned with throughput capacity matched to the GPU cluster size, ensuring that data loading could saturate the network bandwidth available to training instances. For a 16-node P5 training cluster, the FSx file system was configured to deliver aggregate throughput exceeding 100 GB/s, eliminating data loading as a bottleneck.

S3 integration through FSx for Lustre’s lazy loading capability meant that datasets stored in S3 were transparently accessible through the POSIX file system interface. New datasets could be added to S3 and immediately accessed by training jobs without manual data staging. Completed model checkpoints were automatically exported back to S3 for durable storage.

Elastic Fabric Adapter (EFA) networking was configured on all GPU instances to provide the low-latency, high-bandwidth inter-node communication required for distributed training. NCCL parameters were tuned specifically for the EFA network topology, including ring and tree allreduce algorithms optimized for the P5 instance’s network characteristics. These optimizations alone improved multi-node training throughput by 18% compared to Tensora’s previous configuration.

Reliable Distributed Training Pipelines

Pipeline reliability was addressed through systematic engineering at every layer where failures had been occurring.

Checkpoint management was standardized across all training frameworks. DeepSpeed’s checkpoint system was configured with both synchronous checkpoints at regular intervals and asynchronous checkpoints triggered by preemption signals. Checkpoints were written to FSx for Lustre for speed and automatically synced to S3 for durability. The checkpoint retention policy kept the three most recent checkpoints locally and archived older checkpoints to S3 Glacier for cost efficiency.

Health monitoring for training jobs went beyond basic instance health checks. Custom metrics tracked GPU utilization per device, gradient norm convergence, learning rate schedules, loss curves, and inter-node communication latency. Anomaly detection identified training runs that were diverging or exhibiting hardware issues — such as a single GPU with degraded memory bandwidth — before they consumed hours of compute on a doomed run.

Automated failure recovery handled the common failure modes that had previously required manual intervention. If a node failed during training, the system automatically checkpointed surviving nodes, terminated the failed instance, provisioned a replacement, loaded the checkpoint, and resumed training — all without researcher involvement. This process completed in under 10 minutes for most failure scenarios.

MLOps Platform with SageMaker and MLflow

MLflow was deployed as the experiment tracking and model registry platform. Every training job automatically logged its configuration, hyperparameters, metrics, and output artifacts to MLflow. Researchers could compare experiments across any dimension, reproduce previous configurations exactly, and trace the lineage of any model back to its training data and parameters.

SageMaker was integrated for model evaluation and deployment workflows. Trained models registered in MLflow could be deployed to SageMaker endpoints for inference testing with a single command, streamlining the path from experiment to evaluation.

Job submission was standardized through a CLI and API that accepted a training configuration file specifying model architecture, dataset, hyperparameters, and compute requirements. The system handled all infrastructure provisioning, environment setup, and monitoring automatically. Researchers submitted jobs and monitored progress through dashboards — they never logged into an EC2 instance or debugged CUDA driver issues.

The Results

The managed GPU platform transformed Tensora’s ability to conduct AI research efficiently and economically.

45% GPU cost reduction brought monthly compute spend from $130K down to approximately $72K despite a significant increase in training volume. The savings were driven by spot instance integration (contributing roughly 60% of the savings), automated cluster lifecycle management eliminating idle compute (25%), and multi-region placement optimizing for price (15%). Cost per training run for their flagship LLM decreased from approximately $4,200 to $2,300.

99.2% training pipeline reliability replaced the previous 72% completion rate. Over a three-month measurement period, 487 of 491 training jobs completed successfully without manual intervention. The four failures were caused by transient AWS service issues that exceeded the automated recovery system’s retry limits. The elimination of wasted compute from failed runs saved an estimated additional $18K per month beyond the direct cost optimizations.

15-minute job provisioning replaced the previous three-to-five-day wait for GPU capacity. Researchers submit a job configuration and the system provisions a GPU cluster, stages data, configures the training environment, and begins training — typically within 12 to 18 minutes depending on cluster size and spot availability. Multi-region capacity pooling means that a P5 or P4d cluster has been available for every job submission since the platform launched, with zero queuing delays.

3x faster experiment iteration was measured by tracking the number of completed training experiments per researcher per week. The combination of instant provisioning, reliable pipelines, and standardized MLOps workflows enabled researchers to run experiments that would have previously required days of infrastructure setup and manual monitoring. Tensora’s research team completed more experiments in the first quarter on the new platform than in the previous two quarters combined, directly accelerating their model development roadmap.

Tensora AI