Operational Requirements (Aethir Earth)
1. Introduction & Objectives
At Aethir, our mission is to deliver enterprise-grade GPU compute performance through our distributed cloud network. Aethir offers all Enterprise AI customers a robust Service Level Agreement (SLA) as well as several Service Level Objectives (SLOs) to ensure high-quality performance, reliable service, and attentive customer support. Aethir strives to ensure that every Cloud Host aligns with our standards for high-performance, reliability, and security. This document outlines the quality and performance benchmarks for GPUs, detailing our service level objectives, scope of support, escalation protocols, and excluded services.
2. Hardware Quality & Performance Standards
To guarantee optimal compute performance, all Cloud Hosts must adhere to the following minimum hardware requirements:
Approved GPU Server Locations: GPUs sourced from the following regions are not accepted: China and OFAC/UN-sanctioned countries, including Cuba, Iran, North Korea, Syria, Crimea, Donetsk People’s Republic, and Luhansk People’s Republic regions.
Performance Benchmarks: Only enterprise-grade GPUs that meet our benchmarked performance metrics, aligned with NVIDIA's official standards, including:
Compute Performance (TFLOPs)
Memory and Bandwidth
Power Efficiency and Thermal Performance
Key Performance Indicators (KPIs): Uptime, latency, throughput, and error rates are continuously measured to ensure compliance with our performance guarantees, as defined in our attached Service Level Objectives (SLOs).
3. Service Level Objectives (SLO)
Aethir’s SLOs ensure high quality service provided by Cloud hosts.
Cloud Hosts are expected to reach:
99% Monthly Uptime Rate (MUR)
Daily Cumulative Offline Time < 30 minutes
Failure to meet these expectations will result in a slashing penalty as defined here.
When an issue is reported by Aethir’s 24/7 Operations Team, Cloud Hosts must adhere to the following response and resolution timelines based on the pre-defined priority level. Our Operations team will contact Cloud Hosts when a technical issue is discovered. We expect Cloud Hosts to respond within at most 15 minutes and provide resolution based on the table below to ensure appropriate reliability and performance.
P0
24/7
Service outage
15 minutes
Up to 4 hours
Every 15 minutes
P1
12/7*
Significant service degradation
15 minutes
Up to 8 hours
Every 30 minutes
P2
12/7*
1 – Service reconfiguration
2 – Requirement change affecting production service
15 minutes
Up to 24 hours
Every 2 hours
P3
8/7*
Information or feature request
15 minutes
Up to 72 hours
Every 12 hours
*Service hours depend on Cloud Host’s timezone for standard operational windows, with extended coverage for critical issues.
Cloud Hosts are responsible for notifying Aethir Operations of any planned downtime or migrations with at least seventy-two (72) hours’ advanced notice. Further, when P0 or P1 incidents occur, the Cloud Host needs to provide a Root Cause Analysis (RCA), detailing what and when things happened, why things happened, and what will be done to mitigate and/or prevent recurrences of the incident in question.
Some examples (not exhaustive) of different operational processes (and their associated Priority Level) that will require Cloud Host involvement include:
Restoring SSH access - P0
Reprovisioning a node that is out of service - P0
Recovering from network degradation due to congestion or fiber cuts - P1
Opening up ports on a firewall - P2
Server Provisioning/Onboarding - P2
Server Replacement - P1 if service is degraded, P2 if no service impact
Storage/Mounting Issues - P2
Bare Metal Agent Issues - P2
Reinstalling or rolling back the Operating System (OS) - P2
Reinstalling or rolling back CUDA drivers - P2
Recovering from GPU performance degradation due to hardware failure - P2
Generating a Root Cause Analysis document as a port-mortem activity after a P0 or P1 incident - P3
Reprovisioning and cleaning a node after a Client has completed their use of the service, to ensure the machine is ready for the next Client - P3
4. Operational Scope of Services
Aethir’s Operational Support is structured into three levels to guarantee fast response times and efficient problem resolution (priority level labeled for every service).
Proactive Monitoring and First-Level Support
(P0) First-Level Debugging: Identify the root cause of issues and resolve issues where possible. Escalate to relevant Cloud Hosts if necessary.
(P0) Proactive Infrastructure Monitoring: Continuous monitoring of GPU infrastructure, including network health and server availability. Automatic ticket creation upon detection of incidents, with immediate notifications sent to Cloud Hosts via the agreed upon communication channel.
Intermediate Support and Server Management
(P1) Server Rebooting: Perform server reboots or rebuilds based on client requests.
(P2) New User Setup: Ensure that all new client information is safely stored and users gain secure access to servers. Verify that login issues are resolved before closing tickets.
(P3) Server Cleanup: After the client's usage period ends, the server will be reset to its original state and prepared for the next client.
Advanced Support and Server Provisioning
(P2) Network Configuration and Firewall Setup: Configure network policies, firewall rules, and SSH keys to ensure secure and uninterrupted access.
(P2) GPU Driver Support: Install and maintain NVIDIA GPU drivers, ensuring compatibility and functionality with customer workloads.
(P2) Server Provisioning: Provision servers based on client requirements. Configure hardware to meet the performance needs of GPU-intensive applications.
(P3) Operating System Installation: Install Ubuntu 22.04/24.04 LTS and apply all necessary security patches and updates during the initial setup.
(P3) BIOS/Firmware Updates: Ensure BIOS and firmware are always up-to-date to maintain performance and security.
5. Escalation Pipeline
If issues remain unresolved within the defined resolution times, the Cloud Host is required to provide an escalation pipeline for better problem solving:
Designated Contacts: Specific names, WhatsApp/Telegram/Slack contact, email addresses, timezone and operating hours are required for each escalation tier.
Communication Protocols: Regular updates will be provided as per the Update Time frequency for each priority level.
Escalation Steps: Issues not resolved within the target timeframe will automatically be escalated to higher level management and technical teams for immediate action.
6. Out of Scope Services
For clarity, Aethir defines a set of services that are out of scope for Cloud Hosts in each client management, as follows:
Orchestration and Management Software: The host and Aethir are not responsible for the installation, configuration, or management of orchestration software such as Slurm, Kubernetes, or Nomad. These tools require specific configurations based on individual workload requirements and will be the responsibility of the client to implement.
Ongoing Maintenance of Client-Deployed/Modified Software: The host and Aethir are not responsible for technical support and maintenance for software or services that are deployed or modified by the client are out of scope. This includes troubleshooting and assisting with issues that arise in client-deployed software.
Training and Usage: Training on how to use the GPU clusters or consultation on best practices for optimizing GPU utilization will not be provided by both the host and Aethir, to the client. The client is expected to understand the behavior of the GPU cluster themselves, including its workload management, and optimization techniques.
Third-party Software Licensing: Any licensing required for operating systems or other software utilized in the GPU clusters is the responsibility of the client and will not be supported by both the host and Aethir.
7. Monitoring, Testing, & Reporting
For clarity, Aethir performs monitoring and testing on the machines, but also welcomes each Cloud Host to perform the following monitoring, testing and reporting capabilities:
A. Real-Time Monitoring
Alerting: Automated alerts trigger if any metric deviates from predefined SLA targets, such as availability or connectivity.
B. Testing & Validation Protocols
Pre-Onboarding Tests: Standardized stress, benchmark, and load tests are conducted prior to onboarding a Cloud Host or Client.
Regular Validation: Scheduled testing cycles ensure sustained GPU performance over time.
C. Reporting
Frequency & Content: Performance reports are generated at regular intervals and include detailed uptime statistics, latency, throughput metrics, and incident logs.
8. Data Incidents
In the event of a data security incident, the following terms apply. For this section, the Cloud Host is responsible for incident notification, and Aethir is the recipient:
Incident Notification: The Cloud Host will notify Aethir promptly and without undue delay upon becoming aware of a data security incident and will take all commercially reasonable steps to minimize harm and secure Client Data.
Details of Data Security Incident: The Cloud Host’s notification will describe, to the extent possible, the nature of the incident, the measures taken to mitigate potential risks, and any recommended steps Aethir should take to address the incident.
Delivery of Notification: All notifications of data security incidents will be delivered to the email address provided by Aethir.
No Assessment of Client Data: The Cloud Host has no obligation to assess Client Data to identify information subject to any specific legal requirements.
No Acknowledgement of Fault: The Cloud Host’s notification of or response to a data security incident under this section will not be construed as an acknowledgement of any fault or liability regarding the incident.
To ensure continuous improvement and relevance of our SLA and SLOs, Aethir will regularly review performance and support to address any changes in technology, client needs, or operational requirements.
Last updated