Aethir
  • Executive Summary
  • Aethir Introduction
    • Key Features
    • Aethir Token ($ATH)
    • Important Links
    • FAQ
  • Aethir Network
    • The Container
      • Staking and Rewards
    • The Checker
      • Proof of Capacity and Delivery
    • The Indexer
    • Session Dynamics
    • Service Fees
  • Aethir Tokenomics
    • Token Overview
    • Token Distribution of Aethir
    • Token Vesting
    • ATH Token’s Utility & Purpose
    • Compute Rewards
    • Compute Reward Emissions
    • ATH Circulating Supply
    • Complete KYC Verfication
  • Aethir Staking
    • Staking User How-to Guide
    • Staking Key Information
    • Pre-deposit How-To Guide
    • Pre-deposit Vault - Reward Mechanics
    • Eigen Pre-deposit FAQ's
    • Staking Pools Emission Schedule for ATH
  • Aethir Ecosystem
    • CARV Rewards for Aethir Gaming Pool Stakers
  • Aethir Governance
    • Aethir Foundation Bylaws
  • Checker Guide
    • What is the Checker Node
      • How do Checker Nodes Work
      • What is the Checker Node License (NFT)
    • How to Purchase Checker Nodes
      • How to purchase using Arbiscan
      • Checker Node Sale Dynamics
        • Node Purchase Caps
        • Smart Contract Addresses
      • FAQ
        • General
        • Node Sale Tiers & Whitelists
        • User Discounts & Referrals
    • How to Manage Checker Nodes
      • Quick Start
      • Connect Wallet
      • Delegate & Undelegate
        • Virtual Private Servers (VPS) and Node-as-a-Service (NaaS) Provider
      • View Rewards
      • Claim & Withdraw
      • Dashboard
      • FAQ
      • API for Querying License Rewards
    • How to Run Checker Nodes
      • What is a Checker Node Client
        • Who can run a Checker Node Client
        • What is the hardware requirements for running Checker Node Client
        • The Relationship between Checker License Owner and Checker Node Operator
      • Quick Start
      • Install & Update
      • Create or Import a Burner Wallet
      • Export Burner Wallet
      • View License Status
      • Accept/Deny Pending Delegations & Undelegate
      • Set Capacity Limit
      • FAQ
      • API for Querying Client Status
    • Checker Node NFT Buyback Program
      • Checker Node NFT Buyback Program
      • Checker Node NFT Buyback Program FAQs
    • Operator Portal
      • Connect Wallet
      • Manage Burner Wallets
      • View Rewards
      • View License Status
      • FAQ
    • Support
    • Release Notes
      • July 5, 2024
      • July 8, 2024
      • July 9, 2024
      • July 12, 2024
      • July 17, 2024
      • July 25, 2024
      • August 5, 2024
      • August 9, 2024
      • August 28, 2024
      • October 8, 2024
      • October 11, 2024
      • November 4, 2024
      • November 15, 2024
      • November 28, 2024
      • December 10, 2024
      • January 14, 2025
      • April 7, 2025
  • Staking and Rewards for Cloud Host (Compute Providers)
    • Staking as a Cloud Host
    • Rewards For Cloud Host
    • Service Fees
    • Slashing Mechanism
    • Key Terms and Concepts
    • K Value Table
    • Acquiring ATH for Cloud Host Staking
    • Bridging ATH for Cloud Host Staking (ETH to ARB)
  • Aethir Cloud Host Guide
    • Role of a Cloud Host
    • Why Provide GPU Compute on Aethir
    • What is Aethir Earth (AI)
      • Operational Requirements (Aethir Earth)
    • What is Aethir Atmosphere (Cloud Gaming)
    • How to Provide GPU Compute
      • Manage Your ATH Rewards (Wallet)
      • How to Provide Aethir Earth (AI)
      • How to Provide Aethir Atmosphere (Cloud Gaming)
    • Miscellaneous
      • Manage Orders
      • System Events
  • Aethir Cloud Customer Guide
    • What is Aethir Cloud
      • Aethir Earth Service Guide
    • Why Use Aethir Cloud
    • Dashboard
    • How to Rent an Aethir Earth Server
    • How to Deploy Your Game on Aethir Atmosphere
      • Add Game and Versions
      • Deploy(On-Demand)
      • Deploy(Reserved)
    • Manage Your Wallet
    • Miscellaneous
      • Manage Orders
  • Aethir Ecosystem Fund
  • Users & Community
    • User Portal (UP) Guide
  • Protocol Roadmap
  • Terms of Service
    • Privacy Policy
    • Aethir General Terms of Service
    • Aethir Staking Terms of Service
    • Airdrop Terms of Service
  • Whitepaper
Powered by GitBook
On this page
  • 1. Introduction & Objectives
  • 2. Hardware Quality & Performance Standards
  • 3. Service Level Objectives (SLO)
  • 4. Operational Scope of Services
  • 5. Escalation Pipeline
  • 6. Out of Scope Services
  • 7. Monitoring, Testing, & Reporting
  • 8. Data Incidents
  1. Aethir Cloud Host Guide
  2. What is Aethir Earth (AI)

Operational Requirements (Aethir Earth)

1. Introduction & Objectives

At Aethir, our mission is to deliver enterprise-grade GPU compute performance through our distributed cloud network. Aethir offers all Enterprise AI customers a robust Service Level Agreement (SLA) as well as several Service Level Objectives (SLOs) to ensure high-quality performance, reliable service, and attentive customer support. Aethir strives to ensure that every Cloud Host aligns with our standards for high-performance, reliability, and security. This document outlines the quality and performance benchmarks for GPUs, detailing our service level objectives, scope of support, escalation protocols, and excluded services.

2. Hardware Quality & Performance Standards

To guarantee optimal compute performance, all Cloud Hosts must adhere to the following minimum hardware requirements:

  • Approved GPU Server Locations GPUs sourced from the following regions are not accepted: China and OFAC/UN-sanctioned countries, including Cuba, Iran, North Korea, Syria, Crimea, Donetsk People’s Republic, and Luhansk People’s Republic regions.

  • Server Specifications Requirements

    • CPU: Servers must utilize x86 architecture.

    • GPU: For servers equipped with more than one GPU, all GPUs must be of the same model to ensure hardware uniformity and optimized parallel computing.

    • Infrastructure Type: The infrastructure must consist of bare-metal servers (i.e., non-virtualized physical machines).

    • Operating System: Ubuntu 20.04 LTS or Ubuntu 22.04 LTS. Debian and other derivative or similar distributions are not supported. (Assistance with server clean-up and OS installation is available upon request.)

  • Network Accessibility

    • Each server must be assigned a publicly routable IP address.

    • SSH access must be enabled for remote management and monitoring.

  • Performance Benchmarks Only enterprise-grade GPUs that meet our benchmarked performance metrics, aligned with NVIDIA's official standards, including:

    • Compute Performance (TFLOPs)

    • Memory and Bandwidth

    • Power Efficiency and Thermal Performance

  • Key Performance Indicators (KPIs) Uptime, latency, throughput, and error rates are continuously measured to ensure compliance with our performance guarantees, as defined in our attached Service Level Objectives (SLOs).

3. Service Level Objectives (SLO)

Aethir’s SLOs ensure high quality service provided by Cloud hosts.

  1. Cloud Hosts are expected to reach:

    1. 99% Monthly Uptime Rate (MUR)

    2. Daily Cumulative Offline Time < 30 minutes

  2. When an issue is reported by Aethir’s 24/7 Operations Team, Cloud Hosts must adhere to the following response and resolution timelines based on the pre-defined priority level. Our Operations team will contact Cloud Hosts when a technical issue is discovered. We expect Cloud Hosts to respond within at most 15 minutes and provide resolution based on the table below to ensure appropriate reliability and performance.

Priority Level
Service Hours
Description
Response Time
Resolution Time
Update Time

P0

24/7

Service outage

15 minutes

Up to 4 hours

Every 15 minutes

P1

12/7*

Significant service degradation

15 minutes

Up to 8 hours

Every 30 minutes

P2

12/7*

1 – Service reconfiguration

2 – Requirement change affecting production service

15 minutes

Up to 24 hours

Every 2 hours

P3

8/7*

Information or feature request

15 minutes

Up to 72 hours

Every 12 hours

*Service hours depend on Cloud Host’s timezone for standard operational windows, with extended coverage for critical issues.

  1. Cloud Hosts are responsible for notifying Aethir Operations of any planned downtime or migrations with at least seventy-two (72) hours’ advanced notice. Further, when P0 or P1 incidents occur, the Cloud Host needs to provide a Root Cause Analysis (RCA), detailing what and when things happened, why things happened, and what will be done to mitigate and/or prevent recurrences of the incident in question.

  2. Some examples (not exhaustive) of different operational processes (and their associated Priority Level) that will require Cloud Host involvement include:

    1. Restoring SSH access - P0

    2. Reprovisioning a node that is out of service - P0

    3. Recovering from network degradation due to congestion or fiber cuts - P1

    4. Opening up ports on a firewall - P2

    5. Server Provisioning/Onboarding - P2

    6. Server Replacement - P1 if service is degraded, P2 if no service impact

    7. Storage/Mounting Issues - P2

    8. Bare Metal Agent Issues - P2

    9. Reinstalling or rolling back the Operating System (OS) - P2

    10. Reinstalling or rolling back CUDA drivers - P2

    11. Recovering from GPU performance degradation due to hardware failure - P2

    12. Generating a Root Cause Analysis document as a port-mortem activity after a P0 or P1 incident - P3

    13. Reprovisioning and cleaning a node after a Client has completed their use of the service, to ensure the machine is ready for the next Client - P3

4. Operational Scope of Services

Aethir’s Operational Support is structured into three levels to guarantee fast response times and efficient problem resolution (priority level labeled for every service).

Proactive Monitoring and First-Level Support

  • (P0) First-Level Debugging: Identify the root cause of issues and resolve issues where possible. Escalate to relevant Cloud Hosts if necessary.

  • (P0) Proactive Infrastructure Monitoring: Continuous monitoring of GPU infrastructure, including network health and server availability. Automatic ticket creation upon detection of incidents, with immediate notifications sent to Cloud Hosts via the agreed upon communication channel.

Intermediate Support and Server Management

  • (P1) Server Rebooting: Perform server reboots or rebuilds based on client requests.

  • (P2) New User Setup: Ensure that all new client information is safely stored and users gain secure access to servers. Verify that login issues are resolved before closing tickets.

  • (P3) Server Cleanup: After the client's usage period ends, the server will be reset to its original state and prepared for the next client.

Advanced Support and Server Provisioning

  • (P2) Network Configuration and Firewall Setup: Configure network policies, firewall rules, and SSH keys to ensure secure and uninterrupted access.

  • (P2) GPU Driver Support: Install and maintain NVIDIA GPU drivers, ensuring compatibility and functionality with customer workloads.

  • (P2) Server Provisioning: Provision servers based on client requirements. Configure hardware to meet the performance needs of GPU-intensive applications.

  • (P3) Operating System Installation: Install Ubuntu 20.04/22.04 LTS and apply all necessary security patches and updates during the initial setup.

  • (P3) BIOS/Firmware Updates: Ensure BIOS and firmware are always up-to-date to maintain performance and security.

5. Escalation Pipeline

If issues remain unresolved within the defined resolution times, the Cloud Host is required to provide an escalation pipeline for better problem solving:

  • Designated Contacts: Specific names, WhatsApp/Telegram/Slack contact, email addresses, timezone and operating hours are required for each escalation tier.

  • Communication Protocols: Regular updates will be provided as per the Update Time frequency for each priority level.

  • Escalation Steps: Issues not resolved within the target timeframe will automatically be escalated to higher level management and technical teams for immediate action.

6. Out of Scope Services

For clarity, Aethir defines a set of services that are out of scope for Cloud Hosts in each client management, as follows:

  • Orchestration and Management Software: The host and Aethir are not responsible for the installation, configuration, or management of orchestration software such as Slurm, Kubernetes, or Nomad. These tools require specific configurations based on individual workload requirements and will be the responsibility of the client to implement.

  • Ongoing Maintenance of Client-Deployed/Modified Software: The host and Aethir are not responsible for technical support and maintenance for software or services that are deployed or modified by the client are out of scope. This includes troubleshooting and assisting with issues that arise in client-deployed software.

  • Training and Usage: Training on how to use the GPU clusters or consultation on best practices for optimizing GPU utilization will not be provided by both the host and Aethir, to the client. The client is expected to understand the behavior of the GPU cluster themselves, including its workload management, and optimization techniques.

  • Third-party Software Licensing: Any licensing required for operating systems or other software utilized in the GPU clusters is the responsibility of the client and will not be supported by both the host and Aethir.

7. Monitoring, Testing, & Reporting

For clarity, Aethir performs monitoring and testing on the machines, but also welcomes each Cloud Host to perform the following monitoring, testing and reporting capabilities:

A. Real-Time Monitoring

  • Alerting: Automated alerts trigger if any metric deviates from predefined SLA targets, such as availability or connectivity.

B. Testing & Validation Protocols

  • Pre-Onboarding Tests: Standardized stress, benchmark, and load tests are conducted prior to onboarding a Cloud Host or Client.

  • Regular Validation: Scheduled testing cycles ensure sustained GPU performance over time.

C. Reporting

  • Frequency & Content: Performance reports are generated at regular intervals and include detailed uptime statistics, latency, throughput metrics, and incident logs.

8. Data Incidents

In the event of a data security incident, the following terms apply. For this section, the Cloud Host is responsible for incident notification, and Aethir is the recipient:

  • Incident Notification: The Cloud Host will notify Aethir promptly and without undue delay upon becoming aware of a data security incident and will take all commercially reasonable steps to minimize harm and secure Client Data.

  • Details of Data Security Incident: The Cloud Host’s notification will describe, to the extent possible, the nature of the incident, the measures taken to mitigate potential risks, and any recommended steps Aethir should take to address the incident.

  • Delivery of Notification: All notifications of data security incidents will be delivered to the email address provided by Aethir.

  • No Assessment of Client Data: The Cloud Host has no obligation to assess Client Data to identify information subject to any specific legal requirements.

  • No Acknowledgement of Fault: The Cloud Host’s notification of or response to a data security incident under this section will not be construed as an acknowledgement of any fault or liability regarding the incident.

To ensure continuous improvement and relevance of our SLA and SLOs, Aethir will regularly review performance and support to address any changes in technology, client needs, or operational requirements.

PreviousWhat is Aethir Earth (AI)NextWhat is Aethir Atmosphere (Cloud Gaming)

Last updated 1 month ago

Failure to meet these expectations will result in a slashing penalty as defined .

here