Functionality

In this section, we introduce all utility functions used to calculate HyperTool-specific attributes for Native Nodes. For each function, we specify how the resulting attribute is exposed in Kubernetes (as a label, annotation etc.), along with details about its dependencies, parameters, example output, and error handling.

interface_name_and_type

The interface_name_and_type annotations describe the network interface name and type used to reach a specific IP destination, such as the Kubernetes API server, using the pyroute2 library.

The output provides the interface name and network type (e.g., ethernet, wireless) that the system would use to route traffic to the specified IP address. This information is extracted by querying the system’s routing table.

Kubernetes type: Annotation

Dependencies:

pyroute2 : IPRoute class for routing table queries

Parameters:

ip (str): The IP address of the Kubernetes API server.

Example Output:

hyperai.eu/node-interface: eth0
hyperai.eu/node-network-type: ethernet

Error Handling:

If no route is found or an error occurs, the function returns default values:

hyperai.eu/node-interface: "unknown"
hyperai.eu/node-network-type: "unknown"

node_available_interfaces

The node_available_interfaces annotation provides a list of all network interfaces available on the node . This information is useful for understanding the network configuration of the node.

Kubernetes type: Annotation

Dependencies:

netifaces : Provides a list of the available network interfaces on the system.

Parameters:

None

Example Output:

hyperai.eu/node-available-interfaces: eth0, eth1, wlan0

Error Handling:

There is no specific error handling for this function.

latency

The latency annotation describes the grade of the network latency to the node’s control plane.

After we measure the round-trip time (RTT) of the response of the control plane’s API server. We classify the latency through the this scale:

Grade table: A < 500ms, B < 1000ms, C < 2000ms, D >= 2000ms

A (Excellent) : < 500ms
B (Good) : < 1000ms
C (Fair) : < 2000ms
D (Poor) : >= 2000ms

Kubernetes type: Annotation

Dependencies:

ping3 : a pure python3 version of ICMP ping implementation using raw socket.

Parameters:

ip (str): The IP address of the Kubernetes API server.

Example Output:

hyperai.eu/node-latency: A

Error Handling:

If no route is found or an error occurs, the function returns the worst grade:

hyperai.eu/node-latency: D

bandwidth

The bandwidth annotations provide the grade of download speed and upload speed of the node.

Kubernetes type: Annotations

Dependencies:

speedtest-cli : Used to measure the network bandwidth.

Parameters:

None

Example Output:

hyperai.eu/node-download-speed: A
hyperai.eu/node-upload-speed: B

Error Handling:

If the bandwidth cannot be determined, the function returns default values:

hyperai.eu/node-download-speed: "unknown"
hyperai.eu/node-upload-speed: "unknown"

packetLoss

The packetLoss annotation measures the percentage of packets lost during transmission to a target host . The packet loss is categorized into grades based on the percentage of lost packets, providing a quick assessment of network reliability.

The function uses the system’s ping command to send ICMP packets and analyzes the output to extract packet loss percentage. The loss percentage is then classified into one of six grades:

Grade A: 0% packet loss (perfect)
Grade B: ≤ 20% packet loss
Grade C: ≤ 40% packet loss
Grade D: ≤ 60% packet loss
Grade E: ≤ 80% packet loss
Grade F: > 80% packet loss

Kubernetes type: Annotation

Dependencies:

subprocess : Used to execute the system ping command
socket : Used for network socket operations
re : Used to parse ping output for packet loss percentage

Parameters:

ip (str): The IP address of the Kubernetes API server.

Example Output:

hyperai.eu/node-packet-loss: A (0% loss)
hyperai.eu/node-packet-loss: B (1-20% loss)
hyperai.eu/node-packet-loss: F (>80% loss)

Error Handling:

If packet loss cannot be determined, the function defaults to 100% packet loss and assigns the worst grade:

hyperai.eu/node-packet-loss: F

nodeCategory

The nodeCategory annotation provides a composite classification of a node’s computational capacity based on its CPU cores and RAM. This categorization enables quick identification of node capabilities for workload placement and resource allocation decisions.

CPU Grading Scale:

Grade F: 1 core
Grade E: 2-3 cores
Grade D: 4-7 cores
Grade C: 8-15 cores
Grade B: 16-31 cores
Grade A: 32+ cores

RAM Grading Scale:

Grade E: 0-8 GB
Grade D: 8-16 GB
Grade C: 16-32 GB
Grade B: 32-64 GB
Grade A: 64+ GB

Kubernetes type: Annotation

Dependencies:

psutil : Used to retrieve system memory information
os : Used to get CPU count

Parameters:

None

Example Output:

hyperai.eu/node-category: CPU_C_RAM_B (8-15 cores, 32-64 GB RAM)
hyperai.eu/node-category: CPU_A_RAM_A (32+ cores, 64+ GB RAM)
hyperai.eu/node-category: CPU_D_RAM_C (4-7 cores, 16-32 GB RAM)

Error Handling:

The function uses fallback values if detection fails: - CPU cores default to 1 if os.cpu_count() returns None - RAM is retrieved via psutil.virtual_memory().total

uptime

The uptime label provides a grade indicating how long a Kubernetes node has been running since its last boot. This metric is useful for identifying stable, long-running nodes versus recently restarted ones, which can inform scheduling decisions and maintenance planning.

The uptime is calculated by measuring the time elapsed since system boot and categorizing it into grades:

Grade A: ≥ 7 days (10,080 minutes) - Oldest/most stable nodes
Grade B: 2-7 days (2,880-10,079 minutes)
Grade C: 1-2 days (1,440-2,879 minutes)
Grade D: < 1 day (< 1,440 minutes) - Newest nodes

Kubernetes type: Label

Dependencies:

psutil : Used to retrieve system boot time
time : Used to calculate elapsed time

Parameters:

None

Example Output:

hyperai.eu/node-uptime: A (node running for 7+ days)
hyperai.eu/node-uptime: B (node running for 2-7 days)
hyperai.eu/node-uptime: D (node running for less than 1 day)

Error Handling:

If the boot time cannot be determined, the function will return grade D.

TPU

The TPU label provides the total number of TPU (Tensor Processing Unit) devices available on the node.

Kubernetes type: Label

Dependencies:

pathlib : Used to check for TPU device files
subprocess : Used to execute lspci command (optional fallback)

Parameters:

None

Example Output:

hyperai.eu/node-tpu-capacity: 4 (4 TPU devices detected)

Error Handling:

If no TPU devices are found or an error occurs during detection, the function returns None and no label is added to the node.

Accelerators

The Accelerators label provides the total count of hardware accelerators (GPUs and other compute accelerators) available for workload scheduling on the node.

Kubernetes type: Label

Dependencies:

pathlib : Used to scan for accelerator device files

Parameters:

None

Example Output:

hyperai.eu/node-allocatable-accelerators: 2 (2 Accelartors devices detected)

Error Handling:

If no accelerator devices are found or an error occurs during detection, the function returns None and no label is added to the node.

NodePool

The NodePool label assigns or removes a logical pool or role to a Kubernetes node, enabling grouping of nodes with similar characteristics (e.g., ML-optimized nodes, CPU-only nodes, GPU nodes) for workload scheduling and placement policies.

Kubernetes type: Label

Dependencies:

None (the label is applied/removed via HyperTool CLI)

Parameters:

None at runtime; the label value is provided manually by the operator.

Example Output:

hyperai.eu/node-pool: ml-node
hyperai.eu/node-pool: cpu-node
hyperai.eu/node-pool: gpu-node

geolocation

The geolocation labels provide the geographical location of the node based on its public IP address.

Kubernetes type: Label

Dependencies:

geocoder : Used to retrieve the geolocation information based on the public IP address.

Parameters:

None

Example Output:

hyperai.eu/node-geolocation-city: Athens
hyperai.eu/node-geolocation-region: Attica
hyperai.eu/node-geolocation-country": GR

Error Handling:

If the geolocation cannot be determined, the function returns default values:

hyperai.eu/node-geolocation-city: "unknown"
hyperai.eu/node-geolocation-region: "unknown"
hyperai.eu/node-geolocation-country: "unknown"

get_monetary_cost_annotation

To infer the monetary cost category of a node (e.g., very low, low, medium, high, very high), we adopt a K-Means clustering approach trained on publicly available cloud instance pricing data.

Data Collection and Preprocessing

We collect on-demand instance data from AWS, available at: https://aws.amazon.com/ec2/pricing/on-demand/

This dataset includes the following attributes:

vCPU count
Memory (in GiB)
On-Demand Hourly Rate (in USD)

The data is cleaned to strip units (e.g., “GiB”, “$”) and converted into numeric format. These three numerical attributes are then normalized using the StandardScaler from the scikit-learn library.

Clustering

We use K-Means clustering with k = 5 to group instances into cost-based clusters. The feature vector x used during training is defined as:

x = [vCPU, Memory, Price]

The output of clustering is a set of k cluster centroids:

μ₁, μ₂, ..., μₖ ∈ ℝ³

Each cluster is ranked by average price and assigned a qualitative label:

very low, low, medium, high, very high

Cost Category Inference

At runtime, we infer the monetary cost category of a node based on its CPU and memory specifications:

x_inference = [vCPU, Memory, 0.0]

Key Insight: Although price was used during clustering to shape the cost groupings, it is excluded during inference. This is essential because the actual price of the current node is unknown.

We compute the Euclidean distance between the normalized input vector and the stored cluster centroids:

ŷ = argmin_i ||(x_inference - μᵢ) / σ||

Here, μᵢ is the i th centroid (trained with price included), and σ is the standard deviation vector used for normalization.

Kubernetes type: Annotation

Dependencies:

numpy : Used to load the K-Means model and perform distance calculations.

Parameters:

No Params

Example Output:

hyperai.eu/node-monetary-cost-category: very low | low | medium | high | very high

Error Handling: If the node’s CPU or memory specifications are not available, the function returns default values:

hyperai.eu/node-monetary-cost-category: "unknown"

energy_efficiency_annotation

The energy_efficiency_annotation provides a label for each Kubernetes node indicating its energy efficiency level, derived from runtime performance and CPU characteristics. This is critical for scheduling policies that prioritize energy-aware deployment of workloads across heterogeneous nodes.

The annotation is computed based on estimated FLOPs per second from a dot product workload and the node’s TDP (Thermal Design Power). TDP is fetched from a local CPU database (Intel and AMD), and if unavailable, a trained regression model is used as a fallback.

Annotation Process :

Get CPU Model: Extracts CPU string from platform.uname().
Lookup TDP: Matches model in intel_cpus.csv or amd_cpus.csv. If not found, continues to fallback.
Fallback Regression: - Runs lscpu to extract clock speed, logical cores, L3 cache. - Feeds data into a trained RandomForestRegressor model (tdp_regressor.pkl) to estimate TDP.
Estimate FLOPs/sec: Executes multiple NumPy dot products on 1 million-length vectors.

Calculate GFLOPs per Joule: .. math:

\text{GFLOPs/Joule} = \frac{2 \times 10^6 \times 10}{\text{execution time} \times TDP \times 10^9}

Assign Label Based on Range: - very low: < 0.5 - low: 0.5–1.0 - medium: 1.0–2.0 - high: 2.0–5.0 - very high: > 5.0

Model Training Note:

A RandomForestRegressor was trained using combined Intel and AMD datasets. The model was trained on the following features:

Clock (GHz)
Logical Cores
L3 Cache (MB)

Stored in tdp_regressor.pkl. Missing CPU models are inferred using this model.

References:

Intel ARK: https://ark.intel.com
AMD Specs: https://www.amd.com/en/products/specifications
FLOPs Estimation Logic: https://numpy.org/doc/stable/reference/generated/numpy.dot.html

Kubernetes type: Annotation

Dependencies:

platform : to extract CPU model
pandas : to load CPU datasets
numpy : for synthetic FLOPs calculation
time : to measure execution duration
sklearn : to load regression model
subprocess : to call lscpu for fallback TDP prediction

Parameters:

None

Example Output:

hyperai.eu/node-energy-efficiency: very low | low | medium | high | very high

Error Handling:

If the TDP cannot be inferred from both the static dataset and the regression model, the fallback result is:

hyperai.eu/node-energy-efficiency: "unknown"

This typically occurs when the CPU model cannot be parsed or lscpu fails to return required features. Logs are recorded for such failures with logging.error.

flops_per_sec

The annotation is derived from the node’s estimated GFLOPs per Joule, calculated based on synthetic floating-point performance (FLOPs/sec) and the CPU’s Thermal Design Power (TDP). TDP is either retrieved from static datasets or predicted using a regression model when not found.

Kubernetes type: Annotation

Dependencies:

platform : to extract CPU model
pandas : to load CPU datasets
numpy : for synthetic FLOPs calculation
time : to measure execution duration
sklearn : to load regression model
subprocess : to call lscpu for fallback TDP prediction

Parameters:

No Params

Example Output:

hyperai.eu/node-flops-per-sec: 11545015139.0

Error Handling:

If the TDP cannot be inferred from both the static dataset and the regression model, the fallback result is:

hyperai.eu/node-flops-per-sec: "unknown"

This typically occurs when the CPU model cannot be parsed or lscpu fails to return required features. Logs are recorded for such failures with logging.error.