**License: Pro** - Requires a Pro or Enterprise license.
Overview#
The Redfish probe monitors server and storage hardware through the DMTF Redfish API, providing comprehensive health, capacity, and performance metrics. It supports a wide range of hardware platforms including Dell iDRAC, Dell PowerVault ME series, HPE iLO, Lenovo XClarity, and Cisco UCS.
The probe automatically detects the hardware vendor via the Redfish API and adapts its metric collection accordingly. No vendor-specific configuration is required.
Collected Data:
- Controller, drive, pool, and volume health status
- Storage capacity (total, allocated, used, free) for pools and volumes
- I/O performance metrics (reads, writes, latency, throughput)
- Hardware event logs (critical, warning, informational entries)
- Drive failure predictions and hotspare status
- Encryption status for volumes and drives
Supported Hardware:
- Dell iDRAC (PowerEdge servers)
- Dell PowerVault ME series (ME5024, etc.)
- HPE iLO (ProLiant servers)
- Lenovo XClarity (ThinkSystem servers)
- Cisco UCS (Unified Computing System)
- Any hardware implementing the DMTF Redfish standard
Quick Start#
Basic Configuration#
probes:
- name: "Hardware Server01"
type: redfish
params:
base_url: "https://idrac-server01.company.com"
username: "monitoring"
password: "SecurePassword123"
interval: 300
tls:
verify_ssl: falseImportant notes:
base_url: The Redfish API endpoint (iDRAC, iLO, or BMC management address)interval: 300 seconds (5 minutes) is recommended for hardware monitoringverify_ssl: Set tofalsefor self-signed certificates commonly used on BMC interfaces
Multiple Servers#
Monitor multiple hardware targets with separate probe instances:
probes:
- name: "Dell Storage ME5024"
type: redfish
params:
base_url: "https://dell-me5024.company.com"
username: "admin"
password: "StoragePassword"
interval: 300
tls:
verify_ssl: false
- name: "HPE ProLiant DL380"
type: redfish
params:
base_url: "https://ilo-dl380.company.com"
username: "monitoring"
password: "ServerPassword"
interval: 300
tls:
verify_ssl: falseMetrics Collected#
Health Metrics#
Monitor the health status of hardware components. Health values use a standard scale: 0=OK, 1=Warning, 2=Critical, 3=Unknown.
Storage Controllers#
| Metric Name | Description | Type |
|---|---|---|
hardware.storage.controller.health | Controller health status | Gauge |
hardware.storage.redundancy.health | Controller redundancy health | Gauge |
hardware.storage.redundancy.controllers_active | Number of active controllers | Gauge |
hardware.storage.redundancy.controllers_min | Minimum controllers required | Gauge |
hardware.storage.redundancy.controllers_max | Maximum controllers supported | Gauge |
Drives#
| Metric Name | Description | Type |
|---|---|---|
hardware.storage.drive.health | Drive health status | Gauge |
hardware.storage.drive.failure_predicted | Failure prediction (1=failure predicted) | Gauge |
hardware.storage.drive.hotspare | Hotspare status (1=active hotspare) | Gauge |
Storage Pools#
| Metric Name | Description | Type |
|---|---|---|
hardware.storage.pool.health | Pool health status | Gauge |
Volumes#
| Metric Name | Description | Type |
|---|---|---|
hardware.storage.volume.health | Volume health status | Gauge |
hardware.storage.volume.encrypted | Encryption status (1=encrypted) | Gauge |
Events and Logs#
| Metric Name | Description | Type |
|---|---|---|
hardware.logs.entries.total | Total log entries | Gauge |
hardware.logs.entries.critical | Critical log entries | Gauge |
hardware.logs.entries.warning | Warning log entries | Gauge |
hardware.logs.entries.info | Informational log entries | Gauge |
hardware.logs.entries.last_24h | Events in the last 24 hours | Gauge |
hardware.logs.entries.last_7d | Events in the last 7 days | Gauge |
hardware.eventservice.health | Event service health status | Gauge |
hardware.eventservice.subscriptions | Number of event subscriptions | Gauge |
Capacity Metrics#
Storage Pools#
| Metric Name | Description | Unit |
|---|---|---|
hardware.storage.pool.capacity.total | Total pool capacity | bytes |
hardware.storage.pool.capacity.allocated | Allocated space in pool | bytes |
hardware.storage.pool.capacity.allocated_percent | Allocated space percentage | % |
hardware.storage.pool.capacity.used | Actually consumed space | bytes |
hardware.storage.pool.capacity.used_percent | Consumed space percentage | % |
hardware.storage.pool.capacity.free | Free space | bytes |
hardware.storage.pool.capacity.free_percent | Free space percentage | % |
hardware.storage.pool.capacity.volumes | Space allocated to volumes | bytes |
hardware.storage.pool.capacity.snapshots | Space allocated to snapshots | bytes |
hardware.storage.pool.capacity.committed | Total committed space | bytes |
hardware.storage.pool.capacity.overcommit | Over-allocated space (thin provisioning) | bytes |
Volumes#
| Metric Name | Description | Unit |
|---|---|---|
hardware.storage.volume.capacity.total | Total volume capacity | bytes |
hardware.storage.volume.capacity.allocated | Allocated space | bytes |
hardware.storage.volume.capacity.allocated_percent | Allocated space percentage | % |
hardware.storage.volume.capacity.used | Actually used space | bytes |
hardware.storage.volume.capacity.used_percent | Used space percentage | % |
hardware.storage.volume.capacity.free | Free space | bytes |
hardware.storage.volume.capacity.free_percent | Free space percentage | % |
Drives#
| Metric Name | Description | Unit |
|---|---|---|
hardware.storage.drive.capacity.total | Total drive capacity | bytes |
Performance Metrics#
Volume I/O#
| Metric Name | Description | Unit |
|---|---|---|
hardware.storage.volume.io.total_ops | Total I/O operations | count |
hardware.storage.volume.io.reads | Read operations | count |
hardware.storage.volume.io.writes | Write operations | count |
hardware.storage.volume.io.total_bytes | Total data transferred | bytes |
hardware.storage.volume.io.read.bytes | Data read | bytes |
hardware.storage.volume.io.write.bytes | Data written | bytes |
hardware.storage.volume.io.read.latency | Read latency | ms |
hardware.storage.volume.io.write.latency | Write latency | ms |
Pool I/O#
| Metric Name | Description | Unit |
|---|---|---|
hardware.storage.pool.io.reads | Read operations | count |
hardware.storage.pool.io.writes | Write operations | count |
hardware.storage.pool.io.read.bytes | Data read | bytes |
hardware.storage.pool.io.write.bytes | Data written | bytes |
Operations Metrics#
| Metric Name | Description | Type |
|---|---|---|
hardware.storage.drive.has_operations | Operations in progress (1=yes) | Gauge |
hardware.storage.drive.operation.progress | Operation progress | % |
Tags#
All metrics include contextual tags for filtering and grouping.
Controller Tags#
| Tag | Description | Example |
|---|---|---|
controller_id | Controller identifier | A |
controller_name | Controller name | Controller A |
controller | Controller letter | A, B |
controller_type | Controller type | storage |
host | Host system name | me5024-prod |
manufacturer | Controller manufacturer | Dell |
model | Controller model | PERC H740P |
serial_number | Controller serial number | ABC123 |
Drive Tags#
| Tag | Description | Example |
|---|---|---|
drive_id | Drive identifier | Disk.Bay.0:Enclosure.Internal.0-1 |
drive_name | Drive name | Disk 0 |
model | Drive model | ST1200MM0009 |
drive_manufacturer | Drive manufacturer | Seagate |
serial_number | Drive serial number | WFK12345 |
media_type | Media type | SSD, HDD |
protocol | Communication protocol | SAS, SATA |
hotspare_type | Hotspare type | Global, Dedicated |
encryption_ability | Encryption capability | SelfEncryptingDrive |
encryption_status | Encryption status | Unlocked |
service_label | Service label | Bay 0 |
location_type | Location type | Slot |
location_ordinal | Location ordinal value | 0 |
operation_name | Current operation name | Rebuild |
Pool Tags#
| Tag | Description | Example |
|---|---|---|
pool_id | Pool identifier | A |
pool_name | Pool name | Pool A |
description | Pool description | Virtual storage pool |
supported_raid_types | Supported RAID types | RAID1, RAID5, RAID6 |
max_block_size_bytes | Maximum block size | 512 |
thin_provisioned | Thin provisioning indicator | true |
Volume Tags#
| Tag | Description | Example |
|---|---|---|
volume_id | Volume identifier | VD1 |
volume_name | Volume name | Production-Vol1 |
pool_id | Associated pool identifier | A |
raid_type | RAID type | RAID5 |
write_cache_policy | Write cache policy | WriteBack |
block_size_bytes | Block size | 512 |
access_capabilities | Access capabilities | Read, Write |
encryption_type | Encryption type | NativeDriveEncryption |
Event and Log Tags#
| Tag | Description | Example |
|---|---|---|
host | Host system name | me5024-prod |
manager_id | Manager identifier | BMC |
manager_name | Manager name | iDRAC |
model | Manager model | iDRAC9 |
log_service_id | Log service identifier | Sel |
log_service_name | Log service name | System Event Log |
Recommended Alerting#
Essential Health Alerts#
- Monitor
hardware.storage.controller.healthfor controller failures - Monitor
hardware.storage.redundancy.healthfor redundancy issues - Monitor
hardware.storage.drive.failure_predictedfor drives with predicted failures - Monitor
hardware.storage.drive.has_operationsfor ongoing maintenance operations - Monitor
hardware.logs.entries.criticalfor critical system events
Capacity Alerts#
- Monitor
hardware.storage.pool.capacity.free_percentfor available space - Monitor
hardware.storage.volume.capacity.used_percentfor volume utilization
Performance Alerts#
- Monitor
hardware.storage.volume.io.total_opsfor general I/O activity - Monitor
hardware.storage.volume.io.read.latencyandhardware.storage.volume.io.write.latencyfor performance issues
Event Monitoring#
- Monitor
hardware.logs.entries.criticalandhardware.logs.entries.warningfor system issues - Use
hardware.logs.entries.last_24hto track recent system activity - Compare trends between
hardware.logs.entries.last_24handhardware.logs.entries.last_7dto identify event spikes - Use
hardware.eventservice.healthto verify the event service is operating correctly
Troubleshooting#
Connection Issues#
Symptom: Cannot connect to the Redfish endpoint
Diagnosis:
- Verify the BMC/iDRAC/iLO management interface is reachable from the agent host:
curl -k https://idrac-server01.company.com/redfish/v1/ - Verify the management interface is powered on and network-connected
- Check for firewall rules blocking HTTPS (port 443) between the agent and the BMC
TLS/SSL Errors#
Symptom: TLS handshake failures or certificate errors
Resolution:
BMC management interfaces typically use self-signed certificates. Set verify_ssl: false in the probe configuration:
params:
tls:
verify_ssl: falseIf your environment uses properly signed certificates, ensure the CA chain is trusted by the system running the agent.
Authentication Failures#
Symptom: 401 Unauthorized or login failures
Diagnosis:
- Verify credentials by logging in to the BMC web interface manually
- Check that the account is not locked out due to failed login attempts
- Verify the account has sufficient privileges for Redfish API access (read-only access is sufficient)
- Some BMCs limit concurrent sessions – ensure the session limit is not reached
Dell ME Capacity Shows Zero#
Symptom: Pool or volume capacity metrics return 0
Explanation: Dell PowerVault ME series systems may return CapacityBytes=0 in the standard Redfish response. The agent automatically detects this and uses Capacity.Data.AllocatedBytes as the effective capacity. Ensure you are running a recent version of the agent for this workaround to be active.
Debug Logging#
Enable debug logging for the Redfish probe:
# Runtime log level change
curl -X POST http://localhost:8080/api/{key}/debug/logs \
-H "Content-Type: application/json" \
-d '{"module_levels": [{"module": "probe.redfish", "level": "debug"}]}'
# Or start agent with verbose logging
./senhub-agent run --authentication-key KEY --verbose --debug-modules probe.redfishLicense Requirements#
The Redfish probe requires a Pro or Enterprise license.
| Tier | Redfish Probe |
|---|---|
| Free | Not available |
| Pro | Included |
| Enterprise | Included |