| 2026-06-09 |
skydick/monitoring: drop InfluxDB zpool_io, add Prometheus pool-member map
...
The Grafana dashboards (skynet-web on door1) query Prometheus, not InfluxDB —
nothing queries skydick's InfluxDB — so the zpool_io telegraf collector fed
nothing. Replace it with a node-exporter textfile metric skyw_zpool_member{pool,
device}, refreshed from live `zpool status` each run. Grafana joins it onto
node_disk_*_completed_total to get honest *physical* per-pool IOPS, drift-proof
against kernel device renames (unlike a hardcoded device regex).
Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
|
skydick/monitoring: add zpool_io physical pool IOPS collector
...
Telegraf inputs.zfs poolMetrics emits per-objset *logical* read/write
counters that include ARC cache hits, so RAM-served reads appear as huge
pool IOPS with zero disk activity (misled a 2026-05-30 mountd probe). Add
a zpool-iostat exec collector emitting real vdev-layer ops as measurement
zpool_io for an honest physical-IOPS view.
Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
|
| 2026-05-09 |
monitoring: smart_sas_info{vendor,product,revision,serial} for alert enrichment
ldx
committed
on 9 May
|
monitoring: SAS SMART + ZFS pool textfile collectors for skydick
...
Closes the parity gap with door1 telegraf. node-exporter does not parse
SAS-specific smartctl output (predictive failure: grown defects, non-medium
errors, pending defects, ECC totals) — only SATA/NVMe attribute tables.
And the zfs collector exposes ARC + pool I/O but not pool health enum.
Adds skyw-textfile-collectors.service + .timer (5min cadence) that emits:
smart_sas_power_on_hours{device}
smart_sas_grown_defects{device}
smart_sas_non_medium_errors{device}
smart_sas_pending_defects{device}
smart_sas_read_uncorrected{device}
smart_sas_write_uncorrected{device}
zpool_health{pool,state} 0=ONLINE 1=DEGRADED 2=FAULTED ...
Files chmod 0644 so node-exporter user can read them via the textfile
collector.
(Findings: sdd and sde on skydick already at 445 grown defects each.)
ldx
committed
on 9 May
|
| 2026-05-08 |
monitoring: add nodeExporter option, enable on skydick
...
Replaces telegraf-as-only-monitoring with a declarative node-exporter that
the skyw-gw Prometheus scrapes directly. Telegraf->InfluxDB(door1) keeps
running until door1 retirement so the legacy skydick.json grafana
dashboard does not go dark mid-migration.
ldx
committed
on 8 May
|
| 2026-04-07 |
sas-smart: reduce exec interval from 30m to 5m
...
With round_interval=true and 30m, the next gather happens at the next
30m wall-clock boundary, which can mean up to 30 min of gaps after a
restart. 5m gives near-real-time visibility into defect counts —
relevant during resilver operations where new defects might appear.
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
sas-smart: use /run/wrappers/bin/sudo instead of Nix store sudo
...
The Nix store sudo binary lacks the setuid bit (Nix store is not
setuid-capable), so calling it as the telegraf user fails silently
with "must be owned by uid 0 and have the setuid bit set". This
caused the sas-smart exec to emit nothing and smart_sas data never
refreshed after the initial manual write.
Switch to the NixOS security wrapper at /run/wrappers/bin/sudo
which is the proper setuid wrapper.
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
smart plugin: nocheck=never so spun-down drives still report
...
Telegraf's inputs.smart uses smartctl -n standby by default, which
returns exit(2) for drives in low-power mode and Telegraf records no
data for them. On skydick this caused sdd/sde (drive1, ZKL05VPS...FMAC)
to be silently missing from smart_device metrics — the exact drive
that accumulated 63 grown defects and had sg_format failures during
initial setup.
Setting nocheck=never forces smartctl to wake spun-down drives. In a
ZFS pool with active mirrors, drives shouldn't be spinning down
anyway, so the 30-min wakeup overhead is negligible.
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
Fix SAS SMART parsing for pending defects and alt power-on format
...
- pending_defects: was matching word "Pending" instead of the number
in "Pending defect count:0 Pending Defects" — use sed to extract
digits between colon and space
- power_on_hours: some SAS drives report "number of hours powered up"
instead of "Accumulated power on time" — try both formats
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
Add SAS SMART collector for skydick predictive failure metrics
...
Telegraf's inputs.smart parses the SATA/NVMe attribute table but ignores
the SAS-specific sections of `smartctl -a` output. The 18 SAS HDDs on
skydick were therefore reporting only health/temp, with no visibility
into power-on hours, grown defects, non-medium errors, pending defects,
or read/write uncorrected errors.
New sasSmartScript walks /dev/sd?, filters to SAS drives by transport
protocol, and emits a smart_sas line per device with the predictive
failure fields. Wired into telegraf via inputs.exec at 30m interval.
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
| 2026-03-29 |
Add InfluxDB v2 on skydick for fleet monitoring
...
- New modules/influxdb.nix: declarative InfluxDB v2 with ZFS-backed
storage (dick/system/influxdb, bind-mounted to /var/lib/influxdb2)
- monitoring.nix: make influxUrl configurable (default: skydick)
- skydick/default.nix: enable influxdb, point telegraf to localhost
- datapool.nix: document influxdb dataset in hierarchy + creation cmds
Consolidates all monitoring data (door1 + skydick + IoT sensors) into
a single InfluxDB on the ZFS storage server for infinite retention.
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
| 2026-03-15 |
monitoring: increase SMART polling frequency to 30m
...
Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
| 2026-03-14 |
monitoring: add sudo to Telegraf PATH for SMART collection
...
Telegraf's SMART plugin with use_sudo=true needs sudo in PATH.
On NixOS, sudo lives at /run/wrappers/bin/ which wasn't included.
This caused all SMART queries to fail with exit_status=1.
Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
monitoring: auto-discover SMART devices instead of hardcoding
...
Remove smartDevices option and per-host device lists. Telegraf will
now scan all block devices automatically, so disks can be added or
removed without config changes.
Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
monitoring: add ZFS pool health exec input
...
Custom script reports zpool health as numeric metric (0=ONLINE,
1=DEGRADED, 2=FAULTED, etc.) via Telegraf inputs.exec, enabling
Grafana alerting on pool degradation.
Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
monitoring: fix InfluxDB URL and add nvme-cli to Telegraf PATH
...
Use door1's LAN IP (10.0.91.30) instead of WireGuard IP (172.16.1.1)
for InfluxDB endpoint. Add nvme-cli to Telegraf's PATH for NVMe SMART
attribute collection.
Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
monitoring: add lm_sensors and smartmontools to Telegraf PATH
...
Telegraf inputs.sensors needs the `sensors` binary in PATH.
Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
skydick: add Telegraf monitoring with SMART, ZFS, and system metrics
...
Sends metrics to door1 InfluxDB (bucket: skydick) via Telegraf.
Monitors all 5 Mach2 SAS drives, NVMe P4500, and boot SSD via SMART.
InfluxDB token encrypted with agenix.
Co-Authored-By: Claude Opus 4.6 <[email protected]>
|