| 2026-04-07 |
sas-smart: reduce exec interval from 30m to 5m
...
With round_interval=true and 30m, the next gather happens at the next
30m wall-clock boundary, which can mean up to 30 min of gaps after a
restart. 5m gives near-real-time visibility into defect counts —
relevant during resilver operations where new defects might appear.
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
sas-smart: use /run/wrappers/bin/sudo instead of Nix store sudo
...
The Nix store sudo binary lacks the setuid bit (Nix store is not
setuid-capable), so calling it as the telegraf user fails silently
with "must be owned by uid 0 and have the setuid bit set". This
caused the sas-smart exec to emit nothing and smart_sas data never
refreshed after the initial manual write.
Switch to the NixOS security wrapper at /run/wrappers/bin/sudo
which is the proper setuid wrapper.
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
smart plugin: nocheck=never so spun-down drives still report
...
Telegraf's inputs.smart uses smartctl -n standby by default, which
returns exit(2) for drives in low-power mode and Telegraf records no
data for them. On skydick this caused sdd/sde (drive1, ZKL05VPS...FMAC)
to be silently missing from smart_device metrics — the exact drive
that accumulated 63 grown defects and had sg_format failures during
initial setup.
Setting nocheck=never forces smartctl to wake spun-down drives. In a
ZFS pool with active mirrors, drives shouldn't be spinning down
anyway, so the 30-min wakeup overhead is negligible.
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
Fix SAS SMART parsing for pending defects and alt power-on format
...
- pending_defects: was matching word "Pending" instead of the number
in "Pending defect count:0 Pending Defects" — use sed to extract
digits between colon and space
- power_on_hours: some SAS drives report "number of hours powered up"
instead of "Accumulated power on time" — try both formats
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
Add SAS SMART collector for skydick predictive failure metrics
...
Telegraf's inputs.smart parses the SATA/NVMe attribute table but ignores
the SAS-specific sections of `smartctl -a` output. The 18 SAS HDDs on
skydick were therefore reporting only health/temp, with no visibility
into power-on hours, grown defects, non-medium errors, pending defects,
or read/write uncorrected errors.
New sasSmartScript walks /dev/sd?, filters to SAS drives by transport
protocol, and emits a smart_sas line per device with the predictive
failure fields. Wired into telegraf via inputs.exec at 30m interval.
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
| 2026-03-29 |
Add InfluxDB v2 on skydick for fleet monitoring
...
- New modules/influxdb.nix: declarative InfluxDB v2 with ZFS-backed
storage (dick/system/influxdb, bind-mounted to /var/lib/influxdb2)
- monitoring.nix: make influxUrl configurable (default: skydick)
- skydick/default.nix: enable influxdb, point telegraf to localhost
- datapool.nix: document influxdb dataset in hierarchy + creation cmds
Consolidates all monitoring data (door1 + skydick + IoT sensors) into
a single InfluxDB on the ZFS storage server for infinite retention.
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
| 2026-03-15 |
monitoring: increase SMART polling frequency to 30m
...
Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
| 2026-03-14 |
monitoring: add sudo to Telegraf PATH for SMART collection
...
Telegraf's SMART plugin with use_sudo=true needs sudo in PATH.
On NixOS, sudo lives at /run/wrappers/bin/ which wasn't included.
This caused all SMART queries to fail with exit_status=1.
Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
monitoring: auto-discover SMART devices instead of hardcoding
...
Remove smartDevices option and per-host device lists. Telegraf will
now scan all block devices automatically, so disks can be added or
removed without config changes.
Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
monitoring: add ZFS pool health exec input
...
Custom script reports zpool health as numeric metric (0=ONLINE,
1=DEGRADED, 2=FAULTED, etc.) via Telegraf inputs.exec, enabling
Grafana alerting on pool degradation.
Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
monitoring: fix InfluxDB URL and add nvme-cli to Telegraf PATH
...
Use door1's LAN IP (10.0.91.30) instead of WireGuard IP (172.16.1.1)
for InfluxDB endpoint. Add nvme-cli to Telegraf's PATH for NVMe SMART
attribute collection.
Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
monitoring: add lm_sensors and smartmontools to Telegraf PATH
...
Telegraf inputs.sensors needs the `sensors` binary in PATH.
Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
skydick: add Telegraf monitoring with SMART, ZFS, and system metrics
...
Sends metrics to door1 InfluxDB (bucket: skydick) via Telegraf.
Monitors all 5 Mach2 SAS drives, NVMe P4500, and boot SSD via SMART.
InfluxDB token encrypted with agenix.
Co-Authored-By: Claude Opus 4.6 <[email protected]>
|