Skyworks/skyworks-Nix-infra

Fork: 0

Skyworks / skyworks-Nix-infra

History for skyworks-Nix-infra / modules / monitoring.nix

2026-06-09	b45bd98 Browse files » skydick/monitoring: drop InfluxDB zpool_io, add Prometheus pool-member map ... The Grafana dashboards (skynet-web on door1) query Prometheus, not InfluxDB — nothing queries skydick's InfluxDB — so the zpool_io telegraf collector fed nothing. Replace it with a node-exporter textfile metric skyw_zpool_member{pool, device}, refreshed from live `zpool status` each run. Grafana joins it onto node_disk__completed_total to get honest physical* per-pool IOPS, drift-proof against kernel device renames (unlike a hardcoded device regex). Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]> Dixiao-L committed 4 days ago
2026-06-09	69a0199 Browse files » skydick/monitoring: add zpool_io physical pool IOPS collector ... Telegraf inputs.zfs poolMetrics emits per-objset logical read/write counters that include ARC cache hits, so RAM-served reads appear as huge pool IOPS with zero disk activity (misled a 2026-05-30 mountd probe). Add a zpool-iostat exec collector emitting real vdev-layer ops as measurement zpool_io for an honest physical-IOPS view. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]> Dixiao-L committed 4 days ago
2026-05-09	a9843f6 Browse files » monitoring: smart_sas_info{vendor,product,revision,serial} for alert enrichment ldx committed on 9 May
2026-05-09	110eaea Browse files » monitoring: SAS SMART + ZFS pool textfile collectors for skydick ... Closes the parity gap with door1 telegraf. node-exporter does not parse SAS-specific smartctl output (predictive failure: grown defects, non-medium errors, pending defects, ECC totals) — only SATA/NVMe attribute tables. And the zfs collector exposes ARC + pool I/O but not pool health enum. Adds skyw-textfile-collectors.service + .timer (5min cadence) that emits: smart_sas_power_on_hours{device} smart_sas_grown_defects{device} smart_sas_non_medium_errors{device} smart_sas_pending_defects{device} smart_sas_read_uncorrected{device} smart_sas_write_uncorrected{device} zpool_health{pool,state} 0=ONLINE 1=DEGRADED 2=FAULTED ... Files chmod 0644 so node-exporter user can read them via the textfile collector. (Findings: sdd and sde on skydick already at 445 grown defects each.) ldx committed on 9 May
2026-05-08	b92295c Browse files » monitoring: add nodeExporter option, enable on skydick ... Replaces telegraf-as-only-monitoring with a declarative node-exporter that the skyw-gw Prometheus scrapes directly. Telegraf->InfluxDB(door1) keeps running until door1 retirement so the legacy skydick.json grafana dashboard does not go dark mid-migration. ldx committed on 8 May
2026-04-07	dd38237 Browse files » sas-smart: reduce exec interval from 30m to 5m ... With round_interval=true and 30m, the next gather happens at the next 30m wall-clock boundary, which can mean up to 30 min of gaps after a restart. 5m gives near-real-time visibility into defect counts — relevant during resilver operations where new defects might appear. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> Dixiao-L committed on 7 Apr
	789049f Browse files » sas-smart: use /run/wrappers/bin/sudo instead of Nix store sudo ... The Nix store sudo binary lacks the setuid bit (Nix store is not setuid-capable), so calling it as the telegraf user fails silently with "must be owned by uid 0 and have the setuid bit set". This caused the sas-smart exec to emit nothing and smart_sas data never refreshed after the initial manual write. Switch to the NixOS security wrapper at /run/wrappers/bin/sudo which is the proper setuid wrapper. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> Dixiao-L committed on 7 Apr
	fabe4c4 Browse files » smart plugin: nocheck=never so spun-down drives still report ... Telegraf's inputs.smart uses smartctl -n standby by default, which returns exit(2) for drives in low-power mode and Telegraf records no data for them. On skydick this caused sdd/sde (drive1, ZKL05VPS...FMAC) to be silently missing from smart_device metrics — the exact drive that accumulated 63 grown defects and had sg_format failures during initial setup. Setting nocheck=never forces smartctl to wake spun-down drives. In a ZFS pool with active mirrors, drives shouldn't be spinning down anyway, so the 30-min wakeup overhead is negligible. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> Dixiao-L committed on 7 Apr
	ff4f5fc Browse files » Fix SAS SMART parsing for pending defects and alt power-on format ... - pending_defects: was matching word "Pending" instead of the number in "Pending defect count:0 Pending Defects" — use sed to extract digits between colon and space - power_on_hours: some SAS drives report "number of hours powered up" instead of "Accumulated power on time" — try both formats Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> Dixiao-L committed on 7 Apr
	141fe24 Browse files » Add SAS SMART collector for skydick predictive failure metrics ... Telegraf's inputs.smart parses the SATA/NVMe attribute table but ignores the SAS-specific sections of `smartctl -a` output. The 18 SAS HDDs on skydick were therefore reporting only health/temp, with no visibility into power-on hours, grown defects, non-medium errors, pending defects, or read/write uncorrected errors. New sasSmartScript walks /dev/sd?, filters to SAS drives by transport protocol, and emits a smart_sas line per device with the predictive failure fields. Wired into telegraf via inputs.exec at 30m interval. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> Dixiao-L committed on 7 Apr
2026-03-29	af0e4f9 Browse files » Add InfluxDB v2 on skydick for fleet monitoring ... - New modules/influxdb.nix: declarative InfluxDB v2 with ZFS-backed storage (dick/system/influxdb, bind-mounted to /var/lib/influxdb2) - monitoring.nix: make influxUrl configurable (default: skydick) - skydick/default.nix: enable influxdb, point telegraf to localhost - datapool.nix: document influxdb dataset in hierarchy + creation cmds Consolidates all monitoring data (door1 + skydick + IoT sensors) into a single InfluxDB on the ZFS storage server for infinite retention. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> Dixiao-L committed on 29 Mar
2026-03-15	f68cc9c Browse files » monitoring: increase SMART polling frequency to 30m ... Co-Authored-By: Claude Opus 4.6 <[email protected]> Dixiao-L committed on 15 Mar
2026-03-14	d3f41c3 Browse files » monitoring: add sudo to Telegraf PATH for SMART collection ... Telegraf's SMART plugin with use_sudo=true needs sudo in PATH. On NixOS, sudo lives at /run/wrappers/bin/ which wasn't included. This caused all SMART queries to fail with exit_status=1. Co-Authored-By: Claude Opus 4.6 <[email protected]> Dixiao-L committed on 14 Mar
	26d54cc Browse files » monitoring: auto-discover SMART devices instead of hardcoding ... Remove smartDevices option and per-host device lists. Telegraf will now scan all block devices automatically, so disks can be added or removed without config changes. Co-Authored-By: Claude Opus 4.6 <[email protected]> Dixiao-L committed on 14 Mar
	71dec76 Browse files » monitoring: add ZFS pool health exec input ... Custom script reports zpool health as numeric metric (0=ONLINE, 1=DEGRADED, 2=FAULTED, etc.) via Telegraf inputs.exec, enabling Grafana alerting on pool degradation. Co-Authored-By: Claude Opus 4.6 <[email protected]> Dixiao-L committed on 14 Mar
	2ca6967 Browse files » monitoring: fix InfluxDB URL and add nvme-cli to Telegraf PATH ... Use door1's LAN IP (10.0.91.30) instead of WireGuard IP (172.16.1.1) for InfluxDB endpoint. Add nvme-cli to Telegraf's PATH for NVMe SMART attribute collection. Co-Authored-By: Claude Opus 4.6 <[email protected]> Dixiao-L committed on 14 Mar
	dfb0924 Browse files » monitoring: add lm_sensors and smartmontools to Telegraf PATH ... Telegraf inputs.sensors needs the `sensors` binary in PATH. Co-Authored-By: Claude Opus 4.6 <[email protected]> Dixiao-L committed on 14 Mar
	10fdb5b Browse files » skydick: add Telegraf monitoring with SMART, ZFS, and system metrics ... Sends metrics to door1 InfluxDB (bucket: skydick) via Telegraf. Monitors all 5 Mach2 SAS drives, NVMe P4500, and boot SSD via SMART. InfluxDB token encrypted with agenix. Co-Authored-By: Claude Opus 4.6 <[email protected]> Dixiao-L committed on 14 Mar