| 2026-04-07 |
sas-smart: use /run/wrappers/bin/sudo instead of Nix store sudo
...
The Nix store sudo binary lacks the setuid bit (Nix store is not
setuid-capable), so calling it as the telegraf user fails silently
with "must be owned by uid 0 and have the setuid bit set". This
caused the sas-smart exec to emit nothing and smart_sas data never
refreshed after the initial manual write.
Switch to the NixOS security wrapper at /run/wrappers/bin/sudo
which is the proper setuid wrapper.
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
smart plugin: nocheck=never so spun-down drives still report
...
Telegraf's inputs.smart uses smartctl -n standby by default, which
returns exit(2) for drives in low-power mode and Telegraf records no
data for them. On skydick this caused sdd/sde (drive1, ZKL05VPS...FMAC)
to be silently missing from smart_device metrics — the exact drive
that accumulated 63 grown defects and had sg_format failures during
initial setup.
Setting nocheck=never forces smartctl to wake spun-down drives. In a
ZFS pool with active mirrors, drives shouldn't be spinning down
anyway, so the 30-min wakeup overhead is negligible.
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
Fix SAS SMART parsing for pending defects and alt power-on format
...
- pending_defects: was matching word "Pending" instead of the number
in "Pending defect count:0 Pending Defects" — use sed to extract
digits between colon and space
- power_on_hours: some SAS drives report "number of hours powered up"
instead of "Accumulated power on time" — try both formats
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
Add SAS SMART collector for skydick predictive failure metrics
...
Telegraf's inputs.smart parses the SATA/NVMe attribute table but ignores
the SAS-specific sections of `smartctl -a` output. The 18 SAS HDDs on
skydick were therefore reporting only health/temp, with no visibility
into power-on hours, grown defects, non-medium errors, pending defects,
or read/write uncorrected errors.
New sasSmartScript walks /dev/sd?, filters to SAS drives by transport
protocol, and emits a smart_sas line per device with the predictive
failure fields. Wired into telegraf via inputs.exec at 30m interval.
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
| 2026-03-29 |
Add InfluxDB v2 on skydick for fleet monitoring
...
- New modules/influxdb.nix: declarative InfluxDB v2 with ZFS-backed
storage (dick/system/influxdb, bind-mounted to /var/lib/influxdb2)
- monitoring.nix: make influxUrl configurable (default: skydick)
- skydick/default.nix: enable influxdb, point telegraf to localhost
- datapool.nix: document influxdb dataset in hierarchy + creation cmds
Consolidates all monitoring data (door1 + skydick + IoT sensors) into
a single InfluxDB on the ZFS storage server for infinite retention.
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
| 2026-03-25 |
add ylw ed25519 key: agenix access, SSH auth, rekey all secrets
...
- Add ylw's ed25519 public key to secrets.nix admins list
- Re-encrypt all .age secrets so ylw can decrypt
- Add ed25519 key to ye-lw21 authorized SSH keys
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
| 2026-03-24 |
harden and fix: nftables input chain, sudo, agenix, ZFS, NAT priority
...
- Add inet input_filter table to xlab-gateway (policy drop on WAN)
- Restrict NOPASSWD sudo to ldx only; ylw uses password sudo via wheel
- Restructure secrets.nix with admins list, prepare for ylw ed25519 key
- Add ye-lw21 to trusted-users in common.nix
- Remove contradictory relatime=on when atime=off on rpool
- Fix NAT postrouting priority: filter → srcnat
- Remove duplicate nixpkgs.hostPlatform from xlab-gateway hardware-configuration
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
| 2026-03-15 |
monitoring: increase SMART polling frequency to 30m
...
Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
skydick: switch Samba to ldapsam, rename ylw→ye-lw21, drop legacy datasets
...
- Samba passdb backend changed from tdbsam to ldapsam:ldap://10.0.0.1
- Added samba-ldap-admin-password oneshot to seed LDAP admin cred before smbd
- Pinned storage group to GID 997 to match LDAP posixGroup
- Renamed ylw to ye-lw21 across all hosts (users.nix, skydick, xlab-gateway)
- Removed legacy tmpfiles and NFS exports (share/backup/torrent/vm destroyed)
- Added bootstrap LDIF for sambaDomain, storage group, machines OU
Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
| 2026-03-14 |
monitoring: add sudo to Telegraf PATH for SMART collection
...
Telegraf's SMART plugin with use_sudo=true needs sudo in PATH.
On NixOS, sudo lives at /run/wrappers/bin/ which wasn't included.
This caused all SMART queries to fail with exit_status=1.
Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
monitoring: auto-discover SMART devices instead of hardcoding
...
Remove smartDevices option and per-host device lists. Telegraf will
now scan all block devices automatically, so disks can be added or
removed without config changes.
Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
common: disable global flake registry fetch
...
channels.nixos.org is unreachable from CN, causing 25s of
retries on every nix-shell/nix run invocation.
Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
common: add TUNA mirror as primary Nix substituter, add btop
...
cache.nixos.org has ~1.1s latency from CN. TUNA mirror responds
in ~29ms (38x faster). Set connect-timeout=5 and
stalled-download-timeout=15 to fail fast on unreachable mirrors.
Also add btop to skydick monitoring packages.
Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
monitoring: add ZFS pool health exec input
...
Custom script reports zpool health as numeric metric (0=ONLINE,
1=DEGRADED, 2=FAULTED, etc.) via Telegraf inputs.exec, enabling
Grafana alerting on pool degradation.
Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
monitoring: fix InfluxDB URL and add nvme-cli to Telegraf PATH
...
Use door1's LAN IP (10.0.91.30) instead of WireGuard IP (172.16.1.1)
for InfluxDB endpoint. Add nvme-cli to Telegraf's PATH for NVMe SMART
attribute collection.
Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
monitoring: add lm_sensors and smartmontools to Telegraf PATH
...
Telegraf inputs.sensors needs the `sensors` binary in PATH.
Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
skydick: add Telegraf monitoring with SMART, ZFS, and system metrics
...
Sends metrics to door1 InfluxDB (bucket: skydick) via Telegraf.
Monitors all 5 Mach2 SAS drives, NVMe P4500, and boot SSD via SMART.
InfluxDB token encrypted with agenix.
Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
| 2026-03-11 |
users: equalize ldx and ylw permissions
...
- Add ylw to NOPASSWD sudo rule (matching ldx for deploy-rs)
- Add ldx hashedPassword on xlab-gateway (matching ylw)
Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
users: unify ylw as common admin, keep host-specific passwords and groups
...
Move ylw base identity (isNormalUser, wheel, SSH key) to modules/users.nix
alongside ldx. Host configs retain only extra groups and hashedPassword.
Also renames ye-lw21 to ylw on skydick.
Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
nix: add ldx as trusted-user for deploy-rs unsigned store paths
...
Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
| 2026-03-09 |
users: grant ldx full NOPASSWD sudo for deploy-rs
...
deploy-rs runs activate-rs, nix-env, switch-to-configuration, and
confirmation commands through separate non-interactive SSH sessions.
Per-command NOPASSWD rules cannot cover all paths it uses. Full
NOPASSWD is the intended deploy-rs setup.
Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
add systemctl and reboot to NOPASSWD sudo rules
...
Needed for restarting services (systemd-networkd, nftables) after
deploy when switch-to-configuration doesn't detect unit changes.
Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
| 2026-03-07 |
Add switch-to-configuration to NOPASSWD sudo rules
...
Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
Add skydick SSH key, set xlab-gateway deploy to LAN IP
...
- Authorize ldx@skydick ed25519 key for cross-machine deploy-rs
- Change xlab-gateway deploy hostname to 10.253.254.1 (LAN, reachable
from skydick)
Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
deploy-rs: update xlab-gateway hostname, add NOPASSWD sudo
...
- Change xlab-gateway deploy hostname to WAN IP (166.111.98.29)
- Add NOPASSWD sudo rules for deploy-rs activation commands
(nix-env, activate scripts)
Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
Initial skyworks infrastructure flake
...
Unified NixOS configuration for skydick (storage server) and
xlab-gateway (lab router). Flat module structure with shared
common/users/ssh modules, agenix secrets, disko, and deploy-rs.
Co-Authored-By: Claude Opus 4.6 <[email protected]>
|