Skyworks/skyworks-Nix-infra

Fork: 0

Skyworks / skyworks-Nix-infra

2026-05-08	b92295c Browse files » monitoring: add nodeExporter option, enable on skydick ... Replaces telegraf-as-only-monitoring with a declarative node-exporter that the skyw-gw Prometheus scrapes directly. Telegraf->InfluxDB(door1) keeps running until door1 retirement so the legacy skydick.json grafana dashboard does not go dark mid-migration. ldx committed on 8 May
2026-05-06	5090711 Browse files » xlab-gw: fix MSS clamp — match SYN-ACK too, use rt mtu ... Old rule `tcp flags & (syn\|ack) == syn` only matched plain SYN. SYN-ACK from the server has SYN+ACK both set, so masking with syn\|ack and comparing == syn FAILED for SYN-ACK. Result: server responses came back unclamped, full-MTU TCP segments overflowed the WG path's effective MTU (1420 inner), large pages silently stalled — YouTube didn't load, Microsoft pages partial-loaded, Google was slow. Browsers retried indefinitely, looked like "the network is broken" from a user perspective. Replaced with `& (syn\|rst) == syn` which matches both plain SYN and SYN-ACK (only excludes RST, which carries no data). Combined with `set rt mtu` instead of the hard 1280 — lets the kernel pick the right MSS per egress interface (wg-to-wgnet → 1380 v4 / 1360 v6) instead of pessimistically clamping everything. User's commented-out line had the right idea (rt mtu) but wrong flag mask; fixed both at once. ldx committed on 6 May
	6a0b5c5 Browse files » add README — host roles + deploy + DNS gotchas ldx committed on 6 May
	bd0a31d Browse files » skydick: also disable RA in systemd-networkd userspace ... Sysctl accept_ra=0 only stops the kernel — systemd-networkd does its own RA processing in userspace and was caching the link-DNS even after the kernel sysctl was applied. Override the auto- generated 40-bond0.network with networkConfig.IPv6AcceptRA=false. ldx committed on 6 May
	f6e3dbd Browse files » skydick: suppress IPv6 RA processing on bond0 ... `networking.enableIPv6 = false` only disables IPv6 forwarding/use; the kernel still accepts router advertisements unless told otherwise. The gateway's radvd was seeding fd99:23eb:1682::1 as a per-link DNS on bond0, which then took precedence in systemd-resolved for AAAA queries — making blocked names error as 'Connection refused' instead of returning a clean NXDOMAIN through 10.0.0.1's mosdns. Set accept_ra=0 globally + on bond0 explicitly. Existing 'enableIPv6 = false' continues to handle the higher-level disable. ldx committed on 6 May
	bed764c Browse files » skydick: route DNS via 10.0.0.1 only, AliDNS as fallback ... Was: nameservers = [ "10.0.0.1" "223.5.5.5" ] — both treated as primary by systemd-resolved, which then load-balanced to AliDNS and bypassed mosdns's analytics blocking (resolvectl confirmed hm.baidu.com / google-analytics.com leaking through). Now: 10.0.0.1 only as primary, AliDNS demoted to fallbackDns so it activates only when 10.0.0.1 is unreachable. ldx committed on 6 May
	587be46 Browse files » xlab-gateway: route DNS via local mosdns at 10.0.0.1 ... Adds services.resolved with primary DNS 10.0.0.1 (network-local mosdns) and Cloudflare as fallback. Removes the hardcoded DNS=166.111.8.28/29 on the wan99.0 link — those Tsinghua resolvers are subject to GFW poisoning, and per-link DNS overrode the global resolved policy. When 10.0.0.1 is reachable, this host inherits CN-aware split routing and the network analytics-blocking policy. When 10.0.0.1 is down, resolved transparently falls back to Cloudflare so internet keeps working; queries return to 10.0.0.1 once it responds again. ldx authored on 6 May Dixiao-L committed on 6 May
2026-04-07	dd38237 Browse files » sas-smart: reduce exec interval from 30m to 5m ... With round_interval=true and 30m, the next gather happens at the next 30m wall-clock boundary, which can mean up to 30 min of gaps after a restart. 5m gives near-real-time visibility into defect counts — relevant during resilver operations where new defects might appear. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> Dixiao-L committed on 7 Apr
	789049f Browse files » sas-smart: use /run/wrappers/bin/sudo instead of Nix store sudo ... The Nix store sudo binary lacks the setuid bit (Nix store is not setuid-capable), so calling it as the telegraf user fails silently with "must be owned by uid 0 and have the setuid bit set". This caused the sas-smart exec to emit nothing and smart_sas data never refreshed after the initial manual write. Switch to the NixOS security wrapper at /run/wrappers/bin/sudo which is the proper setuid wrapper. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> Dixiao-L committed on 7 Apr
	fabe4c4 Browse files » smart plugin: nocheck=never so spun-down drives still report ... Telegraf's inputs.smart uses smartctl -n standby by default, which returns exit(2) for drives in low-power mode and Telegraf records no data for them. On skydick this caused sdd/sde (drive1, ZKL05VPS...FMAC) to be silently missing from smart_device metrics — the exact drive that accumulated 63 grown defects and had sg_format failures during initial setup. Setting nocheck=never forces smartctl to wake spun-down drives. In a ZFS pool with active mirrors, drives shouldn't be spinning down anyway, so the 30-min wakeup overhead is negligible. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> Dixiao-L committed on 7 Apr
	ff4f5fc Browse files » Fix SAS SMART parsing for pending defects and alt power-on format ... - pending_defects: was matching word "Pending" instead of the number in "Pending defect count:0 Pending Defects" — use sed to extract digits between colon and space - power_on_hours: some SAS drives report "number of hours powered up" instead of "Accumulated power on time" — try both formats Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> Dixiao-L committed on 7 Apr
	141fe24 Browse files » Add SAS SMART collector for skydick predictive failure metrics ... Telegraf's inputs.smart parses the SATA/NVMe attribute table but ignores the SAS-specific sections of `smartctl -a` output. The 18 SAS HDDs on skydick were therefore reporting only health/temp, with no visibility into power-on hours, grown defects, non-medium errors, pending defects, or read/write uncorrected errors. New sasSmartScript walks /dev/sd?, filters to SAS drives by transport protocol, and emits a smart_sas line per device with the predictive failure fields. Wired into telegraf via inputs.exec at 30m interval. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> Dixiao-L committed on 7 Apr
2026-04-01	f926857 Browse files » skydick: use async NFS export for media dataset ... Media data is re-downloadable torrents — sync write guarantees are unnecessary. Switching to async bypasses SLOG round-trips and improves write throughput from 358 to 490 MB/s. All other exports remain sync. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> Dixiao-L committed on 1 Apr
2026-04-01	dd3841f Browse files » skydick: add mirrored NVMe special vdev + mirrored SLOG ... Replaced single-drive SLOG + L2ARC with dual-Optane mirrored setup: - 690G mirrored special vdev for metadata + files ≤128K - 8G mirrored SLOG for sync writes - special_small_blocks=128K set in ZFS properties service - nvme1 formatted to 4Kn to match nvme0 The special vdev is the biggest performance win for an HDD pool: all metadata lookups, directory listings, and small files now hit NVMe. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> Dixiao-L committed on 1 Apr
2026-03-30	21191a6 Browse files » Update skydick README with InfluxDB and monitoring docs ... Documents the fleet monitoring architecture: InfluxDB on ZFS, Telegraf data sources, Grafana datasource layout, and ZFS dataset management. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> Dixiao-L committed on 30 Mar
2026-03-29	6b26b45 Browse files » Fix influxdb-token encryption (was empty) ... Re-encrypted with rage directly instead of agenix EDITOR flow which silently produced an empty ciphertext. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> Dixiao-L committed on 29 Mar
	1c61ec3 Browse files » Update influxdb-token for skydick InfluxDB instance ... Token now authenticates against the local InfluxDB on skydick instead of the old door1 instance. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> Dixiao-L committed on 29 Mar
	af0e4f9 Browse files » Add InfluxDB v2 on skydick for fleet monitoring ... - New modules/influxdb.nix: declarative InfluxDB v2 with ZFS-backed storage (dick/system/influxdb, bind-mounted to /var/lib/influxdb2) - monitoring.nix: make influxUrl configurable (default: skydick) - skydick/default.nix: enable influxdb, point telegraf to localhost - datapool.nix: document influxdb dataset in hierarchy + creation cmds Consolidates all monitoring data (door1 + skydick + IoT sensors) into a single InfluxDB on the ZFS storage server for infinite retention. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> Dixiao-L committed on 29 Mar
2026-03-25	6c84552 Browse files » seems only mtu 1280 works for rdp physicsdolphin committed on 25 Mar
	c7581bf Browse files » 修复MSS physicsdolphin committed on 25 Mar
	61d8971 Browse files » fix xlab-gateway host key in secrets.nix and rekey ... The active host key on xlab-gateway is the original one (AAAAII+EKDpU...), not the replacement. Corrected and rekeyed. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> Dixiao-L committed on 25 Mar
	8f61461 Browse files » add ylw ed25519 key: agenix access, SSH auth, rekey all secrets ... - Add ylw's ed25519 public key to secrets.nix admins list - Re-encrypt all .age secrets so ylw can decrypt - Add ed25519 key to ye-lw21 authorized SSH keys Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> Dixiao-L committed on 25 Mar
2026-03-24	257a2b8 Browse files » harden and fix: nftables input chain, sudo, agenix, ZFS, NAT priority ... - Add inet input_filter table to xlab-gateway (policy drop on WAN) - Restrict NOPASSWD sudo to ldx only; ylw uses password sudo via wheel - Restructure secrets.nix with admins list, prepare for ylw ed25519 key - Add ye-lw21 to trusted-users in common.nix - Remove contradictory relatime=on when atime=off on rpool - Fix NAT postrouting priority: filter → srcnat - Remove duplicate nixpkgs.hostPlatform from xlab-gateway hardware-configuration Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> Dixiao-L committed on 24 Mar
2026-03-24	95f22e0 Browse files » skydick: document drive10 added as second hot spare ... sg_format completed on drive10 (c9bcfa0f). Both LUNs added as spares, bringing the pool to 8 mirrors + 2 hot spares (4 spare LUNs total). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> Dixiao-L committed on 24 Mar
2026-03-23	88b128c Browse files » clamp mss to pmtu fix v6 physicsdolphin committed on 23 Mar
	9270993 Browse files » skydick: document pool expansion to 8 mirrors (~50.9T) ... Added 4 new SAS Mach2 drives (drive6-9) as 4 mirror vdevs. Updated drive inventory, layout diagram, expansion commands, and runbook with sg_format/wipefs steps. drive10 pending sg_format completion. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> Dixiao-L committed on 23 Mar
	234e1a5 Browse files » Update networking.nix physicsdolphin committed on 23 Mar
	408ef51 Browse files » Update networking.nix physicsdolphin committed on 23 Mar
	8ffd49b Browse files » add route from subnet to phicomm mgmt physicsdolphin committed on 23 Mar
2026-03-21	c5a1147 Browse files » 王八蛋bitlocker解密升级版 physicsdolphin committed on 21 Mar