| 2026-05-08 |
monitoring: add nodeExporter option, enable on skydick
...
Replaces telegraf-as-only-monitoring with a declarative node-exporter that
the skyw-gw Prometheus scrapes directly. Telegraf->InfluxDB(door1) keeps
running until door1 retirement so the legacy skydick.json grafana
dashboard does not go dark mid-migration.
ldx
committed
1 day ago
|
| 2026-05-06 |
xlab-gw: fix MSS clamp — match SYN-ACK too, use rt mtu
...
Old rule `tcp flags & (syn|ack) == syn` only matched plain SYN.
SYN-ACK from the server has SYN+ACK both set, so masking with
syn|ack and comparing == syn FAILED for SYN-ACK. Result: server
responses came back unclamped, full-MTU TCP segments overflowed
the WG path's effective MTU (1420 inner), large pages silently
stalled — YouTube didn't load, Microsoft pages partial-loaded,
Google was slow. Browsers retried indefinitely, looked like
"the network is broken" from a user perspective.
Replaced with `& (syn|rst) == syn` which matches both plain SYN
and SYN-ACK (only excludes RST, which carries no data). Combined
with `set rt mtu` instead of the hard 1280 — lets the kernel
pick the right MSS per egress interface (wg-to-wgnet → 1380 v4 /
1360 v6) instead of pessimistically clamping everything.
User's commented-out line had the right idea (rt mtu) but wrong
flag mask; fixed both at once.
ldx
committed
3 days ago
|
add README — host roles + deploy + DNS gotchas
ldx
committed
3 days ago
|
skydick: also disable RA in systemd-networkd userspace
...
Sysctl accept_ra=0 only stops the kernel — systemd-networkd does
its own RA processing in userspace and was caching the link-DNS
even after the kernel sysctl was applied. Override the auto-
generated 40-bond0.network with networkConfig.IPv6AcceptRA=false.
ldx
committed
3 days ago
|
skydick: suppress IPv6 RA processing on bond0
...
`networking.enableIPv6 = false` only disables IPv6 forwarding/use;
the kernel still accepts router advertisements unless told otherwise.
The gateway's radvd was seeding fd99:23eb:1682::1 as a per-link DNS
on bond0, which then took precedence in systemd-resolved for AAAA
queries — making blocked names error as 'Connection refused' instead
of returning a clean NXDOMAIN through 10.0.0.1's mosdns.
Set accept_ra=0 globally + on bond0 explicitly. Existing 'enableIPv6
= false' continues to handle the higher-level disable.
ldx
committed
3 days ago
|
skydick: route DNS via 10.0.0.1 only, AliDNS as fallback
...
Was: nameservers = [ "10.0.0.1" "223.5.5.5" ] — both treated as
primary by systemd-resolved, which then load-balanced to AliDNS
and bypassed mosdns's analytics blocking (resolvectl confirmed
hm.baidu.com / google-analytics.com leaking through).
Now: 10.0.0.1 only as primary, AliDNS demoted to fallbackDns so
it activates only when 10.0.0.1 is unreachable.
ldx
committed
3 days ago
|
xlab-gateway: route DNS via local mosdns at 10.0.0.1
...
Adds services.resolved with primary DNS 10.0.0.1 (network-local mosdns)
and Cloudflare as fallback. Removes the hardcoded DNS=166.111.8.28/29
on the wan99.0 link — those Tsinghua resolvers are subject to GFW
poisoning, and per-link DNS overrode the global resolved policy.
When 10.0.0.1 is reachable, this host inherits CN-aware split routing
and the network analytics-blocking policy. When 10.0.0.1 is down,
resolved transparently falls back to Cloudflare so internet keeps
working; queries return to 10.0.0.1 once it responds again.
ldx
authored
3 days ago
Dixiao-L
committed
3 days ago
|
| 2026-04-07 |
sas-smart: reduce exec interval from 30m to 5m
...
With round_interval=true and 30m, the next gather happens at the next
30m wall-clock boundary, which can mean up to 30 min of gaps after a
restart. 5m gives near-real-time visibility into defect counts —
relevant during resilver operations where new defects might appear.
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
sas-smart: use /run/wrappers/bin/sudo instead of Nix store sudo
...
The Nix store sudo binary lacks the setuid bit (Nix store is not
setuid-capable), so calling it as the telegraf user fails silently
with "must be owned by uid 0 and have the setuid bit set". This
caused the sas-smart exec to emit nothing and smart_sas data never
refreshed after the initial manual write.
Switch to the NixOS security wrapper at /run/wrappers/bin/sudo
which is the proper setuid wrapper.
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
smart plugin: nocheck=never so spun-down drives still report
...
Telegraf's inputs.smart uses smartctl -n standby by default, which
returns exit(2) for drives in low-power mode and Telegraf records no
data for them. On skydick this caused sdd/sde (drive1, ZKL05VPS...FMAC)
to be silently missing from smart_device metrics — the exact drive
that accumulated 63 grown defects and had sg_format failures during
initial setup.
Setting nocheck=never forces smartctl to wake spun-down drives. In a
ZFS pool with active mirrors, drives shouldn't be spinning down
anyway, so the 30-min wakeup overhead is negligible.
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
Fix SAS SMART parsing for pending defects and alt power-on format
...
- pending_defects: was matching word "Pending" instead of the number
in "Pending defect count:0 Pending Defects" — use sed to extract
digits between colon and space
- power_on_hours: some SAS drives report "number of hours powered up"
instead of "Accumulated power on time" — try both formats
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
Add SAS SMART collector for skydick predictive failure metrics
...
Telegraf's inputs.smart parses the SATA/NVMe attribute table but ignores
the SAS-specific sections of `smartctl -a` output. The 18 SAS HDDs on
skydick were therefore reporting only health/temp, with no visibility
into power-on hours, grown defects, non-medium errors, pending defects,
or read/write uncorrected errors.
New sasSmartScript walks /dev/sd?, filters to SAS drives by transport
protocol, and emits a smart_sas line per device with the predictive
failure fields. Wired into telegraf via inputs.exec at 30m interval.
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
| 2026-04-01 |
skydick: use async NFS export for media dataset
...
Media data is re-downloadable torrents — sync write guarantees are
unnecessary. Switching to async bypasses SLOG round-trips and improves
write throughput from 358 to 490 MB/s. All other exports remain sync.
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
skydick: add mirrored NVMe special vdev + mirrored SLOG
...
Replaced single-drive SLOG + L2ARC with dual-Optane mirrored setup:
- 690G mirrored special vdev for metadata + files ≤128K
- 8G mirrored SLOG for sync writes
- special_small_blocks=128K set in ZFS properties service
- nvme1 formatted to 4Kn to match nvme0
The special vdev is the biggest performance win for an HDD pool: all
metadata lookups, directory listings, and small files now hit NVMe.
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
| 2026-03-30 |
Update skydick README with InfluxDB and monitoring docs
...
Documents the fleet monitoring architecture: InfluxDB on ZFS,
Telegraf data sources, Grafana datasource layout, and ZFS
dataset management.
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
| 2026-03-29 |
Fix influxdb-token encryption (was empty)
...
Re-encrypted with rage directly instead of agenix EDITOR flow
which silently produced an empty ciphertext.
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
Update influxdb-token for skydick InfluxDB instance
...
Token now authenticates against the local InfluxDB on skydick
instead of the old door1 instance.
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
Add InfluxDB v2 on skydick for fleet monitoring
...
- New modules/influxdb.nix: declarative InfluxDB v2 with ZFS-backed
storage (dick/system/influxdb, bind-mounted to /var/lib/influxdb2)
- monitoring.nix: make influxUrl configurable (default: skydick)
- skydick/default.nix: enable influxdb, point telegraf to localhost
- datapool.nix: document influxdb dataset in hierarchy + creation cmds
Consolidates all monitoring data (door1 + skydick + IoT sensors) into
a single InfluxDB on the ZFS storage server for infinite retention.
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
| 2026-03-25 |
seems only mtu 1280 works for rdp
|
|
|
fix xlab-gateway host key in secrets.nix and rekey
...
The active host key on xlab-gateway is the original one
(AAAAII+EKDpU...), not the replacement. Corrected and rekeyed.
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
add ylw ed25519 key: agenix access, SSH auth, rekey all secrets
...
- Add ylw's ed25519 public key to secrets.nix admins list
- Re-encrypt all .age secrets so ylw can decrypt
- Add ed25519 key to ye-lw21 authorized SSH keys
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
| 2026-03-24 |
harden and fix: nftables input chain, sudo, agenix, ZFS, NAT priority
...
- Add inet input_filter table to xlab-gateway (policy drop on WAN)
- Restrict NOPASSWD sudo to ldx only; ylw uses password sudo via wheel
- Restructure secrets.nix with admins list, prepare for ylw ed25519 key
- Add ye-lw21 to trusted-users in common.nix
- Remove contradictory relatime=on when atime=off on rpool
- Fix NAT postrouting priority: filter → srcnat
- Remove duplicate nixpkgs.hostPlatform from xlab-gateway hardware-configuration
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
skydick: document drive10 added as second hot spare
...
sg_format completed on drive10 (c9bcfa0f). Both LUNs added as spares,
bringing the pool to 8 mirrors + 2 hot spares (4 spare LUNs total).
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
| 2026-03-23 |
|
skydick: document pool expansion to 8 mirrors (~50.9T)
...
Added 4 new SAS Mach2 drives (drive6-9) as 4 mirror vdevs. Updated
drive inventory, layout diagram, expansion commands, and runbook
with sg_format/wipefs steps. drive10 pending sg_format completion.
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
|
|
|
|
add route from subnet to phicomm mgmt
|
| 2026-03-21 |
|