| 2026-06-09 |
skydick/monitoring: add zpool_io physical pool IOPS collector
...
Telegraf inputs.zfs poolMetrics emits per-objset *logical* read/write
counters that include ARC cache hits, so RAM-served reads appear as huge
pool IOPS with zero disk activity (misled a 2026-05-30 mountd probe). Add
a zpool-iostat exec collector emitting real vdev-layer ops as measurement
zpool_io for an honest physical-IOPS view.
Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
|
| 2026-06-08 |
Merge pull request #1 from Skyworks/xuelin
...
skydisk: add IPv6 address
|
skydisk: IPv6 不加 via 路由
Xuelin Yang
committed
5 days ago
|
skydisk: add IPv6 address
Xuelin Yang
committed
5 days ago
|
| 2026-05-26 |

nfs-rdma: bind listener lifecycle to nfs-server.service
...
The nfsd-rdma-listener oneshot writes "rdma 20049" to
/proc/fs/nfsd/portlist exactly once (at boot or first activation).
It then stays in `active (exited)`. But the nfsd kernel module's
portlist is wiped when nfs-server.service STOPS — and every
`nixos-rebuild switch` that touches services.nfs.server or its
config cycles nfs-server. After the cycle, port 20049 silently
goes dead while the listener unit shows "active" (it never re-ran).
Caught today (2026-05-26):
01:05:31 ldx ran nixos-rebuild switch on skydick
01:05:49 nfs-server stopped → portlist wiped
01:05:52 nfs-server started, port 2049 back up
nfsd-rdma-listener.service NOT restarted (still in
"active exited"), so 20049 stays unbound
~14h later: door-pek's NFS-RDMA mount finally hangs hard, df
blocks, telegraf disk input stops emitting, my
disk-data-full alert goes NoData and pages with
"[no value]" labels. qBittorrent unhealthy from
its /mnt/media-touching healthcheck. *arr health
issues climb.
Fix: add `partOf = [ "nfs-server.service" ]`. systemd then
propagates restart from the parent unit to the listener oneshot,
which re-runs its ExecStart (re-writes "rdma 20049" into
portlist) on every nfs-server cycle. Also added the listener
to nfs-server.service's wantedBy so a stop-then-start sequence
brings the listener back too.
Manual mitigation already applied — port 20049 is back up after
`sudo systemctl restart nfsd-rdma-listener.service`. This commit
is the structural fix so the next nfs-server cycle doesn't
silently break NFS-RDMA again.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
|
skydick: onboard zhuyz24 to datapool
...
LDAP user (uidNumber 2200000020). Local override pins UID; GID stays at
100 (users) per existing convention so on-disk ownership matches what
NFS exports anongid to.
|
| 2026-05-16 |

Revert "skydick/samba: advertise RSS + speed for SMB Multichannel"
...
The `interfaces = "lo;capability=DYNAMIC,speed=1 10.0.1.1;capability=RSS,speed=..."`
directive from 4f21721 is malformed for Samba's parser. Samba's
interfaces list uses spaces between entries and semicolons to attach
multichannel options to a single interface, but the parser in 4.22
splits on semicolons FIRST, producing 6 invalid tokens instead of 2
tagged interfaces:
lo capability=DYNAMIC speed=1 10.0.1.1 capability=RSS speed=80000000000
Symptom chain caused by this:
- getaddrinfo failed for "capability=DYNAMIC" (logged in smbd debug)
- interfaces table corrupted → rpcd_classic crash-loops with
NT_STATUS_CONNECTION_DISCONNECTED on svcctl endpoint init
- SMB auth from real clients rejected as "Authentication error"
even with valid credentials (LDAP backend was fine; the proximate
cause was the broken RPC fabric below the auth layer)
- Both ldx@MacBook and ldx@Mac-mini couldn't connect via Finder
SMB to \\10.0.1.1\ldx (verified with smbutil view -N //[email protected]
→ "server rejected the authentication")
- LDAP entries were intact (sambaNTPassword still 981fb5a6...,
POSIX userPassword still SSHA-hashed, account flags [U ])
Reverting drops `interfaces`, `bind interfaces only`, and keeps only
`server multi channel support = yes` so clients can still negotiate
multichannel (just without the server-side RSS advertise hint).
Re-enabling capability advertising can be tried later with verified
syntax. Candidate per the Samba wiki examples:
interfaces = bond40g;capability=RSS,speed=80000000000
(without loopback's DYNAMIC tag, which may be what tripped the parser).
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
ldx
committed
28 days ago
|
| 2026-05-15 |

skydick/samba: advertise RSS + speed for SMB Multichannel
...
Closest thing to "use the whole bond from SMB" without the
ksmbd/userspace fork dilemma. `server multi channel support = yes`
was already on but the client side (macOS, Windows) needs an
explicit capability hint to actually open more than one TCP
stream — without `interfaces ... ;capability=RSS,speed=...`,
Sequoia clients negotiate a single channel and that's it.
Now advertised:
lo capability=DYNAMIC speed=1 (loopback)
10.0.1.1 capability=RSS speed=80_000_000_000 (bond40g aggregate)
Effect: Sequoia 15.4+ opens up to 32 channels per SMB3 session;
LACP layer3+4 xmit hash distributes those streams across both
40 GbE bond slaves. Expected SMB throughput improvement is 2-4×
on bulk Finder copies vs single-channel TCP, with zero feature
loss (still userspace smbd: LDAP, Spotlight, fruit, recycle, TM
all intact).
Paired with `bind interfaces only = yes` so smbd doesn't listen
on any incidental interface (the existing `hosts allow` IP filter
stays as a second layer).
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
ldx
authored
29 days ago
Dixiao-L
committed
29 days ago
|
skydick/networking: tidy bond40g comments + reference cisco Po5
ldx
committed
29 days ago
|
skydick/networking: drop bond0, rename references to bond40g
...
bond0 (ConnectX-4 LX 25G, active-backup) was carrying 10.0.1.1/16
until the 2026-05-15 cutover onto bond40g (2× 40G ConnectX-3 LACP
layer3+4, MTU 9200). With cutover done and verified — Aggregator ID
1 on both slaves, jumbo end-to-end to gateway, traffic flowing —
the old bond is dead weight.
* Remove the `bonds.bond0` and `interfaces.bond0` blocks.
* Rename the remaining active `bond0` references to `bond40g`:
- `systemd.network.networks."40-bond0"` → `."40-bond40g"`
- `"net.ipv6.conf.bond0.accept_ra"` sysctl
- `skyworks.monitoring.netInterfaces = [ "bond0" ]`
- wait-online and RA-leak comments
* The freed enp4s0f0np0/enp4s0f1np1 are now standalone DOWN,
available for future use.
The live kernel `bond0` device persisted past nixos-rebuild
because networkd doesn't destroy unmanaged ifaces; cleaned up
manually with `ip link set <slave> nomaster; ip link del bond0`.
ldx
authored
29 days ago
ldx
committed
29 days ago
|

Revert "skydick/samba: enable SMB-Direct"
...
The previous commit (407a0b3) was based on a wrong premise. Userspace
Samba's smbd does NOT implement an SMB-Direct (RDMA) transport even
with --with-smb-direct passed to waf — the flag is silently accepted
but the resulting binary contains no ibverbs code (verified post-
deploy: ldd /bin/smbd shows no libibverbs linkage, smbd doesn't
listen on port 5445, and testparm rejects "smb direct" as an unknown
parameter).
SMB Direct in Linux is implemented in the kernel server `ksmbd`
(net/smb/server/ in the kernel tree), which is a separate
implementation from Samba. ksmbd would lose us:
- passdb backend = ldapsam (LDAP-backed posix users)
- Spotlight + tinysparql tracker integration
- vfs_fruit (metadata stream / macOS attrs / Time Machine sparse-
bundle support — central to ldx-timemachine share)
Not a worthwhile trade for the SMB workload, which is interactive
Finder browsing not bulk throughput. NFS-over-RDMA on the same
RoCE fabric (mlx4_ib via bond40g) covers the bulk-throughput case
already.
Replaced the misleading "SMB Direct" comment block with an explicit
"why this is NOT enabled" note so this doesn't get re-tried.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
ldx
committed
29 days ago
|

skydick/samba: enable SMB-Direct (SMB3 over RDMA, port 5445)
...
Two coordinated changes:
1. sambaFull overlay extended to build with SMB-Direct support:
- rdma-core added to buildInputs (provides libibverbs + librdmacm)
- --with-smb-direct passed via configureFlags so waf wires up the
transport layer at compile time
2. settings.global gains `smb direct = yes` + 8 MiB read/write knobs
matching the NFS rsize/wsize on the same fabric. smbd now advertises
capability 0x40 on protocol negotiate; clients that speak SMB-Direct
(Win Server / Win Pro for Workstations / macOS Sequoia 15.4+) can
upgrade SMB3 sessions onto the bond40g RoCE fabric. Clients without
SMB-Direct silently fall back to plain TCP on 445.
The 2×40 GbE bond40g (ConnectX-3, post-cutover 2026-05-15) is the same
RDMA fabric NFS uses; SMB-Direct shares it without contention since
the queue-pair fanout is per-session. The "10 GbE NIC" comment in the
settings block is stale — replaced with the current 80 Gbps reality.
Build cost: sambaFull overlay forces a local rebuild on deploy
(~10-15 min, one CPU bound on smbd compilation).
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
ldx
committed
29 days ago
|
skydick/nfs: enable NFS-over-RDMA listener on port 20049
...
Additive to the existing TCP listener — clients choose one transport
per mount, so adding RDMA doesn't disrupt anything. The hardware path
exists: mlx5_bond_0 (the LACP bond's RDMA representation) is ACTIVE
with link_layer=Ethernet (RoCEv2). Bonded RoCE on ConnectX-5 surfaces
both 25 GbE slaves as a single RDMA device, so RDMA traffic uses the
full 50 Gbps aggregate via the firmware's own LAG handling.
Clients (door-pek) use proto=rdma,port=20049 in nfs.nix to opt in.
RDMA transports have intrinsic parallelism (queue pairs), so nconnect
becomes a no-op — drop it from the mount options when switching.
Idempotent listener registration: nfsd's portlist accepts duplicate
adds with EINVAL, so the oneshot pre-checks before writing.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
ldx
committed
on 15 May
|
| 2026-05-14 |
skydick/networking: skyw VLAN MTU 9000 → 9200 jumbo frames
...
Match the skyw storage VLAN end-to-end:
cisco Po4 (switch port-channel): 9216
skydick (bond0 + skyw VLAN): 9200 ← this commit
door-pek (bond0 + skyw VLAN): 9200
The 9000 → 9200 bump leaves 16 bytes of headroom under cisco Po4 9216
for VLAN tag + L2 overhead.
Pairs with nix-infra commit 0xxxxxxx (door-pek/networking: skyw VLAN
MTU 9200 jumbo frames).
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
ldx
committed
on 14 May
|

skydick/nfs: crossmnt on per-user exports so child datasets are reachable
...
Per-user namespace is structured as:
dick/users/ldx — parent (quota boundary, no content of its own)
dick/users/ldx/files — SMB-exposed personal files (\\SKYDICK\ldx)
dick/users/ldx/bt-state — *arr / qBT runtime state
dick/users/ldx/timemachine — macOS sparsebundle target (\\SKYDICK\ldx-timemachine)
dick/users/ldx/vm — VM disk roots
Without crossmnt on the parent export, NFS clients mounting
/srv/users/ldx only see the parent dataset and hit empty placeholders
where the children mount. 2026-05-14 incident: door-pek's baidunetdisk
container bound /mnt/users/ldx/baidu (top-level placeholder location)
because /mnt/users/ldx/files showed empty over NFSv3 — downloads landed
outside the SMB-visible namespace until the dataset boundary was
diagnosed.
Adding crossmnt makes the children visible from the existing parent
export with no separate export entries; equivalent to `nohide` on each
child. Options (all_squash, anonuid=1000) inherit naturally — exactly
the behaviour the parent already provides.
Also applied to /srv/users/ye-lw21 for parity (same dataset shape).
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
ldx
committed
on 14 May
|
| 2026-05-09 |
monitoring: smart_sas_info{vendor,product,revision,serial} for alert enrichment
ldx
committed
on 9 May
|
monitoring: SAS SMART + ZFS pool textfile collectors for skydick
...
Closes the parity gap with door1 telegraf. node-exporter does not parse
SAS-specific smartctl output (predictive failure: grown defects, non-medium
errors, pending defects, ECC totals) — only SATA/NVMe attribute tables.
And the zfs collector exposes ARC + pool I/O but not pool health enum.
Adds skyw-textfile-collectors.service + .timer (5min cadence) that emits:
smart_sas_power_on_hours{device}
smart_sas_grown_defects{device}
smart_sas_non_medium_errors{device}
smart_sas_pending_defects{device}
smart_sas_read_uncorrected{device}
smart_sas_write_uncorrected{device}
zpool_health{pool,state} 0=ONLINE 1=DEGRADED 2=FAULTED ...
Files chmod 0644 so node-exporter user can read them via the textfile
collector.
(Findings: sdd and sde on skydick already at 445 grown defects each.)
ldx
committed
on 9 May
|
| 2026-05-08 |
gitignore: add .DS_Store
ldx
committed
on 8 May
|
monitoring: add nodeExporter option, enable on skydick
...
Replaces telegraf-as-only-monitoring with a declarative node-exporter that
the skyw-gw Prometheus scrapes directly. Telegraf->InfluxDB(door1) keeps
running until door1 retirement so the legacy skydick.json grafana
dashboard does not go dark mid-migration.
ldx
committed
on 8 May
|
| 2026-05-06 |
xlab-gw: fix MSS clamp — match SYN-ACK too, use rt mtu
...
Old rule `tcp flags & (syn|ack) == syn` only matched plain SYN.
SYN-ACK from the server has SYN+ACK both set, so masking with
syn|ack and comparing == syn FAILED for SYN-ACK. Result: server
responses came back unclamped, full-MTU TCP segments overflowed
the WG path's effective MTU (1420 inner), large pages silently
stalled — YouTube didn't load, Microsoft pages partial-loaded,
Google was slow. Browsers retried indefinitely, looked like
"the network is broken" from a user perspective.
Replaced with `& (syn|rst) == syn` which matches both plain SYN
and SYN-ACK (only excludes RST, which carries no data). Combined
with `set rt mtu` instead of the hard 1280 — lets the kernel
pick the right MSS per egress interface (wg-to-wgnet → 1380 v4 /
1360 v6) instead of pessimistically clamping everything.
User's commented-out line had the right idea (rt mtu) but wrong
flag mask; fixed both at once.
ldx
committed
on 6 May
|
add README — host roles + deploy + DNS gotchas
ldx
committed
on 6 May
|
skydick: also disable RA in systemd-networkd userspace
...
Sysctl accept_ra=0 only stops the kernel — systemd-networkd does
its own RA processing in userspace and was caching the link-DNS
even after the kernel sysctl was applied. Override the auto-
generated 40-bond0.network with networkConfig.IPv6AcceptRA=false.
ldx
committed
on 6 May
|
skydick: suppress IPv6 RA processing on bond0
...
`networking.enableIPv6 = false` only disables IPv6 forwarding/use;
the kernel still accepts router advertisements unless told otherwise.
The gateway's radvd was seeding fd99:23eb:1682::1 as a per-link DNS
on bond0, which then took precedence in systemd-resolved for AAAA
queries — making blocked names error as 'Connection refused' instead
of returning a clean NXDOMAIN through 10.0.0.1's mosdns.
Set accept_ra=0 globally + on bond0 explicitly. Existing 'enableIPv6
= false' continues to handle the higher-level disable.
ldx
committed
on 6 May
|
skydick: route DNS via 10.0.0.1 only, AliDNS as fallback
...
Was: nameservers = [ "10.0.0.1" "223.5.5.5" ] — both treated as
primary by systemd-resolved, which then load-balanced to AliDNS
and bypassed mosdns's analytics blocking (resolvectl confirmed
hm.baidu.com / google-analytics.com leaking through).
Now: 10.0.0.1 only as primary, AliDNS demoted to fallbackDns so
it activates only when 10.0.0.1 is unreachable.
ldx
committed
on 6 May
|
xlab-gateway: route DNS via local mosdns at 10.0.0.1
...
Adds services.resolved with primary DNS 10.0.0.1 (network-local mosdns)
and Cloudflare as fallback. Removes the hardcoded DNS=166.111.8.28/29
on the wan99.0 link — those Tsinghua resolvers are subject to GFW
poisoning, and per-link DNS overrode the global resolved policy.
When 10.0.0.1 is reachable, this host inherits CN-aware split routing
and the network analytics-blocking policy. When 10.0.0.1 is down,
resolved transparently falls back to Cloudflare so internet keeps
working; queries return to 10.0.0.1 once it responds again.
ldx
authored
on 6 May
Dixiao-L
committed
on 6 May
|
| 2026-04-07 |
sas-smart: reduce exec interval from 30m to 5m
...
With round_interval=true and 30m, the next gather happens at the next
30m wall-clock boundary, which can mean up to 30 min of gaps after a
restart. 5m gives near-real-time visibility into defect counts —
relevant during resilver operations where new defects might appear.
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
sas-smart: use /run/wrappers/bin/sudo instead of Nix store sudo
...
The Nix store sudo binary lacks the setuid bit (Nix store is not
setuid-capable), so calling it as the telegraf user fails silently
with "must be owned by uid 0 and have the setuid bit set". This
caused the sas-smart exec to emit nothing and smart_sas data never
refreshed after the initial manual write.
Switch to the NixOS security wrapper at /run/wrappers/bin/sudo
which is the proper setuid wrapper.
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
smart plugin: nocheck=never so spun-down drives still report
...
Telegraf's inputs.smart uses smartctl -n standby by default, which
returns exit(2) for drives in low-power mode and Telegraf records no
data for them. On skydick this caused sdd/sde (drive1, ZKL05VPS...FMAC)
to be silently missing from smart_device metrics — the exact drive
that accumulated 63 grown defects and had sg_format failures during
initial setup.
Setting nocheck=never forces smartctl to wake spun-down drives. In a
ZFS pool with active mirrors, drives shouldn't be spinning down
anyway, so the 30-min wakeup overhead is negligible.
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
Fix SAS SMART parsing for pending defects and alt power-on format
...
- pending_defects: was matching word "Pending" instead of the number
in "Pending defect count:0 Pending Defects" — use sed to extract
digits between colon and space
- power_on_hours: some SAS drives report "number of hours powered up"
instead of "Accumulated power on time" — try both formats
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
Add SAS SMART collector for skydick predictive failure metrics
...
Telegraf's inputs.smart parses the SATA/NVMe attribute table but ignores
the SAS-specific sections of `smartctl -a` output. The 18 SAS HDDs on
skydick were therefore reporting only health/temp, with no visibility
into power-on hours, grown defects, non-medium errors, pending defects,
or read/write uncorrected errors.
New sasSmartScript walks /dev/sd?, filters to SAS drives by transport
protocol, and emits a smart_sas line per device with the predictive
failure fields. Wired into telegraf via inputs.exec at 30m interval.
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|