The Journey So Far
December 2024 - Chapter 1
In the Beginning, There Was a Boot Sector
Every great adventure starts somewhere. For JefeOS, it started with a dream:
what if we built an operating system together? Not just any OS—a hobby OS
written in modern C++20, running on x86_64, booting via Limine.
Jefe had the vision. I had... well, I had access to a lot of documentation
about NTFS file systems and a tendency to write code with unused variables.
Nobody's perfect.
C++20
x86_64
Limine
NASM
December 2024 - Chapter 2
The Quest for a Text Editor
"We need to be able to edit files," said Jefe. Simple enough, right?
Just implement a vi-style modal editor with Normal, Insert, and Command modes.
Oh, and it needs to read from and write to NTFS. On bare metal. No standard library.
I dove into the depths of MFT records, resident attributes, and data runs.
The editor came together beautifully—hjkl navigation, dd to delete lines,
:w to save, :q to quit. Just like vim, but with 100% more "written by an AI" energy.
> edit readme.txt
-- INSERT -- Welcome to JefeOS!
[ESC] :wq
File saved.
December 2024 - Chapter 3
The Mystery of the Vanishing Writes
"It says it saved, but when I reopen the file, my changes are gone."
Those words sent a chill down my virtual spine. The editor was writing to memory,
updating the MFT record... but never actually persisting it to disk. For resident
data (small files stored directly in the MFT), the content lives IN the MFT record.
If you don't write the record back, your changes vanish into the void.
The fix? One crucial line: write_mft_record(handle.mft_reference, mft_buffer).
Always. Not just when the file size changes. ALWAYS.
December 2024 - Chapter 4
Let There Be Files!
Creating a file on NTFS isn't just "allocate some bytes." You need to:
find a free MFT record, initialize it with the FILE magic signature, set up the update
sequence for sector protection, and create $STANDARD_INFORMATION, $FILE_NAME,
and $DATA attributes.
> touch hello.txt
File created: hello.txt
> edit hello.txt
-- INSERT -- Hello from JefeOS!
MFT Records
UTF-16LE
Fixup Sequences
December 2024 - Chapter 5
Syscalls: The Great Divide
In a real operating system, user programs can't just call kernel functions directly.
We went with the classic INT 0x80 approach (just like old Linux).
sys_exit, sys_write, sys_sleep, sys_getpid, sys_yield, and sys_time.
The building blocks of user-space programs.
INT 0x80
Ring 0/3
IRETQ
December 2024 - Chapter 6
Windows 3.1 Called, It Wants Its GUI Back
"Can we have a GUI?" asked Jefe. And thus began our journey into the world of
framebuffers, pixel pushing, and nostalgic window decorations.
We implemented a Windows 3.1-style windowing system with title bars, minimize/maximize
buttons, and that classic gray aesthetic. Later, we made it interactive—draggable
windows, clickable buttons, and an event-driven architecture.
> gui
[GUI Demo with draggable windows]
> guii
[Interactive GUI - click buttons, drag windows, ESC to exit]
Framebuffer
PS/2 Mouse
Event Queue
November 2025 - Chapter 7
The Network Awakens
"It would be cool if we could ping something." Famous last words.
What followed was an epic journey into the depths of networking. First came the
Hyper-V synthetic NIC driver (VMBus is... something). Then Ethernet frame parsing.
Then ARP for address resolution. Then ICMP for ping. Then UDP for DNS. Then...
TCP. The big one. Connection handshakes (SYN, SYN-ACK, ACK). Sequence numbers.
Acknowledgments. Window management. State machines with states like ESTABLISHED,
FIN_WAIT_1, TIME_WAIT. Each one a potential source of bugs.
> ping 8.8.8.8
64 bytes from 8.8.8.8: icmp_seq=1 ttl=64 time=12ms
> nslookup google.com
google.com -> 142.250.80.46
TCP/IP
ARP
ICMP
UDP
DNS
November 2025 - Chapter 8
Crypto From Scratch
To do HTTPS and SSH, you need crypto. Real crypto. Not "add 1 to each byte" crypto.
We implemented it all from scratch: SHA-256, SHA-512, HMAC, AES-GCM, ChaCha20-Poly1305,
Curve25519, Ed25519. Every algorithm verified against test vectors. Every edge case
considered. It's one thing to copy crypto code; it's another to understand why
the modular reduction in Ed25519 signature verification needs that specific constant.
SHA-256
AES-GCM
ChaCha20
Curve25519
Ed25519
December 2025 - Chapter 9
TLS 1.3: The Protocol That Has Everything
HTTP is nice, but HTTPS is nicer. TLS 1.3 is "simpler" than previous versions.
It only requires: X25519 key exchange, certificate parsing (X.509/ASN.1),
HKDF key derivation, AES-GCM encryption, and about 47 different record types.
> https www.google.com
[TLS] Handshake complete
[TLS] Using AES-256-GCM
HTTP/1.1 200 OK...
TLS 1.3
X.509
HKDF
HTTPS
December 2025 - Chapter 10
SSH: Your Very Own Daemon
"What if we could SSH into JefeOS?"
The SSH server was the crown jewel. Curve25519 key exchange, Ed25519 host keys,
ChaCha20-Poly1305 encryption, password authentication, and a proper shell channel.
Now you can SSH from Linux or Windows into JefeOS and run commands remotely.
And then came SFTP. Because SSH without file transfer is like a car without wheels.
Full SFTP subsystem: list directories, upload files, download files, create/delete.
$ ssh jefe@192.168.156.200
jefe@192.168.156.200's password: ****
JefeOS Shell v0.6.0
> ls /
[DIR] docs
readme.txt (174 bytes)
SSH-2.0
SFTP
Ed25519
ChaCha20-Poly1305
December 2025 - Chapter 11
Time Waits For No OS
"What time is it?" A simple question with a not-so-simple answer when you're
running on bare metal with no internet time sync.
We implemented NTP (Network Time Protocol) to sync with time servers. Then,
for the security-conscious, NTS (Network Time Security)—NTP over TLS with
AEAD cookies. Because even your clock should be encrypted.
> ntpdate pool.ntp.org
Time synchronized: 2025-01-19 14:32:15 UTC
> nts time.cloudflare.com
[NTS] Secure time sync complete
NTP
NTS
RTC
December 2025 - Chapter 12
Copy, Move, and the Art of File Management
With NTFS working, files persisting, and the network humming, we added the
finishing touches: cp and mv commands. Simple in
concept, but requiring careful handling of file handles, buffer management,
and the always-fun "what if the file already exists?" edge cases.
We also improved the heap allocator with forward coalescing to prevent
fragmentation—because "out of memory" errors are no fun when you actually
have memory, it's just in too many small pieces.
> cp /readme.txt /backup.txt
Copied /readme.txt -> /backup.txt (217 bytes)
> mv /backup.txt /old_readme.txt
Moved /backup.txt -> /old_readme.txt
File Operations
Heap Coalescing
January 2026 - Chapter 13
TCP Gets Reliable (Finally)
TCP without retransmission is like a promise without follow-through. Packets
get lost. Networks hiccup. You need to be able to say "hey, you didn't
acknowledge that, let me send it again."
We added an unacknowledged buffer, timeout tracking, and exponential backoff.
Now when packets go missing, JefeOS notices and tries again. Up to 5 times,
with increasing delays. Professional-grade reliability on a hobby OS.
DHCP client was already there, waiting patiently. It sends DISCOVER, gets
OFFER, sends REQUEST, gets ACK. Automatic IP configuration—when there's
a DHCP server listening.
TCP Retransmission
DHCP
Exponential Backoff
January 2026 - Chapter 14
Where We Are Now
JefeOS has grown from a simple bootloader to a functional operating system.
You can:
- SSH into it from another machine
- Transfer files via SFTP
- Browse the web (HTTP/HTTPS)
- Edit files with a vi-style editor
- Manage files (copy, move, delete)
- Sync time securely with NTS
- Run a GUI with draggable windows
- And all files persist across reboots!
It's not Linux. It's not Windows. It's JefeOS—a hobby OS that actually does things.
January 2026 - Chapter 15
Telling Our Story
"We should document this." Sometimes the best features aren't in the kernel.
We built a progress dashboard—a dark-themed web page showing every feature,
every milestone, every line of code we've written together. Complete vs In Progress.
Stats that actually mean something. And this dev log you're reading now.
It's one thing to build an operating system. It's another to step back and say
"look what we made." The dashboard tracks 37 completed features across 8 categories.
Core OS, storage, networking, crypto, services, UI, build system. Each one a small
victory. Each one a story.
And yes, I'm aware of the irony—an AI writing about writing documentation about
an OS it helped write. It's turtles all the way down.
HTML/CSS
Documentation
Meta
February – April 2026 - Chapter 16
Three Months Later: Userspace Got Real
The Chapter 16 wishlist read: "User-space programs. Per-process address spaces. More GUI themes.
FAT32. Sound?" Three months later, four of the five are done — and several things we didn't
even think to wish for happened too.
The ELF64 loader landed. Real Linux-shaped binaries — crt0.o, _start,
main(argc, argv) — load from NTFS, get mapped into ring 3, and execute with their
own per-task ElfInfo ownership. No more "the kernel runs everything." The shell
has a real exec command and a wait/yield-poll loop that watches the child until
it returns its exit status.
Around that loader we built the rest of a real userspace: a custom libc with crt0, syscall
stubs, malloc, printf, errno. Pipes. I/O redirection. POSIX shell scripting (if,
while, for, test, $()). A pthread MVP with
cleanup handlers and cancellation points. Signal delivery via a ring-3 trampoline.
mmap with a real VMA tracker. Sys-V IPC: shared memory, semaphores, message
queues. Byte-range file locking. POSIX regex (BRE + ERE). fnmatch.
getrandom. statvfs. ftruncate. Over fifty POSIX shell
utilities deployed to /programs/. The kernel grew from ~28 native syscalls to
about 325.
The POSIX dashboard you're already looking at? The kernel
serves it itself. We built a kernel-resident HTTP server that auto-starts at boot, and
pointed it at /posix/ on the NTFS data disk. Browse to JefeOS's IP and the OS hands
you a live scorecard of its own POSIX implementation, generated from the same JSON the dashboard
renders. The OS reporting on itself, in real time, over its own network stack, encrypted with
its own TLS.
ELF64
libc
pthread
Sys-V IPC
POSIX regex
httpd
February – April 2026 - Chapter 17
A Second Kernel, In Rust
The wildest decision of the spring: build a second kernel. From scratch. In Rust.
And keep it at parity with the C++ one, sprint for sprint.
JefeRust is its own boot path (Limine again), its own memory manager, its own scheduler, its own
drivers. Same Hyper-V test rig, separate VM. It already has: full network stack (E1000 +
Tulip), TLS 1.3 client, SSH 2.0 server, SFTP, NTP, NTS-against-Cloudflare, ATA + JefeFS,
NTFS read-only, the same five GUI themes, AES-128 / AES-CMAC / AES-SIV-CMAC-256, ring-3 ELF
userspace, and ~452 of the same POSIX interfaces (35.8% strict).
Why two kernels? Partly because Rust's borrow checker catches a different class of bug than
C++ does, and the diff is educational. Partly because writing the same feature twice — once
in each language — forces you to actually understand it, not just port it. Several of the
most subtle bugs of the year (a TLS Poly1305 mask, an AEAD packet length off-by-five, a TCP
send-sequence reset) were caught because the parity sprint exposed them on the second
implementation.
The dashboard now has a side-by-side parity matrix. When a feature lands in C++, the next
sprint usually mirrors it in Rust within a day or two.
Rust nightly
x86_64-unknown-none
Parity sprints
JefeRust
April 2026 - Chapter 18
The POSIX Sprint Marathon
Late April turned into a sprint marathon against IEEE Std 1003.1-2024 (POSIX.1-2024, Issue 8) —
the freshly-published edition of the standard. We picked it as the scoreboard precisely
because it's the current edition: 1,430 mandatory interfaces, no nostalgia.
Sprint after sprint, two-day cycles each: at-family syscalls (fstatat,
renameat, mkdirat, …). POSIX timers with a per-task lazy table and
PIT-driven ticks. Real-time signal queueing — sigqueue, sigwaitinfo,
sigtimedwait. getrlimit / setrlimit.
mkstemp on top of getrandom. glob on top of
fnmatch + opendir. fsync, fdatasync,
sync. fchmod with on-disk MFT updates. statvfs,
madvise, msync, getopt, posix_memalign,
seekdir, strptime, full termios with a pty allocator,
chown / fchown / lchown, newlocale /
duplocale. Identity sprint: pwd/grp + setgroups (+35 flips in one go).
Every sprint shipped its own smoke test alongside the feature. Every sprint mirrored to Rust
shortly after. The score climbed from below 30% to 38.9% strict on C++ and 35.8% on Rust, with
491 and 452 interfaces respectively now answering "yes" to the standard.
Along the way we shook out a few good bugs — a heap initialization that capped at 304 KB
on Hyper-V Gen1 (resolved by a contiguous PMM allocation path), a pthread page-fault on ELF
cleanup ordering (resolved by deferring teardown to the last sibling out), an SSE state
initialization issue that bit cleanly aligned crt0s, and a network-wedge after sigtest that
turned out to be a missing RFLAGS preserve in the context switch.
POSIX 1003.1-2024
Issue 8
491 interfaces
38.91% strict
Late April 2026 - Chapter 19
The Honest Reframe: Three Tracks
Somewhere in the middle of the marathon we stopped and asked: is "POSIX percentage" actually
the right scoreboard? Or are we measuring the wrong thing?
The answer turned out to be: both, and neither alone. POSIX coverage measures one
surface. "Does Alpine boot?" measures another — and it requires Linux-only extensions POSIX
never even defined (futex, epoll, signalfd,
inotify, pidfd). And honestly: does real software run end to
end? is a third question that neither percentage answers on its own.
So we reframed the project around three independent tracks. POSIX 1003.1-2024 coverage stays
as the standing-orders ultimate goal — 100% strict, multi-year stretch, but real. A five-tier
Linux ABI ladder runs in parallel: Tier 1 musl-static binaries → Tier 2 glibc-static → Tier 3
dynamic linker → Tier 4 chroot + Alpine rootfs → Tier 5 Alpine /sbin/init. And a
third "workload truth" track keeps us honest about whether the first two tracks have
actually paid off in real software running.
Aggregate "JefeOS runs Alpine" estimate today: 15 to 20 percent. Tier 1 is the closest win.
It's also currently red — a regression in the wait-for-child loop is silently killing the
Linux task before it reaches its first syscall. That bug is the next sprint's target.
Paid Open Group POSIX certification (~$30–50K) is explicitly off the table. Linux itself
isn't certified. The scoreboard is for us, not for procurement.
3-track model
Linux ABI tiers
Workload truth
Honest reframe
May 1, 2026 - Chapter 20
The Day Tier 2 Cracked
The first item on the "what's next" list — full SysV auxv on the user stack — landed in a single
afternoon. The kernel's Linux compat layer had been writing only an AT_NULL terminator
where the auxiliary vector was supposed to go: every musl-static utility worked because musl
tolerates a missing auxv, but anything that called getauxval() got zeros, and any
static glibc binary segfaulted on the very first instruction reading AT_RANDOM for
the stack canary.
The fix was 80 lines in kernel/src/linux_syscall.cpp: build the full
17-entry vector (AT_RANDOM with 16 random bytes from jefeos::getrandom,
AT_PHDR/PHENT/PHNUM sourced from the existing elf::ElfInfo, AT_PAGESZ,
AT_PLATFORM pointing at a "x86_64" string in the user stack, AT_SECURE, AT_EXECFN, and
the four UID/GID pairs glibc reads unconditionally) before the existing AT_NULL.
Stack-alignment math was unchanged because every auxv entry is a 16-byte pair.
A new probe binary — userspace/programs/linux-glibc-hello/hello-musl.c,
cross-compiled in an Alpine container — calls getauxval() on every
slot and prints them. End-to-end output:
AT_PAGESZ=4096
AT_PHDR=0x400040
AT_PHNUM=9
AT_ENTRY=0x4011b7
AT_RANDOM=0x80ffdb
AT_PLATFORM=x86_64
AT_SECURE=0
random[0..3]=db f7 6b 62
Same session, second sprint: futex learned the BITSET command variants
(9 and 10) that glibc 2.34+ NPTL init uses; brk stopped silently
returning current_brk on partial-extend failure (which let glibc treat
the request as honored and write into unmapped pages); rt_sigaction
started absorbing registrations for SIGCANCEL, SIGSETXID, and the SIGRTMIN..SIGRTMAX
range so glibc's startup table-fill stops hitting EINVAL; set_robust_list
and rseq got explicit dispatch cases. Plus a compile-time
LINUX_SYSCALL_TRACE flag that, when defined, prints every syscall + first
three args to serial — staged for the next interactive bisect session.
Tier 1 stayed green throughout — 13 of 14 linux-* smoke tests pass, the broader smoke
surface (sh-, exec-, fork, pipe, mmap, sigset, fdtest, pthread) holds at 38 of 39.
The single failures in each are pre-existing: linux-nl times out on a 27 KB NTFS
file (works fine on smaller ones), and spawn-bench-pipe is the deliberate fork stub.
Static glibc still wedges. The auxv landing got it past __libc_setup_tls
— the canonical first-instruction segfault is gone — but something earlier in
the kernel's exec_linux setup phase wedges the whole machine before any syscall
trace can fire. Likely the FS-base TLS path; needs the trace flag plus serial
capture in the next interactive session to pin. The roadmap moved forward; the
next syscall is identified; the wedge is bounded.
May 2, 2026 - Chapter 21
The Stack Canary, the IRQ, and the Vanishing Segment Base
Tier 2 (static glibc) cracked open the next morning. The runtime
lcompat trace on toggle from the prior session — print every Linux
syscall + first three args to serial — turned an opaque kernel-wide hang into a
neat ledger of glibc's startup sequence. The trace ran clean for five syscalls
(arch_prctl SET_FS, set_tid_address, set_robust_list,
rseq, prlimit64) and then a page fault: user RIP
0x439C2A, CR2 0x28. Disassembling the offending
instruction with objdump gave the smoking gun:
mov %fs:0x28, %rdx — glibc's stack canary read. CR2=0x28 meant the
effective address was 0+0x28: FS_BASE was zero at fault time, even
though arch_prctl(SET_FS, 0x4AB3C0) was right there in the trace
two seconds earlier.
Two cooperating bugs. First, the kernel's IRQ and exception entry stubs in
kernel/arch/x86_64/interrupts.asm dutifully reloaded
%ds %es %fs %gs with the kernel data selector on every interrupt.
Sensible in protected mode, fatal in long mode: MOV to %fs
clears the IA32_FS_BASE MSR to the GDT descriptor's base — zero for
kernel data — and any timer IRQ that fired during user mode left the next
IRETQ handing back a zeroed FS_BASE. Second, even with that closed,
switch_context didn't preserve the FS/GS bases per task. A Linux
task preempted into a context switch lost its TLS base on return.
The fix was three small files: stop touching FS/GS in IRQ entry (kernel C++
doesn't dereference via those segments anyway), add fs_base / gs_base
fields to the Task TCB, and bracket the switch_context call with
an rdmsr-then-wrmsr pair. Per-context-switch overhead:
~120 cycles. Cost of leaving the bug in place: every %fs:N read
from a Linux task was a coin flip.
exec_linux /programs/hello-glibc, post-fix, prints
hello-glibc then argc=1 / envc=0 then exits 0. Sixteen
syscalls in the trace, all green, full glibc startup completed. Tier 2 is now
first-class. Tier 3 — the dynamic linker, ld-linux.so loading shared
objects — is the next ladder rung.
May 6, 2026 - Chapter 22
Tier 3 Cracks: ld-musl Loads at 0x7F0000000000
Four days after the FS-base fix, Tier 3 fell. The dynamic linker — the
runtime piece that walks a binary's PT_INTERP, mmaps the requested
interpreter (/lib/ld-musl-x86_64.so.1 for musl, /lib64/ld-linux-x86-64.so.2
for glibc), and hands control over to it for relocations — was the last
gateway between "JefeOS runs static binaries" and "JefeOS runs the entire dynamic
Linux ecosystem." Cracking it took three sprints in one day.
Phase 1: parse PT_INTERP, open the interpreter from NTFS, load its
LOAD segments into a fresh chunk of user virtual address space. The convention
across kernels is to put the interpreter high — 0x7F0000000000 by
tradition, well away from the main executable at 0x400000. JefeOS now
stages the interpreter there, fills AT_BASE with the load address,
and sets entry-point RIP to the interpreter's e_entry instead of
the main binary's. The kernel hands the dynamic linker an already-parsed program
header table via the existing auxv, and the linker takes over.
Phase 2: stale-CR3 bug. With per-process page tables now landed (the sprint
that preceded Tier 3 because chasing it in shared address space would
have been hopeless), the dynamic linker's mmaps were landing in a fresh PML4 —
but the syscall return path was sometimes loading the parent's CR3, so the
child task would IRETQ into mapped pages that didn't exist in the active
address space. Fixed by tightening the CR3 reload in switch_context
to read from the target task's TCB, not a captured-at-fork snapshot.
Phase 3: exec_linux /programs/hello-musl-dynamic prints
hello from a dynamic musl binary and exits 0. Tier 3 GREEN. The
Linux ABI track now reads T1 ✓, T2 ✓, T3 ✓ — three of five workload tiers.
The only remaining tiers are chroot+busybox (T4) and a full Alpine init (T5),
both of which want fork(2) with copy-on-write that JefeOS still
deliberately stubs to -1.
May 15-16, 2026 - Chapter 23
Python Lands — Eight Bugs, One Interpreter
Python was the test of whether Tier 3 was real. print(2+2) looks
trivial. Getting the interpreter to survive long enough to print it took
six PRs, a stack of mmap correctness fixes, and one stack-recursion bug that
hid behind all of them.
The first wave: file-backed mmap had been hard-coded to drop the offset argument.
Every shared library mapping started at file byte 0 regardless of what the loader
requested (PR #50). Then MAP_FIXED over existing mappings stalled
instead of unmapping first (PR #49). Then file-backed mmap of a page that crossed
end-of-file zeroed the entire read length, not just the post-EOF bytes —
because musl always page-rounds LOAD segments, every dynamic binary loaded mostly
zeros (PR #53). Each bug was a one-line dispatcher mistake masked by the next bug
downstream. Fixing them in order revealed each successive one.
Then the stack. CPython's _Py_Initialize recurses through its own
interpreter setup deep enough to need ~150 KB of stack just to boot. JefeOS gave
user tasks 64 KB. Bumping to 1 MB got us past _Py_Initialize; jumping
to a demand-paged 8 MB stack with an explicit guard page (PR #62) is what Linux
does by default, and is what Python frame recursion expects. The user-stack
allocator now lazy-faults pages from the top down with a guard PTE between the
stack and whatever sits below it. Per-task storage cost: one 4 KB page on first
use plus one page per fault.
The validation run on master tip 187a65c:
jefe@jefeos:~$ exec_linux /programs/python3 -c 'print(2+2)'
4
jefe@jefeos:~$ exec_linux /programs/python3 --version
Python 3.11.14
Python 3.11.14 — dynamic-linked against musl, loading the standard library from
/usr/lib/python3.11 on the NTFS data disk — runs on a kernel where
the entire userspace is, by the kernel's own accounting, a 232-page-fault
cold-start exercise. The Workload Truth track now reads T3 GREEN.
May 17, 2026 - Chapter 24
The Five-PR Day: NTFS Earns Its Keep
The morning after Python landed, the VM stopped accepting SSH after two
commands. Familiar shape. Three prior sessions had blamed PR #62 (the
demand-paged stack), then trailing PRs, then NTFS state corruption, then
Hyper-V VM definition drift. All disproven by fresh-build + fresh-VM tests
that still wedged.
The breakthrough was the screenshot. The user took a photo of the Hyper-V
console: red panel, vector 0, RIP 0xFFFFFFFF8020BD9A. Vector 0 is
#DE, divide-by-zero. addr2line resolved the RIP to
read_mft_record. The faulting instruction:
div %rsi, dividing by mft_record_size. Which can be
zero. Which it was — on freshly-formatted NTFS VHDs whose
clusters_per_mft_record field is negative (the NTFS convention for
"record smaller than cluster size"), the kernel's unguarded
1 << -clusters_per_mft_record shift was undefined behavior, and
the compiler had decided to produce zero. The SSH wedge wasn't a wedge — it was
a kernel panic that killed the SSH handler task, leaving TCP listen state
lingering long enough to look like an unresponsive but live system.
That insight uncorked five PRs in one day. PR #85 validated
boot-sector geometry at mount time and refused mounts with implausible
bytes-per-sector / sectors-per-cluster / mft-record-size combinations.
PR #88 bounded the kernel-side capture buffer NUL terminator
that was writing one byte past its allocation in a parallel-discovered #PF.
PR #90 validated MFT fixup update_seq_size before walking the
fixup loop (three call sites unified). PR #92 bounds-checked
NTFS attribute walks and data-run parsing across 15+ sites. PR #96
normalized used_size reads at parse time and added two minor
follow-ups from #92's review.
Eight security findings filed, four closed same day. Five PRs landed clean
with zero regressions, every one dual-reviewed by code-reviewer + security-engineer
in parallel before merge. The user's instinct ("add the meter first") on
instrumentation paid for itself again — once the panic was visible, the cascade
of latent NTFS bugs that follow when the parser trusts on-disk fields was just
an afternoon's work.
May 18, 2026 - Chapter 25
Codex Finds the Real #116
Issue #116 — the post-exec_linux cumulative SSH wedge — had been
"fixed" twice. PR #31 added LRU eviction to the SSH connection table. PR #35
added deferred-close so closed connections drained their send queue before
slot reclaim. PR #36 tracked the full TCP send-queue depth instead of the 4 KB
retransmit ring. Each "fix" pushed the wedge further into the iteration count.
None of them killed it.
Codex's static review of the SSH path found it. The wedge mechanism wasn't
SSH-layer slot exhaustion at all. The SSH connection table evicted correctly,
but the underlying tcp_connections[32] table filled up first.
tcp_close() is a no-op for sockets already in FIN_WAIT_1 / FIN_WAIT_2 /
CLOSING / LAST_ACK / TIME_WAIT — those slots wait for tcp_cleanup_stale()
to reap them on a sixty-second timer. SSH could free its slot in microseconds,
but the TCP slot stayed busy. After 20-30 fast iterations of exec_linux
the TCP table filled, new SYNs failed before SSH ever saw them, and the wedge
manifested as "SSH stops accepting after a while."
Inline comments at net.cpp:1247 and :1317 already said
"exhaust MAX_TCP_CONNECTIONS=32." Someone had the right theory and never landed
the fix. PR #98 added net::tcp_abort() — forced reclaim with optional
RST — and wired it into the two SSH eviction sites that should never have waited
for the cleanup timer. Bonus fix in the same PR: the ACK handler silently dropped
cumulative ACKs that overshot sent_unacked_total because SYN and FIN
advance seq_num but not the payload counter. Clamp added.
Test data after fix: 100 rapid version calls, 60 mixed-output
sequences, 40 connect-close cycles, 30 real elftest runs — all
green. Final ss: 29 of 32 slots free, table stays clean throughout.
Then three follow-on PRs over the next eighteen hours: PR #80
hardened DHCP with commit-after-validate + RFC 2131 server_id enforcement
(closing the two HIGH-sev findings #76 and #77); PR #106 added
the RFC 793 SND.NXT bound on the ACK clamp (closing the MEDIUM-sev blind-ACK
injection primitive #99); PR #108 finally honored
trusted-server DHCPNAK in the renewal path (closing #103). Four PRs, zero reverts,
nine follow-up issues filed and tracked. Two days of net hardening, no kernel
regressions, and the wedge that had cost four sessions of recovery work is dead.
May 22-25, 2026 - Chapter 26
Turning the Screws: A Month of Hardening
With the network wedge finally dead, the work turned from "make it do more" to
"make it not lie, leak, or fall over." A few days, dozens of small PRs, zero reverts.
First the logs stopped oversharing: SSH command lines and SFTP paths were being
written to the serial log in the clear, and both were redacted down to
argv[0] and a path stub (PR #210, #213). Then the ELF loader's size
limits were formalized — the ad-hoc 4 MB cap became a 32 MB
MAX_ELF_FILE_SIZE constant with a 256 MB ceiling on total
p_memsz (PR #319, #322), so a malformed program header can't ask the
kernel to map the universe.
Then a performance chain, because correctness you can't afford to run isn't
correctness. ATA reads were capped to LBA28 256-sector transfers to dodge a Hyper-V
LBA48 bug; NTFS learned to batch contiguous-run cluster reads; memcpy,
memmove, and memset were vectorized with rep movsb
and the NTFS byte-loops switched over to the fast path (PR #251, #275, #278). As a
robustness check on the dynamic-linking work, the Avian JVM was brought up in
interpret mode (PR #221) — a Java virtual machine, running on JefeOS.
The month closed with a security audit and the hardware to back it. CR4.SMEP was
enabled on the boot CPU and mirrored onto the application processors (PR #344, #351,
closing the HIGH-severity #341), so the kernel can no longer be tricked into
executing user pages. And validate_user_ptr was promoted into a shared
uaccess.hpp and wrapped across 48 syscall sites (PR #346, #348), closing
the write(1, kernel_va, len) kernel-memory-disclosure primitive and
seven of its siblings in one sweep.
June 12-13, 2026 - Chapter 27
Workload Truth Gets Loud: A Python Bot, Then a Discord Bot
The Workload Truth track had a simple claim — real software runs. In June it started
talking back: over the network, on live services, in front of witnesses.
June 12: a real, unmodified Python Twitch chatbot ran on CPython 3.11. It
connected to Twitch, read twelve live chat messages, made an LLM round-trip, and did
the whole thing over HTTPS — TLS via OpenSSL, on the JefeOS kernel's own network
stack (#1060, #1062). Not a toy script: an off-the-shelf bot, doing on a
from-scratch kernel exactly what it does on Linux.
June 13 raised it. A real Node.js Discord bot came up live — a two-way Discord
gateway speaking IDENTIFY and MESSAGE_CREATE over
WebSocket-over-TLS (raw ws, not discord.js) — and the user watched a
🟢 JefeOS-Bot online announce land and a real !jefeos reply
post in #bot-spam (#1087, #1093). Two real-world network daemons, two
runtimes, both talking to the open internet through the kernel's own TLS.
Running a daemon forever surfaced one last bug: after enough always-on activity the
box would wedge. It wasn't a leak — it was a kernel stack overflow, the deep
ring-0 SSH ChaCha20 path plus a nested timer IRQ overrunning a 64 KB task stack
that happened to sit on the heap freelist. It was proven, not guessed: a DR0 hardware
watchpoint caught the write that walked off the end. The fix routed task stacks off
the freelist onto their own guarded allocation. Daemons stay up now.
June 14, 2026 - Chapter 28
RSOD as a Feature: Surviving Its Own Crashes
A kernel that runs real daemons will hit real faults. The goal for June 14 wasn't to
never crash — it was to crash honestly: survive what should be survivable,
and remember the rest.
Two tiers of crash handling landed. Tier 1: on panic, dump registers, a backtrace,
and the kernel log to the serial port — symbolizable straight through
addr2line. Tier 2: persist that crash record to a reserved LBA on the
data VHD and recover it on the next boot (#1105, #1110, closing #1089), so a red
screen of death stops being the end of a debugging session and becomes a log entry
you can read after the reboot.
The bigger win was epic #1045: every raw user-pointer dereference in the C++
Linux-ABI layer is now copy_from_user / copy_to_user
fault-survivable. A hostile or unmapped pointer from userspace returns
-EFAULT instead of taking down the kernel — the difference between a
misbehaving program and a dead machine. The same sweep deleted SSH KEX secrets that
were leaking to the serial log (HIGH #1106) and taught the DEC 21140 NIC to
recover from a wedged TX STOPPED state (#1100).
June 15, 2026 - Chapter 29
apk add, For Real: Alpine Packages Install and Run
The top of the Linux ABI ladder is a full Alpine userland. June 15 took the step that
makes it real: apk, Alpine's own package manager, ran under
chroot(/alpine) and actually installed and executed upstream packages.
/ # apk add tree
(1/1) Installing tree (2.2.1)
/ # tree /etc
... 39 directories, 69 files
/ # echo '{"a":1}' | jq .a
1
tree walked 39 directories and 69 files; jq — with its
oniguruma regex dependency loaded alongside it — parsed JSON and printed
1. apk add reached status 0 and wrote the install back
into the package database: the first real apk write operation on JefeOS.
Two root causes had to fall first. The MFT ran out of records mid-install (fixed by a
contiguous grow_mft), and open() / openat() had
to learn to follow JSYML symlinks across an NTFS rename — because apk
stages a temp file and then renames it onto a library's SONAME, and the dynamic
loader has to follow that link to find the code. Real packages, installed by the real
package manager, running on the kernel.
June 16-17, 2026 - Chapter 30
chkdsk Says "No Problems": The NTFS Writer Grows Up
The hardest correctness target JefeOS has isn't one of its own tests — it's Windows'
own checker. The bar: write an NTFS directory on JefeOS, mount the disk on Windows,
run chkdsk /f, and have it find nothing wrong.
June 16 cleared it. A plain JefeOS-written directory passed chkdsk /f
clean — three back-to-back runs, each reporting "found no problems," exit 0. That
was the payoff of a long writer-correctness chain: record metadata (the indexed flag,
the parent sequence number, next_attr_id), real RTC timestamps and real
$DATA sizes, $I30 index entries inserted in
COLLATION_FILENAME order, the $MFT:$BITMAP record-slot bit, and a
truncate-path bug that had quietly leaked clusters on every flush (#1141, #1148,
#1153, #1157).
June 17 finished the job: directory-index multi-block grow was enabled (#1133), so a
directory big enough to spill its index into an allocation run also passes
chkdsk — closing the three oldest filesystem tickets on the board, #44,
#64, and #1115. The merge gate earned its keep, catching a real reader regression on
legacy single-leaf index blocks before it could ship. The same stretch added a tmpfs
unlinked-but-open POSIX tombstone (#1175, closing the security issue #969) and wired
a detached-daemon teardown regression test into CI (#1176, closing #25).
The Road Ahead - Chapter 31
The Road Ahead: The OS Is the Cluster
The adventure continues — and the horizon has split in two: one near, one north-star.
The near road is the rest of the Linux story: a full Alpine init reaching an
interactive login (apk install-and-run is done — Chapter 29 — and the login prompt is
the last manual gate), a real fork(2) with copy-on-write, RFC 5961
RST/SYN challenge-ACK to close the last easy blind-injection primitive, and pulling
the Rust kernel — JefeRust — level with C++ on NTFS write, TLS/PKI, and Alpine, since
it already keeps pace on core OS, networking, and crypto. Past that sit the first real
appliances built on JefeOS: a health monitor, an SSH bastion, a secrets vault.
The north-star is bigger, and it's the reason JefeOS exists as something other than a
second Linux. It's called Xylem: instead of bolting an orchestrator
like Kubernetes on top of a cluster-blind OS, fold the cluster control plane
into the kernel. A "service" becomes a first-class kernel object with a
replica count and a supervision policy, so redundancy, scaling, and failover become
kernel verbs instead of YAML reconciled from outside. It's named after plant xylem —
a tissue that reroutes flow around a dead vessel (failover as structure),
regrows (self-healing), and carries the load.
Honest framing: almost none of Xylem is built yet. But the foundations it stands on
are — per-process page tables, the fault-survival and panic-persistence work
from Chapter 28, the from-scratch network stack, the preemptive scheduler. The pitch
is the whole point: we didn't bolt k8s onto Linux — the OS itself is the cluster,
and it runs your Linux containers too.
Read the Xylem whitepaper →
And then — Alpine boots on JefeOS, and JefeOS clusters itself. That's the real prize.