Yep, I did that.
I was bored, plus I randomly came across Wojbie mentioning the idea while searching for TLCO stuff in the Discord.
Basically we get a handle to the top thread (the BIOS itself), and then muck around in runUntilLimit's insides (from parallel) to add a thread to the thread list.
Both methods for getting the top thread handle abuse the function environment of rednet.run (which is how RedRun works). By redefining os.pullEventRaw, we can run code in the context of the rednet.run thread (the "rednet thread", as I'll call it from now on) which has upvalues in the context of our program.
The original "exploit", if you could call it that, gets a handle to the rednet thread before restoring the original os table--the exploit temporarily replaces os in the rednet.run context with a proxy object that redefined pullEventRaw and passed any other accesses to the real os--and awaiting the event as it should. In addition to this, the exploit redefines coroutine.resume globally. If we haven't checked in with the rednet thread yet, it just passes through to the original coroutine.resume. If we have the rednet thread's handle, and the thread we're trying to resume is the rednet thread, we must be in the top thread, so we get a handle to that.
After 051c70a, however, the exploit becomes a lot easier. That commit implemented the can_wrap_errors function in the cc.internal.exception library, which parallel uses to decide if it wants to wrap exceptions or not. As part of this implementation, parallel calls a "try barrier" function, which stores information on the stack that cc.internal.exception can use to find out if parallel was pcall'd or not. This information comes in the form of a handle to the parent thread--so we just nab that off the stack and we're good.
Pre-88cb03b, runUntilLimit was passed a list of coroutines (called _routines) and a limit (the minimum amount of coroutines that could still be running without parallel.waitForAny/All returning, minus 1; called _limit, though we don't touch it here). It then stores the amount of coroutines it starts with (called count), creates a table to hold event filters (called tFilters, and indexed with the coroutine object itself), and counts the living coroutines (called living, initialized to count). It iterates over for i=1, count, resuming each coroutine if it exists (i.e; not nil) and the event matches its filter. Dead coroutines are replaced with nil, and, if the living count is less than or equal to the _limit, it returns. (It did this in a weird way, with 2 loops for some reason, but that's not important to understanding how this works.) Post-88cb03b, it works mostly the same, with some renames (_routines became threads and _limit became limit) and eliminating tFilters in favor of storing the coroutine and its filter in a table, which is stored in threads.
Basically, we create the coroutine, resume it for the first time, put its filter wherever the version of parallel we're running wants it, and then increase count and living to account for the new coroutine.