Forked from nicowilliams/fork-is-evil-vfork-is-good-afork-would-be-better.md
Created
May 6, 2025 20:01
-
-
Save olvap80/bad298ad34e012c92e809ab6f45a76aa to your computer and use it in GitHub Desktop.
Revisions
-
nicowilliams revised this gist
Jul 21, 2023 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -17,7 +17,7 @@ So here goes. Long ago, I, like many Unix fans, thought that [`fork(2)`](https://en.wikipedia.org/wiki/Fork_(system_call)) and the [fork-exec process spawning model](https://en.wikipedia.org/wiki/Fork-exec) were the greatest thing, and the Windows sucked for only having [`exec*()`](http://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html) and [`_spawn*()`](https://msdn.microsoft.com/en-us/library/20y988d2.aspx), the last being a Windows-ism. After many years of experience, I learned that [`fork(2)`](http://pubs.opengroup.org/onlinepubs/9699919799/functions/fork.html) is in fact evil. And `vfork(2)`, [long said to be evil](https://www.google.com/?q=vfork+considered+harmful), is in fact [*goodness*](https://www.google.com/?q=vfork+is+good). A slight variant of `vfork(2)` that avoids the need to block the parrent would be even better (see below). Extraordinary statements require explanation, so allow me to explain. -
nicowilliams revised this gist
Mar 2, 2022 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -2,7 +2,7 @@ I recently happened upon a very interesting implementation of [`popen()`](http:/ > This is not a paper. I assume reader familiarity with `fork()` in particular and Unix in general, though, of course, I link to relevant wiki pages, so if the unfamiliar reader is willing to go down the rabbit hole, they should be able to come out far more knowledgeable on these topics. > This gist [got posted on Hacker News](https://news.ycombinator.com/item?id=30502392) and was on the front page for a few hours, and there is a lot of interesting commentary there. And, yes, the topic of `vfork(2)` is [always](https://tldp.org/HOWTO/Secure-Programs-HOWTO/avoid-vfork.html) [rather](https://www.openwall.com/lists/musl/2020/10/12/5) [controversial](https://ewontfix.com/7/) -- readers should know that there are those who strongly disagree with the take I put forth in this gist. > Microsoft published a very relevant paper on this topic, [A Fork in the Road](https://www.microsoft.com/en-us/research/publication/a-fork-in-the-road/) a couple of years after I wrote this gist. I recommend it. It too was discussed [on HN](https://news.ycombinator.com/item?id=19621799). -
nicowilliams revised this gist
Mar 2, 2022 . 1 changed file with 2 additions and 2 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -4,12 +4,12 @@ I recently happened upon a very interesting implementation of [`popen()`](http:/ > This gist [got posted on Hacker News](https://news.ycombinator.com/item?id=30502392) and was on the front page for a few hours, and there is a lot of interesting commentary there. And, yes, the topic of `vfork(2)` is [always](https://tldp.org/HOWTO/Secure-Programs-HOWTO/avoid-vfork.html) [controversial](https://ewontfix.com/7/) -- readers should know that there are those who strongly disagree with the take I put forth in this gist. > Microsoft published a very relevant paper on this topic, [A Fork in the Road](https://www.microsoft.com/en-us/research/publication/a-fork-in-the-road/) a couple of years after I wrote this gist. I recommend it. It too was discussed [on HN](https://news.ycombinator.com/item?id=19621799). > Some additional links I've found that might be of interest to readers: > - https://blog.famzah.net/tag/fork-vfork-popen-clone-performance/ > - A [very similar idea](https://developers.redhat.com/blog/2015/08/19/launching-helper-process-under-memory-and-latency-constraints-pthread_create-and-vfork) to my afork() idea, from RedHat, from 2 years before I published this gist > - A relevant [thread](https://inbox.vuxu.org/tuhs/CAEoi9W6HFL3UcnWkKoqka8Dt16MWskKd6yEJr3HYCcCT9pMTig@mail.gmail.com/T/) about the Microsoft paper. > - An [RHEL7 bug report](https://bugzilla.redhat.com/show_bug.cgi?id=682922) with a particularly interesting attachment named [vfork-safely.c](https://bugzilla.redhat.com/attachment.cgi?id=941229) > - The [`vfork` tag](https://stackoverflow.com/questions/tagged/vfork) in Stack Overflow. -
nicowilliams revised this gist
Mar 2, 2022 . 1 changed file with 9 additions and 2 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,11 +1,18 @@ I recently happened upon a very interesting implementation of [`popen()`](http://pubs.opengroup.org/onlinepubs/9699919799/functions/popen.html) (different API, same idea) called [popen-noshell](https://github.com/famzah/popen-noshell) using [`clone(2)`](http://man7.org/linux/man-pages/man2/clone.2.html), and so I opened an [issue](https://github.com/famzah/popen-noshell/issues/11) requesting use of [`vfork(2)`](http://pubs.opengroup.org/onlinepubs/009695399/functions/vfork.html) or [`posix_spawn()`](http://pubs.opengroup.org/onlinepubs/9699919799/functions/posix_spawn.html) for portability. It turns out that on Linux there's an important advantage to using `clone(2)`. I think I should capture the things I wrote there in a better place. A gist, a blog, whatever. > This is not a paper. I assume reader familiarity with `fork()` in particular and Unix in general, though, of course, I link to relevant wiki pages, so if the unfamiliar reader is willing to go down the rabbit hole, they should be able to come out far more knowledgeable on these topics. > This gist [got posted on Hacker News](https://news.ycombinator.com/item?id=30502392) and was on the front page for a few hours, and there is a lot of interesting commentary there. And, yes, the topic of `vfork(2)` is [always](https://tldp.org/HOWTO/Secure-Programs-HOWTO/avoid-vfork.html) [controversial](https://ewontfix.com/7/) -- readers should know that there are those who strongly disagree with the take I put forth in this gist. > Microsoft published a [very relevant paper on this topic](https://www.microsoft.com/en-us/research/publication/a-fork-in-the-road/) a couple of years after I wrote this gist. I recommend it. It too was discussed [on HN](https://news.ycombinator.com/item?id=19621799). > Some additional links I've found that might be of interest to readers: > - https://blog.famzah.net/tag/fork-vfork-popen-clone-performance/ > - A [very similar idea](https://developers.redhat.com/blog/2015/08/19/launching-helper-process-under-memory-and-latency-constraints-pthread_create-and-vfork) to my afork() idea, from RedHat, from 2 years before I published this gist > - A relevant [thread](https://inbox.vuxu.org/tuhs/CAEoi9W6HFL3UcnWkKoqka8Dt16MWskKd6yEJr3HYCcCT9pMTig@mail.gmail.com/T/) > - An [RHEL7 bug report](https://bugzilla.redhat.com/show_bug.cgi?id=682922) with a particularly interesting attachment named [vfork-safely.c](https://bugzilla.redhat.com/attachment.cgi?id=941229) > - The [`vfork` tag](https://stackoverflow.com/questions/tagged/vfork) in Stack Overflow. So here goes. Long ago, I, like many Unix fans, thought that [`fork(2)`](https://en.wikipedia.org/wiki/Fork_(system_call)) and the [fork-exec process spawning model](https://en.wikipedia.org/wiki/Fork-exec) were the greatest thing, and the Windows sucked for only having [`exec*()`](http://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html) and [`_spawn*()`](https://msdn.microsoft.com/en-us/library/20y988d2.aspx), the last being a Windows-ism. -
nicowilliams revised this gist
Mar 2, 2022 . 1 changed file with 33 additions and 13 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,5 +1,11 @@ I recently happened upon an implementation of [`popen()`](http://pubs.opengroup.org/onlinepubs/9699919799/functions/popen.html) (different API, same idea) using [`clone(2)`](http://man7.org/linux/man-pages/man2/clone.2.html), and so I opened an [issue](https://github.com/famzah/popen-noshell/issues/11) requesting use of [`vfork(2)`](http://pubs.opengroup.org/onlinepubs/009695399/functions/vfork.html) or [`posix_spawn()`](http://pubs.opengroup.org/onlinepubs/9699919799/functions/posix_spawn.html) for portability. It turns out that on Linux there's an important advantage to using `clone(2)`. I think I should capture the things I wrote there in a better place. A gist, a blog, whatever. > This is not a paper. I assume reader familiarity with `fork()` in particular and Unix in general, though, of course, I link to relevant wiki pages, so if the unfamiliar reader is willing to go down the rabbit hole, they should be able to come out far more knowledgeable on these topics. > This gist [got posted on Hacker News](https://news.ycombinator.com/item?id=30502392) and was on the front page for a few hours, and there is a lot of interesting commentary there. And, yes, the topic of `vfork(2)` is always [controversial](https://ewontfix.com/7/) -- readers should know that there are those who strongly disagree with the take I put forth in this gist. > Microsoft published a [very relevant paper on this topic](https://www.microsoft.com/en-us/research/publication/a-fork-in-the-road/) a couple of years after I wrote this gist. I recommend it. It too was discussed [on HN](https://news.ycombinator.com/item?id=19621799). So here goes. Long ago, I, like many Unix fans, thought that [`fork(2)`](https://en.wikipedia.org/wiki/Fork_(system_call)) and the [fork-exec process spawning model](https://en.wikipedia.org/wiki/Fork-exec) were the greatest thing, and the Windows sucked for only having [`exec*()`](http://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html) and [`_spawn*()`](https://msdn.microsoft.com/en-us/library/20y988d2.aspx), the last being a Windows-ism. @@ -8,21 +14,25 @@ After many years of experience, I learned that [`fork(2)`](http://pubs.opengroup Extraordinary statements require explanation, so allow me to explain. I won't bother explaining what [`fork(2)`](https://en.wikipedia.org/wiki/Fork_(system_call)) is -- if you're reading this, I assume you know, but if not, see the linked wikipedia page. But I'll explain `vfork(2)` and why it was said to be harmful. `vfork(2)` is very similar to `fork(2)`, but the new process it creates runs in the same address space as the parent as if it were a thread, even sharing the same stack as the thread that called `vfork(2)`! Two threads can't really share a stack, so the parent is stopped while the child does its thing: either `exec*(2)` or `_exit(2)`. The two threads do share a stack though, so the child has to be careful not to corrupt the stack for the parent. Now, 3BSD added `vfork(2)`, and a few years later 4.4BSD removed it as it was by then considered harmful. (**I cannot find the link to the article or paper from the 80s that declared `vfork(2)` harmful, but I swear I remember seeing it it.** I would appreciate a link.) Most subsequent man pages say as much. But the derivatives of 4.4BSD restored it and do not call it harmful. There's a reason for this: `vfork(2)` is *much* cheaper than `fork(2)` -- much, much, much cheaper. That's because `fork(2)` has to either copy the parent's address space, or arrange for copy-on-write (CoW) (CoW which is supposed to be an optimization to avoid unnecessary copies). But even CoW is very expensive because it requires modifying memory mappings, doing TLB shootdowns if the parent is multi-threaded, taking expensive page faults, and so on. Modern kernels tend to seed the child with a copy of the parent's resident set, but if the parent has a large memory footprint (e.g., is a JVM), then the RSS will be huge. So `fork(2)` is inescapably expensive except for small programs with small footprints (e.g., a shell). So you begin to see why `fork(2)` is evil. And I haven't yet gotten to fork-safety perils! Fork-safety considerations are a lot like thread-safety, but it is harder to make libraries fork-safe than thread-safe. I'm not going to go into fork-safety here: it's not necessary. > Before I go on I should admit to hypocrisy: I do write code that uses `fork(2)`, often for multi-processing daemons -- as opposed to multi-threading, though I often do the latter as well. But the forks there happen very early on when nothing fork-unsafe has happened yet and the address space is _small_, thus avoiding most evils of `fork(2)`. `vfork(2)` cannot be used for this purpose. On Windows one would have to `CreateProcess()` or `_spawn()` to implement multi-processed daemons, which is a huge pain in the neck. Why did I ever think `fork(2)` was elegant then? It was the same reason that everyone else did and does: `CreateProcess*()`, `_spawn()` and `posix_spawn()`, and related APIs, are extremely complex. They _have_ to be because there is an enormous number of things one might do between `fork()` and `exec()` in, say, a shell. That complexity makes `fork()`+`exec()` _look_ good. With `fork()` and `exec()` one does not need a language or API that can express all those things: the host language will do! `fork(2)` gave the Unix's creators the ability to move all that complexity out of kernel-land into user-land, where it's much easier to develop software -- it made them more productive, perhaps much more so. The price Unix's creators paid for that elegance was the need to copy address spaces. Since back then programs and processes were _small_ that inelegance was easy to overlook or ignore. But now processes tend to be *huge* and multi-threaded, and that makes copying even just a parent's resident set, and page table fiddling for the rest, extremely expensive. But `vfork()` has all that elegance, and none of the downsides of `fork()`! `vfork()` does have one downside: that the parent (specifically: the thread in the parent that calls `vfork()`) and child share a stack, necessitating that the parent (thread) be stopped until the child `exec()`s or `_exit()`s. (This can be forgiven due to `vfork(2)`'s long preceding threads -- when threads came along the need for a separate stack for each new thread became utterly clear and unavoidable. The fix for threading was to use a new stack for the new thread and use a callback function and argument as the `main()`-alike for that new stack.) But blocking is bad because _synchronous behavior_ is bad, especially when `vfork(2)` (or `clone(2)`, used like `vfork(2)`) is the only performant alternative to `fork(2)`, yet it could have been better. An asynchronous version of `vfork(2)` would have to run the child in a new/alternate stack, much like a thread. Let's call it `afork()`, or maybe `avfork()`. Now, `afork()` would have to look a lot like `pthread_create()`: it has to take a function to call on a new stack, as well as an argument to pass to that function. I should mention that _all_ the `vfork(2)` man pages I've seen say that the parent _process_ is stopped until the child exits/execs, but this predates threads. Linux, for example, only stops the one thread in the parent that called `vfork()`, not all threads. I believe that is the correct thing to do, but IIRC other OSes stop all threads in the parent process (which is a mistake, IMO). > Some years ago I successfully talked NetBSD developers out of making `vfork(2)` stop all threads in the parent. An `afork()` would allow a `popen()` like API to return very quickly with appropriate pipes for I/O with the child(ren). If anything goes wrong on the child side then the child(ren) will exit and their output pipe (if any) will evince EOF, and/or writes to the child's input will get `EPIPE` and/or will raise `SIGPIPE`, at which point the caller of `popen()` will be able to check for errors. @@ -33,9 +43,11 @@ pid_t afork(int (*start_routine)(void *), void *arg); pid_t aforkx(int flags /* FORK_NOSIGCHLD and/or FORK_WAITPID */, int (*fn)(void *), void *arg); ``` It turns out that `afork()` is easy to implement on Linux: it's just a `clone(<function>, <stack>, CLONE_VM | CLONE_SETTLS, <argument>)` call. (One might want to request that `SIGCHLD` be sent to the parent when the child dies, but this is decidedly _not_ desirable in a `popen()` implementation, as otherwise the program might reap it before `pclose()` can reap it, then `pclose()` could not return the correct result. For [more on this](https://illumos.org/man/2/fork) go look at Illumos.) > See the comments on this gist. In particular, see @NobodyXu's [comment](https://gist.github.com/nicowilliams/a8a07b0fc75df05f684c23c18d7db234?permalink_comment_id=3467835#gistcomment-3467835) about his [`aspawn`](https://github.com/NobodyXu/aspawn)! One can also implement something like `afork()` (minus the Illumos `forkx()` flags) on POSIX systems by using `pthread_create()` to start a thread that will block in `vfork()` while the `afork()` caller continues its business. Add a `taskq` to pre-create as many such worker threads as needed, and you'll have a very fast `afork()`. However, an `afork()` implemented this way won't be able to return a `PID` unless the threads in the taskq pre-vfork (good idea!), instead it would need a completion callback, something like this: ``` int emulated_afork(int (*start_routine)(void *), void *arg, void (*cb)(pid_t) /* may be NULL */); @@ -175,14 +187,22 @@ avfork(int (*start_routine)(void *), void *arg) } ``` The title also says that [`clone(2)`](http://man7.org/linux/man-pages/man2/clone.2.html) is stupid. Allow me to address that. > Now, I don't mean to offend. It's not really stupid. Calling it stupid in the title is just a rhetorical device ("made you look!"). A cheap rhetorical device, yes. Not exactly a professional rhetorical device either. But this is a _gist_, not a paper, and I never expected it to go viral. In retrospect I should have used a softer word, though, if I were to contribute an `afork(2)` to Linux, I expect much stronger language might be used in discussing my patches -- it's standard operating procedure on the Linux kernel lists, so I expect no Linux kernel developer is offended here! `clone(2)` was originally added as an alternative to proper POSIX threads that could be used to implement POSIX threads. It seems to have been inspired by the [Plan 9 `rfork(2)`](https://9fans.github.io/plan9port/man/man3/rfork.html). The idea was that lots of variations on `fork()` would be nice, and as we see here, there are lots of variations on it (`forkx(2)`, `vfork(2)`, `vforkx(2)`, `rfork(2)`)! Perhaps Linux should have had a thread creation system call -- Linux might have then saved itself the pain of the first pthread implementation for Linux. (A lot of mistakes were made on the way to the [NPTL](https://en.wikipedia.org/wiki/Native_POSIX_Thread_Library).) Linux should have learned from Solaris/SVR4, where emulation of BSD sockets via `libsocket` on top of STREAMS proved to be a mistake that took a long time and much expense to fix. Emulating one API from another API with impedance mismatches is usually difficult at best. Since then `clone(2)` has become a swiss army knife -- it has evolved to have zone/jail entering features, but only sort of: Linux doesn't have proper zones/jails, instead Linux added new `clone(2)` flags to to indicate namespaces that should not be shared with the parent. And as new container-related `clone(2)` flags are added that old code might wish it had used them... one will have to modify and rebuild the `clone(2)`-calling world, and _that_ is decidedly not elegant. Linux should have had first-class `fork()`, `vfork()`, `avfork()`, `thread_create()`, and `container_create()` type system calls. The fork family could have been one system call with options, but threads are not processes, and neither are containers (though containers may have processes, and may have a minder/init process). Conflating all of those onto one system call seems a bit much, though even that would be OK if there was just one flag for container entry/start/fork/whatever-metaphor-applies-to-containers. But the `clone(2)` design encourages a proliferation of flags, which means one must constantly pay attention to the possible need to add new flags at existing call sites. Now, my friends tell me, and I read around too, that "nah, containers aren't zones/jails, they're not meant to be used like that", and I don't care for that line of argument. The world needs zones/jails and Linux containers *really* want to be zones/jails. And zones/jails need to start life maximally isolated, and sharing needs to be added explicitly from the host. Doing it the other way around is broken because every time isolation is increased one has to go patch `clone(2)` calls. > Counterpoint: doing it the Solaris way also requires patching the call sites when new types namespaces are added, so maybe my argument falls flat. Perhaps zone creation should have a profile name parameter that allows patching to be applied to configuration files rather than code. That's not a good approach to security for an OS that is not integrated top-to-bottom (on Linux everything has different maintainers and communities: the kernel, the C libraries, every important system library, the shells, the init system, all the user-land programs one expects -- everything). In a world like that containers need to start maximally isolated -- in my opinion anyways. I could go on. I could talk about fork-safety. I could discuss all of the functions that are generally, or in specific cases, safe to call in a child of `fork()`, versus the child of `vfork()`, versus the child of `afork()` (if we had one), or a child of a `clone()` call (but I'd have to consider quite a few flag combinations!). I could go into why 4.4BSD removed `vfork()` (I'd have to do a bit more digging though). I think this post's length is probably just right, so I'll leave it here. -
nicowilliams revised this gist
Apr 5, 2017 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -129,7 +129,7 @@ worker_start_routine(void *arg) } pid_t avfork(int (*start_routine)(void *), void *arg) { static pthread_once_t once = PTHREAD_ONCE_INIT; struct worker_s *worker; -
nicowilliams revised this gist
Mar 31, 2017 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -183,6 +183,6 @@ Since then `clone()` has become a swiss army knife -- it has evolved to have zon Linux should have had first-class `fork()`, `vfork()`, `avfork()`, `thread_create()`, and `container_create()` system calls. The fork family could have been one system call with options, but threads are not processes, and neither are containers (though containers may have processes, and may have a minder/init process). Conflating all of those onto one system call seems a bit much, though even that would be OK if there was just one flag for container entry/start/fork/whatever-metaphor-applies-to-containers. But the `clone(2)` design, or its maintainers, encourages a proliferation of flags, which means one must constantly pay attention to the possible need to add new flags at existing call sites. Now, my friends tell me, and I read around too, that "nah, containers aren't zones/jails, they're not meant to be used like that", and I don't care about that line of argument. The world needs zones/jails and Linux containers *really* want to be zones/jails. They do. And zones/jails need to start life maximally isolated, and sharing needs to be added explicitly from the host. Doing it the other way around is badly broken, because every time isolation is increased one has to go patch `clone(2)` calls. That's not a good approach to security for an OS that is not integrated top-to-bottm (on Linux everything has different maintainers and communities: the kernel, the C libraries, every important system library, the shells, the init system, all the user-land programs one expects -- everything). In a world like that containers need to start maximally isolated. I could go on. I could talk about fork-safety. I could discuss all of the functions that are generally, or in specific cases, safe to call in a child of `fork()`, versus the child of `vfork()`, versus the child of `afork()` (if we had one), or a child of a `clone()` call (but I'd have to consider quite a few flag combinations!). I could go into why 4.4BSD removed `vfork()` (I'd have to do a bit more digging though). I think this post's length is probably just right, so I'll leave it here. -
nicowilliams revised this gist
Mar 31, 2017 . 1 changed file with 2 additions and 0 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -183,4 +183,6 @@ Since then `clone()` has become a swiss army knife -- it has evolved to have zon Linux should have had first-class `fork()`, `vfork()`, `avfork()`, `thread_create()`, and `container_create()` system calls. The fork family could have been one system call with options, but threads are not processes, and neither are containers (though containers may have processes, and may have a minder/init process). Conflating all of those onto one system call seems a bit much, though even that would be OK if there was just one flag for container entry/start/fork/whatever-metaphor-applies-to-containers. But the `clone(2)` design, or its maintainers, encourages a proliferation of flags, which means one must constantly pay attention to the possible need to add new flags at existing call sites. Now, my friends tell me, and I read around too, that "nah, containers aren't zones/jails, they're not meant to be used like that", and I don't care about that line of argument. The world needs zones/jails and Linux containers *really* want to be zones/jails. They do. And zones/jails need to start life maximally isolated, and sharing needs to be added explicitly from the host. Doing it the other way around is badly broken, because every time isolation is increased one has to go patch `clone(2)` calls. That's not a good approach to security for an OS that is not integrated top-to-bottm (on Linux everything has different maintainers and communities: the kernel, the C library, every important system library, the shells, the init system, all the user-land programs one expects -- everything). In a world like that containers need to start maximally isolated. I could go on. I could talk about fork-safety. I could discuss all of the functions that are generally, or in specific cases, safe to call in a child of `fork()`, versus the child of `vfork()`, versus the child of `afork()` (if we had one), or a child of a `clone()` call (but I'd have to consider quite a few flag combinations!). I could go into why 4.4BSD removed `vfork()` (I'd have to do a bit more digging though). I think this post's length is probably just right, so I'll leave it here. -
nicowilliams revised this gist
Mar 31, 2017 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -10,7 +10,7 @@ Extraordinary statements require explanation, so allow me to explain. I won't bother explaining what [`fork(2)`](https://en.wikipedia.org/wiki/Fork_(system_call)) is -- if you're reading this, I assume you know. But I'll explain `vfork(2)` and why it was said to be harmful. `vfork(2)` is very similar to `fork(2)`, but the new process it creates runs in the same address space as the parent as if it were a thread, even sharing the same stack as the thread that called `vfork(2)`! Two threads can't share a stack, so the parent is stopped while the child does its thing: either `exec*(2)` or `_exit(2)`. Now, 3BSD added `vfork(2)`, and a few years later 4.4BSD removed it as it was by then considered harmful. Most subsequent man pages say as much. But the derivatives of 4.4BSD restored it and do not call it harmful. There's a reason for this: `vfork(2)` is *much* cheaper than `fork(2)` -- much, much cheaper. That's because `fork(2)` has to either copy the parent's address space, or arrange for copy-on-write (which is supposed to be an optimization to avoid unnecessary copies). But even COW is very expensive because it requires modifying memory mappings, taking expensive page faults, and so on. Modern kernels tend to seed the child with a copy of the parent's resident set, but if the parent has a large memory footprint (e.g., is a JVM), then the RSS will be huge. So `fork(2)` is inescapably expensive except for small programs with small footprints (e.g., a shell). So you begin to see why `fork(2)` is evil. And I haven't yet gotten to fork-safety perils! Fork-safety considerations are a lot like thread-safety, but it is harder to make libraries fork-safe than thread-safe. I'm not going to go into fork-safety here: it's not necessary. -
nicowilliams revised this gist
Mar 29, 2017 . 1 changed file with 2 additions and 2 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -16,11 +16,11 @@ So you begin to see why `fork(2)` is evil. And I haven't yet gotten to fork-saf (Before I go on I should admit to hypocrisy: I do write code that uses `fork(2)`, often for multi-processing daemons -- as opposed to multi-threading, though I often do the latter as well. But the forks there happen very early on when nothing fork-unsafe has happened yet and the address space is _small_, thus avoiding most evils of `fork(2)`. `vfork(2)` cannot be used for this purpose. On Windows one would have to `CreateProcess()` or `_spawn()` to implement multi-processed daemons, which is a huge pain in the neck.) Why did I ever think `fork(2)` was elegant then? It was the same reason that everyone else did and does: `CreateProcess*()`, `_spawn()` and `posix_spawn()` and such functions are extremely complex, and they have to be because there is an enormous number of things one might do between `fork()` and `exec()` in, say, a shell. But with `fork()` and `exec()` one does not need a language or API that can express all those things: the host language will do! `fork(2)` gave the Unix's creators the ability to move all that complexity out of kernel-land into user-land, where it's much easier to develop software -- it made them more productive, perhaps much more so. The price Unix's creators paid for that elegance was the need to copy address spaces. Since back then programs and processes were _small_ that inelegance was easy to overlook. But now processes tend to be *huge*, and that makes copying even just a parent's resident set, and page table fiddling for the rest, extremely expensive. But `vfork()` has all that elegance, and none of the downsides of `fork()`! `vfork()` does have one downside: that the parent (specifically: the thread in the parent that calls `vfork()`) and child share a stack, necessitating that the parent (thread) be stopped until the child `exec()`s or `_exit()`s. (This can be forgiven due to `vfork(2)`'s long preceding threads -- when threads came along the need for a separate stack for each new thread became utterly clear and unavoidable. The fix for threading was to use a new stack for the new thread and use a callback function and argument as the `main()`-alike for that new stack.) But blocking is bad because _synchronous behavior_ is bad, especially when it's the only option yet it could have been better. An asynchronous version of `vfork()` would have to run the child in a new/alternate stack. Let's call it `afork()`, or `avfork()`. Now, `afork()` would have to look a lot like `pthread_create()`: it has to take a function to call on a new stack, as well as an argument to pass to that function. I should mention that _all_ the `vfork()` man pages I've seen say that the parent _process_ is stopped until the child exits/execs, but this predates threads. Linux, for example, only stops the one thread in the parent that called `vfork()`, not all threads. I believe that is the correct thing to do, but IIRC other OSes stop all threads in the parent process (which is a mistake, IMO). -
nicowilliams revised this gist
Mar 29, 2017 . 1 changed file with 3 additions and 3 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -2,7 +2,7 @@ I recently happened upon an implementation of [`popen()`](http://pubs.opengroup. So here goes. Long ago, I, like many Unix fans, thought that [`fork(2)`](https://en.wikipedia.org/wiki/Fork_(system_call)) and the [fork-exec process spawning model](https://en.wikipedia.org/wiki/Fork-exec) were the greatest thing, and the Windows sucked for only having [`exec*()`](http://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html) and [`_spawn*()`](https://msdn.microsoft.com/en-us/library/20y988d2.aspx), the last being a Windows-ism. After many years of experience, I learned that [`fork(2)`](http://pubs.opengroup.org/onlinepubs/9699919799/functions/fork.html) is in fact evil. And `vfork(2)`, [long said to be evil](https://www.google.com/#q=vfork+considered+harmful), is in fact [*goodness*](https://www.google.com/#q=vfork+is+good). A slight variant of `vfork(2)` that avoids the need to block the parrent would be even better (see below). @@ -14,9 +14,9 @@ Now, 4.3BSD added `vfork(2)`, and a few years later 4.4BSD removed it as it was So you begin to see why `fork(2)` is evil. And I haven't yet gotten to fork-safety perils! Fork-safety considerations are a lot like thread-safety, but it is harder to make libraries fork-safe than thread-safe. I'm not going to go into fork-safety here: it's not necessary. (Before I go on I should admit to hypocrisy: I do write code that uses `fork(2)`, often for multi-processing daemons -- as opposed to multi-threading, though I often do the latter as well. But the forks there happen very early on when nothing fork-unsafe has happened yet and the address space is _small_, thus avoiding most evils of `fork(2)`. `vfork(2)` cannot be used for this purpose. On Windows one would have to `CreateProcess()` or `_spawn()` to implement multi-processed daemons, which is a huge pain in the neck.) Why did I ever think `fork(2)` was elegant then? It was the same reason that everyone else did and does: `CreateProcess*()`, `_spawn()` and `posix_spawn()` and such functions are extremely complex, and they have to be because there is an enormous number of things one might do between `fork()` and `exec()` in, say, a shell, but with `fork()` and `exec()` one can does not need a language or API that can express all those things. It gave the Unix's creators the ability to move all that complexity into user-land, where it's much easier to develop -- it made them more productive, perhaps much more so. The price Unix's creators paid for that elegance was the need to copy address spaces -- back then programs and processes were _small_, but now they're *huge*, and that makes copying even just a parent's resident set, and page table fiddling for the rest, is extremely expensive. But `vfork()` has all that elegance, and none of the downsides of `fork()`! -
nicowilliams revised this gist
Mar 29, 2017 . 1 changed file with 2 additions and 0 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -14,6 +14,8 @@ Now, 4.3BSD added `vfork(2)`, and a few years later 4.4BSD removed it as it was So you begin to see why `fork(2)` is evil. And I haven't yet gotten to fork-safety perils! Fork-safety considerations are a lot like thread-safety, but it is harder to make libraries fork-safe than thread-safe. I'm not going to go into fork-safety here: it's not necessary. (Before I go on I should admit to hypocrisy: I do write code that uses `fork(2)`, often for multi-processing daemons (as opposed to multi-threading, though I often do the latter as well), but the forks there happen very early on when nothing fork-unsafe has happened yet and the address space is _small_.) Why did I ever think `fork(2)` was elegant then? It was the same reason that everyone else did and does: `spawn()` and `posix_spawn()` and such functions are extremely complex, and they have to be because there is an enormous number of things one might do between `fork()` and `exec()` in, say, a shell, but with `fork()` and `exec()` one can does not need a language or API that can express all those things. It gave the Unix's creators the ability to move all that complexity into user-land, where it's much easier to develop -- it made them more productive, perhaps much more so. The price Unix's creators paid for that elegance was the need to copy address spaces -- back then programs and processes were _small_, but now they're *huge*, and that makes copying even just a parent's resident set, and page table fiddling for the rest, is extremely expensive. But `vfork()` has all that elegance, and none of the downsides of `fork()`! -
nicowilliams revised this gist
Mar 28, 2017 . 1 changed file with 8 additions and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -18,7 +18,9 @@ Why did I ever think `fork(2)` was elegant then? It was the same reason that ev But `vfork()` has all that elegance, and none of the downsides of `fork()`! `vfork()` does have one bit of inelegance: that the parent (specifically: the thread in the parent that calls `vfork()`) and child share a stack, necessitating that the parent (thread) be stopped until the child `exec()`s or `_exit()`s. (This can be forgiven due to `vfork(2)`'s long preceding threads -- when threads came along the need for a separate stack for each new thread became utterly clear and unavoidable. The fix for threading was to use a new stack for the new thread and use a callback function and argument as the `main()`-alike for that new stack.) But blocking is bad because _synchronous behavior_ is bad, especially when it's the only option yet it could have been better. An asynchronous version of `vfork()` would have to run the child in a new/alternate stack. Let's call it `afork()`, or `avfork()`. Now, `afork()` would have to look a lot like `pthread_create()`: it has to take a function to call on a new stack, as well as an argument to pass to that function. I should mention that _all_ the `vfork()` man pages I've seen say that the parent _process_ is stopped until the child exits/execs, but this predates threads. Linux, for example, only stops the one thread in the parent that called `vfork()`, not all threads. I believe that is the correct thing to do, but IIRC other OSes stop all threads in the parent process (which is a mistake, IMO). An `afork()` would allow a `popen()` like API to return very quickly with appropriate pipes for I/O with the child(ren). If anything goes wrong on the child side then the child(ren) will exit and their output pipe (if any) will evince EOF, and/or writes to the child's input will get `EPIPE` and/or will raise `SIGPIPE`, at which point the caller of `popen()` will be able to check for errors. @@ -40,6 +42,11 @@ int emulated_afork(int (*start_routine)(void *), void *arg, void (*cb)(pid_t) /* If the threads pre-vfork, then a PID-returning `afork()` can be implemented, though communicating a task to a pre-vforked thread might be tricky: `pthread_cond_wait()` might not work in the child, so one would have to use a pipe into which to write a pointer to the dispatched request. (Pipes _are_ safe to use on the child side of `vfork()`. That is, `read()` and `write()` on pipes are safe in the child of `vfork()`.) Here's how that would work: ``` // This only works if vfork() only stops the one thread in the // parent that called vfork(), not all threads. E.g., as on Linux. // Otherwise this fails outright and there is no way to implement // avfork(). Of course, on Linux one can just use clone(2). static struct avfork_taskq_s { /* elided */ ... } *avfork_taskq; static void -
nicowilliams revised this gist
Mar 27, 2017 . 1 changed file with 5 additions and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -166,7 +166,11 @@ afork(int (*start_routine)(void *), void *arg) } ``` The title also says that [`clone(2)`](http://man7.org/linux/man-pages/man2/clone.2.html) is stupid. Allow me to address that. `clone(2)` was originally added as an alternative to proper POSIX threads that could be used to implement POSIX threads. The idea was that lots of variations on `fork()` would be nice, and as we see here, that's actually true as to `avfork()`! `avfork()` was not the motivation, however. A lot of mistakes were made on the way to the [NPTL](https://en.wikipedia.org/wiki/Native_POSIX_Thread_Library). Linux should have had a thread creation system call -- it would have then saved itself the pain of the first pthread implementation for Linux. Linux should have learned from Solaris/SVR4, where emulation of BSD sockets via `libsocket` on top of STREAMS proved to be a very long and costly mistake. Emulating one API from another API with impedance mismatches is difficult at best. Since then `clone()` has become a swiss army knife -- it has evolved to have zone/jail entering features, but only sort of: Linux doesn't have proper zones/jails, instead adding new "namespaces" and new `clone(2)` flags to go with them. And as new container-related `clone(2)` flags are added that old code might wish it had used them... one will have to modify and rebuild the `clone(2)`-calling world, and _that_ is decidedly not elegant. Linux should have had first-class `fork()`, `vfork()`, `avfork()`, `thread_create()`, and `container_create()` system calls. The fork family could have been one system call with options, but threads are not processes, and neither are containers (though containers may have processes, and may have a minder/init process). Conflating all of those onto one system call seems a bit much, though even that would be OK if there was just one flag for container entry/start/fork/whatever-metaphor-applies-to-containers. But the `clone(2)` design, or its maintainers, encourages a proliferation of flags, which means one must constantly pay attention to the possible need to add new flags at existing call sites. -
nicowilliams revised this gist
Mar 27, 2017 . 1 changed file with 2 additions and 2 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -166,8 +166,8 @@ afork(int (*start_routine)(void *), void *arg) } ``` The title also says that [`clone(2)`](http://man7.org/linux/man-pages/man2/clone.2.html) is stupid. Allow me to address that. `clone(2)` was originally added as an alternative to proper POSIX threads that could be used to implement POSIX threads. The idea was that lots of variations on `fork()` would be nice, and as we see here, that's actually true as to `avfork()`! `avfork()` was not the motivation, however. A lot of mistakes were made on the way to the [NPTL](https://en.wikipedia.org/wiki/Native_POSIX_Thread_Library). Linux should have had a thread creation system call -- it would have then saved itself the pain of the first pthread implementation for Linux. Linux should have learned from Solaris/SVR4, where emulation of BSD sockets via `libsocket` on top of STREAMS proved to be a very long and costly mistake. Since then `clone()` has become a swiss army knife -- it has evolved to have zone/jail entering features, but only sort of: Linux doesn't have proper zones/jails, instead adding new "namespaces" and new `clone(2)` flags to go with them. And as new container-related `clone(2)` flags are added that old code might wish it had used them... one will have to modify and rebuild the `clone(2)`-calling world, and _that_ is decidedly not elegant. Linux should have had first-class `fork()`, `vfork()`, `avfork()`, `thread_create()`, and `container_create()` system calls. The fork family could have been one system call with options, but threads are not processes, and neither are containers (though containers may have processes, and may have a minder/init process). Conflating all of those onto one system call seems a bit much, though even that would be OK if there was just one flag for container entry/start/fork/whatever-metaphor-applies-to-containers. But the `clone(2)` design, or its maintainers, encourages a proliferation of flags, which means one must constantly pay attention to the possible need to add new flags at existing call sites. I could go on. I could talk about fork-safety. I could discuss all of the functions that are generally, or in specific cases, safe to call in a child of `fork()`, versus the child of `vfork()`, versus the child of `afork()` (if we had one), or a child of a `clone()` call (but I'd have to consider quite a few flag combinations!). I could go into why 4.4BSD removed `vfork()` (I'd have to do a bit more digging though). I think this post's length is probably just right, so I'll leave it here. -
nicowilliams revised this gist
Mar 27, 2017 . 1 changed file with 8 additions and 7 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -42,6 +42,14 @@ If the threads pre-vfork, then a PID-returning `afork()` can be implemented, tho ``` static struct avfork_taskq_s { /* elided */ ... } *avfork_taskq; static void avfork_taskq_init(void) { // Elided, left as exercise for the reader ... } // Other taskq functions called below also elided here // taskq thread create start routine static void * worker_start_routine(void *arg) @@ -111,13 +119,6 @@ worker_start_routine(void *arg) return NULL; } pid_t afork(int (*start_routine)(void *), void *arg) { -
nicowilliams revised this gist
Mar 27, 2017 . 1 changed file with 2 additions and 2 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -51,7 +51,7 @@ worker_start_routine(void *arg) // Register the worker and pthread_cond_signal() up to one thread // that might be waiting for a worker. avfork_taskq_add_worker(avfork_taskq, me); do { if ((job = calloc(1, sizeof(*job))) == NULL || pipe2(job->dispatch_pipe, O_CLOEXEC) == -1 || @@ -106,7 +106,7 @@ worker_start_routine(void *arg) // Do the job _exit(job->descr->start_routine(job->descr->arg)); } while(!avfork_taskq->terminated); // Perhaps this gets set via atexit() return NULL; } -
nicowilliams revised this gist
Mar 27, 2017 . 1 changed file with 9 additions and 5 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -40,7 +40,7 @@ int emulated_afork(int (*start_routine)(void *), void *arg, void (*cb)(pid_t) /* If the threads pre-vfork, then a PID-returning `afork()` can be implemented, though communicating a task to a pre-vforked thread might be tricky: `pthread_cond_wait()` might not work in the child, so one would have to use a pipe into which to write a pointer to the dispatched request. (Pipes _are_ safe to use on the child side of `vfork()`. That is, `read()` and `write()` on pipes are safe in the child of `vfork()`.) Here's how that would work: ``` static struct avfork_taskq_s { /* elided */ ... } *avfork_taskq; // taskq thread create start routine static void * @@ -57,7 +57,7 @@ worker_start_routine(void *arg) pipe2(job->dispatch_pipe, O_CLOEXEC) == -1 || pipe2(job->ready_pipe, O_CLOEXEC) == -1 || (pid = vfork()) == -1) { avfork_taskq_remove(avfork_taskq, me, errno); // We're out! break; } if (pid != 0) { @@ -70,9 +70,11 @@ worker_start_routine(void *arg) // later. // This also marks this worker as available and signals // up to one thread that might be waiting for a worker. avfork_taskq_record_child(avfork_taskq, me, job, pid); if (avfork_taskq_too_big_p(avfork_taskq)) break; // Dynamically shrink the taskq continue; } @@ -89,6 +91,8 @@ worker_start_routine(void *arg) // to implement posix_spawn(), a better popen(), better system(), // and so on. // Note too that the child does not refer to the taskq at all. // Get a job if (net_read(me->dispatch_pipe[0], &job->descr, sizeof(job->descr)) != sizeof(job->descr)) { job->errno = errno ? errno : EINVAL; @@ -134,7 +138,7 @@ afork(int (*start_routine)(void *), void *arg) job.start_routine = start_routine; job.arg = arg; worker = avfork_taskq_get_worker(avfork_taskq); // Lockless when possible; starts a worker if needed // Send the worker our job. If we're lucky, we only wait for an already // pre-vfork()ed child to read our job and indicate readiness. If we're -
nicowilliams revised this gist
Mar 27, 2017 . 1 changed file with 5 additions and 3 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -146,15 +146,17 @@ afork(int (*start_routine)(void *), void *arg) // Perhaps it could grow without bounds when demand is great, then shrink // when demand is low (see worker_start_routine()). if (net_write(worker->dispatch_pipe[1], &jobp, sizeof(jobp)) != sizeof(jobp) || net_read(worker->ready_pipe[0], &c, sizeof(c)) != sizeof(c)) job.errno = errno ? errno : EINVAL; // Cleanup (void) close(worker->dispatch_pipe[0]); (void) close(worker->dispatch_pipe[1]); (void) close(worker->ready_pipe[0]); (void) close(worker->ready_pipe[1]); if (job.errno) return -1; return job.pid; // when the read returns the PID is in pid } ``` -
nicowilliams revised this gist
Mar 27, 2017 . 1 changed file with 20 additions and 4 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -78,6 +78,17 @@ worker_start_routine(void *arg) // This is the child // Notice that only read(2), write(2), _exit(2), and the start_routine // from the avfork() call are called here. The avfork() start_routine() // should only call async-signal-safe functions and should not call // anything that's not safe on the child-side of vfork(). Depending // on the OS or C library it may not be possible to use some or any // kind of locks, condition variables, allocators, RTLD, etc... At least // dup2(2), close(2), sigaction(2), signal masking functions, exec(2), // and _exit(2) are safe to call in start_routine(), and that's enough // to implement posix_spawn(), a better popen(), better system(), // and so on. // Get a job if (net_read(me->dispatch_pipe[0], &job->descr, sizeof(job->descr)) != sizeof(job->descr)) { job->errno = errno ? errno : EINVAL; @@ -134,11 +145,16 @@ afork(int (*start_routine)(void *), void *arg) // workers, and to grow dynamically so that at first there are no workers. // Perhaps it could grow without bounds when demand is great, then shrink // when demand is low (see worker_start_routine()). if (net_write(worker->dispatch_pipe[1], &jobp, sizeof(jobp)) != sizeof(jobp) || net_read(worker->ready_pipe[0], &c, sizeof(c)) != sizeof(c)) { return errno ? errno : EINVAL; // Cleanup (void) close(worker->dispatch_pipe[0]); (void) close(worker->dispatch_pipe[1]); (void) close(worker->ready_pipe[0]); (void) close(worker->ready_pipe[1]); errno = 0; return job.pid; // when the read returns the PID is in pid } ``` -
nicowilliams revised this gist
Mar 27, 2017 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -143,7 +143,7 @@ afork(int (*start_routine)(void *), void *arg) } ``` The title also says that [`clone(2)`](http://man7.org/linux/man-pages/man2/clone.2.html) is stupid. Allow me to address that. `clone(2)` was originally added as an alternative to proper POSIX threads that could be used to implement POSIX threads. The idea was that lots of variations on `fork()` would be nice, and as we see here, that's actually true as to `avfork()`! `avfork()` was not the motivation, however. A lot of mistakes were made on the way to the [NPTL](https://en.wikipedia.org/wiki/Native_POSIX_Thread_Library). Linux should have had a thread creation system call -- it would have then saved itself the pain of the first pthread implementation for Linux. Linux should have learned from Solaris/SVR4, where emulation of BSD sockets on top of STREAMS proved to be a very long, costly mistake. Since then `clone()` has become a swiss army knife -- it has evolved to have zone/jail entering features, but only sort of: Linux doesn't have proper zones/jails, instead adding new "namespaces" and new `clone(2)` flags to go with them. And as new container-related `clone(2)` flags are added that old code might wish it had used them... one will have to modify and rebuild the `clone(2)`-calling world, and _that_ is decidedly not elegant. Linux should have had first-class `fork()`, `vfork()`, `avfork()`, `thread_create()`, and container-enter/create system calls. The fork family could have been one system call with options, but threads are not processes, and neither are containers (though containers may have processes, and may have a minder/init process). Conflating all of those onto one system call seems a bit much, though even that would be OK if there was just one flag for container entry/start/fork/whatever-metaphor-applies-to-containers. But the `clone(2)` design, or its maintainers, encourages a proliferation of flags, which means one must constantly pay attention to the possible need to add new flags at existing call sites. -
nicowilliams revised this gist
Mar 27, 2017 . 1 changed file with 2 additions and 2 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -22,7 +22,7 @@ But `vfork()` has all that elegance, and none of the downsides of `fork()`! An `afork()` would allow a `popen()` like API to return very quickly with appropriate pipes for I/O with the child(ren). If anything goes wrong on the child side then the child(ren) will exit and their output pipe (if any) will evince EOF, and/or writes to the child's input will get `EPIPE` and/or will raise `SIGPIPE`, at which point the caller of `popen()` will be able to check for errors. One might as well borrow the [Illumos](https://wiki.illumos.org/display/illumos/illumos+Home) [`forkx()`/`vforkx()`](https://illumos.org/man/2/forkx) flags, and make `afork()` look like this: ``` pid_t afork(int (*start_routine)(void *), void *arg); @@ -99,7 +99,7 @@ worker_start_routine(void *arg) static void avfork_taskq_init(void) { // Elided, left as exercise for the reader ... } -
nicowilliams revised this gist
Mar 27, 2017 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -18,7 +18,7 @@ Why did I ever think `fork(2)` was elegant then? It was the same reason that ev But `vfork()` has all that elegance, and none of the downsides of `fork()`! `vfork()` does have one bit of inelegance: that the parent and child share a stack, necessitating that the parent be stopped until the child `exec()`s or `_exit()`s. (This can be forgiven due to `vfork(2)`'s long preceding threads -- when threads came along the need for a separate stack for each new thread became utterly clear and unavoidable. The fix for threading was to use a new stack for the new thread and use a callback function and argument as the `main()`-alike for that new stack.) But blocking is bad because _synchronous behavior_ is bad, especially when it's the only option yet it could have been better. An asynchronous version of `vfork()` would have to run the child in a new/alternate stack. Let's call it `afork()`, or `avfork()`. Now, `afork()` would have to look a lot like `pthread_create()`: it has to take a function to call on a new stack, as well as an argument to pass to that function. An `afork()` would allow a `popen()` like API to return very quickly with appropriate pipes for I/O with the child(ren). If anything goes wrong on the child side then the child(ren) will exit and their output pipe (if any) will evince EOF, and/or writes to the child's input will get `EPIPE` and/or will raise `SIGPIPE`, at which point the caller of `popen()` will be able to check for errors. -
nicowilliams revised this gist
Mar 27, 2017 . 1 changed file with 16 additions and 4 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -40,6 +40,8 @@ int emulated_afork(int (*start_routine)(void *), void *arg, void (*cb)(pid_t) /* If the threads pre-vfork, then a PID-returning `afork()` can be implemented, though communicating a task to a pre-vforked thread might be tricky: `pthread_cond_wait()` might not work in the child, so one would have to use a pipe into which to write a pointer to the dispatched request. (Pipes _are_ safe to use on the child side of `vfork()`. That is, `read()` and `write()` on pipes are safe in the child of `vfork()`.) Here's how that would work: ``` static struct avfork_taskq_s { /* elided */ ... } taskq; // taskq thread create start routine static void * worker_start_routine(void *arg) @@ -78,12 +80,12 @@ worker_start_routine(void *arg) // Get a job if (net_read(me->dispatch_pipe[0], &job->descr, sizeof(job->descr)) != sizeof(job->descr)) { job->errno = errno ? errno : EINVAL; _exit(1); } job->descr->pid = getpid(); // Save the pid where a thread in the parent can see it if(net_write(me->ready_pipe[1], "", sizeof("")) != sizeof("")) { job->errno = errno; _exit(1); } @@ -94,9 +96,17 @@ worker_start_routine(void *arg) return NULL; } static void avfork_taskq_init(void) { // Elided ... } pid_t afork(int (*start_routine)(void *), void *arg) { static pthread_once_t once = PTHREAD_ONCE_INIT; struct worker_s *worker; struct job_descr_s job; struct job_descr_s *jobp = &job; @@ -106,7 +116,7 @@ afork(int (*start_routine)(void *), void *arg) // looks like. It might grow up to N worker threads, and thereafter // if there are no available workers then taskq.get_worker() blocks // in a pthread_cond_wait() until a worker is ready. pthread_once(&once, avfork_taskq_init); // Describe the job memset(&job, 0, sizeof(job)); @@ -133,6 +143,8 @@ afork(int (*start_routine)(void *), void *arg) } ``` The title also says that `clone()` is stupid. `clone(2)` was originally added as an alternative to proper POSIX threads that could be used to implement POSIX threads. The idea was that lots of variations on `fork()` would be nice, and as we see here, that's actually true as to `avfork()`! `avfork()` was not the motivation, however. A lot of mistakes were made on the way to the [NPTL](https://en.wikipedia.org/wiki/Native_POSIX_Thread_Library). Linux should have had a thread creation system call -- it would have then saved itself the pain of the first pthread implementation for Linux. Linux should have learned from Solaris/SVR4, where emulation of BSD sockets on top of STREAMS proved to be a very long, costly mistake. Since then `clone()` has become a swiss army knife -- it has evolved to have zone/jail entering features, but only sort of: Linux doesn't have proper zones/jails, instead adding new "namespaces" and new `clone(2)` flags to go with them. And as new container-related `clone(2)` flags are added that old code might wish it had used them... one will have to modify and rebuild the `clone(2)`-calling world, and _that_ is decidedly not elegant. Linux should have had first-class `fork()`, `vfork()`, `avfork()`, `thread_create()`, and container-enter/create system calls. The fork family could have been one system call with options, but threads are not processes, and neither are containers (though containers may have processes, and may have a minder/init process). Conflating all of those onto one system call seems a bit much, though even that would be OK if there was just one flag for container entry/start/fork/whatever-metaphor-applies-to-containers. But the `clone(2)` design, or its maintainers, encourages a proliferation of flags, which means one must constantly pay attention to the possible need to add new flags at existing call sites. I could go on. I could talk about fork-safety. I could discuss all of the functions that are generally, or in specific cases, safe to call in a child of `fork()`, versus the child of `vfork()`, versus the child of `afork()` (if we had one), or a child of a `clone()` call (but I'd have to consider quite a few flag combinations!). I could go into why 4.4BSD removed `vfork()` (I'd have to do a bit more digging though). I think this post's length is probably just right, so I'll leave it here. -
nicowilliams revised this gist
Mar 27, 2017 . 1 changed file with 7 additions and 0 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -69,6 +69,8 @@ worker_start_routine(void *arg) // This also marks this worker as available and signals // up to one thread that might be waiting for a worker. me->taskq->record_child(job, pid); if (me->taskq->too_big_p()) break; // Dynamically shrink the taskq continue; } @@ -117,6 +119,11 @@ afork(int (*start_routine)(void *), void *arg) // pre-vfork()ed child to read our job and indicate readiness. If we're // unlucky then the worker we got is busy going through vfork(). Worker // threads really don't do much though, so we should usually get lucky. // // The taskq should be sized so that there isn't too much contention for // workers, and to grow dynamically so that at first there are no workers. // Perhaps it could grow without bounds when demand is great, then shrink // when demand is low (see worker_start_routine()). errno = 0; if (net_write(worker->dispatch_pipe[1], &jobp, sizeof(jobp)) != sizeof(jobp) || net_read(worker->ready_pipe, &c, sizeof(c)) != sizeof(c)) { -
nicowilliams revised this gist
Mar 27, 2017 . 1 changed file with 20 additions and 9 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -47,13 +47,15 @@ worker_start_routine(void *arg) struct worker_s *me = arg; struct job_s *job; // Register the worker and pthread_cond_signal() up to one thread // that might be waiting for a worker. me->taskq->add_worker(me); do { if ((job = calloc(1, sizeof(*job))) == NULL || pipe2(job->dispatch_pipe, O_CLOEXEC) == -1 || pipe2(job->ready_pipe, O_CLOEXEC) == -1 || (pid = vfork()) == -1) { me->taskq->remove(me, errno); // We're out! break; } if (pid != 0) { @@ -63,7 +65,9 @@ worker_start_routine(void *arg) reap_child(pid); else // The child took a job; record it so we can reap it // later. // This also marks this worker as available and signals // up to one thread that might be waiting for a worker. me->taskq->record_child(job, pid); continue; } @@ -96,21 +100,28 @@ afork(int (*start_routine)(void *), void *arg) struct job_descr_s *jobp = &job; char c; // avfork_taskq_init() is elided here, but one can imagine what it // looks like. It might grow up to N worker threads, and thereafter // if there are no available workers then taskq.get_worker() blocks // in a pthread_cond_wait() until a worker is ready. pthread_once(&avfork_taskq_init_once_control, avfork_taskq_init); // Describe the job memset(&job, 0, sizeof(job)); job.start_routine = start_routine; job.arg = arg; worker = taskq.get_worker(); // Lockless when possible; starts a worker if needed // Send the worker our job. If we're lucky, we only wait for an already // pre-vfork()ed child to read our job and indicate readiness. If we're // unlucky then the worker we got is busy going through vfork(). Worker // threads really don't do much though, so we should usually get lucky. errno = 0; if (net_write(worker->dispatch_pipe[1], &jobp, sizeof(jobp)) != sizeof(jobp) || net_read(worker->ready_pipe, &c, sizeof(c)) != sizeof(c)) { return -1; return job.pid; // when the read returns the PID is in pid } ``` -
nicowilliams revised this gist
Mar 27, 2017 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -37,7 +37,7 @@ One can also implement something like `afork()` (minus the Illumos `forkx()` fla int emulated_afork(int (*start_routine)(void *), void *arg, void (*cb)(pid_t) /* may be NULL */); ``` If the threads pre-vfork, then a PID-returning `afork()` can be implemented, though communicating a task to a pre-vforked thread might be tricky: `pthread_cond_wait()` might not work in the child, so one would have to use a pipe into which to write a pointer to the dispatched request. (Pipes _are_ safe to use on the child side of `vfork()`. That is, `read()` and `write()` on pipes are safe in the child of `vfork()`.) Here's how that would work: ``` // taskq thread create start routine -
nicowilliams revised this gist
Mar 27, 2017 . 1 changed file with 10 additions and 8 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -54,7 +54,7 @@ worker_start_routine(void *arg) pipe2(job->ready_pipe, O_CLOEXEC) == -1 || (pid = vfork()) == -1) { me->taskq->remove(me, errno); break; } if (pid != 0) { // The child exited or exec'ed @@ -84,7 +84,8 @@ worker_start_routine(void *arg) // Do the job _exit(job->descr->start_routine(job->descr->arg)); } while(!me->taskq->terminated); return NULL; } pid_t @@ -95,15 +96,16 @@ afork(int (*start_routine)(void *), void *arg) struct job_descr_s *jobp = &job; char c; pthread_once(&avfork_taskq_init_once_control, avfork_taskq_init); // Describe the job memset(&job, 0, sizeof(job)); job.start_routine = start_routine; job.arg = arg; again: worker = taskq.get_worker(); // Lockless when possible; starts a worker if needed // Send the worker our job if (net_write(worker->dispatch_pipe[1], &jobp, sizeof(jobp)) != sizeof(jobp) goto again; // Add infinite loop protection if (net_read(worker->ready_pipe, &c, sizeof(c)) != sizeof(c)) -
nicowilliams revised this gist
Mar 27, 2017 . 1 changed file with 75 additions and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -37,7 +37,81 @@ One can also implement something like `afork()` (minus the Illumos `forkx()` fla int emulated_afork(int (*start_routine)(void *), void *arg, void (*cb)(pid_t) /* may be NULL */); ``` If the threads pre-vfork, then a PID-returning `afork()` can be implemented, though communicating a task to a pre-vforked thread might be tricky: `pthread_cond_wait()` might not work in the child, so one would have to use a pipe into which to write a pointer to the dispatched request. Here's how that would work: ``` // taskq thread create start routine static void * worker_start_routine(void *arg) { struct worker_s *me = arg; struct job_s *job; me->taskq->add_worker(me); do { if ((job = calloc(1, sizeof(*job))) == NULL || pipe2(job->dispatch_pipe, O_CLOEXEC) == -1 || pipe2(job->ready_pipe, O_CLOEXEC) == -1 || (pid = vfork()) == -1) { me->taskq->remove(me, errno); return NULL; } if (pid != 0) { // The child exited or exec'ed if (job->errno) // The child failed to get a job reap_child(pid); else // The child took a job; record it so we can reap it // later me->taskq->record_child(job, pid); continue; } // This is the child // Get a job if (net_read(me->dispatch_pipe[0], &job->descr, sizeof(job->descr)) != sizeof(job->descr)) { job->errno = errno ? errno : EWHATEVER; _exit(1); } job->descr->pid = getpid(); // Save the pid where a thread in the parent can see it if(net_write(me->ready_pipe[1], "", sizeof("")) != sizeof("")) { job->errno = errno ? errno : EWHATEVER; _exit(1); } // Do the job _exit(job->descr->start_routine(job->descr->arg)); } while(!me->taskq->terminated); } pid_t afork(int (*start_routine)(void *), void *arg) { struct worker_s *worker; struct job_descr_s job; struct job_descr_s *jobp = &job; char c; job.start_routine = start_routine; job.arg = arg; again: // Lock-lessly walk taskq to find a free pre-vforked thread // and take ownership of it. while (...) { ... } if (worker == NULL) // Add a taken worker ... if (net_write(worker->dispatch_pipe[1], &jobp, sizeof(jobp)) != sizeof(jobp) goto again; // Add infinite loop protection if (net_read(worker->ready_pipe, &c, sizeof(c)) != sizeof(c)) // Cleanup, maybe goto again ... return job.pid; // when the read returns the PID is in pid } ``` The title also says that `clone()` is stupid. Linux should have had a thread creation system call -- it would have then saved itself the pain of the first pthread implementation for Linux. `clone()` has become a swiss army knife -- it has evolved to have zone/jail entering features, but only sort of: Linux doesn't have proper zones/jails, and as new flags are added that old code might wish it had used... one will have to modify and rebuild the `clone()`-calling world, and that's decidedly not elegant. -
nicowilliams revised this gist
Mar 24, 2017 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -37,7 +37,7 @@ One can also implement something like `afork()` (minus the Illumos `forkx()` fla int emulated_afork(int (*start_routine)(void *), void *arg, void (*cb)(pid_t) /* may be NULL */); ``` If the threads pre-vfork, then a PID-returning `afork()` can be implemented, though communicating a task to a pre-vforked thread might be tricky: `pthread_cond_wait()` might not work in the child, so one would have to use a pipe into which to write a pointer to the dispatched request. The title also says that `clone()` is stupid. Linux should have had a thread creation system call -- it would have then saved itself the pain of the first pthread implementation for Linux. `clone()` has become a swiss army knife -- it has evolved to have zone/jail entering features, but only sort of: Linux doesn't have proper zones/jails, and as new flags are added that old code might wish it had used... one will have to modify and rebuild the `clone()`-calling world, and that's decidedly not elegant.
NewerOlder