TODO list for distcc -*- mode: indented-text; fill-column: 78; -*- See also TODO comments in source files static linking cachegrind shows that a large fraction of client runtime is spent in the dynamic linker, which is kind of a waste. In principle using dietlibc might reduce the fixed overhead of the client. However, the nsswitch functions are always dynamically linked: even if we try to produce a static client it will include dlopen and eventually indirectly get libc, so it's probably not practical. testing How to use Debian's make-kpkg with distcc? Does it work with the masquerade feature? Add --die-after= option to distccd to make sure it gets killed off in testing even if the test script is killed. ccache Add an uncached fd to ccache, so that we can describe e.g. network failures that shouldn't be remembered. Export this as CCACHE_ERR_FD or something. kernel 2.5 bug Andrew Morton has a situation where distcc on Linux 2.5 causes a TCP bug. (One machine thinks the socket is ESTABLISHED but the other thinks it is CLOSED. This should never happen.) Need to install 2.5 on two machines and run compilations until it hangs. slow networks Use Linux Traffic Control to simulate compilation across a slow network. Woohoo! "single queue multi server" scheduler research this more. read hosts from file compression manpages The GNU project considers manpages to be deprecated, and they are certainly harder to maintain than a proper manual, but many people still find them useful. It might be nice to update the manual pages to contain quick-reference information that is smaller than the user manual but larger than what is available from --help. Is that ever really needed? The manpages should be reasonably small both because that suits the format, and also because I don't want to need to keep too much duplicated information up to date. This might be a nice small bit of work for somebody who wants to contribute. http://www.debian.org/doc/debian-policy/ch-docs.html User Manual We tried using Docbook in release 1.2, but the tools for it seem much less mature than those for Linuxdoc, so I'm going to skip this for a while. - Add some documentation of the benchmark system. Does this belong in the manual, or in a separate manual? - Note that mixed gcc versions might cause different optimizations, which may be a problem. In addition, files which test the gcc version either in the configure script or in the preprocessor will have trouble. The kernel is one such program, and it needs to be built the same versions of gcc on all machines. - FAQ: Can't you check the gcc version? No, because gcc programs which report the same versions number can have different behaviours, perhaps due to vendor/distributor patches. - Actually, distcc might use flock, lockf, or something else, depending on the platform. - Note that LSB requires init scripts to reset PATH, etc, so as to be independent of user settings if started interactively. - Discuss dietlibc. Just cpp and linker? - Is it easy to describe how to install only the bits of gcc needed for distcc clients? Basically the driver, header, linker, and specs. Would this save much space? Preforking - The daemon might "pre-fork" children, which will each accept connections and do their thing, as in Apache. This might reduce the number of fork() calls incurred, and perhaps also the latency in accepting a connection. I'm not sure if it's really justified, but if we have server-side concurrency limits it might fall out nicely. -g support Perhaps detect the -g option, and then absolutify filenames passed to the compiler. This will cause absolute filenames to appear in error messages, but I don't see any easy way to have both correct stabs info and also correct error messages. Is anything else wrong with this approach? --enable-final option for KDE Bernardo Innocenti <email@example.com> says Using the --enable-final configure option of KDE makes distcc almost useless. What is he talking about? > Moin Martin, Hello! > --enable-final makes the build system concatenate all sourcefiles in a > directory (say, Konqueror's sourcefiles) into one big file. Thanks for explaining that. I'd wondered about that approach, so it's interesting to hear that KDE has done it. The SGI compiler does something similar, but by writing a bytecode into the .o files and then doing global optimization at "link" time. > Technically, this is achieved by creating a dummy file which simply > includes every C++ sourcefile. The advantage of this is that the > compile a) takes less time since there is only little scattered file > opening involved and b) produces usually more optimized code, since > the compiler can see more code at once and thus there are more > chances to optimize. Of course this eats a lot more memory, but that > is not an issue nowadays. > > Now, it's clear why this makes distcc useless: there is just one huge file > per project, and outsourcing that file via distcc to other nodes will just > delay the build since the sourcecode (and it's a lot) has to be transferred > over the network, and there is no way to pararellize this. Yes, that does seem to make it non-parallelizable. However, I suppose you might profitably still use distcc to build different libraries/programs in different directories at the same time, if the makefile can do that. Or at least you might use it to shift work from a slow machine to a faster one. Statistics Accumulate statistics on how many jobs are built on various machines. Want to be able to do something like "watch ccache -s". Compression Use LZO. Very cheap compared to cpp. Perhaps statically link minilzo into the program -- simpler and possibly faster than using a shared library. kill compiler If the client is killed, it will close the connection. The server ought to kill the compiler so as to prevent runaway processes on the server. This probably involves selecting() for read on the connection. The compilation will complete relatively soon anyhow, so it's not worth doing this unless there is a simple implementation. Scheduling Scheduler needs to include port number when naming machines New "supermarket" scheduler There's rarely any point sending two files to any one machine at the same time. Presumably the network can be completely filled by one of them. Other processes queue up behind whoever is waiting for the connection. Try to keep them in order. Implement this by a series of locks How to correctly allow for make running other tasks on the machine at the same time? Tasks that are waiting for resources should *not* be bound to any particular resource: they should wait for *anything* to be free, and then take that. Perhaps we need a semaphore initialized to the number of remote slots? What's the best portable way to do semaphores? SysVSem, or something built from scratch? Cygwin has POSIX semaphores. Doing this for an inconsistent list of hosts might be tricky, though. Alternatively, create a client-side daemon that queues up clients and sends them to the next-available machine. Alternatively, use select() to wait on many files. Can this wait for locks? It would be a shame if it can't. Alternatively, just sleep for 0.1s and then try to acquire a lock again. Ugly, but simple and it would probably work. Not very expensive compared to actually running the compiler, and probably cheap compared to running a compiler in the wrong place. Probably don't do load limitation on remote hosts by default: just send everything and let the daemon accept as it wishes. corks Can corks cause data to be sent in the SYN or ACK packet? tcp fiddling I wonder if increasing the maximum window size (sys.net.core.wmem_default, etc) will help anything? It's probably dominated by scheduling inefficiency at the moment. benchmark Try aspell and xmms, which may have strange Makefiles. compression when needed Compression is probably only useful when we're network-bound. We can roughly detect this by seeing whether we had to wait to acquire the mutex to send data to a particular machine. If we're waiting for the network to be free, we might as well start compressing. Load balancing Perhaps rely on external balancer http://balance.sourceforge.net/ Perhaps we can adapt its ideas. rsync-like caching Send source as an rdiff against the previous version. Needs to be able to fall back to just sending plain text of course. Perhaps use different compression for source and binary. --ping option It would be nice to have a <tt>--ping</tt> client option to contact all the remote servers, and perhaps return some kind of interesting information. Output should be machine-parseable e.g. to use in removing unreachable machines from the host list. host specification Perhaps look in /etc/distcc/hosts and ~/.distcc/hosts by default, if the DISTCC_HOSTS is not set. If it is set, perhaps allow it to contain filenames which cause those files to be read. Implicit usage: Take CC name from environment variable DISTCC_CC. Document this. Perhaps have a separate one for CXX? (Though really I don't see any point in this, because we could only distinguish the two by looking at the prefix of the source file, which is the same as what gcc will do.) We need to reliably distinguish the three cases, and then also implement each one correctly. Plenty of room for interesting test cases here. Three methods of usage: "Explicit" compiler name. distcc gcc -c foo.c Nice and simple!! Name of the real compiler is simply taken from argv. "Implicit" compiler. distcc -c foo.c distcc foo.o -o foo First argument is not a compiler name. "Intercepted" compiler ln -s distcc /usr/local/bin/cc cc -c foo.c cc foo.o -o The command line looks like an implicit compiler invocation, in that the first word is not the name of the compiler. However, rather than using DISTCC_CC, we need to find the "real" underlying compiler. Want to set a _DISTCC_SAFEGUARD environment variable to protect against accidentally invoking distcc recursively. I'm not sure what the precedence should be between DISTCC_CC and an intercepted compiler name. On the whole I think using the intercepted name is probably better. So the decision tree is probably like this: if a compiler name is explicitly specified run the named compiler otherwise, if we intercepted the call work out the name of the real compiler otherwise, use DISTCC_CC So how to work out if the compiler name was explicitly specified? 1- Look to see whether it looks like a source file or option. But this is a problem for some linker invocations... 2- Look along the PATH to see if the file exists and is executable. When checking the path: If the filename is absolute, then just check it directly. Otherwise, check every directory of the path, looking for a file of that name. Check if it's executable. We can't rely on the contents of the path being the same on the server, but it should not be necessary to evalute this on the server. If random files in the build are executable and either on the path or explicitly named on the command line then we may have trouble. How to tell if we've intercepted the compiler? One way is just to check if the last component of argv is "distcc". This is what ccache does, and it probably works pretty well. How to find the real compiler? We might try looking for the first program of the same name that's not a symlink, but that will cause trouble on machines where there is a link like "gcc -> gcc-3.2", which is common. On the server, distcc may be on the path. Sending an absolute path to the compiler is undesirable, particularly since we might be cross-compiling (and have some programs in /usr/local), or running to a different distribution. Another problem is that people will probably end up with the hook on their distccd's path, and therefore it will recurse. See http://bugs.gentoo.org/show_bug.cgi?id=13625#c2 Protocol Perhaps rather than getting the server to reinterpret the command line, we should mark the input and output parameters on the client. So what's sent across the network might be distcc -c @@INPUT@@ -o @@OUTPUT@@ It's probably better to add additional protocol sections to say which words should be the input and output files than to use magic values. The attraction is that this would allow a particularly knotty part of code to be included only in the client and run only once. If any bugs are fixed in this, then only the client will need to be upgraded. This might remove most of the gcc-specific knowledge from the server. Different clients might be used to support various very different distributable jobs. We ought to allow for running commands that don't take an input or output file, in case we want to run "gcc --version". The drawback is that probably new servers need to be installed to handle the new protocol version. I don't know if there's really a compelling reason to do this. If the argument parser depends on things that can only be seen on the client, such as checking whether files exist, then this may be needed. gcc wierdnesses: distcc needs to handle <tt>$COMPILER_PATH</tt> and <tt>$GCC_EXEC_PREFIX</tt> in some sensible way, if there is one. Not urgent because I have never heard of them being used. compiler versioning: distcc might usefully verify that the compiler versions and critical parameters are compatible on all machines, for example by running -V. This really should be done in a way that preserves the simplicity of the protocol: we don't want to interactively query the server on each request. Perhaps distcc ought to add <tt>-b</tt> and <tt>-V</tt> options to the compiler, based on whatever is present on the current machine? Or perhaps the user should just do this. networking timeouts: distcc waits for too long on unreachable hosts. We probably need to timeout after about a second and build locally. Probably this should be implemented by connect() in non-blocking mode, bounded by a select. Also we want a timeout for name resolution. The GNU resolver has a specific feature to do this. On other systems we probably need to use alarm(), but that might be more trouble than it is worth. Jonas Jensen says: Timing out the connect call could be done easier than this, just by interrupting it with a SIGALRM, but that's not enough to abort gethostbyname. This method of longjmp'ing from a signal handler is what they use in curl, so it should be ok. The client should have a medium-term local cache about unusable servers, to avoid always retrying connections. Several different cases (unreachable, host down, server down, server broken) will produce slightly different errors. ssh Running distcc across OpenSSH has several security advantages and should be supported in the future. They include: Volunteer machines will not need to open an additional network-facing service. Only authenticated users can use a volunteer machine. Clients have some guarantees that their connections to a volunteer are not being spoofed. Using SSH is greatly preferable to developing and maintaining a custom security protocol. If the client or volunteer is subverted, then the other party is not protected. (For example, if the administrator of the volunteer is malicious, or if the volunteer has been compromised, then compilation results might contain trojans.) However, this is the case for practically every Internet protocol. Using SSH will consume some CPU cycles in computation on both client and volunteer. A simple implementation would be trivial, since the daemon already works on stdin/stdout. However, this might perform poorly because SSH takes quite a long time to open a connection. Connections should be hoarded by the client. If the client doesn't already have an ssh connection to the server, distcc should fork, with a background task holding the connection open and coordinating access. waitstatus Make sure that native waitstatus formats are the same as the Unix/Linux/BSD formats used on the wire. (See <http://www.opengroup.org/onlinepubs/007904975/functions/wait.html>, which says they may only be interpreted by macros.) I don't know of any system where they're different. gui a gui to show progress of compilation and distribution of load would be neat. probably the most sensible way is to make it parse $distcc_log. override compiler name distcc could support cross-compilation by a per-volunteer option to override the compiler name. On the local host, it might invoke gcc directly, but on some volunteers it might be necessary to specify a more detailed description of the compiler to get the appropriate cross tool. This might be insufficient for Makefiles that need to call several different compilers, perhaps gcc and g++ or different versions of gcc. Perhaps they can make do with changing the DISTCC host settings at appropriate times. I'm not convinced this complexity is justified. IPv6 distcc could easily handle IPv6, but it doesn't yet. The new sockets API does not work properly on all systems, so we need to detect it and fall back to the old API as necessary. LNX-BBC It would be nice to put distcc and appropriate compilers on the LNX-BBC. This could be pretty small because only the compiler would be required, not header files or libraries. #pragma implementation We might keep the same file basename, and put the files in a temporary subdirectory. This might avoid some problems with C++ looking at the filename for #pragma implementation stuff. This is also a potential fix for the -MD stuff: we could run the whole compile in a subdirectory, and then grab any .d files generated. Installable package for Windows Also, it would be nice to have an easily installable package for Windows that makes the machine be a Cygwin-based compile volunteer. It probably needs to include cross-compilers for Linux (or whatever), or at least simple instructions for building them. Automatic detection ("zero configuration") of compile volunteers is probably not a good idea, because it might be complicated to implement, and would possibly cause breakage by distributing to machines which are not properly configured. Notwithstanding the previous point, centralized configuration for a site would be good, and probably quite practical. Setting up a list of machines centrally rather than configuring each one sounds more friendly. The most likely design is to use DNS SRV records (RFC2052), or perhaps multi-RR A records. For exmaple, compile.ozlabs.foo.com would resolve to all relevant machines. Another possibility would be to use SLP, the Service Location Protocol, but that adds a larger dependency and it seems not to be widely deployed. Large-scale Distribution distcc in it's present form works well on small numbers of close machines owned by the same people. It might be an interesting project to investigate scaling up to large numbers of machines, which potentially do not trust each other. This would make distcc somewhat more like other "peer-to-peer" systems like Freenet and Napster. Load Balancing When running a job locally (such as cpp or ld), distcc ought to count that against the load of localhost. At the moment it is biased towards too much local load. distcc needs a way to know that some machines have multiple CPUs, and should accept a proportionally larger number of jobs at the same time. It's not clear whether multiprocessor machines should be completely filled before moving on to another machine. If there are more parallel invocations of distcc than available CPUs it's not clear what behaviour would be best. Options include having the remaining children sleep; distributing multiple jobs across available machines; or running all the overflow jobs locally. In fact, on Linux it seems that running two tasks on a CPU is not much slower than running a single task, because the task-switching overhead is pretty low. Problems tend to occur when we run more jobs than will fit into available physical memory. It might be nice if there was a "batch mode" scheduler that would finish one before running the next, but in the absence of that we have to do it ourselves. I can't see any clean and portable way to determine when the compiler is using too much memory: it would depend on the RSS of the compiler (which depends on the source file), on the amount of memory and swap, and on what other tasks are running. In addition, on some small boxes compiling large code, you may actually want (or need) to have it swap sometimes. In addition, it might be nice to have a --max-load option, as for GNU Make, to tell it not to accept more than one job (or more than zero?) when the machine's load average is above that number. We can try calling getloadavg(), which should exist on Linux and BSD, but apparently not on Solaris. Can take patches later. A server-side administrative restriction on the number of consecutive tasks would probably be a sufficient approximation. Oscar Esteban suggests that when the server is limiting accepted jobs, it may be better to have it accept source, but defer compiling it. This implies not using fifos, even if they would otherwise be appropriate. This may smooth out network utilization. There may be some undesirable transient effects where we're waiting for one small box to finish all the jobs it has queued. The scheduler would ideally also take into account the special distribution required for non-parallel parts of the build: the most common case is running configure, where many small jobs will be run sequentially. In general the best solution is to run them locally, but if the local machine is very slow that may not be true. Perhaps some kind of adaptive system based on measuring the performance of all available machines would make sense. distributed caching Look in the remote machine's cache as well. Perhaps use a SQUID-like broadcast of the file digest and other critical details to find out if any machine in the workgroup has the file cached. Perhaps this could be built on top of a more general file-caching mechanism that maps from hash to body. At the moment this sounds like premature optimization.