TODO list for distcc            -*- mode: indented-text; fill-column: 78; -*-

See also TODO comments in source files


static linking
    
    cachegrind shows that a large fraction of client runtime is spent in the
    dynamic linker, which is kind of a waste.  In principle using dietlibc
    might reduce the fixed overhead of the client.  However, the nsswitch
    functions are always dynamically linked: even if we try to produce a
    static client it will include dlopen and eventually indirectly get libc,
    so it's probably not practical.

testing

    How to use Debian's make-kpkg with distcc?  Does it work with the
    masquerade feature?

    Add --die-after= option to distccd to make sure it gets killed off in
    testing even if the test script is killed.

ccache
    
    Add an uncached fd to ccache, so that we can describe e.g. network
    failures that shouldn't be remembered.  Export this as CCACHE_ERR_FD or
    something. 

kernel 2.5 bug

    Andrew Morton has a situation where distcc on Linux 2.5 causes a TCP bug.
    (One machine thinks the socket is ESTABLISHED but the other thinks it is
    CLOSED.  This should never happen.)

    Need to install 2.5 on two machines and run compilations until it hangs. 


slow networks

    Use Linux Traffic Control to simulate compilation across a slow network.
    Woohoo!

"single queue multi server" scheduler

    research this more.

read hosts from file

compression

manpages

    The GNU project considers manpages to be deprecated, and they are
    certainly harder to maintain than a proper manual, but many people still
    find them useful.

    It might be nice to update the manual pages to contain quick-reference
    information that is smaller than the user manual but larger than what is
    available from --help.  Is that ever really needed?

    The manpages should be reasonably small both because that suits the
    format, and also because I don't want to need to keep too much duplicated
    information up to date.

    This might be a nice small bit of work for somebody who wants to
    contribute.
    
    http://www.debian.org/doc/debian-policy/ch-docs.html

User Manual

    We tried using Docbook in release 1.2, but the tools for it seem much less
    mature than those for Linuxdoc, so I'm going to skip this for a while.

 - Add some documentation of the benchmark system.  Does this belong
   in the manual, or in a separate manual?

 - Note that mixed gcc versions might cause different optimizations, which may
   be a problem.  In addition, files which test the gcc version either in the
   configure script or in the preprocessor will have trouble.  The kernel is
   one such program, and it needs to be built the same versions of gcc on all
   machines.

 - FAQ: Can't you check the gcc version?  No, because gcc programs which
   report the same versions number can have different behaviours, perhaps due
   to vendor/distributor patches.

 - Actually, distcc might use flock, lockf, or something else, depending on
   the platform.

 - Note that LSB requires init scripts to reset PATH, etc, so as to be
   independent of user settings if started interactively.

 - Discuss dietlibc.


Just cpp and linker?

 - Is it easy to describe how to install only the bits of gcc needed for
   distcc clients?  Basically the driver, header, linker, and specs.  Would
   this save much space?


Preforking

 - The daemon might "pre-fork" children, which will each accept
   connections and do their thing, as in Apache.  This might reduce
   the number of fork() calls incurred, and perhaps also the latency
   in accepting a connection.  I'm not sure if it's really justified,
   but if we have server-side concurrency limits it might fall out
   nicely.


-g support

    Perhaps detect the -g option, and then absolutify filenames passed to the
    compiler.  This will cause absolute filenames to appear in error messages,
    but I don't see any easy way to have both correct stabs info and also
    correct error messages.

    Is anything else wrong with this approach?

  
--enable-final option for KDE

  Bernardo Innocenti <bernie@develer.com> says

    Using the --enable-final configure option of KDE makes distcc almost
    useless.  

  What is he talking about?

> Moin Martin,

Hello!

> --enable-final makes the build system concatenate all sourcefiles in a
> directory (say, Konqueror's sourcefiles) into one big file.

Thanks for explaining that.  I'd wondered about that approach, so it's
interesting to hear that KDE has done it.  The SGI compiler does
something similar, but by writing a bytecode into the .o files and
then doing global optimization at "link" time.

> Technically, this is achieved by creating a dummy file which simply
> includes every C++ sourcefile. The advantage of this is that the
> compile a) takes less time since there is only little scattered file
> opening involved and b) produces usually more optimized code, since
> the compiler can see more code at once and thus there are more
> chances to optimize. Of course this eats a lot more memory, but that
> is not an issue nowadays.
> 
> Now, it's clear why this makes distcc useless: there is just one huge file
> per project, and outsourcing that file via distcc to other nodes will just
> delay the build since the sourcecode (and it's a lot) has to be transferred
> over the network, and there is no way to pararellize this.

Yes, that does seem to make it non-parallelizable.  However, I suppose
you might profitably still use distcc to build different
libraries/programs in different directories at the same time, if the
makefile can do that.  Or at least you might use it to shift work from
a slow machine to a faster one.


Statistics

    Accumulate statistics on how many jobs are built on various machines.

    Want to be able to do something like "watch ccache -s".


Compression

  Use LZO.  Very cheap compared to cpp.

  Perhaps statically link minilzo into the program -- simpler and
  possibly faster than using a shared library.


kill compiler

    If the client is killed, it will close the connection.  The server ought
    to kill the compiler so as to prevent runaway processes on the server. 

    This probably involves selecting() for read on the connection.

    The compilation will complete relatively soon anyhow, so it's not worth
    doing this unless there is a simple implementation.
    

Scheduling

  Scheduler needs to include port number when naming machines

  New "supermarket" scheduler

    There's rarely any point sending two files to any one machine at
    the same time.  Presumably the network can be completely filled by
    one of them.

    Other processes queue up behind whoever is waiting for the
    connection.  Try to keep them in order.

    Implement this by a series of locks

    How to correctly allow for make running other tasks on the machine at the
    same time?

    Tasks that are waiting for resources should *not* be bound to any
    particular resource: they should wait for *anything* to be free, and then
    take that.  Perhaps we need a semaphore initialized to the number of
    remote slots?  

    What's the best portable way to do semaphores?  SysVSem, or something
    built from scratch?  Cygwin has POSIX semaphores.  

    Doing this for an inconsistent list of hosts might be tricky, though.

    Alternatively, create a client-side daemon that queues up clients and
    sends them to the next-available machine.

    Alternatively, use select() to wait on many files.  Can this wait for
    locks?  It would be a shame if it can't.

    Alternatively, just sleep for 0.1s and then try to acquire a lock again.
    Ugly, but simple and it would probably work.  Not very expensive compared
    to actually running the compiler, and probably cheap compared to running a
    compiler in the wrong place.

    Probably don't do load limitation on remote hosts by default: just send
    everything and let the daemon accept as it wishes.


corks

    Can corks cause data to be sent in the SYN or ACK packet?


tcp fiddling

    I wonder if increasing the maximum window size (sys.net.core.wmem_default,
    etc) will help anything?  It's probably dominated by scheduling
    inefficiency at the moment.


benchmark

    Try aspell and xmms, which may have strange Makefiles.


compression when needed

    Compression is probably only useful when we're network-bound.  We can
    roughly detect this by seeing whether we had to wait to acquire the mutex
    to send data to a particular machine.  If we're waiting for the network to
    be free, we might as well start compressing.


Load balancing

  Perhaps rely on external balancer

  http://balance.sourceforge.net/

  Perhaps we can adapt its ideas.

rsync-like caching

  Send source as an rdiff against the previous version.

  Needs to be able to fall back to just sending plain text of course.

  Perhaps use different compression for source and binary.

--ping option
       
  It would be nice to have a <tt>--ping</tt> client option to contact
  all the remote servers, and perhaps return some kind of interesting
  information.  

  Output should be machine-parseable e.g. to use in removing
  unreachable machines from the host list.

host specification

    Perhaps look in /etc/distcc/hosts and ~/.distcc/hosts by default, if the
    DISTCC_HOSTS is not set.

    If it is set, perhaps allow it to contain filenames which cause those
    files to be read.


Implicit usage:

    Take CC name from environment variable DISTCC_CC.  Document this.
    Perhaps have a separate one for CXX?  (Though really I don't see
    any point in this, because we could only distinguish the two by
    looking at the prefix of the source file, which is the same as
    what gcc will do.)

    We need to reliably distinguish the three cases, and then also
    implement each one correctly.  Plenty of room for interesting test
    cases here.

    Three methods of usage:

	"Explicit" compiler name.
	    distcc gcc -c foo.c  

	    Nice and simple!!  Name of the real compiler is simply taken
	    from argv[1].

	"Implicit" compiler.
	    distcc -c foo.c
	    distcc foo.o -o foo
	    
	    First argument is not a compiler name.  

	"Intercepted" compiler
	    ln -s distcc /usr/local/bin/cc
	    cc -c foo.c
	    cc foo.o -o

	    The command line looks like an implicit compiler invocation, in
	    that the first word is not the name of the compiler.  

	    However, rather than using DISTCC_CC, we need to find the
	    "real" underlying compiler.

	    Want to set a _DISTCC_SAFEGUARD environment variable to
	    protect against accidentally invoking distcc recursively.

   I'm not sure what the precedence should be between DISTCC_CC and an
   intercepted compiler name.  On the whole I think using the
   intercepted name is probably better.

   So the decision tree is probably like this:

   if a compiler name is explicitly specified
	run the named compiler
   otherwise, if we intercepted the call
        work out the name of the real compiler
   otherwise, 
        use DISTCC_CC

   So how to work out if the compiler name was explicitly specified?  
  
        1- Look to see whether it looks like a source file or option.
	But this is a problem for some linker invocations...
	
	2- Look along the PATH to see if the file exists and is
	executable. 

    When checking the path:

        If the filename is absolute, then just check it directly.
        Otherwise, check every directory of the path, looking for a
        file of that name.  Check if it's executable.

    We can't rely on the contents of the path being the same on the
    server, but it should not be necessary to evalute this on the
    server. 

    If random files in the build are executable and either on the path
    or explicitly named on the command line then we may have trouble.

    How to tell if we've intercepted the compiler?  One way is just to
    check if the last component of argv[0] is "distcc".  This is what
    ccache does, and it probably works pretty well.

    How to find the real compiler?  We might try looking for the first
    program of the same name that's not a symlink, but that will cause
    trouble on machines where there is a link like "gcc -> gcc-3.2",
    which is common.

    On the server, distcc may be on the path.  Sending an absolute
    path to the compiler is undesirable, particularly since we might
    be cross-compiling (and have some programs in /usr/local), or
    running to a different distribution.

    Another problem is that people will probably end up with the hook on their
    distccd's path, and therefore it will recurse.  See
    http://bugs.gentoo.org/show_bug.cgi?id=13625#c2


Protocol
  Perhaps rather than getting the server to reinterpret the command
  line, we should mark the input and output parameters on the client.
  So what's sent across the network might be

    distcc -c @@INPUT@@ -o @@OUTPUT@@

  It's probably better to add additional protocol sections to say
  which words should be the input and output files than to use magic
  values.

  The attraction is that this would allow a particularly knotty part
  of code to be included only in the client and run only once.  If any
  bugs are fixed in this, then only the client will need to be
  upgraded.  This might remove most of the gcc-specific knowledge from
  the server.

  Different clients might be used to support various very different
  distributable jobs.

  We ought to allow for running commands that don't take an input or
  output file, in case we want to run "gcc --version".

  The drawback is that probably new servers need to be installed to
  handle the new protocol version.

  I don't know if there's really a compelling reason to do this.  If
  the argument parser depends on things that can only be seen on the
  client, such as checking whether files exist, then this may be
  needed.

gcc wierdnesses:

    distcc needs to  handle <tt>$COMPILER_PATH</tt> and
    <tt>$GCC_EXEC_PREFIX</tt> in some sensible way, if there is one.
    Not urgent because I have never heard of them being used.

compiler versioning:

    distcc might usefully verify that the compiler versions and
    critical parameters are compatible on all machines, for example by
    running -V.  This really should be done in a way that preserves
    the simplicity of the protocol: we don't want to interactively
    query the server on each request.  Perhaps distcc ought to add
    <tt>-b</tt> and <tt>-V</tt> options to the compiler, based on
    whatever is present on the current machine?  Or perhaps the user
    should just do this.

networking timeouts:

    distcc waits for too long on unreachable hosts.  We probably need
    to timeout after about a second and build locally.  Probably this
    should be implemented by connect() in non-blocking mode, bounded
    by a select.

    Also we want a timeout for name resolution.  The GNU resolver has
    a specific feature to do this.  On other systems we probably need
    to use alarm(), but that might be more trouble than it is worth.  Jonas
    Jensen says:

	Timing out the connect call could be done easier than this, just by
	interrupting it with a SIGALRM, but that's not enough to abort
	gethostbyname. This method of longjmp'ing from a signal handler is what
	they use in curl, so it should be ok.

    The client should have a medium-term local cache about unusable
    servers, to avoid always retrying connections.  Several different
    cases (unreachable, host down, server down, server broken) will
    produce slightly different errors.


ssh

    Running distcc across OpenSSH has several security advantages and
    should be supported in the future.  They include:

	Volunteer machines will not need to open an additional
	network-facing service.

	Only authenticated users can use a volunteer machine.

	Clients have some guarantees that their connections to a
	volunteer are not being spoofed.

    Using SSH is greatly preferable to developing and maintaining a
    custom security protocol.

    If the client or volunteer is subverted, then the other party is
    not protected.  (For example, if the administrator of the
    volunteer is malicious, or if the volunteer has been compromised,
    then compilation results might contain trojans.)  However, this is
    the case for practically every Internet protocol.

    Using SSH will consume some CPU cycles in computation on both
    client and volunteer.

    A simple implementation would be trivial, since the daemon already
    works on stdin/stdout.  However, this might perform poorly because
    SSH takes quite a long time to open a connection.

    Connections should be hoarded by the client.  If the client
    doesn't already have an ssh connection to the server, distcc
    should fork, with a background task holding the connection open
    and coordinating access.


waitstatus

    Make sure that native waitstatus formats are the same as the
    Unix/Linux/BSD formats used on the wire.  (See
    <http://www.opengroup.org/onlinepubs/007904975/functions/wait.html>,
    which says they may only be interpreted by macros.)  I don't know
    of any system where they're different.


gui

    a gui to show progress of compilation and distribution of load would be
    neat.  probably the most sensible way is to make it parse $distcc_log.

override compiler name
	    
    distcc could support cross-compilation by a per-volunteer option to
    override the compiler name.  On the local host, it might invoke gcc
    directly, but on some volunteers it might be necessary to specify a more
    detailed description of the compiler to get the appropriate cross tool.
    This might be insufficient for Makefiles that need to call several
    different compilers, perhaps gcc and g++ or different versions of gcc.
    Perhaps they can make do with changing the DISTCC host settings at
    appropriate times.

    I'm not convinced this complexity is justified.
	    
IPv6

    distcc could easily handle IPv6, but it doesn't yet.  The new sockets API
    does not work properly on all systems, so we need to detect it and fall
    back to the old API as necessary.

LNX-BBC

    It would be nice to put distcc and appropriate compilers on the LNX-BBC.
    This could be pretty small because only the compiler would be required,
    not header files or libraries.

#pragma implementation
  
    We might keep the same file basename, and put the files in a temporary
    subdirectory.  This might avoid some problems with C++ looking at the
    filename for #pragma implementation stuff.

    This is also a potential fix for the -MD stuff: we could run the whole
    compile in a subdirectory, and then grab any .d files generated.

Installable package for Windows

    Also, it would be nice to have an easily installable package for Windows
    that makes the machine be a Cygwin-based compile volunteer.  It probably
    needs to include cross-compilers for Linux (or whatever), or at least
    simple instructions for building them.


    Automatic detection ("zero configuration") of compile volunteers is
    probably not a good idea, because it might be complicated to implement,
    and would possibly cause breakage by distributing to machines which are
    not properly configured.


    Notwithstanding the previous point, centralized configuration for a site
    would be good, and probably quite practical.  Setting up a list of
    machines centrally rather than configuring each one sounds more friendly.
    The most likely design is to use DNS SRV records (RFC2052), or perhaps
    multi-RR A records.  For exmaple, compile.ozlabs.foo.com would resolve to
    all relevant machines.  Another possibility would be to use SLP, the
    Service Location Protocol, but that adds a larger dependency and it seems
    not to be widely deployed.


Large-scale Distribution

    distcc in it's present form works well on small numbers of close machines
    owned by the same people.  It might be an interesting project to
    investigate scaling up to large numbers of machines, which potentially do
    not trust each other.  This would make distcc somewhat more like other
    "peer-to-peer" systems like Freenet and Napster.

Load Balancing

    When running a job locally (such as cpp or ld), distcc ought to count that
    against the load of localhost.  At the moment it is biased towards too
    much local load.

    distcc needs a way to know that some machines have multiple CPUs, and
    should accept a proportionally larger number of jobs at the same time.
    It's not clear whether multiprocessor machines should be completely filled
    before moving on to another machine.


    If there are more parallel invocations of distcc than available CPUs it's
    not clear what behaviour would be best.  Options include having the
    remaining children sleep; distributing multiple jobs across available
    machines; or running all the overflow jobs locally.


    In fact, on Linux it seems that running two tasks on a CPU is not much
    slower than running a single task, because the task-switching overhead is
    pretty low.


    Problems tend to occur when we run more jobs than will fit into available
    physical memory.  It might be nice if there was a "batch mode" scheduler
    that would finish one before running the next, but in the absence of that
    we have to do it ourselves.  I can't see any clean and portable way to
    determine when the compiler is using too much memory: it would depend on
    the RSS of the compiler (which depends on the source file), on the amount
    of memory and swap, and on what other tasks are running.  In addition, on
    some small boxes compiling large code, you may actually want (or need) to
    have it swap sometimes.


    In addition, it might be nice to have a --max-load option, as for GNU
    Make, to tell it not to accept more than one job (or more than zero?) when
    the machine's load average is above that number.  We can try calling
    getloadavg(), which should exist on Linux and BSD, but apparently not on
    Solaris.  Can take patches later.


    A server-side administrative restriction on the number of consecutive
    tasks would probably be a sufficient approximation.


    Oscar Esteban suggests that when the server is limiting accepted jobs, it
    may be better to have it accept source, but defer compiling it.  This
    implies not using fifos, even if they would otherwise be appropriate.
    This may smooth out network utilization.  There may be some undesirable
    transient effects where we're waiting for one small box to finish all the
    jobs it has queued.


    The scheduler would ideally also take into account the special
    distribution required for non-parallel parts of the build: the most common
    case is running configure, where many small jobs will be run sequentially.
    In general the best solution is to run them locally, but if the local
    machine is very slow that may not be true.  Perhaps some kind of adaptive
    system based on measuring the performance of all available machines would
    make sense.

distributed caching

    Look in the remote machine's cache as well.

    Perhaps use a SQUID-like broadcast of the file digest and other critical
    details to find out if any machine in the workgroup has the file cached.
    Perhaps this could be built on top of a more general file-caching
    mechanism that maps from hash to body.  At the moment this sounds like
    premature optimization.