dump-load-format.txt   [plain text]

This file describes the format produced by 'svnadmin dump' and
consumed by 'svnadmin load'.  

The format has undergone revisions over time.  They are presented in
reverse chronological order here.  You may wish to start with the
VERSION 1 description in order to get a baseline understanding first.


(generated by SVN versions 1.1.0-present, if requested by the user)

This format is equivalent to the VERSION 2 format except for the

1.) The format starts with the new version number of the dump format
    ("SVN-fs-dump-format-version: 3\n").

2.) There are several new optional headers for node changes:

[Text-delta: true|false]
[Prop-delta: true|false]
[Text-delta-base-md5: blob]
[Text-delta-base-sha1: blob]
[Text-copy-source-sha1: blob]
[Text-content-sha1: blob]

    The default value for the boolean headers is "false".  If the value is
    set to "true", then the text and property contents will be treated
    as deltas against the previous contents of the node (as determined
    by copy history for adds with history, or by the value in the
    previous revision for changes--just as with commits).

Property deltas have the same format as regular property lists except
that (1) properties with the same value as in the previous contents of
the node are not printed, and (2) deleted properties will be written
out as

D <name length>

just as a regular property is printed, but with the "K " changed to a
"D " and with no value part.

Text deltas are written out as a series of svndiff0 windows.  If
Text-delta-base-md5 is provided, it is the checksum of the base to
which the text delta is applied; note that older versions (pre-1.5) of
'svnadmin load' may ignore the checksum.

Text-delta-base-sha1, Text-copy-source-sha1, and Text-content-sha1 are not
currently used by the loader.  They are written by 1.6-and-later versions of
Subversion so that future loaders can optionally choose which checksum to
use for checking for corruption.


(generated by SVN versions 0.18.0-present, by default)

This format is equivalent to the VERSION 1 format in every respect,
except for the following:

1.) The format starts with the new version number of the dump format
    ("SVN-fs-dump-format-version: 2\n").

2.) In addition to "Revision Records", another sort of record is supported:
    the "UUID" record, which should be of the form:

UUID: 7bf7a5ef-cabf-0310-b7d4-93df341afa7e

    This should be used to indicate the UUID of the originating repository.


(generated by SVN versions prior to 0.18.0)

The binary format starts with the version number of the dump format
("SVN-fs-dump-format-version: 1\n"), followed by a series of revision
records.  Each revision record starts with information about the
revision, followed by a variable number of node changes for that
revision.  Fields in [braces] are optional, and unknown headers are
always ignored, for backwards compatibility.

Revision-number: N
Prop-content-length: P
Content-length: L

   ...P bytes of property data.  Properties are stored in the same
   human-readable hashdump format used by working copy property files,
   except that they end with "PROPS-END\n" for better readability.

Node-path: absolute/path/to/node/in/filesystem
Node-kind: file | dir  (1)
Node-action: change | add | delete | replace
[Node-copyfrom-rev: X]
[Node-copyfrom-path: path ]
[Text-copy-source-md5: blob] (2)
[Text-content-md5: blob]
[Text-content-length: T]
[Prop-content-length: P]
Content-length: Y (3)

   ... Y bytes of content data, divided into P bytes of "property"
   data and T bytes of "text" data.  The properties come first; their
   total length (including formatting) is Prop-content-length, and is
   included in Node-content-length.  The "PROPS-END\n" line always
   terminates the property section if there are props.  The remainder
   of the Y bytes (expected to be equivalent to Text-content-length)
   represent the contents of the node.


   (1) if the node represents a deletion, this field is optional.
   (2) this is a checksum of the source of the copy.  a loader process
       can use this checksum to determine that the copyfrom path/rev
       already present in a filesystem is really the *correct* one to
   (3) the Content-length header is technically unnecessary, since the
       information it holds (and more) can be found in the
       Prop-content-length and Text-content-length fields.  Though
       Subversion itself does not make use of the header when reading
       a dumpfile, we include it for compatibility with generic RFC822
   (4) There are actually 2 types of version 1 dump streams. The
       regular ones are generated since r2634 (svn 0.14.0). Older ones
       also claim to be version 1, but miss the Props-content-length
       and Text-content-length fields in the block header. In those
       days there *always* was a properties block.

Here's an example of revision 1422, whereby I added a new directory
"baz", added a new file "bop" inside it, and modified the file "foo.c":

Revision-number: 1422
Prop-content-length: 80
Content-length: 80

K 6
V 7
K 3
V 33
Added two files, changed a third.

Node-path: bar/baz
Node-kind: dir
Node-action: add
Prop-content-length: 35
Content-length: 35

K 10
V 4

Node-path: bar/baz/bop
Node-kind: file
Node-action: add
Prop-content-length: 76
Text-content-length: 54
Content-length: 130

K 14
V 2
K 12
V 15
Here is the text of the newly added 'bop' file.

Node-path: bar/foo.c
Node-kind: file
Node-action: change
Text-content-length: 102
Content-length: 102

Here is the fulltext of my change to an existing /bar/foo.c.
Notice that this file has no properties.

-*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*-

Old discussion: 

(This file started as a proposal, preserved here for posterity.)

A proposal for an svn filesystem dump/restore format.

Two problems we want to solve

 1.  When we change our node-id schema, we need to migrate all of our
     data (by dumping and restoring).

 2.  Serves as a backup format.  Could be read by other software tools

Design Goals

 A.  Written as two new public functions in svn_fs.h.  To be invoked
     by new 'svnadmin' subcommands.

 B.  Format uses only timeless fs concepts.

     The dump format needs to reference concepts that we *know* are
     general enough to never change.  These concepts must exist
     independently of any internal node-id schema, or any DB storage
     backend.  In other words, we're talking about the basic ideas in
     our original "design spec" from May 2000.

Format Semantics

Here are the timeless semantics of our fs design -- the things that
would be stored in our dump format.

  - A filesystem is an array of trees.
    Each tree is called a "revision" and has unversioned properties attached.

  - A revision has a tree of "nodes" hanging off of it.
    Actually, the nodes in the filesystem form a DAG.  A revision
    always points to an initial node that represents the 'root' of some tree.
  - The majority of a tree's nodes are hard-links (references) to
    nodes that were created in earlier trees.

  - A node contains 

        - versioned text
        - versioned properties
        - predecessor history:  "which node am I a variant of?"
        - copy history:  "which node am I a copy of?"

    The history values can be non-existent (meaning the node is
    completely new), or can have a value of {revision, path}.

Refinement of proposal #2:  (after discussion with gstein)

Each node starts with RFC822-style headers at the top.  The final
header is a 'Content-length:', followed by the content, so record
boundaries can be inferred.

The content section has two implicit parts: a property hash, and the
fulltext.  The division between these two sections is implied by the
"PROPS-END\n" tag at the end of the prophash.  In the case of a
directory node or a revision, only the prophash is present.