The *file-index* is a part of the Mercurial store that tracks all file-paths used
within a repository store.

WARNING: The "file-index" is an experimental feature still in development.
While the file format (`fileindex-v1`) is frozen and you can thus expect it to be supported,
the feature itself may still have bugs or need API/config changes.


The file index supersedes the fncache (see the "Difference with the Fncache"
section) and is independent of the working copy, only tracking file-paths relevant
for the history.

The file index is used by operations that needs to iterate over all the
information in the store, like local and stream cloning, repository upgrades or
various debug commands computing statistics.

The file index also provides a bidirectional mapping between "file-path" and
"file-token". The "file-token" are an internal opaque fixed size identifiers
that can be used by algorithms for efficiency.

The file index is designed for efficient querying and updating.

The file index supersede the "fncache" feature, see the `Difference with
FnCache` section for details. It is also currently incompatible with the
experimental `tree manifest` feature.

This feature is guarded by the `fileindex-v1` requirement. See
:hg:`help internals.requirements` for details.

High level design
=================

Functionality
-------------

The file-index features revolve around two main concepts:

* file-path: The path of a file for which we have file revision stored in the
  store of a repository.

  The "file-path" is the same as the path initially tracked and committed from
  the working copy without any transformation or encoding, except for path
  separator being normalised to "/". The "file-path" is a relative path (from
  the working copy root).

  When using a narrow repository, filtered file-paths whose file-revisions are
  missing from the store are not stored in the file-index.

* file-token: An opaque, fixed size identifier that uniquely identifies a
  file-path within a given repository. The file tokens are local only and not
  consistent between clones.

  In practice, a file index *token* is a nonzero unsigned 31-bit integer: it
  fits in 32 bits but we reserve the value 0 and all values with the high bit
  for other uses. However, this is not a core guarantee for "file-token" and
  might change in the future.


The feature envelop of the file-index is:

    * records the full set of "file-path" for whom we have file-revision stored
     in the repository store.

     For a standard repository, this mean all the "file-path" ever seen in the
     history.

     For a narrow repository, this mean all the "file-path" seen in the history
     that are also selected by the narrow spec.

    * provides an efficient mapping from "file-path" to "file-token",

    * provides an efficient mapping from "file-token" to "file-path",

    * listing the set of all stored "file-path" efficiently,

    * adding new "file-path" efficiently, without affecting the validity of
      existing "file-token". A set of addition can be done in an "atomic" and
      transactional way.

      - A series of update will be either no visible and visible all at once on
        commit.

      - An ongoing updates can be aborted, not adding the new "file-path"
        recorded so far.

      - "pending visibility": external process, like hooks can be made to see
        the changes from the in-progress transaction.

    * removing some "file-path" from the file-index. However, such operation:
        - may affects the validity of existing "file-token"
        - is not guaranteed to be efficient,
        - doesn't ensure transactionality

    * It is mmap friendly.

    * Using this feature or not should not affect exchange between peers.

    * Once loaded, read only accesses can be done without memory allocations

As the feature set and format is not frozen yet, this feature set is subject to
change.


Implementation summary
----------------------

There are four main component to the `file-index`.

Three are related to the data themselves:

- the "path-index" that maps "file-path" to "file-token",
- the "token-index" that maps "file-token" to "file-path",
- the "file-path-data" that contains all the "file-path" themself.

The last one is the "docket". It stores various metadata and provides
transactionality, mmap safety, and other lower level details. See the "Internal
filesystem representation" section for details on the docket.

The "file-path-data" contains all the "file-path" stored in the repository. They
are each stored in full, to be able access them directly without processing or
memory allocation. This content is "append only" and it's size complexity is
`O(N)`.

The "token-index" is a linear index that can resolve each "file-token" to a
"file-path" in `O(1)` time. It also contains extra metadata to efficiently split
the "basename" of "file-path" from the directories containing it. This content
is append only and it's size complexity is `O(N)`.

The "path-index" is a prefix tree that allow to map a "file-path" into a
"file-token" (if the "file-path" is known to the `file-index`). The prefix tree
rely on the content of the "file-path-data" to encode its prefix. This allow it
to use fixed size node. The complexity of search or adding a "file-path" is
`O(log(N))` time. To preserve the append-only property on disk, this prefix tree
is stored on disk as "persistent" tree, only adding new nodes pointing to
existing one when inserting data, never overwriting existing nodes.  As we will
eventually vacuum the "dead" nodes, its size complexity remains `O(N)`.

Difference with FnCache
-----------------------

The `fncache` and the `file-index` store similar but different information.
While the `file-index` store plain `file-path`, the `fncache` store the path of
filelog related files (stored on disk in `.hg/store/data/`).

So when committing a "Foo/Bar/luz.txt" file:

- the `file-index` stores "Foo/Bar/luz.txt"
- the `fncache` stores "data/Foo/Bar/luz.txt.i"
  (and possibly "data/Foo/Bar/luz.txt.d")

Note: that the path stored in the `fncache` are "unencoded", so they are the
path before any "path encoding happens", so in practice, the actual file system
file will likely be "data/_foo/_bar/luz.txt.i", and some of the filelogs may
live in `.hg/store/dh/`.

See :hg:`help internals.revlogs` for details on the revlog related files and
path encoding.


By leaking revlog's implementation details, the `fncache` offer less flexibility
to the revlog, and the storage layer in general. A repository using `fncache`
must use one filelog per "file-path" and its revlogs cannot use flexible
filenames. On the other hand, a `file-index` repository store an higher level
information from which the filelog on can be computed. So it offers the same
features while offering more flexibility.

The only "drawback" is that the `fncache` provides a way to know that a given
filelog is not inlined (even after lock release), while with the `file-index` we
need to open the filelog to get that information. However, the `fncache` doesn't
provide a way to know that a given filelog is inlined, once the lock is
released. In addition, revlog's inlining is overall a quite flawed feature that
we are slowly moving away from, so this is not expected to be a major problem.


Another significant difference is that the `fncache` doesn't provides
`file-token`.

Unlike the `file-index`, the `fncache` is currently not visible during "pending"
operations.

Finally, but not least, the `fncache` file format is not tailored for
performance. Stored as flat, line based listing of path, the `fncache` needs to
load all the paths from disk to be able to find if a path exist within itself,
slowing down searches and updates. This affects the performance of higher level
operation like `commit` or `pull`.

Internal filesystem representation
==================================

File organization
-----------------

The file index storage consists of four files:

* `.hg/store/fileindex`: the docket file
* `.hg/store/fileindex-list.{ID1}`: the list file, containing the "file-path-data"
* `.hg/store/fileindex-meta.{ID2}`: the meta file, containing the "token-index"
* `.hg/store/fileindex-tree.{ID3}`: the tree file, containing the "path-index"

The docket is a small file that functions similarly to the `dirstate-v2` docket.
It stores metadata about the other files, including their active sizes and IDs.
The files with ID suffixes are append-only and never truncated. This provide use
with "transactionality" and "mmap safety". When we ever need to rewrite their
content, we write to a file with a new ID and update the docket to point to it.

The docket file format
----------------------

The purpose of the docket file is to provide a consistent view of the file
index. Changes to the other files are only visible to other process when the
docket is updated on disk, hence only the docket needs to be rolled back when a
transaction is aborted.

The docket file contains the following fields at fixed byte offsets counting
from the start of the file:

* Offset 0:
  The 12-byte marker string "fileindex-v1". This makes it easier to distinguish
  in case we introduce a new format in the future, although it is not strictly
  necessary since `.hg/store/requires` determines which format to use.

The following five "used size" fields are stored as 32-bit big-endian integers.
The actual size of the respective files may be larger (if another Mercurial
process is appending but has not updated the docket yet). That extra data at the
end of the files must be ignored.

* Offset 12: The used size of the list file in bytes.
* Offset 16: The used size of the meta file in bytes.
* Offset 20: The used size of the tree file in bytes.

The following four "ID" fields are stored as 8-byte strings. They indicate
where the corresponding data file is stored. For example, if the list file ID is
"ab19b7c0", then the list file is stored at `.hg/store/fileindex-list.ab19b7c0`.

* Offset 24: The list file ID.
* Offset 32: The meta file ID.
* Offset 40: The tree file ID.

* Offset 48: Prefix tree root node address

  Pseudo-pointer to the root node in the prefix tree, counted in bytes from the
  start of the tree file, as a 32-bit big-endian integer.

* Offset 52: Amount of "dead" data

  How many bytes of the tree file (within its used size) are unused, as a 32-bit
  big-endian integer. When appending to an existing tree file, some existing
  nodes can be unreachable from the new root but they still take up space. This
  counter is used to decide when to write a new tree file from scratch instead
  of appending to an existing one. Effectively vacuuming the "dead" data.

* Offset 56: Four flag bytes, currently ignored and reset to zero when saving
  the docket file.

* Offset 60: Number of garbage entries as a 32-bit big-endian integer.

* Offset 64: Size of the garbage path buffer in bytes, as a 32-bit big-endian
  integer.

* Offset 68:
  Array of garbage entries. Each has the following 12-byte layout:

  * Offset 0:
    Time-to-live (TTL) as a 16-bit big-endian integer. This is the remaining
    number of transactions the file must be kept around for.
  * Offset 2:
    Timestamp when the entry was added, in seconds since the Unix epoch, as a
    32-bit big-endian integer.
  * Offset 6:
    Offset of the path within the path buffer, as a 32-bit big-endian integer.
  * Offset 10:
    Length of the path in bytes, as a 16-bit big-endian integer.

* Next offset (68 + 12 * number of garbage entries):
  The garbage path buffer, containing file paths relative to `.hg/store/`
  terminated by null bytes, one after the other. The presence of the null bytes
  is optional and should not be relied on.

The list file format
---------------------

The purpose of the list file is to allow the meta file and tree file to
reference paths from fixed-length structures. It also enables code to construct
file paths without allocation, assuming the list file is mapped in memory.

Because the list file is append-only, to remove paths from it, we must write a
new list file from scratch. This is necessary in rare cases such as when
narrowing a repository or stripping changesets.

The list file contains all the file paths terminated by null bytes, one after
the other. The order of the paths is arbitrary and should not be relied on.
The presence of the null bytes is optional and should not be relied on.

The meta file format
--------------------

The purpose of the meta file is to store information for each token (in
particular, its file path) with constant time lookup.

Because the meta file is append-only, to remove elements from it, we must write
a new meta file from scratch. This is necessary in rare cases such as when
narrowing a repository or stripping changesets.

The meta file contains an array of 8-byte elements. The element for token T is
found at byte offset T*8. Each element has the following layout:

* Offset 0:
  Pseudo-pointer to the start of the token's file path, counted in bytes from
  the start of the list file, as a 32-bit big-endian integer.

* Offset 4:
  Length of the token's file path in bytes, as a 16-bit big-endian integer.
  This does not include the potential null terminator.

* Offset 6:
  Length of the path's directory name prefix in bytes, as a 16-bit big-endian
  integer. This is the byte index of the final "/" character in the path. If
  there is no "/", meaning this represents a file in the root of the repository,
  then this field is zero.

The tree file format
--------------------

The purpose of the tree file is to store information for each file path (in
particular, its token) and support fast lookup.

The tree file contains nodes that form a prefix tree. The docket file indicates
which node is the root node. Because the tree file is append-only, every time we
insert a path into the tree, we must create new internal nodes all the way up to
the root. This results in a growing number of unreachable nodes over time. When
this unused space exceeds a threshold, we rebuild the tree file from scratch.
Note that we don't do this process for each individual insertion, but only when
we flush the current inserted batch to disk.

Each node has a string *label*, and a *prefix* obtained by concatenating labels
from the root to the node. The node does not store its label directly, only its
first character, its length, and a file token. The label is a substring of the
file path associated with the token, where the starting index is obtained by
adding label lengths from the root to the node.

There must not be a common prefix between any pair of sibling node labels. This
limits the number of children to 256 based on initial bytes 0x00 to 0xff, and
further to 253 since "\n", "\r", and "\0" are forbidden in file paths. This
limit allows us to store the number of children in an 8-bit integer.

Each node has the following 6-byte header:

* Offset 0: Token as a 32-bit big-endian integer.
* Offset 4: Label length as an 8-bit integer.
* Offset 5: Number of children as an 8-bit integer.

The root node has token 0 and label length 0. For non-root nodes, the token and
the label length must be nonzero.

The header is followed by two arrays:

* Offset 6:
  For each child, its label's first character as an 8-bit integer.
  As explained above, these characters are all distinct.

* Next offset (6 + number of children):
  For each child, a 32-bit big-endian integer value.

  If the high bit is 0, then the value is a pseudo-pointer to the child node,
  counting in bytes from the start of the tree file.

  If the high bit is 1, then the child is a leaf node and the remaining 31 bits
  are its token. Its label length is implicit: the label extends to the end of
  the file path associated with the token.
