yjs/INTERNALS.md

# Yjs Internals

This document roughly explains how Yjs works internally. There is a complete
walkthrough of the Yjs codebase available as a recording:
https://youtu.be/0l5XgnQ6rB4

The Yjs CRDT algorithm is described in the [YATA
paper](https://www.researchgate.net/publication/310212186_Near_Real-Time_Peer-to-Peer_Shared_Editing_on_Extensible_Data_Types)
from 2016. For an algorithmic view of how it works, the paper is a reasonable
place to start. There are a handful of small improvements implemented in Yjs
which aren't described in the paper. The most notable is that items have an
`originRight` as well as an `origin` property, which improves performance when
many concurrent inserts happen after the same character.

At its heart, Yjs is a list CRDT. Everything is squeezed into a list in order to
reuse the CRDT resolution algorithm:

- Arrays are easy - they're lists of arbitrary items.
- Text is a list of characters, optionally punctuated by formatting markers and
  embeds for rich text support. Several characters can be wrapped in a single
linked list `Item` (this is also known as the compound representation of
CRDTs). More information about this in [this blog
article](https://blog.kevinjahns.de/are-crdts-suitable-for-shared-editing/).
- Maps are lists of entries. The last inserted entry for each key is used, and
  all other duplicates for each key are flagged as deleted.

Each client is assigned a unique *clientID* property on first insert. This is a
random 53-bit integer (53 bits because that fits in the javascript safe integer
range \[JavaScript uses IEEE 754 floats\]).

## List items

Each item in a Yjs list is made up of two objects:

- An `Item` (*src/structs/Item.js*). This is used to relate the item to other
  adjacent items.
- An object in the `AbstractType` hierarchy (subclasses of
  *src/types/AbstractType.js* - eg `YText`). This stores the actual content in
the Yjs document.

The item and type object pair have a 1-1 mapping. The item's `content` field
references the AbstractType object and the AbstractType object's `_item` field
references the item.

Everything inserted in a Yjs document is given a unique ID, formed from a
*ID(clientID, clock)* pair (also known as a [Lamport
Timestamp](https://en.wikipedia.org/wiki/Lamport_timestamp)). The clock counts
up from 0 with the first inserted character or item a client makes. This is
similar to automerge's operation IDs, but note that the clock is only
incremented by inserts. Deletes are handled in a very different way (see
below).

If a run of characters is inserted into a document (eg `"abc"`), the clock will
be incremented for each character (eg 3 times here). But Yjs will only add a
single `Item` into the list. This has no effect on the core CRDT algorithm, but
the optimization dramatically decreases the number of javascript objects
created during normal text editing. This optimization only applies if the
characters share the same clientID, they're inserted in order, and all
characters have either been deleted or all characters are not deleted. The item
will be split if the run is interrupted for any reason (eg a character in the
middle of the run is deleted).

When an item is created, it stores a reference to the IDs of the preceding and
succeeding item. These are stored in the item's `origin` and `originRight`
fields, respectively. These are used when peers concurrently insert at the same
location in a document. Though quite rare in practice, Yjs needs to make sure
the list items always resolve to the same order on all peers. The actual logic
is relatively simple - its only a couple dozen lines of code and it lives in
the `Item#integrate()` method. The YATA paper has much more detail on this
algorithm.

### Item Storage

The items themselves are stored in two data structures and a cache:

- The items are stored in a tree of doubly-linked lists in *document order*.
  Each item has `left` and `right` properties linking to its siblings in the
document. Items also have a `parent` property to reference their parent in the
document tree (null at the root). (And you can access an item's children, if
any, through `item.content`).
- All items are referenced in *insertion order* inside the struct store
  (*src/utils/StructStore.js*). This references the list of items inserted by
for each client, in chronological order. This is used to find an item in the
tree with a given ID (using a binary search). It is also used to efficiently
gather the operations a peer is missing during sync (more on this below).

When a local insert happens, Yjs needs to map the insert position in the
document (eg position 1000) to an ID. With just the linked list, this would
require a slow O(n) linear scan of the list. But when editing a document, most
inserts are either at the same position as the last insert, or nearby. To
improve performance, Yjs stores a cache of the 80 most recently looked up
insert positions in the document. This is consulted and updated when a position
is looked up to improve performance in the average case. The cache is updated
using a heuristic that is still changing (currently, it is updated when a new
position significantly diverges from existing markers in the cache). Internally
this is referred to as the skip list / fast search marker.

### Deletions

Deletions in Yjs are treated very differently from insertions. Insertions are
implemented as a sequential operation based CRDT, but deletions are treated as
a simpler state based CRDT.

When an item has been deleted by any peer, at any point in history, it is
flagged as deleted on the item. (Internally Yjs uses the `info` bitfield.) Yjs
does not record metadata about a deletion:

- No data is kept on *when* an item was deleted, or which user deleted it.
- The struct store does not contain deletion records
- The clientID's clock is not incremented

If garbage collection is enabled in Yjs, when an object is deleted its content
is discarded. If a deleted object contains children (eg a field is deleted in
an object), the content is replaced with a `GC` object (*src/structs/GC.js*).
This is a very lightweight structure - it only stores the length of the removed
content.

Yjs has some special logic to share which content in a document has been
deleted:

- When a delete happens, as well as marking the item, the deleted IDs are
  listed locally within the transaction. (See below for more information about
transactions.) When a transaction has been committed locally, the set of
deleted items is appended to a transaction's update message.
- A snapshot (a marked point in time in the Yjs history) is specified using
  both the set of (clientID, clock) pairs *and* the set of all deleted item
IDs. The deleted set is O(n), but because deletions usually happen in runs,
this data set is usually tiny in practice. (The real world editing trace from
the B4 benchmark document contains 182k inserts and 77k deleted characters. The
deleted set size in a snapshot is only 4.5Kb).

## Transactions

All updates in Yjs happen within a *transaction*. (Defined in
*src/utils/Transaction.js*.)

The transaction collects a set of updates to the Yjs document to be applied on
remote peers atomically. Once a transaction has been committed locally, it
generates a compressed *update message* which is broadcast to synchronized
remote peers to notify them of the local change. The update message contains:

- The set of newly inserted items
- The set of items deleted within the transaction.

## Network protocol

The network protocol is not really a part of Yjs. There are a few relevant
concepts that can be used to create a custom network protocol:

* `update`: The Yjs document can be encoded to an *update* object that can be
  parsed to reconstruct the document. Also every change on the document fires
an incremental document update that allows clients to sync with each other.
The update object is a Uint8Array that efficiently encodes `Item` objects and
the delete set.
* `state vector`: A state vector defines the known state of each user (a set of
  tuples `(client, clock)`). This object is also efficiently encoded as a
Uint8Array.

The client can ask a remote client for missing document updates by sending
their state vector (often referred to as *sync step 1*). The remote peer can
compute the missing `Item` objects using the `clocks` of the respective clients
and compute a minimal update message that reflects all missing updates (sync
step 2).

An implementation of the syncing process is in
[y-protocols](https://github.com/yjs/y-protocols).

## Snapshots

A snapshot can be used to restore an old document state. It is a `state vector`
\+ `delete set`. A client can restore an old document state by iterating through
the sequence CRDT and ignoring all Items that have an `id.clock >
stateVector[id.client].clock`. Instead of using `item.deleted` the client will
use the delete set to find out if an item was deleted or not.

It is not recommended to restore an old document state using snapshots,
although that would certainly be possible. Instead, the old state should be
computed by iterating through the newest state and using the additional
information from the state vector.