Merge branch 'josephg-main' into main
This commit is contained in:
commit
dadc08597d
179
INTERNALS.md
Normal file
179
INTERNALS.md
Normal file
@ -0,0 +1,179 @@
|
||||
# Yjs Internals
|
||||
|
||||
This document roughly explains how Yjs works internally. There is a complete
|
||||
walkthrough of the Yjs codebase available as a recording:
|
||||
https://youtu.be/0l5XgnQ6rB4
|
||||
|
||||
The Yjs CRDT algorithm is described in the [YATA
|
||||
paper](https://www.researchgate.net/publication/310212186_Near_Real-Time_Peer-to-Peer_Shared_Editing_on_Extensible_Data_Types)
|
||||
from 2016. For an algorithmic view of how it works, the paper is a reasonable
|
||||
place to start. There are a handful of small improvements implemented in Yjs
|
||||
which aren't described in the paper. The most notable is that items have an
|
||||
`originRight` as well as an `origin` property, which improves performance when
|
||||
many concurrent inserts happen after the same character.
|
||||
|
||||
At it heart, Yjs is a list CRDT. Everything is squeezed into a list in order to
|
||||
reuse the CRDT resolution algorithm:
|
||||
|
||||
- Arrays are easy - they're lists of arbitrary items.
|
||||
- Text is a list of characters, optionally punctuated by formatting markers and
|
||||
embeds for rich text support. Several characters can be wrapped in a single
|
||||
linked list `Item` (this is also known as the compound representation of
|
||||
CRDTs). More information about this in [this blog
|
||||
article](https://blog.kevinjahns.de/are-crdts-suitable-for-shared-editing/).
|
||||
- Maps are lists of entries. The last inserted entry for each key is used, and
|
||||
all other duplicates for each key are flagged as deleted.
|
||||
|
||||
Each client is assigned a unique *clientID* property on first insert. This is a
|
||||
random 53-bit integer (53 bits because that fits in the javascript safe integer
|
||||
range).
|
||||
|
||||
## List items
|
||||
|
||||
Each item in a Yjs list is made up of two objects:
|
||||
|
||||
- An `Item` (*src/structs/Item.js*). This is used to relate the item to other
|
||||
adjacent items.
|
||||
- An object in the `AbstractType` heirachy (subclasses of
|
||||
*src/types/AbstractType.js* - eg `YText`). This stores the actual content in
|
||||
the Yjs document.
|
||||
|
||||
The item and type object pair have a 1-1 mapping. The item's `content` field
|
||||
references the AbstractType object and the AbstractType object's `_item` field
|
||||
references the item.
|
||||
|
||||
Everything inserted in a Yjs document is given a unique ID, formed from a
|
||||
*ID(clientID, clock)* pair (also known as a [Lamport
|
||||
Timestamp](https://en.wikipedia.org/wiki/Lamport_timestamp)). The clock counts
|
||||
up from 0 with the first inserted character or item a client makes. This is
|
||||
similar to automerge's operation IDs, but note that the clock is only
|
||||
incremented by inserts. Deletes are handled in a very different way (see
|
||||
below).
|
||||
|
||||
If a run of characters is inserted into a document (eg `"abc"`), the clock will
|
||||
be incremented for each character (eg 3 times here). But Yjs will only add a
|
||||
single `Item` into the list. This has no effect on the core CRDT algorithm, but
|
||||
the optimization dramatically decreases the number of javascript objects
|
||||
created during normal text editing. This optimization only applies if the
|
||||
characters share the same clientID, they're inserted in order, and all
|
||||
characters have either been deleted or all characters are not deleted. The item
|
||||
will be split if the run is interrupted for any reason (eg a character in the
|
||||
middle of the run is deleted).
|
||||
|
||||
When an item is created, it stores a reference to the IDs of the preceeding and
|
||||
succeeding item. These are stored in the item's `origin` and `originRight`
|
||||
fields, respectively. These are used when peers concurrently insert at the same
|
||||
location in a document. Though quite rare in practice, Yjs needs to make sure
|
||||
the list items always resolve to the same order on all peers. The actual logic
|
||||
is relatively simple - its only a couple dozen lines of code and it lives in
|
||||
the `Item#integrate()` method. The YATA paper has much more detail on the this
|
||||
algorithm.
|
||||
|
||||
### Item Storage
|
||||
|
||||
The items themselves are stored in two data structures and a cache:
|
||||
|
||||
- The items are stored in a tree of doubly-linked lists in *document order*.
|
||||
Each item has `left` and `right` properties linking to its siblings in the
|
||||
document. Items also have a `parent` property to reference their parent in the
|
||||
document tree (null at the root). (And you can access an item's children, if
|
||||
any, through `item.content`).
|
||||
- All items are referenced in *insertion order* inside the struct store
|
||||
(*src/utils/StructStore.js*). This references the list of items inserted by
|
||||
for each client, in chronological order. This is used to find an item in the
|
||||
tree with a given ID (using a binary search). It is also used to efficiently
|
||||
gather the operations a peer is missing during sync (more on this below).
|
||||
|
||||
When a local insert happens, Yjs needs to map the insert position in the
|
||||
document (eg position 1000) to an ID. With just the linked list, this would
|
||||
require a slow O(n) linear scan of the list. But when editing a document, most
|
||||
inserts are either at the same position as the last insert, or nearby. To
|
||||
improve performance, Yjs stores a cache of the 10 most recently looked up
|
||||
insert positions in the document. This is consulted and updated when a position
|
||||
is looked up to improve performance in the average case. The cache is updated
|
||||
using a heuristic that is still changing (currently, it is updated when a new
|
||||
position significantly diverges from existing markers in the cache). Internally
|
||||
this is referred to as the skip list / fast search marker.
|
||||
|
||||
### Deletions
|
||||
|
||||
Deletions in Yjs are treated very differently from insertions. Insertions are
|
||||
implemented as a sequential operation based CRDT, but deletions are treated as
|
||||
a simpler state based CRDT.
|
||||
|
||||
When an item has been deleted by any peer, at any point in history, it is
|
||||
flagged as deleted on the item. (Internally Yjs uses the `info` bitfield.) Yjs
|
||||
does not record metadata about a deletion:
|
||||
|
||||
- No data is kept on *when* an item was deleted, or which user deleted it.
|
||||
- The struct store does not contain deletion records
|
||||
- The clientID's clock is not incremented
|
||||
|
||||
If garbage collection is enabled in Yjs, when an object is deleted its content
|
||||
is discarded. If a deleted object contains children (eg a field is deleted in
|
||||
an object), the content is replaced with a `GC` object (*src/structs/GC.js*).
|
||||
This is a very lightweight structure - it only stores the length of the removed
|
||||
content.
|
||||
|
||||
Yjs has some special logic to share which content in a document has been
|
||||
deleted:
|
||||
|
||||
- When a delete happens, as well as marking the item, the deleted IDs are
|
||||
listed locally within the transaction. (See below for more information about
|
||||
transactions.) When a transaction has been committed locally, the set of
|
||||
deleted items is appended to a transaction's update message.
|
||||
- A snapshot (a marked point in time in the Yjs history) is specified using
|
||||
both the set of (clientID, clock) pairs *and* the set of all deleted item
|
||||
IDs. The deleted set is O(n), but because deletions usually happen in runs,
|
||||
this data set is usually tiny in practice. (The real world editing trace from
|
||||
the B4 benchmark document contains 182k inserts and 77k deleted characters. The
|
||||
deleted set size in a snapshot is only 4.5Kb).
|
||||
|
||||
## Transactions
|
||||
|
||||
All updates in Yjs happen within a *transaction*. (Defined in
|
||||
*src/utils/Transaction.js*.)
|
||||
|
||||
The transaction collects a set of updates to the Yjs document to be applied on
|
||||
remote peers atomically. Once a transaction has been committed locally, it
|
||||
generates a compressed *update message* which is broadcast to synchronized
|
||||
remote peers to notify them of the local change. The update message contains:
|
||||
|
||||
- The set of newly inserted items
|
||||
- The set of items deleted within the transaction.
|
||||
|
||||
## Network protocol
|
||||
|
||||
The network protocol is not really a part of Yjs. There are a few relevant
|
||||
concepts that can be used to create a custom network protocol:
|
||||
|
||||
* `update`: The Yjs document can be encoded to an *update* object that can be
|
||||
parsed to reconstruct the document. Also every change on the document fires
|
||||
an incremental document updates that allows clients to sync with each other.
|
||||
The update object is an Uint8Array that efficiently encodes `Item` objects and
|
||||
the delete set.
|
||||
* `state vector`: A state vector defines the know state of each user (a set of
|
||||
tubles `(client, clock)`). This object is also efficiently encoded as a
|
||||
Uint8Array.
|
||||
|
||||
The client can ask a remote client for missing document updates by sending
|
||||
their state vector (often referred to as *sync step 1*). The remote peer can
|
||||
compute the missing `Item` objects using the `clocks` of the respective clients
|
||||
and compute a minimal update message that reflects all missing updates (sync
|
||||
step 2).
|
||||
|
||||
An implementation of the syncing process is in
|
||||
[y-protocols](https://github.com/yjs/y-protocols).
|
||||
|
||||
## Snapshots
|
||||
|
||||
A snapshot can be used to restore an old document state. It is a `state vector`
|
||||
+ `delete set`. I client can restore an old document state by iterating through
|
||||
the sequence CRDT and ignoring all Items that have an `id.clock >
|
||||
stateVector[id.client].clock`. Instead of using `item.deleted` the client will
|
||||
use the delete set to find out if an item was deleted or not.
|
||||
|
||||
It is not recommended to restore an old document state using snapshots,
|
||||
although that would certainly be possible. Instead, the old state should be
|
||||
computed by iterating through the newest state and using the additional
|
||||
information from the state vector.
|
11
README.md
11
README.md
@ -892,11 +892,14 @@ do not require a central source of truth.
|
||||
|
||||
Yjs implements a modified version of the algorithm described in [this
|
||||
paper](https://www.researchgate.net/publication/310212186_Near_Real-Time_Peer-to-Peer_Shared_Editing_on_Extensible_Data_Types).
|
||||
I will eventually publish a paper that describes why this approach works so well
|
||||
in practice. Note: Since operations make up the document structure, we prefer
|
||||
the term *struct* now.
|
||||
This [article](https://blog.kevinjahns.de/are-crdts-suitable-for-shared-editing/)
|
||||
explains a simple optimization on the CRDT model and
|
||||
gives more insight about the performance characteristics in Yjs.
|
||||
More information about the specific implementation is available in
|
||||
[INTERNALS.md](./INTERNALS.md) and in
|
||||
[this walkthrough of the Yjs codebase](https://youtu.be/0l5XgnQ6rB4).
|
||||
|
||||
CRDTs suitable for shared text editing suffer from the fact that they only grow
|
||||
CRDTs that suitable for shared text editing suffer from the fact that they only grow
|
||||
in size. There are CRDTs that do not grow in size, but they do not have the
|
||||
characteristics that are benificial for shared text editing (like intention
|
||||
preservation). Yjs implements many improvements to the original algorithm that
|
||||
|
Loading…
x
Reference in New Issue
Block a user