Ropes Data Structure: From the Original Paper to Implementation

AI Usage Disclaimer

One day, I was looking for asymptotic complexity of ECMAScript String's += or concat() operation out of curiosity, and wondered why JS doesn't have a traditional StringBuilder since the strings are immutable. Turns out, the JS spec doesn't specify any Big-O complexity guarantees and they are left to the implementers, but still I couldn't wrap my head around how concat() operations could happen in near-constant time. Coincidentally, almost all popular browser-based JS engines - V8, SpiderMonkey, JavaScriptCore and LibJS - implement this data structure internally and I still don't see people talk about it more often: Ropes.

Tradeoffs

Of course, there are tradeoffs. Otherwise ropes would have been the de facto standard for implementing strings in any programming language. The reason they're not quite as popular is because they solve a very niche problem involving large immutable strings, frequent concatenation, substring sharing, or editor-like updates. As a result, almost all of the popular engines use different underlying representations for strings, based on the length. One possible use case that I can think of is React's renderToString().

Ropes actually perform worse for short strings because of the pointer overhead and cache locality, and in a typical workload, most strings are usually small. As a result, engines use plain one-byte (ASCII) or two-byte (UTF-16) strings, until the length of the string exceeds a threshold, at which point a rope-like (extending the idea of ropes from the original paper) structure is used which uses flatten-on-demand. It means concat()s are O(1) since it modifies ~2 nodes, and when the tree is rebalanced (rope is flattened) is engine-specific. Mostly it's flattened when there's a request for linear char buffer, indexing, native interop, or certain substring paths. It gives the best of both worlds, concat()s are faster, and we don't have to spend time rebalancing until we actually need it. The amount of ad-hoc optimizations these engines do is fascinating.

Implementation

Alright, let's go through the original paper. The idea of storing data in the leaf nodes and information about the data in internal nodes of the tree isn't new, B+ trees existed since early 1970s. However, B trees self-balance on insertion, with at most O(log n) node updates, but in case of ropes, rebalancing could cost O(rope_len) in the worst-case, which means worst-case adversarial input can be generated that triggers rebalancing after every concat().

The paper mentions length of rope should be at least F(n+2) for a depth of n. But how do fibonacci numbers come in the picture while describing the definition of a balanced binary tree? The usual definition of a balanced binary tree is abs(h(left)-h(right)) <=1 (i.e. heights of left and right subtrees of every node in the tree differ by at most 1), which means for a tree to be balanced -

For depth=0, min_leaves_required=1
For depth=1, min_leaves_required=2
For depth=2, min_leaves_required=3
For depth=3, min_leaves_required=5
For depth=4, min_leaves_required=8 ... and so on

which forms a sequence 1,2,3,5,8... which is a standard fibonacci sequence shifted by 2, hence the bound ofF(n+2). You can verify these values by drawing a balanced binary tree on paper. So, given a rope's total length and current tree depth, we can efficiently get a rough estimate of whether the tree needs rebalancing after a concat() operation.

I also like how the paper just assumes you have a garbage collector at your disposal, and even hints at using reference counting if you're implementing in a language like C. Let's not implement a GC here, maybe some other day, in another post. Let's use Go and for simplicity we will only consider ASCII strings for now. We'll see how things might change for UTF-16 later.

The API contract looks something like this -

type Rope interface {
  Len() int
  String() string
  Concat(other Rope) Rope
  At(index int) byte
  root() *node
}
type rope struct {
  root *node
}

// expose root via root()
func (r *rope) root() *node {
  if r == nil {
    return r
  }
  return r.root;
}

The paper specifies, we require some notion of "sequence of characters" for leaf nodes. Let's use strings for that. It also mentions tree indexed by position, it means we need to store some information about the length of string represented by a node. Let's use totalLen for that. The node will look something like this -

type node struct {
  left *node
  right *node
  leaf string
  totalLen int // length of substring represented by this subtree
}

Let's start with implementing easier and straight-forward Len() and Concat() first. Since we're maintaining totalLen for each node, Len() is trivial. For Concat(), we just create a new parent root node, and attach roots of the two ropes as left and right children of this newly created node. Note that since we're not doing any rebalancing, this operation is O(1).

// length
func (r *rope) Len() int {
  if r == nil || r.root == nil {
    return 0
  }
  return r.root.totalLen
}

// concat
func (r *rope) Concat(other Rope) Rope {
  return &rope{root:concat(r.root, other.root())}
}

// helper
func concat(a *node, b *node) *node {
  if a == nil {
    return b
  }
  if b == nil {
    return a
  }
  return &node{
    left:a,
    right:b,
    totalLen: a.totalLen + b.totalLen,
  }
}

Let's implement String() now, which is a standard in-order tree traversal.

func (r *rope) String() string {
  if r == nil {
    return ""
  }
  return r.root.String()
}

func (n *node) String() string {
  if n == nil {
    return ""
  }
  if n.left == nil && n.right == nil {
    return n.leaf
  }
  return n.left.String() + n.right.String()
}

The idea for implementing At(x) is to do a DFS from root node, since we store total length represented by subtree under the node, it tells us if we need to explore left or right subtree. Once we are at the leaf node, we just return the character in the leaf string chunk. For simplicity, let's ignore out-of-bounds errors.

func (r *rope) At(index int) byte {
  return r.root.at(index)
}

func (n *node) at(index int) byte {
  if n.left == nil && n.right == nil {
    return n.leaf[index]
  }
  leftLen := 0
  if n.left != nil {
    leftLen = n.left.totalLen
  }
  if index < leftLen {
    return n.left.at(index)
  }
  return n.right.at(index-leftLen)
}

Now the only challenging part remaining is rebalancing the tree. The idea is to rebuild a balanced tree. Since the left-to-right order of leaves gives the canonical representation of the text we are storing, we first collect leaves by traversing similar to how we did it in String() implementation. After collecting the leaves, we build the new tree by recursively splitting the collected leaves in half and combine the two halves with concat().

            root                            root
           /    \                          /    \
        node     "gh"       ---->       node     node
       /   \                           /   \    /   \
    node    "ef"                    "ab"  "cd" "ef" "gh"
   /   \
"ab"  "cd"

collected:["ab","cd", "ef", "gh"]

Let's implement rebalance() now. We can call this rebalance() whenever it's optimal to do so.

func (r *rope) rebalance() *rope {
  if r == nil || r.root() == nil {
    return r
  }
  return &rope{root:rebalance(r.root())}
}

// helper
func rebalance(n *node) *node {
  if n == nil {
    return n
  }
  if n.left == nil && n.right  == nil {
    return n
  }
  leaves := make([]*node, 0)
  collectLeaves(n, &leaves)
  return buildBalanced(leaves, 0, len(leaves))
}

// collect leaves in order
func collectLeaves(n *node, leaves *[]*node){
  if n == nil {
    return
  }
  if n.left == nil && n.right == nil {
    *leaves = append(*leaves, &node{leaf:n.leaf, totalLen:n.totalLen})
    return
  }
  collectLeaves(n.left, leaves)
  collectLeaves(n.right, leaves) }

// build a new balanced tree
func buildBalanced(leaves []*node, L, R int) *node {
  if L >= R {
    return nil
  }
  if R-L == 1 {
    return leaves[L]
  }
  M := L + (R-L)/2
  left := buildBalanced(leaves, L, M)
  right := buildBalanced(leaves, M, R)
  return concat(left, right)
}

This wraps-up our basic implementation of ropes. Of course we could re-use leaf nodes and do a ton of other optimizations, but hopefully the basic idea is clear now.

References

For further reading: