The BigList class grew out of my desire to have a List-like class, but that supported efficient insertions and removals from anywhere in the list. Because List is implemented as a simple buffer, removing or inserting an item in the middle of a List require moving all of the items after the insertion point. If the list becomes large (hundreds or even thousands of items), this can be very slow. I knew that implementing this would probably be best done with some sort of tree structure, and I described my initial thoughts and investigations in this post.

Eventually my research led me to a paper called "Ropes: an Alternative to Strings", by Boehm, Atkinson, and Plass, in "Software--Practice and Experience", Vol. 25(12), 1315–1330 (December 1995). This paper describes a "Rope" data structure that the authors used for storing text fragments instead of conventional strings. I also looked at a white paper published by SGI on their STL extension, also called ropes, and based on the same Boehm, Atkinson and Plass paper.

Both of these implementations use a binary tree with two kinds of nodes. Leaf nodes hold a small array of items that are in the list. Concatination nodes don't hold items themselves, but point to a left and right node, which might be either leaf nodes or more concatination nodes. They also hold a pre-computed field with the count of items in their children.

Finding an item with a particular index involves traversing the tree, going left or right at each level based on the count field. Thus, finding a node with a particular index takes O(log N) time, assuming the tree is roughly balanced. Inserting and deleting items is also O(log N): the leaf with the insertion/removal point is found, and the operation takes place on the array with that leaf (and the count fields of all parents are updated). If a leaf becomes larger than a particular limit, it is split into a concatination node and two leaves. If a leaf becomes empty, it is removed from the tree.

One interesting part about both the papers I looked at was that they also had the notion that multiple "ropes" could share common parts among themselves. This allowed operations like copying, concatinating, and getting a sub-list to be very fast, since the items involved wouldn't have to be copied. I really liked this idea. The interesting issue was how to deal with changes: if changes were made in the naive way, then changes to one list would change any other lists that shared that part: clearly unacceptable!

In the Boehm et. al. paper, ropes were immutable data structure (like .NET Strings), so they didn't really have to deal with this issue. The SGI paper used reference counts for memory management and dealt with changes that way. That didn't make sense in a garbage-collected situation. However, I still wanted lists that didn't share parts to be efficiently indexable and changable.

The solution I settled on was this. Each node has a boolean field called shared. Initially, this field is set to false. However, if an operation causes that node to be shared by more than one list, then the shared field is set to true. Once the shared field is set to true, it always remains true, and that node is never modified again. In addition, if a concatination node has the shared bit set to true, all of it's children, grandchildren, etc. are considered shared (even though, for efficiency, their bits aren't actually changed).

If an update needs to be made to a shared node, then the node is copied to a similar non-shared node, and changes are made to that node instead. This gives the best of both worlds. If a list never shares any parts, then updates proceed as efficiently as you would expected. If a copy is made, it is made efficiently, and items are only copies when needed. This strategy is often called "copy on write".

Also the algorithm sounds straightfoward enough, making sure that the shared bit was correctly set and respected in all the cases proved tricky. Bugs in this case were hard to find -- if the shared bit wasn't set right, the only ill effect might be that a change to one list would cause another list to change invisibly. I came up with a test plan that worked well: I used a set of 8 BigLists, along with 8 parallel regular Lists that always were to hold the same set of items. Every iteration, I would perform a random operation one or more of the lists, which might include adding an item, removing an item, changing an item, making a copy, concantinating two lists, getting a sub-list: all the basic operations I supported. I would do the same operation on the regular Lists as well, and after each iteration, I checked that all 8 BigLists and Lists were the same. When I had a bug with the shared bit, typically that showed up as a change to one list that had a strange unexpected affect on one of the other lists as well. I ran many thousands of iterations: often a particular small bug would show up only in a rare case, and would only appear on iteration 3712. Then the debugging fun began!

You can look at the documentation for BigList here -- you'll notice that it looks pretty much like a regular List except for the performance of operations; all the internal tree structure and shared bits is completely hidden behind the facade of a ordinary list. If you want to look at the code itself, download PowerCollections here and look at BigList.cs.