Wintellect  

Browse by Tags

All Tags » collections   (RSS)

This is the last installment in a three part series about using collections in C#.

The entire series can be accessed here:

We've now covered the interfaces and some concrete instances of collections provided by the .NET Framework. Now you are interested in moving things to the next level. What if the provided collections simply don't meet your business requirements? What are some ways you can use the collections concept to build your own classes to solve business problems?

Yield to Iterators

The first important thing to understand when you begin building your custom collection is the concept of iterators in .NET and the yield statement. I'm surprised that many people use the language without truly understanding this statement, what it exists and how it can be used.

You might have encountered yield in your journeys. If you've built custom AJAX client controls, you probably implemented IScriptControl. One method asks for IEnumerable<ScriptReference>. The implementation is usually presented as:

...
ScriptReference sr = new ScriptReference("~/MyUserControl.js");
yield return sr;
...

You could alternatively have created a List or any other collection of ScriptReference and returned that. What does yield really do for us?

To better understand, I've created a short little console application. You can create a new console project and simply paste this code to build and run.

using System;
using System.Collections;
using System.Collections.Generic;

namespace Yield
{
    internal class Program
    {
        private delegate bool DoSomething();

        private sealed class Doer
        {
            private readonly DoSomething _doSomething;
            private readonly string _msg;

            public Doer(DoSomething doSomething, string message)
            {
                _doSomething = doSomething;
                _msg = message;
                Console.WriteLine(string.Format("{0}: Ctor()", _msg));
            }

            public bool Do()
            {
                Console.WriteLine(string.Format("{0}: Do()", _msg));
                return _doSomething();
            }
        }

        private sealed class DoerCollection : IEnumerable<Doer>
        {
            public IEnumerator<Doer> GetEnumerator()
            {
                yield return new Doer(() => true, "1");
                yield return new Doer(() => false, "2");
                yield return new Doer(() => true, "3");
                yield break;
            }

            IEnumerator IEnumerable.GetEnumerator()
            {
                return GetEnumerator();
            }
        }

        private static void _DoIt(IEnumerable<Doer> doerCollection)
        {
            foreach (Doer doer in doerCollection)
            {
                if (!doer.Do())
                {
                    break;
                }
                Console.WriteLine(".");                
            }
            Console.WriteLine("..");                
        }

        private static void Main(string[] args)
        {
            _DoIt(new DoerCollection());
            _DoIt(new List<Doer>
                      {
                          new Doer(() => true, "4"),
                          new Doer(() => false, "5"),
                          new Doer(() => true, "6")
                      });

            Console.ReadLine();
        }
    }
}

So let's walk through the code.

First, I define a delegate called DoSomething that simply states "I want a method that takes no parameters and returns a boolean." This is a contrived example, of course, but in the "real world" you may have a pipeline or chain of responsibility that performs actions and then returns a status indicating that the process should continue or there is another node to consider, etc. I encapsulated the delegate in the class Doer. The constructor takes a "message" and an implementation of the delegate. The only reason I pass in the message is to track which object is doing what. What's important here is to see when the classes are created compared to when the main method is called, which simply invokes the delegate.

Next, I created my custom collection, DoerCollection. This is a collection of "activities" to perform. Obviously I am simply returning true or false in the example, but again, in a real-world scenario this could be a file system processor that iterates through a directory and returns files until no more can be found, or calls a web service and returns the status ... you get the idea. Notice that I simply yield return different instances of Doer that I pass the delegate implementation and a unique message identifier. If you recall from the first article in this series, this class is a collection because it implements IEnumerable.

The DoIt method takes any collection typed to the Doer class, and loops through the classes calling their "do" method until false is returned. It also emits some output just to demonstrate how it is looping, etc.

Finally, we get to implementation. The whole point of this example is to demonstrate how the yield command operates. We perform the exact same function on two very similar collections. The first pass uses an instance of my custom collection. The second pass creates a list and passes that into the method. What do you expect the output to look like? Compile the program and run it, and if you guessed correctly, you have a strong grasp of IEnumerator and yield.

Both collections were wired to contain three instances. Both had an instance return true, then false, then true, so the expected result would be to make it through two items and then break out of our loop. This is exactly what happens, but it's the output that is interesting. It turns out that using the List forced me to create everything up front, even if I wasn't going to use it (and who knows when that garbage collector will come by). The custom class using yield however only created two instances. The third class was never created!

Yield is nothing more than syntactic sugar for a state engine.

But wait, we are simply spinning through a collection. What do I mean by state engine??

If you recall in part 1, the key to collections is the Enumerator. An enumerator is a state engine. The "current state" represents either nothing (empty collection or already iterated through the entire collection) or an instance within the collection. The only transition this state engine can make is to move to the next item or end up in an uninitialized state.

This is what the program outputs to the console:

Yield

Now we'll pull out ildasm to peek beneath the hood. I've highlighted the DoerCollection class.

Yield IL

You'll notice that the GetEnumerator implementation actually creates a nested class behind the scenes. That class is our state engine. In red you can see the key pieces of that engine: a state, a current Doer instance, and the reference to the parent class. Highlighted is the key method called to transition state, MoveNext.

What is really interesting is pulling open the MoveNext method. I've used RedGate's free Reflector tool to reverse engineer the code. This will take the generated IL and provide a C# representation, so we can see what the actual underlying algorithm for the enumerator is.

private bool MoveNext()
{
    switch (this.<>1__state)
    {
        case 0:
            this.<>1__state = -1;
            if (Program.DoerCollection.CS$<>9__CachedAnonymousMethodDelegatea == null)
            {
                Program.DoerCollection.CS$<>9__CachedAnonymousMethodDelegatea = new Program.DoSomething(Program.DoerCollection.<GetEnumerator>b__7);
            }
            this.<>2__current = new Program.Doer(Program.DoerCollection.CS$<>9__CachedAnonymousMethodDelegatea, "1");
            this.<>1__state = 1;
            return true;

        case 1:
            this.<>1__state = -1;
            if (Program.DoerCollection.CS$<>9__CachedAnonymousMethodDelegateb == null)
            {
                Program.DoerCollection.CS$<>9__CachedAnonymousMethodDelegateb = new Program.DoSomething(Program.DoerCollection.<GetEnumerator>b__8);
            }
            this.<>2__current = new Program.Doer(Program.DoerCollection.CS$<>9__CachedAnonymousMethodDelegateb, "2");
            this.<>1__state = 2;
            return true;

        case 2:
            this.<>1__state = -1;
            if (Program.DoerCollection.CS$<>9__CachedAnonymousMethodDelegatec == null)
            {
                Program.DoerCollection.CS$<>9__CachedAnonymousMethodDelegatec = new Program.DoSomething(Program.DoerCollection.<GetEnumerator>b__9);
            }
            this.<>2__current = new Program.Doer(Program.DoerCollection.CS$<>9__CachedAnonymousMethodDelegatec, "3");
            this.<>1__state = 3;
            return true;

        case 3:
            this.<>1__state = -1;
            break;
    }
    return false;
}

You can quickly see that what is generated is really a massive switch statement. Based on the current state, it updates the current reference and changes the state. Most importantly, however, is the fact that the results of the yield are executed "on demand." In other words, it is not creating a large list, filling it with instances, and then iterating. Instead, the classes are instantiated "on demand" and then referenced for re-use later in case the collection is iterated again.

The whole key to this process is that the enumerator hides the underlying implementation. The consuming code simply knows there is a collection to iterate through. How that collection is built is up to the enumerator, which leads to very interesting possibilities. In the case of the ASP.NET page, this means that controls can be called iteratively and yield their script references and descriptors. The "master" code is simply iterating through the collection and wiring up the script references.

Thinking of collections as different ways of grouping objects is certainly valuable and can pertain to many different business situations. Understanding that Enumerator is really a state machine, however, allows you to start thinking of collections as processes. They aren't necessarily pools of instances, but can be algorithms or other processes as well. The key is that the use of the enumerator hides the implementation so that the consuming code simply iterates through something without having to understand the underlying implementation of how something is provided.

Jeremy Likness

The collection is a powerful construct that allows a developer to logically group related elements and navigate through them. In this article, we'll explore some concrete implementations of collections that are part of the base .NET framework.

The entire series can be accessed here:

(If you enjoy this article, please vote for it at the Code Project by clicking here, then scroll to the bottom to vote!)

There are two basic name spaces that provide rich collection functionality "out-of-the-box" and are useful in a number of ways. There are System.Collections for non-generic collections and System.Collections.Generic. For more specialized collections, we'll also look at System.Collections.ObjectModel. (Extra Credit: we won't cover it here, but after reading this article you may want to investigate System.Collections.Specialized).

An Out of the Box Interview Answer

A very common interview question is to explain the difference between ArrayList and List. If you got that one correct, you probably mentioned something about boxing, or taking a value type and converting it to an object so it essentially becomes part of the heap instead of the local stack. This operation is expensive. Because ArrayList is not generically typed, it must box and unbox value types. For this reason, any type of collection that deals with value types (and for that matter, structs) should focus on the List<T> implementation. Just how expensive is the boxing operation? Try this little console program and see for yourself:

using System;
using System.Collections;
using System.Collections.Generic;

namespace Arrays
{
    internal class Program
    {
        private static void Main()
        {
            const int ITERATIONS = 9999999;

            DateTime startBuild = DateTime.UtcNow;

            ArrayList integers = new ArrayList();

            for (int x = 0; x < ITERATIONS; x++)
            {
                integers.Add(x);
            }

            DateTime endBuild = DateTime.UtcNow;

            for (int x = 0; x < ITERATIONS; x++)
            {
                int y = (int) integers[x];
            }

            DateTime endParse = DateTime.UtcNow;

            TimeSpan buildArray = endBuild - startBuild;
            TimeSpan parseArray = endParse - endBuild;

            startBuild = DateTime.UtcNow;

            List<int> integerList = new List<int>();

            for (int x = 0; x < ITERATIONS; x++)
            {
                integerList.Add(x);
            }

            endBuild = DateTime.UtcNow;

            for (int x = 0; x < ITERATIONS; x++)
            {
                int y = integerList[x];
            }

            endParse = DateTime.UtcNow;

            TimeSpan buildList = endBuild - startBuild;
            TimeSpan parseList = endParse - endBuild;

            double build = (double) buildArray.Ticks/(double) buildList.Ticks;
            double parse = (double) parseArray.Ticks/(double) parseList.Ticks;
            double total = (double) (buildArray.Ticks + parseArray.Ticks)/(double) (buildList.Ticks + parseList.Ticks);

            Console.WriteLine(string.Format("Build Array: {0} List: {1} {2}", buildArray, buildList, build));
            Console.WriteLine(string.Format("Parse Array: {0} List: {1} {2}", parseArray, parseList, parse));
            Console.WriteLine(string.Format("Total Array: {0} List: {1} {2}", buildArray + parseArray, buildList + parseList, total));

            Console.ReadLine();
        }
    }
}

It basically spins through a list of integers, storing them in both an ArrayList and a List. On my machine, the ArrayList takes over 7 times longer to load, and 1.2 times longer to retrieve values, than the strongly typed List implementation. That is something important to keep in mind when considering collections.

I'm Just Not Your Type

The first collections we'll look at are not generically typed. That doesn't mean they aren't typed ... some in fact are designed for explicit types, but they don't support generics. We already covered the ArrayList, which I believe is there for backwards compatibility to the versions that didn't support generics, as I cannot imagine a situation when I would use that over a List.

These classes derive from CollectionBase and DictionaryBase which are abstract classes that implement ICollection and, in the dictionary, IDictionary.

BitArray

Use this class when manipulating bits. It exposes the bits as an array of bool, so you can do something fun like:

...
if (myArray[x])
{
blah blah
}
...

The underlying storage is done at a bit level for compact storage. What's nice is that you can initialize the collection with a byte array and perform bitwise operations (logical NOT, AND, OR, XOR) between two arrays (great for masks, etc).

Hashtable

The Hashtable serves and important function. It makes large collections of objects easier to parse and search based on the implementation of the hashcode algorithm. One important decision to make is whether you will use a Hashtable or a Dictionary. What's the difference?

The dictionary maps a key to a value. Each key is unique. Different keys might have the same value, but if you are searching for a specific key, you will get exactly one entry. What's more important to note with the Dictionary type is that it is defined with a generic type. Therefore, there is no boxing or unboxing and it will, in general, perform faster and better than a hash table when you are using value types.

The hash table requires that its object implement the hashcode algorithm (or that an algorithm is injected into the constructor). The idea is that objects will have a "mostly" unique key per the hashcode algorithm. However, hash tables will allow multiple objects to exist for the same hash code because the algorithm does not guarantee uniqueness. Hash tables are most often used when there is not a well-defined key to map to the value. The hash code function is used to resolve a "region" of objects, then that subset of objects can be further scanned to complete the algorithm. Using a hashcode when you have a well-defined key is also more expensive because it only stores objects, not generic types, so boxing and unboxing will occur if the targets are value types.

Queue: First in Line, First to Eat!

The Queue is often compared to a physical line. In the lunch line, the first person in the line is also the first person to leave the line (usually). The queue functions this way. To put someone in line, you call the Enqueue method. To get the person at the front of the line (the next one to "leave") you call the Dequeue method.

For an idea of how the Queue collection could be used, consider this practical example: syslog. Syslog is a standard way for network equipment to broadcast status. By default, syslog messages are sent to a host via the UDP protocol on port 514. UDP, unlike TCP/IP, is a disconnected protocol (it doesn't wait for nor require a response, and does not support routing of large packets that must be broken into chunks and reassembled). While you can configure what hardware sends for syslog, some equipment can be incredibly verbose and send out dozens of status updates every second.

Imagine writing a syslog server that retrieves these values from a listening UDP port. The thread listening to the port must be incredibly fast or it will block the port and miss important messages. In order to keep the listen port open, you could implement a synchronized queue. The listener would simply Enqueue the incoming message, then go back and listen to the next message. A background thread (or even several threads running simultaneously) could then call Dequeue to perform processing on those messages.

Most of the time you'll want to use the generically typed equivalent for the Queue to avoid boxing and unboxing.

SortedList

The sorted list is a hybrid between the List and the Dictionary. The keys in the list are sorted, so after adding values to the list, you can enumerate the keys and get the value in the sort order of the key. This might be useful to enumerate countries based on the country or perhaps files based on a key that includes their directory, etc.

Just Put it on the Stack

The stack is a very popular pattern for collections. It is a last-in first-out (LIFO) collection, compared to the queue which is FIFI (first-in, first-out). Stacks are important for composite operations that require a history of state. Calculators work by pushing operands and operators on the stack, computing the values, then popping those values to integrate into the next operation. Stacks are also important in recursive functions — if you wanted to recurse without using a method call, you'd loop instead and place your values on the stack, then pop them off until the stack is empty.

Many years ago in the days of VB6 I helped build a complex web application that had many multi-page transactions. To enable the user to navigate these transactions, we used a custom stack. Each navigation involved pushing the parameters and page directives onto the stack, then the target pages would pop these values and use them. A multi-page transaction would only pop the final values when the transaction was complete. This allowed us to rollback transactions, as well as nest transactions (for example, if you were in the middle of transaction A, then navigated to "B" and hit cancel, you'd pop back into A instead of some generic menu).

Again, you will more often than not use the generically-typed version of the Stack to get the job done.

Generics are Less Expensive

Many of the collections we discussed have generically-typed equivalents that eliminate the need for boxing and un-boxing. When it comes to value types, generically typed classes are almost always less expensive and provide better performance. In addition to generically typed versions of the collections we've already discussed, System.Collections.Generic provides some unique collections only available as strongly-typed implementations.

Dictionary Lookup

By far one of the more commonly used collections, the dictionary has a strongly typed key that maps to a strongly typed value. This is the classic use for mapping one item to another, whether it's an image name to the bytes of the actual image or a security key to a security context object.

Ready, HashSet, Go!

The HashSet class does what the name implies: manages sets. Sets are different than typical lists in a few ways. Sets are loose collections of objects: order is not important. Each object must be unique. If you do not require the classes you are collecting be in a particular order, hash sets exhibit very good performance benefits over indexed and ordered lists. The hash set also provides set operations such as union and intersection. According to Kim Hamilton's article introducing the Hash set, the preferred name for this would have been simply set (you can see the article to learn why the hash part was added).

LinkedList

The linked list is a powerful linked list implementation that provides nodes that link both forward (to the next node) and backwards (to the previous node). The list maintains an internal count. Inserting, deleting, and counting nodes are all O(1) operations. An O(1) operation is an operation that takes the same amount of time regardless of the size of the data it is being performed against ... this means that the list performans just as well when adding or removing nodes as a small or a large list.

Type<T>

The remaining items in this namespace are counterparts to the collections and implementations of the interfaces we've discussed. The caveat is that they are all strongly typed which means better performance in almost all cases involving value types and often for reference types as well. This really leads us to the last collection to be discussed (remember, I left the specialized namespace for homework). This also takes us into a new namespace!

Just an Observation

The System.Collections.ObjectModel namespace is for object-based operations that belong in reusable libraries. It relates to classes which have methods that return or consume collections. Perhaps the most often used collection here is the ObservableCollection.

The ObservableCollection provides a collection that implements INotifyCollectionChanged, which is similar to INotifiyPropertyChanged but at the collection level. In short, whenever an object is added to or removed from the collection, or items within the collection are refreshed, the collection will raise the CollectionChanged event. This is important when there is a dependency on the collection that should be notified whenever the underlying collection changes.

Of course, the most common implementation of this is for databound user interface elements. Objects like lists and grids need to refresh when the underlying lists change. Technologies like Windows Presentation Foundation (WPF) and Silverlight rely on observable collections to optimize the UI and only refresh the elements when there is a requirement, such as the list changing. In fact, these frameworks automatically hook into the events when databound to refresh, so whenever you are dealing with lists that change, you should consider using the observable collection instead of one of the other collection types for binding.

Conclusion

That is a lot of information to cover but hopefully provided insights into the various types of collections and some uses for them. In the next and final installment, we'll consider custom collections and how to tap into IEnumerable for more advanced functionality.

Jeremy Likness