Encoded Streams
Home About Workshops Articles Writing Talks Books Contact

Stream Classes For Encoding Binary Data

Email is a text based mechanism. Email cannot contain binary data. However, email can contain MIME attachments, which are binary attachments encoded to a text format. Encoding will expand the size of the attachment because the printable character set is much smaller than the 256 possible values that an 8-bit byte can hold. Some encoding schemes use a variable number of bytes, some will use a fixed number. For example, base64 and uu encoding convert 3 bytes of binary data into 4 octets of printable characters, whereas Yenc encoding encodes one byte to one octet for most values, and has 'escaped' octets for the rest. Some encoding schemes produce line delineated data, others just produce a block of data.

The .NET framework provides code to encode and decode base64. The Convert class provides ToBase64String to convert a byte array to a base64 encoded string and FromBase64String to convert a string back to a binary byte array. The framework also provides the ToBase64Transform and ToBase64Transform classes, but these merely uses Convert.

There are many problems with the Convert classes implementation. For a start, Convert methods are static which means that they cannot cache data over to a subsequent call. This means that only whole packets of data (3 bytes for raw data, 4 characters for encoded data) can be converted at a time. If you read binary data from a stream, you must make sure that you read an exact multiple of three bytes and pass this to ToBase64String, if you pass a count that is not a multiple of three then the routine will pad the data with zeros before converting it. The base64 routine treats each three byte input block as a 24-bit number that it splits into four 6-bit groups. Each group is then converted to a character. The groups that contain bits entirely from the padding are converted to an = character, this character can only appear at the end of the data (although the end of the data may not have the = character).

If you are converting data from a stream you may be tempted to read in blocks from the stream and pass them to ToBase64String and then add together the output strings. However, if you are part of the way through a data stream and are not stringent about passing an exact multiple of three bytes to ToBase64String you may end up with a character string with invalid characters in it. Furthermore, you will have the overhead of concatenating strings. Correspondingly, if you read encoded data you must read an exact multiple of four characters and pass this to FromBase64String, if you pass another count then an exception will be thrown.

This brings me to another problem with the methods in Convert. RFC2045, which defines MIME attachments, states that base64 data should be represented as lines of no more than 76 characters. FromBase64String will ignore whitespace and so can convert a MIME attachment that includes newlines, but ToBase64String does not allow you to specify that you want the data split into lines. If you want base64 data split over lines you have to call ToBase64String and split the lines yourself. Again, this implies involves creating additional strings.

The FromBase64Transform and ToBase64Transform classes have instance methods, so it is possible for them to cache partial blocks of data. However, both of these classes use the methods in Convert to convert to and from base64 and hence get their problems. FromBase64Transform strips out whitespace from the encoded stream which is a waste of processing time because Convert.FromBase64CharArray will ignore whitespace. ToBase64Transform does not have the facility to split the resulting encoded data into lines.

Closer inspection of the Convert methods show other issues, and these occur largely because of code reuse. The first issue is that the transform methods of these classes perform lots of allocations, now, I know that in .NET memory allocation is cheap, but it is more expensive than not doing it at all. In some cases memory allocations can occur even when the allocated arrays will not be used. A further issue happens with the fact that FromBase64CharArray takes an array of Char and ToBase64CharArray returns a Char array (doh!) but the methods that use them (TransformBlock and TransformFinalBlock on the FromBase64Transform and ToBase64Transform classes) handle byte arrays. This means that there will always be a call to Encoding.ASCII to convert between these two array types. This involves more array allocation and iterations through the values in the various input buffers: yet more CPU cycles are burned.

In addition, since so many temporary buffers are used this means that a lot of copying must occur between all of these buffers. The library code does make a concession to optimisation here because instead of using the generic Array.Copy routine the library methods use the Buffer class. Array.Copy and Buffer.BlockCopy are internalcall, which means that they are implemented in unmanaged C++, and essentially involves a call to memmove.

Version 1.1 of the framework only has code to convert base64 streams, and although this is a popular encoding stream, it is not the only one in use. I decided that I would create classes that would do base64 4ncoding, uuencoding and Yenc encoding.

I wanted to fix all of these problems. I argued that the data that would be converted would be made available through a stream (for example a NetworkStream or a FileStream) so it made sense for me to write my classes as stream classes. The framework's CryptoStream class is interesting because instances are based on another stream instance, in effect chaining the streams. I liked this paradigm and decided that I would make my own stream classes work this way. I wanted to make these streams unbuffered. The reason is that it should be the developer's choice whether buffering is used, and in any case, FileStream contains buffering, and the winsock implementation that is wrapped by NetworkStream also has buffering, so any buffering in my classes would be unnecessary.

The download for this article is a library that contains four classes shown here:

Class Description
EncodedStream Abstract base class containing the common code for all the classes.
Base64Stream Allows you to encode and decode base64 data in a stream. You tell the class to split the output data into lines.
UUStream Standard Unix file encoding.
YencStream The Yenc encoding. This supports data being split over multiple streams.

The Base64Stream class has the following constructors:

public Base64Stream(Stream stream);
public Base64Stream(Stream stream, bool read);
public Base64Stream(Stream stream, int lineLen);

The first constructor can create a read or write stream. The first time you access the stream you determine what type of stream it is. If you call a read method (Read or ReadByte) then the stream will be a read-only stream and any attempt to write to it will throw an exception. If the first call you make is to a write method (Write or WriteByte) then the stream will be a write-only stream. The second constructor has a Boolean which you can use to indicate whether the stream is read or write. The final constructor is a write-only stream that will split the output over lines.

The UUStream class performs standard Unix encoding and decoding. It has the following constructors:

public UUStream(Stream stream)
public UUStream(Stream stream, string name, string mode)

The first constructor is read-only, the stream will extract the header information and provide that through two read-only string properties called FileName and Mode. The second constructor is write-only and the user provides the name of the file and mode that will be placed at the beginning of the output stream.

Finally, the YencStream class performs Yenc encoding. This mechanism allows you to convert a binary stream to one or more output streams. This is reflected in the constructors:

public YencStream(Stream stream, string name, int size);
public YencStream(Stream stream, int byteCount, uint crc);
public YencStream(Stream stream, string name, int size, int part,
   int pbegin, int totalSize, int totalParts, uint crc);

The first constructor is a write-only stream and outputs a single part. The size of the input data and the name of the file are parameters because they have to be written to the header part of the output stream. The second constructor is a read-only stream, that can be part of a multi-part set of data. The name of the file will be read from the stream and made available through the Name read-only string property. The byteCount parameter is a count of the data in this part, the Size read-only property will give the total size of all the data in all the parts of the file. The final constructor is a write-only stream that can take multiple parts. You need to create a YencStream for each part that you want to create (and therefore it is your responsibility to calculate the size of each part). The name of the file is passed in the name parameter, the size of the entire file (ie the size of all the parts) is passed in totalSize and the number of parts is passed in totalParts. For each part you pass the size of the part in size, the part number in part and the start position in the file in pbegin. Yenc ensures data integrity by providing cyclic redundancy checks. Each part has a CRC. If you have a multi-part file then you must provide the CRC from the last part as an initialization parameter. When you have written data to the stream, you can get the CRC by calling the read-only CRC property.

The download for this page is provided as a binary file only. I do not have the time to document the source code, and so I will not provide. If you use this library you must acknowledge me in your product's documentation and in your product's About box.

  (c) Richard Grimes 2006, all rights reserved