Konves.Testing

I wanted to be able to test non-public classes and members without having to use that awful generated stuff that MSTest builds.  I also didn’t want to have to reflect into an assembly in every single unit test.  The result ended up being a nice little rabbit trail project which proxies public or non-public classes to provide access and assertions for public and non-public members.  I’m currently using to test some of the internal serialization classes in my NBT Library project.

The code is available on Github: https://github.com/skonves/Konves.Testing

There is a NuGet package available: https://www.nuget.org/packages/KonvesTesting

And its released under my very own GSD license! http://blog.stevekonves.com/2011/11/gsd/

Posted on March 2, 2014 at 9:31 pm by Steve Konves · Permalink · Leave a comment
In: Uncategorized

C# Named Binary Tag (NBT) library

In order to more elegantly hack Minecraft data files, I have created a .Net library for reading and writing named binary tag (NBT) files.  Sure, this sort of thing already exists, but I had a few specific goals in mind:

  1. Understand the file format by immersing myself in implementing the spec.
  2. Create a library with a clean, .Net style naming convention and architecture.
  3. Personally own the rights to the code.
The source is available on GitHub and the binaries are available through the public NuGet feed.
Get it. Use it. Love it.
Posted on February 19, 2014 at 11:35 pm by Steve Konves · Permalink · Leave a comment
In: C#, Development, Projects

My C# Wishlist

Below are a few things that I wouldn’t mind seeing in the .Net framework in the future.

Intuitive “select many” in Linq query notation.

This should be self-explanatory.  Maybe throw in intuitive group and join, too.  Maybe you’ll get a bulk discount.

A “params” option for generic type constraints

This:

public delegate TResult Func<in T, out TResult>(T arg);
public delegate TResult Func<in T1, in T2, out TResult>(T1 arg1, T2 arg2);
public delegate TResult Func<in T1, in T2, in T3, out TResult>(T1 arg1, T2 arg2, T3 arg3);
public delegate TResult Func<in T1, in T2, in T3, in T4, out TResult>(T1 arg1, T2 arg2, T3 arg3, T4 arg4);
// and so on ...

Could be something like:

public delegate TResult Func<in params T[], out TResult>(params T[] args[]);
Posted on July 31, 2013 at 3:36 pm by Steve Konves · Permalink · Leave a comment
In: C#, Development

Interval Dictionary Library Published

I have published an alpha version of my new project: the Interval Dictionary.  This collection is similar in function to System.Collections.Generic.Dictionary<TKey,TValue> but rather than associating values to keys, a value is associated to an interval.  This allows O(log n) retrieval of values based on a key that would fall within an interval.

Consider the following example code:

IntervalDictionary<int, string> months = new IntervalDictionary<int, string>();

// Add the month's first and last day-of-year values as the interval bounds and the month name as the value
months.Add(1, 31, "January");
months.Add(32, 59, "February");
months.Add(60, 90, "March");
months.Add(91, 120, "April");
months.Add(121, 151, "May");
months.Add(152, 181, "June");
months.Add(182, 212, "July");
months.Add(213, 243, "August");
months.Add(244, 273, "September");
months.Add(274, 304, "October");
months.Add(305, 334, "November");
months.Add(335, 365, "December");

// Gets the name of the month which is associated with the 264th day of the year
string month = months[264];

In the example, an interval dictionary provides the functionality to look up the name of a month (element value) based on a day-of-year value because each month is defined by a day-of-year interval and a month name value.  Without this type of data structure, worst case evaluation would be O(n) because a user would have to iterate through each month evaluating whether the key (264 in the example) in contained in the interval for the month.  On the back end, the Interval Dictionary implements a self-balancing AVL-tree.  Each tree node represents an interval defined by an upper and lower bound. A value can then be associated with each tree node.  Thus, one can search the tree in O(log n) for an interval that contains a specified key and then get the associated value.

The source code for this project is published on CodePlex. Installation is available through NuGet.

Posted on March 23, 2013 at 1:49 pm by Steve Konves · Permalink · Leave a comment
In: Interval Dictionary, Projects

Nerd Sniped by Large Text Files

So I got nerd sniped pretty hard today with a software problem:

You have a text file such that:
• The file is too large to be loaded into memory
• Any record may be duplicated
Find:
• The top n most frequently occurring records in the file.

The solution for a small file is trivial. One simply needs to open a stream reader and load each line into a Dictionarysuch that if the key does not exist, add one for the current record add set the value to 1; if the key does exist, increment the value for the current record. At the end of the file, sort descending the dictionary by the values and then take the first n records. Bam balam. Easy-peasy.

The solution for a large file is not so trivial, but I wanted to find an efficient method nevertheless. The first challenge was actually creating a file that was too large to load into memory, yet that contained a dataset that contained records whose relative frequency distributions where known. Simply put, I needed a file where I already know the top n most frequent records to that I could assert that my code not only compiled and ran without errors, but that I actually came to the correct result.

My approach to creating the file was to first create a seed file which contained 100,000 purely random records. Each record would contain between 3 to 7 “words” and each word would contain between 3 to 14 letters. Statistically, each record would only appear once in this file.

        private const string letters = "qwertyuiopasdfghjklzxcvbnmQWERTYUIOPASDFGHJKLZXCVBNM0123456789";
        public static void CreateSeedFile(string filePath, int lineCount, int wordsMin, int wordsMax, int lettersMin, int lettersMax)
        {
            Random r = new Random();

            using (StreamWriter s = new StreamWriter(filePath))
            {
                s.AutoFlush = false;

                for (int line = 0; line < lineCount; line++)
                {
                    int words = r.Next(wordsMin, wordsMax + 1);

                    for ( int word = 0; word < words; word++)
                    {
                        int wordLen = r.Next(lettersMin, lettersMax + 1);

                        for (int letter = 0; letter < wordLen; letter++)
                        {
                            s.Write(letters[r.Next(0, letters.Length)]);
                        }

                        if (word < words -1)
                            s.Write(' ');
                    }

                    if(line < lineCount -1)
                        s.WriteLine();
                }

                s.Flush();
                s.Close();
            }
        }

In a loop, I then selected a random number between 1 and 1,000,000 and wrote than many records, starting from the beginning of the seed file, to the “large file” for a total of 100,000,000 records. Every loop would write the first line of the seed file to the huge file and the probability of each successive seed record being written decreased linearly until the end of the file where the probability was near 0. This ensures that the number of times the first seed record would be written to the large file would always be equal to or greater than the number of times the second file would be written. This holds true for any two consecutive seed records. Thus, to assert the validity of my code, the resulting n most frequently occurring records in the large file should be, in order, the first n records of the seed file.

        public static void CreateHugeFile(string filePath, string seedFilePath, int seedCount, int lineCount)
        {
            StreamReader seedfile = null;
            StreamWriter hugefile = null;
            try
            {
                seedfile = new StreamReader(seedFilePath);
                hugefile = new StreamWriter(filePath);

                Random r = new Random();

                for (int line = 0; line < lineCount; )
                {
                    int seeds = r.Next(1, seedCount);

                    for (int i = 0; i < seeds && line < lineCount; i++, line++)
                    {
                        hugefile.WriteLine(seedfile.ReadLine());
                    }

                    seedfile.BaseStream.Position = 0;
                    seedfile.DiscardBufferedData();
                }
            }
            finally
            {
                seedfile.Close();
                hugefile.Flush();
                hugefile.Close();

                (seedfile as IDisposable).Dispose();
                (hugefile as IDisposable).Dispose();
            }
        }

My approach to analyzing the large file was four high level steps:
1. Break the file down into mutually exclusive sets
2. Solve for the top n most frequently occurring record hashes in each set
3. Solve for the overall top n most frequently occurring hashes across all sets
4. Traverse the file one last time to resolve each of the hashes to its originating record.

Step 1
In order to create mutually exclusive sets, each distinct record must be in exactly one set. Thus, splitting the file in half, for example, would allow for records in the first half to also be duplicated in the last half. I chose to segregate sets by grouping them by the first character of each record. In theory, one could balance the workload by grouping those sets further (ie. a-d, f-m, n-o, etc.) which would reduce the number of times the file would need to be read in step two. For the sake of time, I did not pursue that feature.

        public static Dictionary<char, int> GetCharacterDistribution(string path)
        {
            Dictionary<char, int> sets = new Dictionary<char, int>();

            using (StreamReader r = new StreamReader(path))
            {
                char c;
                while (!r.EndOfStream)
                {
                    c = r.ReadLine()[0];

                    if (sets.ContainsKey(c))
                        sets[c]++;
                    else
                        sets.Add(c, 1);
                }
            }

            return sets;
        }

Step 2
Next I read the file line by line once for each set. Lines that are not part of the set are ignored. I set up a Dictionary whose key is the hash code for each line and whose value is the number of times the record which produces that hash appears in the file. For each line, the hash is evaluated and then added if not already present, otherwise the value is incremented. One the file is read for one set, the dictionary can be sorted descending by value, and the top n records stored. This is done for each set.

        public static Dictionary<int, int> GetTopHashCounts(char set, int n, string path)
        {
            Dictionary<int, int> hashes = new Dictionary<int, int>();

            using (StreamReader r = new StreamReader(path))
            {
                string record;
                int hash;
                while (!r.EndOfStream)
                {
                    record = r.ReadLine();

                    if (record[0] == set)
                    {
                        hash = record.GetHashCode();

                        if (hashes.ContainsKey(hash))
                            hashes[hash]++;
                        else
                            hashes.Add(hash, 1);
                    }
                }
            }

            var q = hashes.OrderByDescending(h => h.Value).Take(n).ToDictionary(h=>h.Key, h=>h.Value);

            return q;
        }

Step 3
Once the top n hashes have been retrieved for each set, all of the sets hash results can be sorted descending (as one list) and the top n hashes can be determined.

            var topNHashes =
                setTopHashes
                .SelectMany(h => h.Value)
                .OrderByDescending(h => h.Value)
                .Take(n)
                .ToDictionary(h => h.Key, h => h.Value);

Step 4
At this point, we only know the most common hashes, not the most common records. Because a hash is a one way function, the file needs to be traversed on last time to resolve the hashes back to their original records’ string values.

        public static Dictionary<string, int> GetTopRecords(string path, Dictionary<int, int> hashCounts)
        {
            Dictionary<string, int> recordCounts = new Dictionary<string, int>();

            using (StreamReader r = new StreamReader(path))
            {
                string record;
                int hash;
                while (!r.EndOfStream)
                {
                    record = r.ReadLine();

                    hash = record.GetHashCode();

                    if (hashCounts.Keys.Contains(hash))
                    {
                        recordCounts.Add(record, hashCounts[hash]);
                        hashCounts.Remove(hash);
                        if (!hashCounts.Any())
                            break;
                    }
                }
            }

            return recordCounts.OrderByDescending(h => h.Value).ToDictionary(h => h.Key, h => h.Value);
        }

After running the final project, the result matched the top records of my seed list in the original order.  <put success-baby meme here>

Points to consider
The size of the file to be read can theoretically be very large as long as it is broken up into enough sets. Thus the efficiency is roughly O(logx n) where x is very large making it virtually linear. With 62 sets (a-z, A-Z, 0-9) one billion records could be analyzed with about 16-17 million records per set (assuming even set distribution) which is well within the memory limitations of a .Net application.

This example uses .GetHashCode() which is not guaranteed to produce unique hashes; however, during the course of my testing, I never ran into any collisions.

If the records are, on average, significantly smaller that the hashes, then skipping the hashing and just using the record value would be a viable solution.

Full project:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;

namespace ConsoleApplication1256
{
    class Program
    {
        static void Main(string[] args)
        {
            int seeds = (int)1e5;
            int lines = (int)1e8;
            CreateSeedFile(@"e:\Users\Stephen\Desktop\seedfile-1e9.txt", seeds, 3, 7, 3, 14);
            CreateHugeFile(@"e:\Users\Stephen\Desktop\hugefile-1e9.txt", @"e:\Users\Stephen\Desktop\seedfile-1e9.txt", seeds, lines);

            string path = @"e:\Users\Stephen\Desktop\hugefile-1e9.txt";

            var sets = GetCharacterDistribution(path);

            Dictionary<char, Dictionary<int, int>> setTopHashes = new Dictionary<char, Dictionary<int, int>>();

            int n = 10;

            foreach (var set in sets)
            {
                setTopHashes.Add(set.Key, GetTopHashCounts(set.Key, n, path));
            }

            var topNHashes =
                setTopHashes
                .SelectMany(h => h.Value)
                .OrderByDescending(h => h.Value)
                .Take(n)
                .ToDictionary(h => h.Key, h => h.Value);

            var result = GetTopRecords(path, topNHashes);
        }

        public static Dictionary<string, int> GetTopRecords(string path, Dictionary<int, int> hashCounts)
        {
            Dictionary<string, int> recordCounts = new Dictionary<string, int>();

            using (StreamReader r = new StreamReader(path))
            {
                string record;
                int hash;
                while (!r.EndOfStream)
                {
                    record = r.ReadLine();

                    hash = record.GetHashCode();

                    if (hashCounts.Keys.Contains(hash))
                    {
                        recordCounts.Add(record, hashCounts[hash]);
                        hashCounts.Remove(hash);
                        if (!hashCounts.Any())
                            break;
                    }
                }
            }

            return recordCounts.OrderByDescending(h => h.Value).ToDictionary(h => h.Key, h => h.Value);
        }

        public static Dictionary<int, int> GetTopHashCounts(char set, int n, string path)
        {
            Dictionary<int, int> hashes = new Dictionary<int, int>();

            using (StreamReader r = new StreamReader(path))
            {
                string record;
                int hash;
                while (!r.EndOfStream)
                {
                    record = r.ReadLine();

                    if (record[0] == set)
                    {
                        hash = record.GetHashCode();

                        if (hashes.ContainsKey(hash))
                            hashes[hash]++;
                        else
                            hashes.Add(hash, 1);
                    }
                }
            }

            var q = hashes.OrderByDescending(h => h.Value).Take(n).ToDictionary(h=>h.Key, h=>h.Value);

            return q;
        }

        public static Dictionary<char, int> GetCharacterDistribution(string path)
        {
            Dictionary<char, int> sets = new Dictionary<char, int>();

            using (StreamReader r = new StreamReader(path))
            {
                char c;
                while (!r.EndOfStream)
                {
                    c = r.ReadLine()[0];

                    if (sets.ContainsKey(c))
                        sets[c]++;
                    else
                        sets.Add(c, 1);
                }
            }

            return sets;
        }

        public static void CreateHugeFile(string filePath, string seedFilePath, int seedCount, int lineCount)
        {
            StreamReader seedfile = null;
            StreamWriter hugefile = null;
            try
            {
                seedfile = new StreamReader(seedFilePath);
                hugefile = new StreamWriter(filePath);

                Random r = new Random();

                for (int line = 0; line < lineCount; )
                {
                    int seeds = r.Next(1, seedCount);

                    for (int i = 0; i < seeds && line < lineCount; i++, line++)
                    {
                        hugefile.WriteLine(seedfile.ReadLine());
                    }

                    seedfile.BaseStream.Position = 0;
                    seedfile.DiscardBufferedData();
                }
            }
            finally
            {
                seedfile.Close();
                hugefile.Flush();
                hugefile.Close();

                (seedfile as IDisposable).Dispose();
                (hugefile as IDisposable).Dispose();
            }
        }

        private const string letters = "qwertyuiopasdfghjklzxcvbnmQWERTYUIOPASDFGHJKLZXCVBNM0123456789";
        public static void CreateSeedFile(string filePath, int lineCount, int wordsMin, int wordsMax, int lettersMin, int lettersMax)
        {
            Random r = new Random();

            using (StreamWriter s = new StreamWriter(filePath))
            {
                s.AutoFlush = false;

                for (int line = 0; line < lineCount; line++)
                {
                    int words = r.Next(wordsMin, wordsMax + 1);

                    for ( int word = 0; word < words; word++)
                    {
                        int wordLen = r.Next(lettersMin, lettersMax + 1);

                        for (int letter = 0; letter < wordLen; letter++)
                        {
                            s.Write(letters[r.Next(0, letters.Length)]);
                        }

                        if (word < words -1)
                            s.Write(' ');
                    }

                    if(line < lineCount -1)
                        s.WriteLine();
                }

                s.Flush();
                s.Close();
            }
        }
    }
}
Posted on March 4, 2013 at 11:20 pm by Steve Konves · Permalink · Leave a comment
In: Uncategorized

Slow Cheetah (VS Plugin)

http://visualstudiogallery.msdn.microsoft.com/69023d00-a4f9-4a34-a6cd-7e854ba318b5

Do less with more! (Less effort with more projects … ok, lame.) This plugin supports config file transforms for all project types, not just web applications. Want to change configurations for dev, test, and production. It does that! Want to preview the transform before deploying. Yup it does that too. Want to instantly write all of your code based on the config file. Well, maybe you can talk them into adding that in a future release.

I maintain a list of my favorite dev tools in my post It’s all about Productivity

Posted on November 29, 2012 at 11:13 am by Steve Konves · Permalink · Leave a comment
In: Uncategorized

Psalm Infographic

Psalm Infographic

Psalm Infographic

For the past year or so I have been compiling data from the Book of Psalms into an infographic which should be pretty self-explanatory. If you have any feedback as far as typographical errors, poorly worded explanations, or just bad kerning, please feel free to post a comment. Enjoy!

This document sized as a 18″x24″ poster. If you would like a printed copy, it is currently available for purchase at for $14.95.

Posted on November 28, 2012 at 9:09 pm by Steve Konves · Permalink · Leave a comment
In: Uncategorized

C# MongoDB snippet

Here’s a handy snippet I use for building out data models for MongoDB.  Feel free to use, modify, share, and plagiarize as needed.

<?xml version="1.0" encoding="utf-8"?>
<CodeSnippets xmlns="http://schemas.microsoft.com/VisualStudio/2005/CodeSnippet">
  <CodeSnippet Format="1.0.0">
    <Header>
      <SnippetTypes>
        <SnippetType>Expansion</SnippetType>
      </SnippetTypes>
      <Title>BSON Serialization Property</Title>
      <Author>Steve Konves</Author>
      <Description>
      </Description>
      <HelpUrl>
      </HelpUrl>
      <Shortcut>bprop</Shortcut>
    </Header>
    <Snippet>
      <Declarations>
        <Literal Editable="true">
          <ID>Property Name</ID>
          <ToolTip>
          </ToolTip>
          <Default>BsonProperty</Default>
          <Function>
          </Function>
        </Literal>
        <Literal Editable="true">
          <ID>Type</ID>
          <ToolTip>
          </ToolTip>
          <Default>string</Default>
          <Function>
          </Function>
        </Literal>
        <Literal Editable="true">
          <ID>Abbr</ID>
          <ToolTip>
          </ToolTip>
          <Default>bp</Default>
          <Function>
          </Function>
        </Literal>
        <Literal Editable="true">
          <ID>BsonProperty</ID>
          <ToolTip>BsonProperty</ToolTip>
          <Default>BsonProperty</Default>
          <Function>
          </Function>
        </Literal>
      </Declarations>
      <Code Language="csharp"><![CDATA[[BsonRequired]
        [BsonElement(_$BsonProperty$ElementName)]
        public $Type$ $BsonProperty$ { get; set; }
        private const string _$BsonProperty$ElementName = "$Abbr$";
        public static readonly string $BsonProperty$ElementName = _$BsonProperty$ElementName;]]></Code>
    </Snippet>
  </CodeSnippet>
</CodeSnippets>

 

Posted on September 27, 2012 at 6:25 pm by Steve Konves · Permalink · Leave a comment
In: C#, Development

Scripture Reference Parser library is published

My last post was in March.  That is because I have been working feverishly on a new project and has NOTHING to do with the fact that my wife and I had a beautiful baby girl in April.  I kid.  And I digress.

Off and on since last year (more off then on, but whatever) I have been working on a library to parse scripture verses.  It started while working on SermonsToGrow.com for a friend of mine and deals with the inherently complicated parsing of scripture references.  References come if various flavors with combinations of abbreviations, different ways to delimit lists, and so on.  Consider the following cases:

  1. John 3:16
  2. jn 3.16
  3. Jn 3:16-7,20
  4. jon 5-6
Each of the above formats are common, yet require more than a trivial regex to parse them.  Also, not all books or verses are present in every religious text.  Different Bible translations contain or omit apocryphal books and the Book of Mormon, for example, has a completely different set altogether.  The library uses embedded XML files which contained the definitions of each supported text which provide fast, powerful parsing for virtually any reference from any scripture text.
This project is available on codeplex.com: http://scripturereference.codeplex.com/. Feel free to contact me about supporting this project with updates or code review.
Posted on September 13, 2012 at 10:23 pm by Steve Konves · Permalink · Leave a comment
In: C#, Development, Scripture Reference Parser

Useful Equation

Observe the following:

1) Comedy = Tragedy + Time
2) Code = Caffeine + Time

The two statements are demonstrated as true by a combination of Portal 2 and your average software shop.  Bonus points for being a software developer who’s played the game. But I digress.

Assuming equations 1 and 2 are correct, then it can be proven that having too much fun while consuming caffeine will result in bad code using the following:

3) Comedy + Caffeine = Code + Tragedy

In conclusion, do not code under the influence of video games and energy drinks.

Posted on March 12, 2012 at 7:34 pm by Steve Konves · Permalink · Leave a comment
In: Development