Interval Dictionary Library Published
I have published an alpha version of my new project: the Interval Dictionary. This collection is similar in function to System.Collections.Generic.Dictionary<TKey,TValue> but rather than associating values to keys, a value is associated to an interval. This allows O(log n) retrieval of values based on a key that would fall within an interval.
Consider the following example code:
IntervalDictionary<int, string> months = new IntervalDictionary<int, string>(); // Add the month's first and last day-of-year values as the interval bounds and the month name as the value months.Add(1, 31, "January"); months.Add(32, 59, "February"); months.Add(60, 90, "March"); months.Add(91, 120, "April"); months.Add(121, 151, "May"); months.Add(152, 181, "June"); months.Add(182, 212, "July"); months.Add(213, 243, "August"); months.Add(244, 273, "September"); months.Add(274, 304, "October"); months.Add(305, 334, "November"); months.Add(335, 365, "December"); // Gets the name of the month which is associated with the 264th day of the year string month = months[264];
In the example, an interval dictionary provides the functionality to look up the name of a month (element value) based on a day-of-year value because each month is defined by a day-of-year interval and a month name value. Without this type of data structure, worst case evaluation would be O(n) because a user would have to iterate through each month evaluating whether the key (264 in the example) in contained in the interval for the month. On the back end, the Interval Dictionary implements a self-balancing AVL-tree. Each tree node represents an interval defined by an upper and lower bound. A value can then be associated with each tree node. Thus, one can search the tree in O(log n) for an interval that contains a specified key and then get the associated value.
The source code for this project is published on CodePlex. Installation is available through NuGet.
In: Interval Dictionary, Projects
Nerd Sniped by Large Text Files
So I got nerd sniped pretty hard today with a software problem:
You have a text file such that:
• The file is too large to be loaded into memory
• Any record may be duplicated
Find:
• The top n most frequently occurring records in the file.
The solution for a small file is trivial. One simply needs to open a stream reader and load each line into a Dictionarysuch that if the key does not exist, add one for the current record add set the value to 1; if the key does exist, increment the value for the current record. At the end of the file, sort descending the dictionary by the values and then take the first n records. Bam balam. Easy-peasy.
The solution for a large file is not so trivial, but I wanted to find an efficient method nevertheless. The first challenge was actually creating a file that was too large to load into memory, yet that contained a dataset that contained records whose relative frequency distributions where known. Simply put, I needed a file where I already know the top n most frequent records to that I could assert that my code not only compiled and ran without errors, but that I actually came to the correct result.
My approach to creating the file was to first create a seed file which contained 100,000 purely random records. Each record would contain between 3 to 7 “words” and each word would contain between 3 to 14 letters. Statistically, each record would only appear once in this file.
private const string letters = "qwertyuiopasdfghjklzxcvbnmQWERTYUIOPASDFGHJKLZXCVBNM0123456789";
public static void CreateSeedFile(string filePath, int lineCount, int wordsMin, int wordsMax, int lettersMin, int lettersMax)
{
Random r = new Random();
using (StreamWriter s = new StreamWriter(filePath))
{
s.AutoFlush = false;
for (int line = 0; line < lineCount; line++)
{
int words = r.Next(wordsMin, wordsMax + 1);
for ( int word = 0; word < words; word++)
{
int wordLen = r.Next(lettersMin, lettersMax + 1);
for (int letter = 0; letter < wordLen; letter++)
{
s.Write(letters[r.Next(0, letters.Length)]);
}
if (word < words -1)
s.Write(' ');
}
if(line < lineCount -1)
s.WriteLine();
}
s.Flush();
s.Close();
}
}
In a loop, I then selected a random number between 1 and 1,000,000 and wrote than many records, starting from the beginning of the seed file, to the “large file” for a total of 100,000,000 records. Every loop would write the first line of the seed file to the huge file and the probability of each successive seed record being written decreased linearly until the end of the file where the probability was near 0. This ensures that the number of times the first seed record would be written to the large file would always be equal to or greater than the number of times the second file would be written. This holds true for any two consecutive seed records. Thus, to assert the validity of my code, the resulting n most frequently occurring records in the large file should be, in order, the first n records of the seed file.
public static void CreateHugeFile(string filePath, string seedFilePath, int seedCount, int lineCount)
{
StreamReader seedfile = null;
StreamWriter hugefile = null;
try
{
seedfile = new StreamReader(seedFilePath);
hugefile = new StreamWriter(filePath);
Random r = new Random();
for (int line = 0; line < lineCount; )
{
int seeds = r.Next(1, seedCount);
for (int i = 0; i < seeds && line < lineCount; i++, line++)
{
hugefile.WriteLine(seedfile.ReadLine());
}
seedfile.BaseStream.Position = 0;
seedfile.DiscardBufferedData();
}
}
finally
{
seedfile.Close();
hugefile.Flush();
hugefile.Close();
(seedfile as IDisposable).Dispose();
(hugefile as IDisposable).Dispose();
}
}
My approach to analyzing the large file was four high level steps:
1. Break the file down into mutually exclusive sets
2. Solve for the top n most frequently occurring record hashes in each set
3. Solve for the overall top n most frequently occurring hashes across all sets
4. Traverse the file one last time to resolve each of the hashes to its originating record.
Step 1
In order to create mutually exclusive sets, each distinct record must be in exactly one set. Thus, splitting the file in half, for example, would allow for records in the first half to also be duplicated in the last half. I chose to segregate sets by grouping them by the first character of each record. In theory, one could balance the workload by grouping those sets further (ie. a-d, f-m, n-o, etc.) which would reduce the number of times the file would need to be read in step two. For the sake of time, I did not pursue that feature.
public static Dictionary<char, int> GetCharacterDistribution(string path)
{
Dictionary<char, int> sets = new Dictionary<char, int>();
using (StreamReader r = new StreamReader(path))
{
char c;
while (!r.EndOfStream)
{
c = r.ReadLine()[0];
if (sets.ContainsKey(c))
sets[c]++;
else
sets.Add(c, 1);
}
}
return sets;
}
Step 2
Next I read the file line by line once for each set. Lines that are not part of the set are ignored. I set up a Dictionary whose key is the hash code for each line and whose value is the number of times the record which produces that hash appears in the file. For each line, the hash is evaluated and then added if not already present, otherwise the value is incremented. One the file is read for one set, the dictionary can be sorted descending by value, and the top n records stored. This is done for each set.
public static Dictionary<int, int> GetTopHashCounts(char set, int n, string path)
{
Dictionary<int, int> hashes = new Dictionary<int, int>();
using (StreamReader r = new StreamReader(path))
{
string record;
int hash;
while (!r.EndOfStream)
{
record = r.ReadLine();
if (record[0] == set)
{
hash = record.GetHashCode();
if (hashes.ContainsKey(hash))
hashes[hash]++;
else
hashes.Add(hash, 1);
}
}
}
var q = hashes.OrderByDescending(h => h.Value).Take(n).ToDictionary(h=>h.Key, h=>h.Value);
return q;
}
Step 3
Once the top n hashes have been retrieved for each set, all of the sets hash results can be sorted descending (as one list) and the top n hashes can be determined.
var topNHashes =
setTopHashes
.SelectMany(h => h.Value)
.OrderByDescending(h => h.Value)
.Take(n)
.ToDictionary(h => h.Key, h => h.Value);
Step 4
At this point, we only know the most common hashes, not the most common records. Because a hash is a one way function, the file needs to be traversed on last time to resolve the hashes back to their original records’ string values.
public static Dictionary<string, int> GetTopRecords(string path, Dictionary<int, int> hashCounts)
{
Dictionary<string, int> recordCounts = new Dictionary<string, int>();
using (StreamReader r = new StreamReader(path))
{
string record;
int hash;
while (!r.EndOfStream)
{
record = r.ReadLine();
hash = record.GetHashCode();
if (hashCounts.Keys.Contains(hash))
{
recordCounts.Add(record, hashCounts[hash]);
hashCounts.Remove(hash);
if (!hashCounts.Any())
break;
}
}
}
return recordCounts.OrderByDescending(h => h.Value).ToDictionary(h => h.Key, h => h.Value);
}
After running the final project, the result matched the top records of my seed list in the original order. <put success-baby meme here>
Points to consider
The size of the file to be read can theoretically be very large as long as it is broken up into enough sets. Thus the efficiency is roughly O(logx n) where x is very large making it virtually linear. With 62 sets (a-z, A-Z, 0-9) one billion records could be analyzed with about 16-17 million records per set (assuming even set distribution) which is well within the memory limitations of a .Net application.
This example uses .GetHashCode() which is not guaranteed to produce unique hashes; however, during the course of my testing, I never ran into any collisions.
If the records are, on average, significantly smaller that the hashes, then skipping the hashing and just using the record value would be a viable solution.
Full project:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
namespace ConsoleApplication1256
{
class Program
{
static void Main(string[] args)
{
int seeds = (int)1e5;
int lines = (int)1e8;
CreateSeedFile(@"e:\Users\Stephen\Desktop\seedfile-1e9.txt", seeds, 3, 7, 3, 14);
CreateHugeFile(@"e:\Users\Stephen\Desktop\hugefile-1e9.txt", @"e:\Users\Stephen\Desktop\seedfile-1e9.txt", seeds, lines);
string path = @"e:\Users\Stephen\Desktop\hugefile-1e9.txt";
var sets = GetCharacterDistribution(path);
Dictionary<char, Dictionary<int, int>> setTopHashes = new Dictionary<char, Dictionary<int, int>>();
int n = 10;
foreach (var set in sets)
{
setTopHashes.Add(set.Key, GetTopHashCounts(set.Key, n, path));
}
var topNHashes =
setTopHashes
.SelectMany(h => h.Value)
.OrderByDescending(h => h.Value)
.Take(n)
.ToDictionary(h => h.Key, h => h.Value);
var result = GetTopRecords(path, topNHashes);
}
public static Dictionary<string, int> GetTopRecords(string path, Dictionary<int, int> hashCounts)
{
Dictionary<string, int> recordCounts = new Dictionary<string, int>();
using (StreamReader r = new StreamReader(path))
{
string record;
int hash;
while (!r.EndOfStream)
{
record = r.ReadLine();
hash = record.GetHashCode();
if (hashCounts.Keys.Contains(hash))
{
recordCounts.Add(record, hashCounts[hash]);
hashCounts.Remove(hash);
if (!hashCounts.Any())
break;
}
}
}
return recordCounts.OrderByDescending(h => h.Value).ToDictionary(h => h.Key, h => h.Value);
}
public static Dictionary<int, int> GetTopHashCounts(char set, int n, string path)
{
Dictionary<int, int> hashes = new Dictionary<int, int>();
using (StreamReader r = new StreamReader(path))
{
string record;
int hash;
while (!r.EndOfStream)
{
record = r.ReadLine();
if (record[0] == set)
{
hash = record.GetHashCode();
if (hashes.ContainsKey(hash))
hashes[hash]++;
else
hashes.Add(hash, 1);
}
}
}
var q = hashes.OrderByDescending(h => h.Value).Take(n).ToDictionary(h=>h.Key, h=>h.Value);
return q;
}
public static Dictionary<char, int> GetCharacterDistribution(string path)
{
Dictionary<char, int> sets = new Dictionary<char, int>();
using (StreamReader r = new StreamReader(path))
{
char c;
while (!r.EndOfStream)
{
c = r.ReadLine()[0];
if (sets.ContainsKey(c))
sets[c]++;
else
sets.Add(c, 1);
}
}
return sets;
}
public static void CreateHugeFile(string filePath, string seedFilePath, int seedCount, int lineCount)
{
StreamReader seedfile = null;
StreamWriter hugefile = null;
try
{
seedfile = new StreamReader(seedFilePath);
hugefile = new StreamWriter(filePath);
Random r = new Random();
for (int line = 0; line < lineCount; )
{
int seeds = r.Next(1, seedCount);
for (int i = 0; i < seeds && line < lineCount; i++, line++)
{
hugefile.WriteLine(seedfile.ReadLine());
}
seedfile.BaseStream.Position = 0;
seedfile.DiscardBufferedData();
}
}
finally
{
seedfile.Close();
hugefile.Flush();
hugefile.Close();
(seedfile as IDisposable).Dispose();
(hugefile as IDisposable).Dispose();
}
}
private const string letters = "qwertyuiopasdfghjklzxcvbnmQWERTYUIOPASDFGHJKLZXCVBNM0123456789";
public static void CreateSeedFile(string filePath, int lineCount, int wordsMin, int wordsMax, int lettersMin, int lettersMax)
{
Random r = new Random();
using (StreamWriter s = new StreamWriter(filePath))
{
s.AutoFlush = false;
for (int line = 0; line < lineCount; line++)
{
int words = r.Next(wordsMin, wordsMax + 1);
for ( int word = 0; word < words; word++)
{
int wordLen = r.Next(lettersMin, lettersMax + 1);
for (int letter = 0; letter < wordLen; letter++)
{
s.Write(letters[r.Next(0, letters.Length)]);
}
if (word < words -1)
s.Write(' ');
}
if(line < lineCount -1)
s.WriteLine();
}
s.Flush();
s.Close();
}
}
}
}
Slow Cheetah (VS Plugin)
http://visualstudiogallery.msdn.microsoft.com/69023d00-a4f9-4a34-a6cd-7e854ba318b5
Do less with more! (Less effort with more projects … ok, lame.) This plugin supports config file transforms for all project types, not just web applications. Want to change configurations for dev, test, and production. It does that! Want to preview the transform before deploying. Yup it does that too. Want to instantly write all of your code based on the config file. Well, maybe you can talk them into adding that in a future release.
I maintain a list of my favorite dev tools in my post It’s all about Productivity
In: Uncategorized
Psalm Infographic
For the past year or so I have been compiling data from the Book of Psalms into an infographic which should be pretty self-explanatory. If you have any feedback as far as typographical errors, poorly worded explanations, or just bad kerning, please feel free to post a comment. Enjoy!
This document sized as a 18″x24″ poster. If you would like a printed copy, it is currently available for purchase at for $14.95.
In: Uncategorized
C# MongoDB snippet
Here’s a handy snippet I use for building out data models for MongoDB. Feel free to use, modify, share, and plagiarize as needed.
<?xml version="1.0" encoding="utf-8"?>
<CodeSnippets xmlns="http://schemas.microsoft.com/VisualStudio/2005/CodeSnippet">
<CodeSnippet Format="1.0.0">
<Header>
<SnippetTypes>
<SnippetType>Expansion</SnippetType>
</SnippetTypes>
<Title>BSON Serialization Property</Title>
<Author>Steve Konves</Author>
<Description>
</Description>
<HelpUrl>
</HelpUrl>
<Shortcut>bprop</Shortcut>
</Header>
<Snippet>
<Declarations>
<Literal Editable="true">
<ID>Property Name</ID>
<ToolTip>
</ToolTip>
<Default>BsonProperty</Default>
<Function>
</Function>
</Literal>
<Literal Editable="true">
<ID>Type</ID>
<ToolTip>
</ToolTip>
<Default>string</Default>
<Function>
</Function>
</Literal>
<Literal Editable="true">
<ID>Abbr</ID>
<ToolTip>
</ToolTip>
<Default>bp</Default>
<Function>
</Function>
</Literal>
<Literal Editable="true">
<ID>BsonProperty</ID>
<ToolTip>BsonProperty</ToolTip>
<Default>BsonProperty</Default>
<Function>
</Function>
</Literal>
</Declarations>
<Code Language="csharp"><![CDATA[[BsonRequired]
[BsonElement(_$BsonProperty$ElementName)]
public $Type$ $BsonProperty$ { get; set; }
private const string _$BsonProperty$ElementName = "$Abbr$";
public static readonly string $BsonProperty$ElementName = _$BsonProperty$ElementName;]]></Code>
</Snippet>
</CodeSnippet>
</CodeSnippets>
In: C#, Development
Scripture Reference Parser library is published
My last post was in March. That is because I have been working feverishly on a new project and has NOTHING to do with the fact that my wife and I had a beautiful baby girl in April. I kid. And I digress.
Off and on since last year (more off then on, but whatever) I have been working on a library to parse scripture verses. It started while working on SermonsToGrow.com for a friend of mine and deals with the inherently complicated parsing of scripture references. References come if various flavors with combinations of abbreviations, different ways to delimit lists, and so on. Consider the following cases:
- John 3:16
- jn 3.16
- Jn 3:16-7,20
- jon 5-6
In: C#, Development, Scripture Reference Parser
Useful Equation
Observe the following:
1) Comedy = Tragedy + Time
2) Code = Caffeine + Time
The two statements are demonstrated as true by a combination of Portal 2 and your average software shop. Bonus points for being a software developer who’s played the game. But I digress.
Assuming equations 1 and 2 are correct, then it can be proven that having too much fun while consuming caffeine will result in bad code using the following:
3) Comedy + Caffeine = Code + Tragedy
In conclusion, do not code under the influence of video games and energy drinks.
Back to the drawing board
do
{
ViewDrawingBoard(newIdeas);
}
An Even Better .NET Regular Expression Tester
For a long time, I have been a fan of Derek Slager’s “Better .NET Regular Expression Tester”. But as an avid user of Visual Studio, I have gotten used to using F5 to quickly debug code; however, when using Derek’s tester, F5 refreshes the page and I loose all my work. Thus, I have created an “even better” tester in Silverlight which evaluates the input text against the pattern in real-time as well as displaying error messages about malformed patterns. Check it out.
This project is available on CodePlex at slregextester.codeplex.com.
In: C#, Development, Regex Tester, Silverlight
New Project Idea: Internet Based Grammar
It you have ever spent any time online, you have probably spent some time browsing comments – forum comments, new article comments, other comments. I will give every reader the benefit of the doubt that they have only ever posted useful, constructive, pertinent things in comment thread or have held their tongues. However one thing you will notice is that every discussion will always digress to a comment about grammar, at which point what ever the objective validity of the subject being discussed, all previous comments are in error and thus completely null and void. Could you imagine if people interacted like this in real life?
Cashier: “That comes to $45.63. Please swipe your card and enter your PIN number”
Shopper: “YOU ABSOLUTE STUPID MORONN CANT YOU SEE THAT YOU, OBBVIOUSLY ARE A TOTAL MORON, AND ARE, IN FACT, A MORON!!!!!!!!!111ONE!!!! ITS NOT PIN NUMBER ITS JUST PIN YOU ARE SO STUPID!!!!!!!”
Cashier: “Why do you always yell at me in caps lock every time that you come through my line? Every time you shop here you do it. Do you need a key bored that is not broken?”
Shopper: “ITS NOT BORED ITS BOARD!!!1 BORED IS WHEN YOU DONT HAVE ANYTHING LEFT TO DO IN YOU’RE MOMS BASEMENT!!!!!!!! YOU ARE SUCH A MORON!!!!”
Cashier: “Shoptard its not you’re its your.”
As sad as this type of interaction is, the amazing thing is that, for the most part, the rules of grammar being thrown back and forth are correct. Every day thousands upon thousands of entries are added to the world’s largest grammar repository and, so far, are going unused. Enter the new idea: Internet Based Grammar.
Overview:
Basically, the system would scrape websites for comment threads and identify potential misuses of grammar. (btw that’s the easy part as they are so ubiquitous) Then the service would identify later comments that discuss the former misuse. The results would be distilled into a tentative rule and added to a database. Common patterns in the database would could then be compiled as a thorough grammar rule book.
Feel free to comment on this post.
