Nerd Sniped by Large Text Files

So I got nerd sniped pretty hard today with a software problem:

You have a text file such that:
• The file is too large to be loaded into memory
• Any record may be duplicated
Find:
• The top n most frequently occurring records in the file.

The solution for a small file is trivial. One simply needs to open a stream reader and load each line into a Dictionarysuch that if the key does not exist, add one for the current record add set the value to 1; if the key does exist, increment the value for the current record. At the end of the file, sort descending the dictionary by the values and then take the first n records. Bam balam. Easy-peasy.

The solution for a large file is not so trivial, but I wanted to find an efficient method nevertheless. The first challenge was actually creating a file that was too large to load into memory, yet that contained a dataset that contained records whose relative frequency distributions where known. Simply put, I needed a file where I already know the top n most frequent records to that I could assert that my code not only compiled and ran without errors, but that I actually came to the correct result.

My approach to creating the file was to first create a seed file which contained 100,000 purely random records. Each record would contain between 3 to 7 “words” and each word would contain between 3 to 14 letters. Statistically, each record would only appear once in this file.

        private const string letters = "qwertyuiopasdfghjklzxcvbnmQWERTYUIOPASDFGHJKLZXCVBNM0123456789";
        public static void CreateSeedFile(string filePath, int lineCount, int wordsMin, int wordsMax, int lettersMin, int lettersMax)
        {
            Random r = new Random();

            using (StreamWriter s = new StreamWriter(filePath))
            {
                s.AutoFlush = false;

                for (int line = 0; line < lineCount; line++)
                {
                    int words = r.Next(wordsMin, wordsMax + 1);

                    for ( int word = 0; word < words; word++)
                    {
                        int wordLen = r.Next(lettersMin, lettersMax + 1);

                        for (int letter = 0; letter < wordLen; letter++)
                        {
                            s.Write(letters[r.Next(0, letters.Length)]);
                        }

                        if (word < words -1)
                            s.Write(' ');
                    }

                    if(line < lineCount -1)
                        s.WriteLine();
                }

                s.Flush();
                s.Close();
            }
        }

In a loop, I then selected a random number between 1 and 1,000,000 and wrote than many records, starting from the beginning of the seed file, to the “large file” for a total of 100,000,000 records. Every loop would write the first line of the seed file to the huge file and the probability of each successive seed record being written decreased linearly until the end of the file where the probability was near 0. This ensures that the number of times the first seed record would be written to the large file would always be equal to or greater than the number of times the second file would be written. This holds true for any two consecutive seed records. Thus, to assert the validity of my code, the resulting n most frequently occurring records in the large file should be, in order, the first n records of the seed file.

        public static void CreateHugeFile(string filePath, string seedFilePath, int seedCount, int lineCount)
        {
            StreamReader seedfile = null;
            StreamWriter hugefile = null;
            try
            {
                seedfile = new StreamReader(seedFilePath);
                hugefile = new StreamWriter(filePath);

                Random r = new Random();

                for (int line = 0; line < lineCount; )
                {
                    int seeds = r.Next(1, seedCount);

                    for (int i = 0; i < seeds && line < lineCount; i++, line++)
                    {
                        hugefile.WriteLine(seedfile.ReadLine());
                    }

                    seedfile.BaseStream.Position = 0;
                    seedfile.DiscardBufferedData();
                }
            }
            finally
            {
                seedfile.Close();
                hugefile.Flush();
                hugefile.Close();

                (seedfile as IDisposable).Dispose();
                (hugefile as IDisposable).Dispose();
            }
        }

My approach to analyzing the large file was four high level steps:
1. Break the file down into mutually exclusive sets
2. Solve for the top n most frequently occurring record hashes in each set
3. Solve for the overall top n most frequently occurring hashes across all sets
4. Traverse the file one last time to resolve each of the hashes to its originating record.

Step 1
In order to create mutually exclusive sets, each distinct record must be in exactly one set. Thus, splitting the file in half, for example, would allow for records in the first half to also be duplicated in the last half. I chose to segregate sets by grouping them by the first character of each record. In theory, one could balance the workload by grouping those sets further (ie. a-d, f-m, n-o, etc.) which would reduce the number of times the file would need to be read in step two. For the sake of time, I did not pursue that feature.

        public static Dictionary<char, int> GetCharacterDistribution(string path)
        {
            Dictionary<char, int> sets = new Dictionary<char, int>();

            using (StreamReader r = new StreamReader(path))
            {
                char c;
                while (!r.EndOfStream)
                {
                    c = r.ReadLine()[0];

                    if (sets.ContainsKey(c))
                        sets[c]++;
                    else
                        sets.Add(c, 1);
                }
            }

            return sets;
        }

Step 2
Next I read the file line by line once for each set. Lines that are not part of the set are ignored. I set up a Dictionary whose key is the hash code for each line and whose value is the number of times the record which produces that hash appears in the file. For each line, the hash is evaluated and then added if not already present, otherwise the value is incremented. One the file is read for one set, the dictionary can be sorted descending by value, and the top n records stored. This is done for each set.

        public static Dictionary<int, int> GetTopHashCounts(char set, int n, string path)
        {
            Dictionary<int, int> hashes = new Dictionary<int, int>();

            using (StreamReader r = new StreamReader(path))
            {
                string record;
                int hash;
                while (!r.EndOfStream)
                {
                    record = r.ReadLine();

                    if (record[0] == set)
                    {
                        hash = record.GetHashCode();

                        if (hashes.ContainsKey(hash))
                            hashes[hash]++;
                        else
                            hashes.Add(hash, 1);
                    }
                }
            }

            var q = hashes.OrderByDescending(h => h.Value).Take(n).ToDictionary(h=>h.Key, h=>h.Value);

            return q;
        }

Step 3
Once the top n hashes have been retrieved for each set, all of the sets hash results can be sorted descending (as one list) and the top n hashes can be determined.

            var topNHashes =
                setTopHashes
                .SelectMany(h => h.Value)
                .OrderByDescending(h => h.Value)
                .Take(n)
                .ToDictionary(h => h.Key, h => h.Value);

Step 4
At this point, we only know the most common hashes, not the most common records. Because a hash is a one way function, the file needs to be traversed on last time to resolve the hashes back to their original records’ string values.

        public static Dictionary<string, int> GetTopRecords(string path, Dictionary<int, int> hashCounts)
        {
            Dictionary<string, int> recordCounts = new Dictionary<string, int>();

            using (StreamReader r = new StreamReader(path))
            {
                string record;
                int hash;
                while (!r.EndOfStream)
                {
                    record = r.ReadLine();

                    hash = record.GetHashCode();

                    if (hashCounts.Keys.Contains(hash))
                    {
                        recordCounts.Add(record, hashCounts[hash]);
                        hashCounts.Remove(hash);
                        if (!hashCounts.Any())
                            break;
                    }
                }
            }

            return recordCounts.OrderByDescending(h => h.Value).ToDictionary(h => h.Key, h => h.Value);
        }

After running the final project, the result matched the top records of my seed list in the original order.  <put success-baby meme here>

Points to consider
The size of the file to be read can theoretically be very large as long as it is broken up into enough sets. Thus the efficiency is roughly O(logx n) where x is very large making it virtually linear. With 62 sets (a-z, A-Z, 0-9) one billion records could be analyzed with about 16-17 million records per set (assuming even set distribution) which is well within the memory limitations of a .Net application.

This example uses .GetHashCode() which is not guaranteed to produce unique hashes; however, during the course of my testing, I never ran into any collisions.

If the records are, on average, significantly smaller that the hashes, then skipping the hashing and just using the record value would be a viable solution.

Full project:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;

namespace ConsoleApplication1256
{
    class Program
    {
        static void Main(string[] args)
        {
            int seeds = (int)1e5;
            int lines = (int)1e8;
            CreateSeedFile(@"e:\Users\Stephen\Desktop\seedfile-1e9.txt", seeds, 3, 7, 3, 14);
            CreateHugeFile(@"e:\Users\Stephen\Desktop\hugefile-1e9.txt", @"e:\Users\Stephen\Desktop\seedfile-1e9.txt", seeds, lines);

            string path = @"e:\Users\Stephen\Desktop\hugefile-1e9.txt";

            var sets = GetCharacterDistribution(path);

            Dictionary<char, Dictionary<int, int>> setTopHashes = new Dictionary<char, Dictionary<int, int>>();

            int n = 10;

            foreach (var set in sets)
            {
                setTopHashes.Add(set.Key, GetTopHashCounts(set.Key, n, path));
            }

            var topNHashes =
                setTopHashes
                .SelectMany(h => h.Value)
                .OrderByDescending(h => h.Value)
                .Take(n)
                .ToDictionary(h => h.Key, h => h.Value);

            var result = GetTopRecords(path, topNHashes);
        }

        public static Dictionary<string, int> GetTopRecords(string path, Dictionary<int, int> hashCounts)
        {
            Dictionary<string, int> recordCounts = new Dictionary<string, int>();

            using (StreamReader r = new StreamReader(path))
            {
                string record;
                int hash;
                while (!r.EndOfStream)
                {
                    record = r.ReadLine();

                    hash = record.GetHashCode();

                    if (hashCounts.Keys.Contains(hash))
                    {
                        recordCounts.Add(record, hashCounts[hash]);
                        hashCounts.Remove(hash);
                        if (!hashCounts.Any())
                            break;
                    }
                }
            }

            return recordCounts.OrderByDescending(h => h.Value).ToDictionary(h => h.Key, h => h.Value);
        }

        public static Dictionary<int, int> GetTopHashCounts(char set, int n, string path)
        {
            Dictionary<int, int> hashes = new Dictionary<int, int>();

            using (StreamReader r = new StreamReader(path))
            {
                string record;
                int hash;
                while (!r.EndOfStream)
                {
                    record = r.ReadLine();

                    if (record[0] == set)
                    {
                        hash = record.GetHashCode();

                        if (hashes.ContainsKey(hash))
                            hashes[hash]++;
                        else
                            hashes.Add(hash, 1);
                    }
                }
            }

            var q = hashes.OrderByDescending(h => h.Value).Take(n).ToDictionary(h=>h.Key, h=>h.Value);

            return q;
        }

        public static Dictionary<char, int> GetCharacterDistribution(string path)
        {
            Dictionary<char, int> sets = new Dictionary<char, int>();

            using (StreamReader r = new StreamReader(path))
            {
                char c;
                while (!r.EndOfStream)
                {
                    c = r.ReadLine()[0];

                    if (sets.ContainsKey(c))
                        sets[c]++;
                    else
                        sets.Add(c, 1);
                }
            }

            return sets;
        }

        public static void CreateHugeFile(string filePath, string seedFilePath, int seedCount, int lineCount)
        {
            StreamReader seedfile = null;
            StreamWriter hugefile = null;
            try
            {
                seedfile = new StreamReader(seedFilePath);
                hugefile = new StreamWriter(filePath);

                Random r = new Random();

                for (int line = 0; line < lineCount; )
                {
                    int seeds = r.Next(1, seedCount);

                    for (int i = 0; i < seeds && line < lineCount; i++, line++)
                    {
                        hugefile.WriteLine(seedfile.ReadLine());
                    }

                    seedfile.BaseStream.Position = 0;
                    seedfile.DiscardBufferedData();
                }
            }
            finally
            {
                seedfile.Close();
                hugefile.Flush();
                hugefile.Close();

                (seedfile as IDisposable).Dispose();
                (hugefile as IDisposable).Dispose();
            }
        }

        private const string letters = "qwertyuiopasdfghjklzxcvbnmQWERTYUIOPASDFGHJKLZXCVBNM0123456789";
        public static void CreateSeedFile(string filePath, int lineCount, int wordsMin, int wordsMax, int lettersMin, int lettersMax)
        {
            Random r = new Random();

            using (StreamWriter s = new StreamWriter(filePath))
            {
                s.AutoFlush = false;

                for (int line = 0; line < lineCount; line++)
                {
                    int words = r.Next(wordsMin, wordsMax + 1);

                    for ( int word = 0; word < words; word++)
                    {
                        int wordLen = r.Next(lettersMin, lettersMax + 1);

                        for (int letter = 0; letter < wordLen; letter++)
                        {
                            s.Write(letters[r.Next(0, letters.Length)]);
                        }

                        if (word < words -1)
                            s.Write(' ');
                    }

                    if(line < lineCount -1)
                        s.WriteLine();
                }

                s.Flush();
                s.Close();
            }
        }
    }
}
Posted on March 4, 2013 at 11:20 pm by Steve Konves · Permalink
In: Uncategorized

Leave a Reply