Sunday, July 15, 2012

The Bentley-Knuth problem and solutions

By Vasudev Ram

I recently saw this post about an interesting programming problem on the Web (apparently initially posed by Jon Bentley to Donald Knuth.

For lack of a better term (and also because the name is somewhat memorable), I'm calling it the Bentley-Knuth problem: More shell, less egg

The problem description, from the above post:

The program Bentley asked Knuth to write is one that’s become familiar to people who use languages with serious text-handling capabilities: Read a file of text, determine the n most frequently used words, and print out a sorted list of those words along with their frequencies.

The post is interesting in itself - read it. For fun, I decided to write solutions to the problem in Python and also in UNIX shell.

My initial Python solution is below. The code is not very Pythonic / refactored / tested, but it works, and does have some minimal error checking. See this Python sorting HOWTO page for some ways it could be improved. UNIX shell solution coming in a while.

UPDATE: Unix shell solution added below the Python one.

Note: I should mention that neither my Python nor UNIX shell solution works exactly the same as the McIlroy shell solution, since that one converts upper case letters to lower case, and also, uses a strict "English dictionary"-style definition of a "word", i.e. only alphabetic characters, whereas my two solutions use the definition of a word as "a sequence of non-blank characters", as is more commonly used in parsing computer programs. But I could add both of the tr invocations to the front of my shell pipeline and get the same result as McIlroy.

# Author: Vasudev Ram -
# Version: 0.1

# The problem this program tries to solve is from the page:

# Description: The program Bentley asked Knuth to write:

# Read a file of text, determine the n most frequently 
# used words, and print out a sorted list of those words 
# along with their frequencies.

import sys
import os
import string

sys_argv = sys.argv

def usage():
 sys.stderr.write("Usage: %s n file\n" % sys_argv[0])
 sys.stderr.write("where n is the number of most frequently\n")
 sys.stderr.write("used words you want to find, and \n")
 sys.stderr.write("file is the name of the file in which to look.\n")

if len(sys_argv) < 3:

 n = int(sys_argv[1])
except ValueError:
 sys.stderr.write("%s: Error: %s is not a decimal numeric value" % (sys_argv[0], 

print "n =", n
if n < 1:
 sys.stderr.write("%s: Error: %s is not a positive value" % 
  (sys_argv[0], sys_argv[1]))

in_filename = sys.argv[2]
print "%s: Finding %d most frequent words in file %s" % \
 (sys_argv[0], n, in_filename)

 fil_in = open(in_filename)
except IOError:
 sys.stderr.write("%s: ERROR: Could not open in_filename %s\n" % \
  (sys_argv[0], in_filename))

word_freq_dict = {}

for lin in fil_in:
 words_in_line = lin.split()
 for word in words_in_line:
  if word_freq_dict.has_key(word):
   word_freq_dict[word] += 1
   word_freq_dict[word] = 1

word_freq_list = []
for item in word_freq_dict.items():

wfl = sorted(word_freq_list, 
 key=lambda word_freq_list: word_freq_list[1], reverse=True)
print "The %d most frequent words sorted by decreasing frequency:" % n
len_wfl = len(wfl)
if n > len_wfl:
 print "n = %d, file has only %d unique words," % (n, len_wfl)
 print "so printing %d words" % len_wfl
print "Word: Frequency"
m = min(n, len_wfl)
for i in range(m):
 print wfl[i][0], ": ", wfl[i][1]


And here is my initial solution in UNIX shell:


# Usage:
# ./ n file
# where "n" is the number of most frequent words 
# you want to find in "file".

awk '
        for (i = 1; i <= NF; i++)
END     {
            for (i in word_freq)
                print i, word_freq[i]
' < $2 | sort -nr +1 | sed $1q
- Vasudev Ram - Dancing Bison Enterprises

1 comment:

Veky said...

lambda f,n:collections.Counter(re.findall(r"\w+",open(f).read().lower())).most_common(n)