Tuesday, August 16, 2016

Count line frequencies with OrderedDict in Python

By Vasudev Ram


Python programs to count the frequencies of words in a string or from a file are used as common examples. They are often done using dicts. Here is a small program that counts the frequencies of lines in its input. There are some uses for this functionality. I will show those, and also compare and contrast this program with other tools, later.

The program uses an OrderedDict from the collections module of the Python standard library.

The program could also be written using either a regular dict or a defaultdict (also from the collections module), or a collections.Counter, with slightly different code in each of those cases.
from __future__ import print_function
"""
linefreq.py
A program to find the frequencies of input lines.
Author: Vasudev Ram
Copyright 2016 Vasudev Ram
Web site: https://vasudevram.github.io
Blog: http://jugad2.blogspot.com
Product store: http://gumroad.com/vasudevram
"""
import sys
from collections import OrderedDict

def linefreq(in_fil):
    counts = OrderedDict()
    for line in in_fil:
        counts[line] = counts.get(line, 0) + 1
    print("Freq".rjust(8) + ": Line")
    for line, freq in counts.items():
        print(str(freq).rjust(8) + ": " + line, end="")
    print('-' * (10 + max(map(len, counts))))
    for line, freq in reversed(counts.items()):
        print(str(freq).rjust(8) + ": " + line, end="")

def main():
    sa, lsa = sys.argv, len(sys.argv)
    if lsa == 1:
        linefreq(sys.stdin)
    elif lsa == 2:
        with open(sa[1], "r") as in_fil:
            linefreq(in_fil)
    else:
        print("Only one filename argument supported.")

if __name__ == '__main__':
    main()
I ran it on this input file:
line 1
line 2
line 2
line 3
line 3
line 3
line 4
line 4
line 4
line 4
where "line 1" occurs once, "line 2" occurs twice, etc., with this command:
$ python linefreq.py infile1.txt
and got this output:
Freq: Line
       1: line 1
       2: line 2
       3: line 3
       4: line 4
-----------------
       4: line 4
       3: line 3
       2: line 2
       1: line 1
The reversed lines are output just to show that it is possible to use reversed() on an OrderedDict, unlike on a dict.
I also got the same output, as expected, when I ran this form of the command:
$ cat infile1.txt | python linefreq.py
This line:
    print('-' * (10 + max(map(len, counts))))
is used to print a row of dashes as long as the longest output line from above it.
The length of the longest line can also be computed inline in the first for loop.


- Vasudev Ram - Online Python training and consulting

Get updates on my software products / ebooks / courses.

My Python posts     Subscribe to my blog by email

My ActiveState recipes



No comments: