Sunday, November 1, 2015

data_dump, a Python tool like Unix od (octal dump)

By Vasudev Ram




The Unix od command, which stands for octal dump, should be known to regular Unix users. Though the name includes the word octal (for historical reasons) [1], it supports other numeric systems as well; see below.

[1] See:

The Wikipedia page for od, which says that "od is one of the earliest Unix programs, having appeared in version 1 AT&T Unix."

od is a handy tool. It dumps the contents of a file (or standard input) to standard output, in "unambiguous" ways, such as the ability to show the file contents as numeric values (ASCII codes), interpreted as bytes / two-byte words / etc. It can do this in octal, decimal, binary or hexadecimal format. It can also show the content as characters. But the Unix cat command does that already, so the od command is more often used to show characters along with their numeric codes. It also shows the byte offset (from the start of the file) of every, say, 10th character in the file, in the left column of its output, so the user can keep track of where any content occurs in the file.

All this is useful because it allows Unix users (programmers and system administrators as well as end users) to inspect the contents of files in different ways (hex, binary, character, etc.). The files thus inspected could be text files or binary files of any kind. Often, programmers use the output of od to debug their application, by viewing a file that their program is either reading from or writing to, to verify that it contains what they expect, or to find that it contains something that they do not expect - which could be due either to invalid input or to a bug in their program causing incorrect output.

I needed to use od recently. Doing so made me think of writing a simple version of it in Python, for fun and practice. So I did it. I named it data_dump.py. Here is the code for it:

'''
Program name: data_dump.py
Author: Vasudev Ram.
Copyright 2015 Vasudev Ram.
Purpose: To dump the contents of a specified file or standard input, 
to the standard output, in one or more formats, such as:
    - as characters
    - as decimal numbers
    - as hexadecimal numbers
    - as octal numbers
    
Inspired by the od (octal dump) command of Unix, and intended to work,
very roughly, like it. Will not attempt to replicate od exactly or even 
closely. May diverge from od's way of doing things, as desired.
'''

# Imports:

from __future__ import print_function
import sys

# Global constants:

# Maximum number of character (from the input) to output per line.
MAX_CHARS_PER_LINE = 16

# Global variables:

# Functions:

def data_dump(infil, line_len=MAX_CHARS_PER_LINE, options=None):
    '''
    Dumps the data from the input source infil to the standard output.
    '''
    byte_addr = 0
    buf = infil.read(line_len)
    # While not EOF.
    while buf != '':
        # Print the offset of the first character to be output on this line.
        # The offset refers to the offset of that character in the input,
        # not in the output. The offset is 0-based.
        sys.stdout.write("{:>08s}: ".format(str(byte_addr)))

        # Print buf in character form, with . for control characters.
        # TODO: Change to use \n for line feed, \t for tab, etc., for 
        # those control characters which have unambiguous C escape 
        # sequences.
        byte_addr += len(buf)
        for c in buf:
            sys.stdout.write('  ') # Left padding before c as char.
            if (0 <= ord(c) <= 31) or (c == 127):
                sys.stdout.write('.')
            else:
                sys.stdout.write(c)
        sys.stdout.write('\n')

        # Now print buf in hex form.
        sys.stdout.write(' ' * 10) # Padding to match that of byte_addr above.
        for c in buf:
            sys.stdout.write(' ') # Left padding before c in hex.
            sys.stdout.write('{:>02s}'.format((hex(ord(c))[2:].upper())))
        sys.stdout.write('\n')
        buf = infil.read(line_len)
    infil.close()


def main():
    '''
    Checks the arguments, sets option flags, sets input source.
    Then calls data_dump() function with the input source and options.
    '''
    try:
        lsa = len(sys.argv)
        if lsa == 1:
            # Input from standard input.
            infil = sys.stdin
        elif lsa == 2:
            # Input from a file.
            infil = open(sys.argv[1], "rb")
        data_dump(infil)
        sys.exit(0)
    except IOError as ioe:
        print("Error: IOError: " + str(ioe))
        sys.exit(1)

if __name__ == '__main__':
    main()

And here is the output of a sample run, on a small text file:
$ data_dump.py t3
00000000:   T  h  e     q  u  i  c  k     b  r  o  w  n
           54 68 65 20 71 75 69 63 6B 20 62 72 6F 77 6E 20
00000016:   f  o  x     j  u  m  p  e  d     o  v  e  r
           66 6F 78 20 6A 75 6D 70 65 64 20 6F 76 65 72 20
00000032:   t  h  e     l  a  z  y     d  o  g  .  .  .  T
           74 68 65 20 6C 61 7A 79 20 64 6F 67 2E 0D 0A 54
00000048:   h  e     q  u  i  c  k     b  r  o  w  n     f
           68 65 20 71 75 69 63 6B 20 62 72 6F 77 6E 20 66
00000064:   o  x     j  u  m  p  e  d     o  v  e  r     t
           6F 78 20 6A 75 6D 70 65 64 20 6F 76 65 72 20 74
00000080:   h  e     l  a  z  y     d  o  g  .  .  .  T  h
           68 65 20 6C 61 7A 79 20 64 6F 67 2E 0D 0A 54 68
00000096:   e     q  u  i  c  k     b  r  o  w  n     f  o
           65 20 71 75 69 63 6B 20 62 72 6F 77 6E 20 66 6F
00000112:   x     j  u  m  p  e  d     o  v  e  r     t  h
           78 20 6A 75 6D 70 65 64 20 6F 76 65 72 20 74 68
00000128:   e     l  a  z  y     d  o  g  .
           65 20 6C 61 7A 79 20 64 6F 67 2E

$
Note that I currently replace control / non-printable characters by a dot, in the output. Another option could be to replace (at least some of) them with C escape sequences, such as \r (carriage return, ASCII 13), \n (line feed, ASCII 10), etc. That is the way the original od does it.

In a future post, I'll make some improvements, and also show and discuss some interesting and possibly anomalous results that I got when testing data_dump.py with different inputs.

Happy dumping! :)


Details of the above image are available here:

Truck image credits

- Vasudev Ram - Online Python training and programming

Signup to hear about new products and services I create.

Posts about Python  Posts about xtopdf

My ActiveState recipes


No comments: