Showing posts with label file-formats. Show all posts
Showing posts with label file-formats. Show all posts

Friday, November 25, 2016

Processing DSV data (Delimiter-Separated Values) with Python

By Vasudev Ram



I needed to process some DSV files recently. Here is a Python program I wrote for it, with a few changes over the original. E.g. I do not show the processing of the data here; I only read and print it. Also, I support two different command-line options (and ways) to specify the delimiter character.

DSV (Delimiter-separated values) is a common tabular text data format, with one record per line, and some number of fields per record, where the fields are separated or delimited by some specific character. Some common delimiter characters used in DSV files are tab (which makes them TSV files, Tab-Separated Values, common on Unix), comma (CSV files, Comma-Separated Values, a common spreadsheet and database import-export format), the pipe character (|), the colon (:) and others.

They - DSV files - are described in this section, Data File Metaformats, under Chapter 5: Textuality, of Eric Raymond (ESR)'s book, The Art of Unix Programming, which is a recommended read for anyone interested in Unix (one of the longest-lived operating systems [1]) and in software design and development.

[1] And, speaking a bit loosely, nowaday Unix is also the most widely used OS in the world, by a fair margin, due to its use (as a variant) in Android and iOS based mobile devices, both of which are Unix-based, not to mention Apple MacOS and Linux computers, which also are. Android devices alone number in the billions.

The program, read_dsv.py, is a command-line utility, written in Python, that allows you to specify the delimiter character in one of two ways:

- with a "-c delim_char" option, in which delim_char is an ASCII character,
- with a "-n delim_code" option, in which delim_code is an ASCII code.

It then reads either the files specified as command-line arguments after the -n or -c option, or if no files are given, it reads its standard input.

Here is the code for read_dsv.py:
from __future__ import print_function

"""
read_dsv.py
Author: Vasudev Ram
Web site: https://vasudevram.github.io
Blog: https://jugad2.blogspot.com
Product store: https://gumroad.com/vasudevram
Purpose: Shows how to read DSV data, i.e. 

https://en.wikipedia.org/wiki/Delimiter-separated_values 

from either files or standard input, split the fields of each
line on the delimiter, and process the fields in some way.
The delimiter character is configurable by the user and can
be specified as either a character or its ASCII code.

Reference:
TAOUP (The Art Of Unix Programming): Data File Metaformats:
http://www.catb.org/esr/writings/taoup/html/ch05s02.html
ASCII table: http://www.asciitable.com/
"""

import sys
import string

def err_write(message):
    sys.stderr.write(message)

def error_exit(message):
    err_write(message)
    sys.exit(1)

def usage(argv, verbose=False):
    usage1 = \
        "{}: read and process DSV (Delimiter-Separated-Values) data.\n".format(argv[0])
    usage2 = "Usage: python" + \
        " {} [ -c delim_char | -n delim_code ] [ dsv_file ] ...\n".format(argv[0])
    usage3 = [
        "where one of either the -c or -n option must be given,\n",  
        "delim_char is a single ASCII delimiter character, and\n", 
        "delim_code is a delimiter character's ASCII code.\n", 
        "Text lines will be read from specified DSV file(s) or\n", 
        "from standard input, split on the specified delimiter\n", 
        "specified by either the -c or -n option, processed, and\n", 
        "written to standard output.\n", 
    ]
    err_write(usage1)
    err_write(usage2)
    if verbose:
        for line in usage3:
            err_write(line)

def str_to_int(s):
    try:
        return int(s)
    except ValueError as ve:
        error_exit(repr(ve))

def valid_delimiter(delim_code):
    return not invalid_delimiter(delim_code)

def invalid_delimiter(delim_code):
    # Non-ASCII codes not allowed, i.e. codes outside
    # the range 0 to 255.
    if delim_code < 0 or delim_code > 255:
        return True
    # Also, don't allow some specific ASCII codes;
    # add more, if it turns out they are needed.
    if delim_code in (10, 13):
        return True
    return False

def read_dsv(dsv_fil, delim_char):
    for idx, lin in enumerate(dsv_fil):
        fields = lin.split(delim_char)
        assert len(fields) > 0
        # Knock off the newline at the end of the last field,
        # since it is the line terminator, not part of the field.
        if fields[-1][-1] == '\n':
            fields[-1] = fields[-1][:-1]
        # Treat a blank line as a line with one field,
        # an empty string (that is what split returns).
        print("Line", idx, "fields:")
        for idx2, field in enumerate(fields):
            print(str(idx2) + ":", "|" + field + "|")

def main():
    # Get and check validity of arguments.
    sa = sys.argv
    lsa = len(sa)
    if lsa == 1:
        usage(sa)
        sys.exit(0)
    if lsa == 2:
        # Allow the help option with any letter case.
        if sa[1].lower() in ("-h", "--help"):
            usage(sa, verbose=True)
            sys.exit(0)
        else:
            usage(sa)
            sys.exit(0)

    # If we reach here, lsa is >= 3.
    # Check for valid mandatory options (sic).
    if not sa[1] in ("-c", "-n"):
        usage(sa, verbose=True)
        sys.exit(0)

    # If -c option given ...
    if sa[1] == "-c":
        # If next token is not a single character ...
        if len(sa[2]) != 1:
            error_exit(
            "{}: Error: -c option needs a single character after it.".format(sa[0]))
        if not sa[2] in string.printable:
            error_exit(
            "{}: Error: -c option needs a printable ASCII character after it.".format(\
            sa[0]))
        delim_char = sa[2]
    # else if -n option given ...
    elif sa[1] == "-n":
        delim_code = str_to_int(sa[2])
        if invalid_delimiter(delim_code):
            error_exit(
            "{}: Error: invalid delimiter code {} given for -n option.".format(\
            sa[0], delim_code))
        delim_char = chr(delim_code)
    else:
        # Checking for what should not happen ... a bit of defensive programming here.
        error_exit("{}: Program error: neither -c nor -n option given.".format(sa[0]))

    try:
        # If no filenames given, read sys.stdin ...
        if lsa == 3:
            print("processing sys.stdin")
            dsv_fil = sys.stdin
            read_dsv(dsv_fil, delim_char)
            dsv_fil.close()
        # else (filenames given), read them ...
        else:
            for dsv_filename in sa[3:]:
                print("processing file:", dsv_filename)
                dsv_fil = open(dsv_filename, 'r')
                read_dsv(dsv_fil, delim_char)
                dsv_fil.close()
    except IOError as ioe:
        error_exit("{}: Error: {}".format(sa[0], repr(ioe)))
        
if __name__ == '__main__':
    main()

Here are test runs of the program (both valid and invalid), and the results of each one:

Run it without any arguments. Gives a brief usage message.
$ python read_dsv.py
read_dsv.py: read and process DSV (Delimiter-Separated-Values) data.
Usage: python read_dsv.py [ -c delim_char | -n delim_code ] [ dsv_file ] ...
Run it with a -h option (for help). Gives the verbose usage message.
$ python read_dsv.py -h
read_dsv.py: read and process DSV (Delimiter-Separated-Values) data.
Usage: python read_dsv.py [ -c delim_char | -n delim_code ] [ dsv_file ] ...
where one of either the -c or -n option must be given,
delim_char is a single ASCII delimiter character, and
delim_code is a delimiter character's ASCII code.
Text lines will be read from specified DSV file(s) or
from standard input, split on the specified delimiter
specified by either the -c or -n option, processed, and
written to standard output.
Run it with a -v option (invalid run). Gives the brief usage message.
$ python read_dsv.py -v
read_dsv.py: read and process DSV (Delimiter-Separated-Values) data.
Usage: python read_dsv.py [ -c delim_char | -n delim_code ] [ dsv_file ] ...
Run it with a -c option but no ASCII character argument (invalid run). Gives the brief usage message.
$ python read_dsv.py -c
read_dsv.py: read and process DSV (Delimiter-Separated-Values) data.
Usage: python read_dsv.py [ -c delim_char | -n delim_code ] [ dsv_file ] ...
Run it with a -c option followed by the pipe character (invalid run). The OS (here, Windows) gives an error message because the pipe character cannot be used to end a pipeline.
$ python read_dsv.py -c |
The syntax of the command is incorrect.
Run it with the -c option and the pipe character as the delimiter character, but protected (by double quotes) from interpretation by the OS shell (CMD).
$ python read_dsv.py -c "|" file1.dsv
processing file: file1.dsv
Line 0 fields:
0: |1|
1: |2|
2: |3|
3: |4|
4: |5|
5: |6|
6: |7|
Line 1 fields:
0: |field1|
1: |fld2|
2: | fld3 with spaces around it |
3: |    fld4 with leading spaces|
4: |fld5 with trailing spaces     |
5: |next field is empty|
6: ||
7: |last field|
Line 2 fields:
0: ||
1: |1|
2: |22|
3: |333|
4: |4444|
5: |55555|
6: |666666|
7: |7777777|
8: |88888888|
Line 3 fields:
0: ||
1: ||
2: ||
3: ||
4: ||
5: |                      |
6: ||
7: ||
8: ||
9: ||
10: ||
Run it with the -n option followed by 124, the ASCII code for the pipe character as the delimiter.
$ python read_dsv.py -n 124 file1.dsv
[Gives exact same output as the run above, as it should, because both use the same delimiter and read the same input file.]
Copy file1.dsv to file3.dsv. Change all the pipe characters (delimiters) to colons:
Run it with the -n option followed by 58, the ASCII code for the colon character.
$ python read_dsv.py -n 58 file3.dsv
[Gives exact same output as the run above, as it should, because other than the delimiters (pipe versus colon), the input is the same.]

I added support for the -n option to the program because it makes it more flexible, since you can specify any ASCII character as the delimiter (that makes sense), by giving its ASCII code.

And of course, to find out the values of the ASCII codes for these delimiter characters, I used the char_to_ascii_code.py program from my recent post:

Trapping KeyboardInterrupt and EOFError for program cleanup

You may have noticed that I mentioned delimiter characters and DSV files in that post too. The char_to_ascii_code.py utility shown in that post was created to find the ASCII code for any character (without having to look it up on the web each time).

- Enjoy.

- Vasudev Ram - Online Python training and consulting

- Black Flyday at Flywheel Wordpress Managed Hosting - get 3 months free on the annual plan.

Get updates on my software products / ebooks / courses.

Jump to posts: Python   DLang   xtopdf

Subscribe to my blog by email

My ActiveState recipes



Monday, September 29, 2014

CommonMark, a pure Python Markdown parser and renderer


By Vasudev Ram

I got to know about CommonMark.org via this post on the Python Reddit:

CommonMark.py - pure Python Markdown parser and renderer

From what I could gather, CommonMark is, or aims to be, two things:

1. "A standard, unambiguous syntax specification for Markdown, along with a suite of comprehensive tests".

2. A Python parser and renderer for the CommonMark Markdown spec.

CommonMark on PyPI, the Python Package Index.

Excerpts from the CommonMark.org site:

[ We propose a standard, unambiguous syntax specification for Markdown, along with a suite of comprehensive tests to validate Markdown implementations against this specification. We believe this is necessary, even essential, for the future of Markdown. ]

[ Who are you?
We're a group of Markdown fans who either work at companies with industrial scale deployments of Markdown, have written Markdown parsers, have extensive experience supporting Markdown with end users – or all of the above.

John MacFarlane
David Greenspan
Vicent Marti
Neil Williams
Benjamin Dumke-von der Ehe
Jeff Atwood ]

So I installed the Python library for it with:
pip install commonmark
Then modified this snippet of example code from the CommonMark PyPI site:
import CommonMark
parser = CommonMark.DocParser()
renderer = CommonMark.HTMLRenderer()
print(renderer.render(parser.parse("Hello *World*")))
on my local machine, to add a few more types of Markdown syntax:
import CommonMark
parser = CommonMark.DocParser()
renderer = CommonMark.HTMLRenderer()
markdown_string = \
"""
Heading
=======
 
Sub-heading
-----------
 
# Atx-style H1 heading.
## Atx-style H2 heading.
### Atx-style H3 heading.
#### Atx-style H4 heading.
##### Atx-style H5 heading.
###### Atx-style H6 heading.
 
Paragraphs are separated
by a blank line.
 
Let 2 spaces at the end of a line to do a  
line break
 
Text attributes *italic*, **bold**, `monospace`.
 
A [link](http://example.com).
 
Shopping list:
 
  * apples
  * oranges
  * pears
 
Numbered list:
 
  1. apples
  2. oranges
  3. pears
 
"""
print(renderer.render(parser.parse(markdown_string)))
Here is a screenshot of the output HTML generated by CommonMark, loaded in Google Chrome:


Reddit user bracewel, who seems to be a CommonMark team member, said on the Py Reddit thread:

eventually we'd like to add a few more renderers, PDF/RTF being the first....

So CommonMark looks interesting and worth keeping an eye on, IMO.

- Vasudev Ram - Dancing Bison Enterprises - Python training and consulting

Dancing Bison - Contact Page

Thursday, March 20, 2014

JSONLint.com, an online JSON validator

By Vasudev Ram



JSON page on Wikipedia.

JSON, as most developers nowadays know, has become useful as a data format both for web client-server communication and for data interchange between different languages, since most popular programming languages have support for it (see the lower part of the JSON home page linked above in this sentence).

While searching for information about some specific aspects of JSON for some Python consulting work, I came across this site:

JSONLint.com

JSONLint.com is an online JSON validator. It is from the Arc90 Lab. (Arc90 is the creator of Readability, a tool that removes the clutter from web pages and makes a clean view for reading now or later on your computer, smartphone, or tablet.)

You paste some JSON data into a text box on the site and then click the Validate button, and it tells you whether the JSON is valid or not.

JSONLint.com is a useful resource for any language with JSON support, including Python.

P.S. Arc90 is being acquired by SFX Entertainment, Inc. (NASDAQ:SFXE).

- Vasudev Ram - Python consulting and training

Contact Page

Monday, November 5, 2012

PDFBuilder can now take multiple input files from command line


By Vasudev Ram

PDFBuilder, which I blogged about recently, can now build a composite PDF from an arbitrary number [1] of input files (CSV and TDV) [2] specified on the command line. (I've removed the hard-coding in the first version.)

I've also cleaned up and refactored the PDFBuilder code some, though I still need to do some more.

UPDATE: I've pasted a few code snippets from PDFBuilder.py at the end of this post.

This version of PDBBuilder can be downloaded here, as a part of xtopdf v1.4, from the Bitbucket repository.

[1] Arbitrary number, that is, subject to the limitations of the length of the command line supported by your OS, of course - whether Unix / Linux, Mac OS X or Windows. However, there is a solution for that.

[2] The design of PDFBuilder allows for easily adding support for other input file formats that are row-oriented. See the method next_row() in the file CSVReader.py in the source package, for an example of how to add support for other compatible input formats. You just have to write a reader class (analogous to CSVReader) for that other format, called, say, FooReader, and provide an open() method and a next_row() method as in the CSVReader class, but adapted to handle Foo data.

Some code snippets from PDFBuilder.py:

The PDFBuilder class:
class PDFBuilder:
 """
 Class to build a composite PDF out of multiple input sources.
 """

 def __init__(self, pdf_filename, font, font_size, 
    header, footer, input_filenames):
  """
  PDFBuilder __init__ method.
  """
  self._pdf_filename = pdf_filename
  self._input_filenames = input_filenames

  # Create a PDFWriter instance.
  self._pw = PDFWriter(pdf_filename)

  # Set its font.
  self._pw.setFont(font, font_size)

  # Set its header and footer.
  self._pw.setHeader(header)
  self._pw.setFooter(footer)
  
 def build_pdf(self, input_filenames):
  """
  PDFBuilder.build_pdf method.
  Builds the PDF using contents of the given input_filenames.
  """

  # Loop over all names in input_filenames.
  # Instantiate the appropriate reader for each filename, 
  # based on the filename extension.

  # For each reader, get each row, and for each row,
  # combine all the columns into a string separated by a space,
  # and write that string to the PDF file.

  # Start a new PDF page after each reader's content is written
  # to the PDF file.

  for input_filename in input_filenames:
   # Check if name ends in ".csv", ignoring upper/lower case
   if input_filename[-4:].lower() == ".csv":
    reader = CSVReader(input_filename)
   # Check if name ends in ".tdv", ignoring upper/lower case
   elif input_filename[-4:].lower() == ".tdv":
    reader = TDVReader(input_filename)
   else:
    sys.stderr.write("Error: Invalid input file. Exiting\n")
    sys.exit(0)

   hdr_str = "Data from reader: " + \
    reader.get_description()
   self._pw.writeLine(hdr_str)
   self._pw.writeLine('-' * len(hdr_str))

   reader.open()
   try:
    while True:
     row = reader.next_row()
     s = ""
     for item in row:
      s = s + item + " "
     self._pw.writeLine(s)
   except StopIteration:
    # Close this reader, save this PDF page, and 
    # start a new one for next reader.
    reader.close()
    self._pw.savePage()
    #continue

 def close(self):
  self._pw.close()
The main() function that uses the PDFBuilder class to create a composite PDF:
def main():

 # global variables

 # program name for error messages
 global prog_name
 # debug flag - if true, print debug messages, else don't
 global DEBUGGING
 
 # Set the debug flag based on environment variable DEBUG, 
 # if it exists.
 debug_env_var = os.getenv("DEBUG")
 if debug_env_var == "1":
  DEBUGGING = True

 # Save program filename for error messages
 prog_name = sys.argv[0]

 # check for right args
 if len(sys.argv) < 2:
  usage()
  sys.exit(1)

 # Get output PDF filename from the command line.
 pdf_filename = sys.argv[1]

 # Get the input filenames from the command line.
 input_filenames = sys.argv[2:]

 # Create a PDFBuilder instance.
 pdf_builder = PDFBuilder(pdf_filename, "Courier", 10, 
       "Composite PDF", "Composite PDF", 
       input_filenames)

 # Build the PDF using the inputs.
 pdf_builder.build_pdf(input_filenames)

 pdf_builder.close()

 sys.exit(0)
And a batch file, run.bat, calls the program with input filename arguments:
@echo off
python PDFBuilder.py %1 file1.csv file1.tdv file2.csv file2.tdv file1-repeats5.csv
Run the batch file like this:
C:> run composite.pdf
which will create a PDF file, composite.pdf, from the input CSV and TDV files given as command-line arguments.

Enjoy.

- Vasudev Ram - Dancing Bison Enterprises