Showing posts with label parsing. Show all posts
Showing posts with label parsing. Show all posts

Sunday, April 15, 2018

compilerbook.org - Introduction to Compilers and Language Design by Prof. Douglas Thain

By Vasudev Ram

Came across this book today:

Introduction to Compilers and Language Design (compilerbook.org)

(a free online textbook by Douglas Thain)

Prof. Douglas Thain is Associate Professor, Computer Science and Engineering, University of Notre Dame, USA.

Excerpts from the page:

[ This online textbook is being released chapter-by-chapter during 2017. The complete book will be available for purchase in the spring 2018 semester.

This textbook is suitable for a one semester undergraduate course in compilers. Guided by this book, students can undertake construction of a compiler which accepts a C-like language and produces working X86 code. The textbook and materials have been developed by Prof. Douglas Thain as part of the CSE 40243 compilers class at the University of Notre Dame.

You are free to download, use, and print these PDFs for personal and academic use. Commercial printing or distribution is prohibited. Instead of copying PDFs, please point students to this page (compilerbook.org) so that they can access the latest version. If you enjoy holding a physical book (like I do!) you will be able to order an inexpensive hardcover edition in 2018. ]

I just read a bit of the book so far, but it seems quite good.

- Vasudev Ram - Online Python training and consulting

Get fast reliable hosting with A2Hosting.com

Get updates (via Gumroad) on my forthcoming apps and content.

Jump to posts: Python * DLang * xtopdf

Subscribe to my blog by email

My ActiveState Code recipes

Follow me on: LinkedIn * Twitter

Do you create and sell digital products? Get Convertkit:

Email marketing for online creators



Friday, February 13, 2015

Splitting a string on multiple different delimiters

By Vasudev Ram

Just recently I was working on some ideas related to my text file indexing program - which I had blogged about earlier, here:

A simple text file indexing program in Python

As part of that work, I was using Python's string split() method for something. Found that it had a limitation that the separator can be only one string (though it can comprise more than one character).
Trying to work out a solution for that (i.e. the ability to split a string on any one of a set of separator / delimiter characters), I gave these commands interactively in the Python interpreter:

>>> print "".split.__doc__
S.split([sep [,maxsplit]]) -> list of strings

Return a list of the words in the string S, using sep as the
delimiter string. If maxsplit is given, at most maxsplit
splits are done. If sep is not specified or is None, any
whitespace string is a separator and empty strings are removed
from the result.

>>> s = "abc.def;ghi"
>>> s.split(".;")
['abc.def;ghi']
>>> s.split(".")
['abc', 'def;ghi']
>>> '---'.join(s.split("."))
'abc---def;ghi'
>>> '---'.join(s.split(".")).split(";")
['abc---def', 'ghi']
>>> "---".join('---'.join(s.split(".")).split(";"))
'abc---def---ghi'
>>> "---".join('---'.join(s.split(".")).split(";")).split('---')
['abc', 'def', 'ghi']
>>>

So you can see that by doing repeated manual split()'s and join()'s, I was able to split the original string the way I wanted, i.e. on both the period and semicolon as delimiters. I'll work out a function or class to do it and then blog it in a sequel to this post.

(Using regular expressions to match the delimiters, and extracting all but the matched parts, may be one way to do it, but I'll try another approach. There probably are many ways to go about it).

- Vasudev Ram - Dancing Bison Enterprises Signup to hear about new products or services from me. Contact Page


Tuesday, February 10, 2015

Recursively dumping the structure of an HTML5 document

By Vasudev Ram





A while ago I had written this post,

The html5lib Python library (and Animatron :-)

which shows basic usage of a Python library called html5lib, that lets you parse HTML5 documents and then walk through their structure.

That post uses this HTML5 document as input for the program shown in it:


Yesteday I modified the program (test_html5lib.py) shown in that earlier post, to make it recursive, thereby simplifying it. Here is the code for the resulting program, html5_dump.py.
# Demo program to show how to dump the structure of 
# an HTML5 document to text, using html5lib.
# Author: Vasudev Ram.
# Copyright 2015 Vasudev Ram - http://www.dancingbison.com

import html5lib

# Define a function to dump HTML5 element info recursively, 
# given a top-level element.
def print_element(elem, indent, level):
    for sub_elem in elem:
        print "{}{}".format(indent * level, sub_elem)
        # Recursive call to print_element().
        print_element(sub_elem, indent, level + 1)

f = open("html5doc.html")
# Parse the HTML document.
tree = html5lib.parse(f)
indent = '----'
level = 0
print_element(tree, indent, level)
I ran the program with:
$ py html5_dump.py

where the py in the command refers to py, the Python Launcher for Windows

Here is the program output, which you can see is basically the same as the previous version, but, done using recursion.
<Element u'{http://www.w3.org/1999/xhtml}head' at 0x02978938>
<Element u'{http://www.w3.org/1999/xhtml}body' at 0x02978968>
----<Element u'{http://www.w3.org/1999/xhtml}header' at 0x02978980>
--------<Element u'{http://www.w3.org/1999/xhtml}h1' at 0x02978920>
--------<Element u'{http://www.w3.org/1999/xhtml}h2' at 0x02978B00>
--------<Element u'{http://www.w3.org/1999/xhtml}h3' at 0x02978AB8>
----<Element u'{http://www.w3.org/1999/xhtml}p' at 0x02978AE8>
----<Element u'{http://www.w3.org/2000/svg}svg' at 0x02978788>
--------<Element u'{http://www.w3.org/2000/svg}defs' at 0x02A12050>
--------<Element u'{http://www.w3.org/2000/svg}rect' at 0x02A12020>
--------<Element u'{http://www.w3.org/2000/svg}text' at 0x02A12068>
----<Element u'{http://www.w3.org/1999/xhtml}footer' at 0x02A12080>

The recursion helps in two ways: 1) recursively printing sub-elements, and 2) not having to keep track of the indentation level needed - the Python interpreter's handling of nested calls and backing out of them, takes care of that for us. See the line:
print_element(sub_elem, indent, level + 1)
However, if using deep recursion, we have to remember about python recursion depth issues.

Enjoy.

- Vasudev Ram - online Python trainer and freelance programmer

Seeking alpha ...

Signup to hear about my new software products.

Contact Page

Sub-feeds for my posts about Python and posts about xtopdf.

Tuesday, May 21, 2013

A partial crossword solver in Python

A Cryptic Crossword Clue Solver ←

Saw this via Twitter.

It is a partial crossword solver, because it only helps solve a particular category of crossword clues - those in which the clue (which is usually a sentence or phrase) contains both a "definition" of the answer as well a hint of some kind that leads to the same answer. This solver tries to compute the answer using both the definition and the hint, and checks whether the results match. Ingenious.

I found it interesting because this is a somewhat difficult problem, and yet the author managed to create a solution (involving NLTK and parsing) that works in many, if not all cases.

Also, long ago, in college days, I had written another kind of partial crossword solver (in BASIC); it was much simpler, using a brute force method - what it did was help solve the kind of crossword clues in which the answer is a permutation of a substring of the characters comprising the clue sentence or phrase. The program would generate and display on the screen, all possible permutations of all possible substrings of the sentence, that were of the same length as the answer. Then you had to view those permutations and guess whether any of them was the right answer, based on the clue.

I wrote the permutation-generation code by hand, but saw recently that the Python itertools module has methods to generate permutations (as well as combinations) from sequences:

http://docs.python.org/2/library/itertools.html

http://en.m.wikipedia.org/wiki/Permutation

http://en.wikipedia.org/wiki/Crossword

- Vasudev Ram
dancingbison.com