Recently, I thought of getting the text from HTML documents and putting that text to PDF. So I did it :)
Here's how:
""" HTMLTextToPDF.py A demo program to show how to convert the text extracted from HTML content, to PDF. It uses the Beautiful Soup library, v4, to parse the HTML, and the xtopdf library to generate the PDF output. Beautiful Soup is at: http://www.crummy.com/software/BeautifulSoup/ xtopdf is at: https://bitbucket.org/vasudevram/xtopdf Guide to using and installing xtopdf: http://jugad2.blogspot.in/2012/07/guide-to-installing-and-using-xtopdf.html Author: Vasudev Ram - http://www.dancingbison.com Copyright 2015 Vasudev Ram """ import sys from bs4 import BeautifulSoup from PDFWriter import PDFWriter def usage(): sys.stderr.write("Usage: python " + sys.argv[0] + " html_file pdf_file\n") sys.stderr.write("which will extract only the text from html_file and\n") sys.stderr.write("write it to pdf_file\n") def main(): # Create some HTML for testing conversion of its text to PDF. html_doc = """ <html> <head> <title> Test file for HTMLTextToPDF </title> </head> <body> This is text within the body element but outside any paragraph. <p> This is a paragraph of text. Hey there, how do you do? The quick red fox jumped over the slow blue cow. </p> <p> This is another paragraph of text. Don't mind what it contains. What is mind? Not matter. What is matter? Never mind. </p> This is also text within the body element but not within any paragraph. </body> </html> """ pw = PDFWriter("HTMLTextTo.pdf") pw.setFont("Courier", 10) pw.setHeader("Conversion of HTML text to PDF") pw.setFooter("Generated by xtopdf: http://slid.es/vasudevram/xtopdf") # Use method chaining this time. for line in BeautifulSoup(html_doc).get_text().split("\n"): pw.writeLine(line) pw.savePage() pw.close() if __name__ == '__main__': main()
The program uses the Beautiful Soup library for parsing and extracting information from HTML, and xtopdf, my Python library for PDF generation.
Run it with:
python HTMLTextToPDF.pyand the output will be in the file HTMLTextTo.pdf.
Screenshot below:
- Vasudev Ram - Python training and programming - Dancing Bison EnterprisesRead more of my posts about Python or read posts about xtopdf (latter is subset of former)
Signup to hear about my new software products or services.
Contact Page
1 comment:
My apologies to anyone who sees this post twice via Planet Python or other aggregators / feed readers. I made a last minute edit to the post, adding links to relevant libraries, due to which duplicate posts may be seen.
Post a Comment