Tuesday, November 3, 2015

Using the wikipedia Python library (to search for oranges :)

By Vasudev Ram



Orange and orange juice image from Wikimedia Commons.

I had come across the wikipedia Python library some time ago. Note that I said "wikipedia Python library", not "Wikipedia Python API". That's because wikipedia is a Python library that wraps the Wikipedia API, providing a somewhat higher-level / easier-to-use interface to the programmer.

Here are a few basic ways of using this library:

First, install it with this command at your OS command line:
$ pip install wikipedia
(I am using $ as the command line prompt, so don't type it.)

Now the Python code snippets:
Import the library:
import wikipedia
Use the .page() method to saerch for a Wikipedia page:
print "1: Searching Wikipedia for 'Orange'"
try:
    print wikipedia.page('Orange')
except wikipedia.exceptions.DisambiguationError as e:
    print str(e)
    print 'DisambiguationError: The page name is ambiguous'
print
The output (partly truncated) is:
1: Searching Wikipedia for 'Orange'
"Orange" may refer to:
Orange (colour)
Orange (fruit)
Some other citrus or citrus-like fruit
Orange (manga)
Orange (2010 film)
Orange (2012 film)
Oranges (film)
The Oranges (film)
Orange Record Label
Orange (band)
Orange (Al Stewart album)
Orange (Jon Spencer Blues Explosion album)
"Orange" (song)
Between the Eyes
"L'Orange" (song)
DisambiguationError: The page name is ambiguous
Next, use the .page method with one of the results from above, which are actual page titles:
print "2: Searching Wikipedia for 'Orange_(fruit)'"
print wikipedia.page('Orange_(fruit)')
The output may not be what one expects:
2: Searching Wikipedia for 'Orange (fruit)'
<WikipediaPage 'Orange (fruit)'>
That'ss because the return value from the above call is a WikipediaPage object, not the page content itself. To get the content we want, we have to access the 'content' attrbute of the WikipediaPage object:
#print wikipedia.page('Orange_(fruit)').content
However, if we access it directly, we may get a Unicode error, so we encode it to UTF-8:
result = wikipedia.page('Orange_(fruit)').content.encode('UTF8')
print "3: Result of searching Wikipedia for 'Orange_(fruit)':"
print result
orange_count = result.count('orange')
print
print "The Wikipedia page for 'Orange_(fruit)' has " + \
    "{} occurrences of the word 'orange'".format(orange_count)
Here are the first few lines of the output, followed by the count at the end:
3: Result of searching Wikipedia for 'Orange_(fruit)':
The orange (specifically, the sweet orange) is the fruit of the citrus species Citrus × sinensis in the family Rutaceae.
The fruit of the Citrus × sinensis is considered a sweet orange, whereas the fruit of the Citrus × aurantium is considered a bitter orange. The sweet orange reproduces asexually (apomixis through nucellar embryony); varieties of sweet orange arise through mutations.
The orange is a hybrid, between pomelo (Citrus maxima) and mandarin (Citrus reticulata). It has genes that are ~25% pomelo and ~75% mandarin; however, it is not a simple backcrossed BC1 hybrid, but hybridized over multiple generations. The chloroplast genes, and therefore the maternal line, seem to be pomelo. The sweet orange has had its full genome sequenced. Earlier estimates of the percentage of pomelo genes varying from ~50% to 6% have been reported.
Sweet oranges were mentioned in Chinese literature in 314 BC. As of 1987, orange trees were found to be the most cultivated fruit tree in the world. Orange trees are widely grown in tropical and subtropical climates for their sweet fruit. The fruit of the orange tree can be eaten fresh, or processed for its juice or fragrant peel. As of 2012, sweet oranges accounted for approximately 70% of citrus production.
In 2013, 71.4 million metric tons of oranges were grown worldwide, production being highest in Brazil and the U.S. states of Florida and California.

The Wikipedia page for 'Orange_(fruit)' has 172 occurrences of the word 'orange'
- Enjoy.

- Vasudev Ram - Online Python training and programming

Signup to hear about new products and services I create.

Posts about Python  Posts about xtopdf

My ActiveState recipes

2 comments:

Vasudev Ram said...

>That's because wikipedia is a Python library that wraps the Wikipedia API, providing a somewhat higher-level / easier-to-use interface to the programmer.

Like my xtopdf library does for ReportLab:

Search Google for xtopdf

Online presentation about xtopdf

Vasudev Ram said...


The wikipedia library produces output on both sys.stdout (standard output) and sys.stderr (standard error output). You can redirect those two streams separately, like this, assuming the entire program in the post above is in the file wikipedia_orange.py:

On Unix:

$ python wikipedia_orange.py >out 2>err
(to redirect both streams to files)
or

$ python wikipedia_orange.py >out 2>/dev/null
(to redirect only stdout to a file, and throw away stderr)

On Windows:

$ python wikipedia_orange.py >out 2>err
(to redirect both streams to files)

or

$ python wikipedia_orange.py >out 2>NUL
(to redirect only stdout to a file, and throw away stderr)