jugad2 - Vasudev Ram on software innovation: web-utilities

Saturday, March 31, 2018

Checking if web sites are online with Python

By Vasudev Ram

Hi readers,

Recently, I thought of writing a small program to check if one or more web sites are online or not. I used the requests Python library with the HTTP HEAD method. I also checked out PycURL for this. It is a thin wrapper over libcurl, the library that powers the well-known and widely used curl command line tool. While PycURL looks powerful and fast (since it is a thin wrapper that exposes most or all of the functionality of libcurl), I decided to use requests for this version of the program. The code for the program is straightforward, but I found a few interesting things while running it with a few different sites as arguments. I mention those points below.

Here is the tool: I named it is_site_online.py:

"""
is_site_online.py
Purpose: A Python program to check if a site is online or not.
Uses the requests library and the HTTP HEAD method.
Tries both with and without HTTP redirects.
Author: Vasudev Ram
Copyright 2018 Vasudev Ram
Web site: https://vasudevram.github.io
Blog: https://jugad2.blogspot.com
Product store: https://gumroad.com/vasudevram
"""

from __future__ import print_function
import sys
import requests
import time

if len(sys.argv) < 2:
    sys.stderr.write("Usage: {} site ...".format(sys.argv[0]))
    sys.stderr.write("Checks if the given site(s) are online or not.")
    sys.exit(0)

print("Checking if these sites are online or not:")
print("   ".join(sys.argv[1:]))

print("-" * 60)
try:
    for site in sys.argv[1:]:
        for allow_redirects in (False, True):
            tc1 = time.clock()
            r = requests.head(site, allow_redirects=allow_redirects)
            tc2 = time.clock()
            print("Site:", site)
            print("Check with allow_redirects =", allow_redirects)
            print("Results:")
            print("r.ok:", r.ok)
            print("r.status_code:", r.status_code)
            print("request time:", round(tc2 - tc1, 3), "secs")
            print("-" * 60)
except requests.ConnectionError as ce:
    print("Error: ConnectionError: {}".format(ce))
    sys.exit(1)
except requests.exceptions.MissingSchema as ms:
    print("Error: MissingSchema: {}".format(ms))
    sys.exit(1)
except Exception as e:
    print("Error: Exception: {}".format(e))
    sys.exit(1)

The results of some runs of the program:

Check for Google and Yahoo!:

$ python is_site_online.py http://google.com http://yahoo.com
Checking if these sites are online or not:
http://google.com   http://yahoo.com
-----------------------------------------------------------
Site: http://google.com
Check with allow_redirects = False
Results:
r.ok: True
r.status_code: 302
request time: 0.217 secs
------------------------------------------------------------
Site: http://google.com
Check with allow_redirects = True
Results:
r.ok: True
r.status_code: 200
request time: 0.36 secs
------------------------------------------------------------
Site: http://yahoo.com
Check with allow_redirects = False
Results:
r.ok: True
r.status_code: 301
request time: 2.837 secs
------------------------------------------------------------
Site: http://yahoo.com
Check with allow_redirects = True
Results:
r.ok: True
r.status_code: 200
request time: 1.852 secs
------------------------------------------------------------

In the cases where allow_redirects is False, google.com gives a status code of 302 and yahoo.com gives a status code of 301. The 3xx series of codes are related to HTTP redirection.

After seeing this, I looked up HTTP status code information in a few sites such as Wikipedia and the official site www.w3.org (the World Wide Web Consortium), and found a point worth noting. See the part in the Related links section at the end of this post about "302 Found", where it says: "This is an example of industry practice contradicting the standard.".

Now let's check for some error cases:

One error case: we do not give an http:// prefix (assume some novice user who is mixed up about schemes and paths), so they type a garbled site name, say http.om:

$ python is_site_online.py http.om
Checking if these sites are online or not:
http.om
------------------------------------------------------------
Traceback (most recent call last):
  File "is_site_online.py", line 32, in 
    r = requests.head(site, allow_redirects=allow_redirects)
[snip long traceback]
    raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL 'http.om':
No schema supplied. Perhaps you meant http://http.om?

This traceback tells us that when no HTTP 'scheme' [1][2] is given, requests raises a MissingSchema exception. So we now know that we need to catch that exception in our code, by adding another except clause to the try statement, which I later did, in the program you see in this post. In general, this technique can be useful when using a new Python library for the first time: just don't handle any exceptions in the beginning, use it a few times with variations in input or modes of use, and see what sorts of exceptions it throws. Then add code to handle them.

[1] The components of a URL

[2] Parts of URL

Another error case - a made-up site name that does not exist:

$ python is_site_online.py http://abcd.efg
Checking if these sites are online or not:
http://abcd.efg
------------------------------------------------------------
Caught ConnectionError: HTTPConnectionPool(host='abcd.efg',
port=80): Max retries exceeded with url: / (Caused
by NewConnectionError(': Failed
to establish a new connection: [Errno 11004] getaddrinfo
failed',))

From the above error we can see or figure out a few things:

- the requests library defines a ConnectionError exception. I first ran the above command without catching ConnectionError in the program; it gave that error, then I added the handler for it.

- requests uses an HTTP connection pool

- requests does some retries when you try to get() or head() a URL (a site name)

- requests uses urllib3 (from the Python standard library) under the hood

I had discovered that last point earlier too; see this post:

urllib3, the library used by the Python requests library

And as I mentioned in that post, urllib3 itself uses httplib.

Now let's check for some sites that are misspellings of the site google.com:

$ python is_site_online.py http://gogle.com Checking ... ------------------------------------------------------------ Site: http://gogle.com With allow_redirects: False Results: r.ok: True r.status_code: 301 request time: 3.377 ------------------------------------------------------------ Site: http://gogle.com With allow_redirects: True Results: r.ok: True r.status_code: 200 request time: 1.982 ------------------------------------------------------------

$ python is_site_online.py http://gooogle.com Checking ... ------------------------------------------------------------ Site: http://gooogle.com With allow_redirects: False Results: r.ok: True r.status_code: 301 request time: 0.425 ------------------------------------------------------------ Site: http://gooogle.com With allow_redirects: True Results: r.ok: True r.status_code: 200 request time: 1.216 ------------------------------------------------------------

Interestingly, the results show that that both those misspellings of google.com exist as sites.

It is known that some people register domains that are similar in spelling to well-known / popular / famous domain names, maybe hoping to capture some of the traffic resulting from users mistyping the famous ones. Although I did not plan it that way, I realized, from the above two results for gogle.com and gooogle.com, that this tool can be used to detect the existence of such sites (if they are online when you check, of course).

PrintFriendly.com, site to print any web page well

By Vasudev Ram

PrintFriendly.com is a web site that claims to lets you make any web page print friendly.

From their site:

[ PrintFriendly cleans and formats web pages for perfect print experience. PrintFriendly removes Ads, Navigation and web page junk, so you save paper and ink when you print. It's free and easy to use. Perfect to use at home, the office, or whenever you need to print a web page. ]

I tried out PrintFriendly on the About page of my blog, jugad2. It seems to have worked well.

Here is a screenshot of the first part of the resulting page preview:

Here is a screenshot of roughly the middle part of that page:

And here is a screenshot of the bottom of the page, with the image of my mascot showing.

All three parts seem to have come out well.

Interestingly, the site also has a feature by which, in the print preview, if you highlight certain parts with your mouse, it shows a small popup that lets you delete the selected part - before printing, I guess.

So PrintFriendly is eco-friendly.

The site reminds me slightly of my selpg Linux utility, which lets you print selected pages of a text file to PDF (with the help of xtopdf, which is also by me.

- Vasudev Ram - Online Python training and programming

Signup to hear about new products and services I create.

Posts about Python Posts about xtopdf

My ActiveState recipes

Share |

Saturday, July 23, 2011

gplus.to makes short URL for your Google Plus page

By Vasudev Ram - dancingbison.com | @vasudevram | jugad2.blogspot.com

Saw this via someone's Google Plus page:

The site gplus.to lets you create a short URL for your Google Plus (G+) page. (gplus.to is not a Google product).

I created a gplus.to URL for myself at gplus.to/vasudevram (yes, I know, that handle vasudevram is a bit long, sort of defeats the purpose of a URL shortener, I could have used, say, vram, but I like to try to be consistent, it makes things easier to remember than, say, having one web account called vram, another called vasudevram, a third called something else ..., so as far as possible, I use the user id vasudevram for all web services that I become a user of.)

(I haven't used G+ much yet, only got the account some days ago, but will probably use it more over time. So not much to see on my G+ page right now).

- Vasudev Ram - dancingbison.com

Share |

Monday, July 11, 2011

WebStatusCodes, a web development utility by Brian Jones

By Vasudev Ram - dancingbison.com | @vasudevram | jugad2.blogspot.com

I think this tool can be useful to web developers:

WebStatusCodes - http://webstatuscodes.appspot.com/ - is a web development utility written by Brian Jones, using Google AppEngine. Excerpts from the main page of the site, which convey what it is about:

[
This was put here as a handy testing tool for apps that need to test how their code deals with various HTTP status codes, and as a very basic reference for those who know HTTP but need an occasional reminder of whether 302 or 301 is "permanent", or can't remember which code means "Not Modified".
...
Request a valid status code by putting the status code as the first part of the URL path. For example, requesting
http://webstatuscodes.appspot.com/403 will return a 403 forbidden error, and the body of the response will contain the standard message describing that response.
...
Currently, this site doesn't do anything other than return the error. 401 doesn't issue a challenge, for example.
...
This site is also a handy reference. Not only is the list of supported codes and their accompanying short description messages in a table below (in case you *don't* know what you're looking for), but
requesting a status code in a browser includes the description in the body of the response (in case you *do* know what you're looking for).
]

Below the last paragraph above, is a table of web (HTTP) request status codes and their accompanying short descriptions.

Posted via email
- Vasudev Ram - Dancing Bison Enterprises

jugad2 - Vasudev Ram on software innovation

Pages

Saturday, March 31, 2018

Checking if web sites are online with Python

Friday, February 26, 2016

PrintFriendly.com, site to print any web page well

Thursday, February 18, 2016

Web dev utilities: HTML Encode

Saturday, July 23, 2011

gplus.to makes short URL for your Google Plus page

Monday, July 11, 2011

WebStatusCodes, a web development utility by Brian Jones

Blog Archive

Labels