Hi readers,
Recently, I thought of writing a small program to check if one or more web sites are online or not. I used the requests Python library with the HTTP HEAD method. I also checked out PycURL for this. It is a thin wrapper over libcurl, the library that powers the well-known and widely used curl command line tool. While PycURL looks powerful and fast (since it is a thin wrapper that exposes most or all of the functionality of libcurl), I decided to use requests for this version of the program. The code for the program is straightforward, but I found a few interesting things while running it with a few different sites as arguments. I mention those points below.
Here is the tool: I named it is_site_online.py:
""" is_site_online.py Purpose: A Python program to check if a site is online or not. Uses the requests library and the HTTP HEAD method. Tries both with and without HTTP redirects. Author: Vasudev Ram Copyright 2018 Vasudev Ram Web site: https://vasudevram.github.io Blog: https://jugad2.blogspot.com Product store: https://gumroad.com/vasudevram """ from __future__ import print_function import sys import requests import time if len(sys.argv) < 2: sys.stderr.write("Usage: {} site ...".format(sys.argv[0])) sys.stderr.write("Checks if the given site(s) are online or not.") sys.exit(0) print("Checking if these sites are online or not:") print(" ".join(sys.argv[1:])) print("-" * 60) try: for site in sys.argv[1:]: for allow_redirects in (False, True): tc1 = time.clock() r = requests.head(site, allow_redirects=allow_redirects) tc2 = time.clock() print("Site:", site) print("Check with allow_redirects =", allow_redirects) print("Results:") print("r.ok:", r.ok) print("r.status_code:", r.status_code) print("request time:", round(tc2 - tc1, 3), "secs") print("-" * 60) except requests.ConnectionError as ce: print("Error: ConnectionError: {}".format(ce)) sys.exit(1) except requests.exceptions.MissingSchema as ms: print("Error: MissingSchema: {}".format(ms)) sys.exit(1) except Exception as e: print("Error: Exception: {}".format(e)) sys.exit(1)The results of some runs of the program: Check for Google and Yahoo!:
$ python is_site_online.py http://google.com http://yahoo.com Checking if these sites are online or not: http://google.com http://yahoo.com ----------------------------------------------------------- Site: http://google.com Check with allow_redirects = False Results: r.ok: True r.status_code: 302 request time: 0.217 secs ------------------------------------------------------------ Site: http://google.com Check with allow_redirects = True Results: r.ok: True r.status_code: 200 request time: 0.36 secs ------------------------------------------------------------ Site: http://yahoo.com Check with allow_redirects = False Results: r.ok: True r.status_code: 301 request time: 2.837 secs ------------------------------------------------------------ Site: http://yahoo.com Check with allow_redirects = True Results: r.ok: True r.status_code: 200 request time: 1.852 secs ------------------------------------------------------------In the cases where allow_redirects is False, google.com gives a status code of 302 and yahoo.com gives a status code of 301. The 3xx series of codes are related to HTTP redirection. After seeing this, I looked up HTTP status code information in a few sites such as Wikipedia and the official site www.w3.org (the World Wide Web Consortium), and found a point worth noting. See the part in the Related links section at the end of this post about "302 Found", where it says: "This is an example of industry practice contradicting the standard.". Now let's check for some error cases: One error case: we do not give an http:// prefix (assume some novice user who is mixed up about schemes and paths), so they type a garbled site name, say http.om:
$ python is_site_online.py http.om Checking if these sites are online or not: http.om ------------------------------------------------------------ Traceback (most recent call last): File "is_site_online.py", line 32, inThis traceback tells us that when no HTTP 'scheme' [1][2] is given, requests raises a MissingSchema exception. So we now know that we need to catch that exception in our code, by adding another except clause to the try statement, which I later did, in the program you see in this post. In general, this technique can be useful when using a new Python library for the first time: just don't handle any exceptions in the beginning, use it a few times with variations in input or modes of use, and see what sorts of exceptions it throws. Then add code to handle them. [1] The components of a URL [2] Parts of URL Another error case - a made-up site name that does not exist:r = requests.head(site, allow_redirects=allow_redirects) [snip long traceback] raise MissingSchema(error) requests.exceptions.MissingSchema: Invalid URL 'http.om': No schema supplied. Perhaps you meant http://http.om?
$ python is_site_online.py http://abcd.efg Checking if these sites are online or not: http://abcd.efg ------------------------------------------------------------ Caught ConnectionError: HTTPConnectionPool(host='abcd.efg', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('From the above error we can see or figure out a few things: - the requests library defines a ConnectionError exception. I first ran the above command without catching ConnectionError in the program; it gave that error, then I added the handler for it. - requests uses an HTTP connection pool - requests does some retries when you try to get() or head() a URL (a site name) - requests uses urllib3 (from the Python standard library) under the hood I had discovered that last point earlier too; see this post: urllib3, the library used by the Python requests library And as I mentioned in that post, urllib3 itself uses httplib. Now let's check for some sites that are misspellings of the site google.com:: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))
$ python is_site_online.py http://gogle.com Checking ... ------------------------------------------------------------ Site: http://gogle.com With allow_redirects: False Results: r.ok: True r.status_code: 301 request time: 3.377 ------------------------------------------------------------ Site: http://gogle.com With allow_redirects: True Results: r.ok: True r.status_code: 200 request time: 1.982 ------------------------------------------------------------ $ python is_site_online.py http://gooogle.com Checking ... ------------------------------------------------------------ Site: http://gooogle.com With allow_redirects: False Results: r.ok: True r.status_code: 301 request time: 0.425 ------------------------------------------------------------ Site: http://gooogle.com With allow_redirects: True Results: r.ok: True r.status_code: 200 request time: 1.216 ------------------------------------------------------------Interestingly, the results show that that both those misspellings of google.com exist as sites. It is known that some people register domains that are similar in spelling to well-known / popular / famous domain names, maybe hoping to capture some of the traffic resulting from users mistyping the famous ones. Although I did not plan it that way, I realized, from the above two results for gogle.com and gooogle.com, that this tool can be used to detect the existence of such sites (if they are online when you check, of course). Related links: Wikipedia: List_of_HTTP_status_codes This excerpt from the above Wikipedia page is interesting: [ 302 Found This is an example of industry practice contradicting the standard. The HTTP/1.0 specification (RFC 1945) required the client to perform a temporary redirect (the original describing phrase was "Moved Temporarily"),[22] but popular browsers implemented 302 with the functionality of a 303 See Other. Therefore, HTTP/1.1 added status codes 303 and 307 to distinguish between the two behaviours.[23] However, some Web applications and frameworks use the 302 status code as if it were the 303.[24] ] 3xx Redirection W3C: Status Codes URL redirection requests docs: redirection section IBM Knowledge Center: HTTP Status codes and reason phrases Enjoy. - Vasudev Ram - Online Python training and consulting Sell your digital products online at a low monthly rate with SendOwlGet updates (via Gumroad) on my forthcoming apps and content. Jump to posts: Python * DLang * xtopdf Subscribe to my blog by email My ActiveState Code recipesFollow me on: LinkedIn * Twitter Are you a creator of online products? Get Convertkit:Email marketing for professional bloggers
No comments:
Post a Comment