overview
While the title of this post is Urllib2, we will show a few
Examples of where to use urllib as they are often used together.
This will be an introductory post of urllib2 which we will go to
Focus on fetching urls, requests, posts, user agents and error handling.
For more information, see the official documentation.
Also, this article was written for Python version 2.x
HTTP is based on requests and responses - the client makes requests and
Servers send replies.
A program on the Internet can work as a client (access resources) or as a
a server (provides services).
A URL identifies a resource on the Internet.
What is Urllib2?
urllib2is a Python module that can be used to fetch URLs.
It defines functions and classes to support URL actions (Basic and Digest
authentication, redirects, cookies, etc.)
The magic starts with importing the urllib2 module.
What is the difference between urllib and urllib2?
While both modules do URL request related things, they have different ones
functionality
urllib2can accept a Request object to set the headers for a URL request,
screamingaccepts only one url.
screamingprovides the urlencode method used for generation
of GET query strings,urllib2does not have such a function.
Because of thisscreamingAndurllib2are often used together.
See the documentation for more information.
What is Urlopen?
urllib2provides a very simple interface in the form of the urlopen function.
This feature is capable of fetching URLs using a variety of different protocols
(HTTP, FTP, …)
Just pass the URLVacation()get one„dateiartic“Handle the removed data.
Additionally,urllib2provides an interface to handle common situations –
like basic authentication, cookies, proxies and so on.
These are provided by objects called handlers and openers.
Get URLs
This is the most basic way of using the library.
Below is how to make a simple request using urllib2.
Start by importing the urllib2 module.
Put the answer in a variable (answer)
The response is now a file-like object.
Read the data from the response into a string (html)
Do something with this string.
noteIf the URL contains a space, you must parse it with urlencode.
Let's see an example of how this works.
import urllib2response = urllib2.urlopen('https://www.pythonforbeginners.com/')print response.info()html = response.read()# do somethingresponse.close() # best practice for closing the file Note: you can also use a URL starting with "ftp:", "file:" etc.).
The remote server accepts the incoming values and formats a plain text response
to send back.
The return value ofVacation()allows access to the headers of the HTTP server
by thedie Info()method and the data for the remote resource via methods like
read()Andreading lines ().
Also the file object returned byVacation()is iterable.
Simple urllib2 script
Let's show another example of a simple urllib2 script
import urllib2response = urllib2.urlopen('http://python.org/')print "Response:", response# Gets the URL. This gets the real URL. print "The URL is: ", response.geturl()# Get the code print "This gets the code: ", response.code# Get the headers. # This returns a dictionary-like object describing the retrieved page, # specifically the headers sent by the server .info()['date']# Get the server part of the header sprint "The server is: ", response.info()[' server']# Get all datahtml = response.read()print "Get all data: ", html# Get only the length print "Get the length :", len(html)# Shows that the file object is iterable for the response line: print line.rstrip()# Note that the rstrip removes the trailing newlines and carriage returns before # printing the output.
Download files with Urllib2
This little script downloads a file from the pythonforbeginners.com website
import urllib2# file to write tofile = "downloaded_file.html"url = "https://www.pythonforbeginners.com/"response = urllib2.urlopen(url)#open thefile for writingfh = open(file, "w")# read from request while writing to filefh.write(response.read())fh.close()# You can also use the with statement:with open(file, ' w') as f: f.write(response.read())
The difference in this script is that we use "wb" which means we use the "
File binary.
import urllib2mp3file = urllib2.urlopen("http://www.example.com/songs/mp3.mp3")output = open('test.mp3','wb')output.write(mp3file.read())output .schließen()
Urllib2 requests
The Request object represents the HTTP request that you make.
In its simplest form, you create a request object that specifies the URL you want
fetch.
Calling urlopen with this request object returns a response object for the URL
requested.
The request function under the urllib2 class accepts both URL and parameters.
If you don't include the data (and only pass the URL), the request will be made
is actually a GET request
If you include the data, the request made is a POST request, where the
url is your post url and the parameter is http post content.
Let's look at the example below
import urllib2import urllib# Specify urlurl = 'https://www.pythonforbeginners.com'# This packs the request (it doesn't) request = urllib2.Request(url)# Sends the request and catches the response response = urllib2 .urlopen (request)# Extract the response html = response.read()# Print it out and print html
You can set the outgoing data on the request to be sent to the server.
In addition, you can pass on additional data information (“metadata”) about the data or
the About request itself to the server - this information is sent as HTTP
"Headers".
If you want to POST data, you must first create the data in a dictionary.
Make sure you understand what the code is doing.
# Prepare the dataquery_args = { 'q':'query string', 'foo':'bar' }# This url encodes your data (that's why we need to import urllib above)data = urllib.urlencode(query_args)# Send HTTP POST requestrequest = urllib2.Request(url, data)response = urllib2.urlopen(request) html = response.read()# Print the result print html
user agents
The way a browser identifies itself is through the User-Agent header.
By default, urllib2 identifies itself asPython-urllib/x.y
where x and y are the major and minor version numbers of the Python version.
This could confuse the website or simply not work.
With urllib2 you can add your own headers with urllib2.
The reason you want to do that is because some websites don't like it
searched for programs.
When you create an application that accesses other people's web resources,
It's polite to include real user agent information in your requests,
so that they can more easily identify the source of the hits.
When you create the request object, you can add your headers to a dictionary,
and use add_header() to set the user agent value before opening the request.
That would look something like this:
# Import module import urllib2# Define url = 'http://www.google.com/#q=my_search'# Add your headersheaders = {'User-Agent' : 'Mozilla 5.10'}# Build the Inquiry. request = urllib2.Request(url, None, headers)# Get the response response = urllib2.urlopen(request)# Print the headersprint response.headers
You can also add headers with add_header().
Syntax: Request.add_header(key, val)
The example below uses Mozilla 5.10 as the user agent, and that's something too
appears in the web server log file.
import urllib2req = urllib2.Request('http://192.168.1.2/')req.add_header('User-agent', 'Mozilla 5.10')res = urllib2.urlopen(req)html = res.read()print html
This is shown in the log file.
„GET /HTTP/1.1? 200 151 „-“ „Mozilla 5.10?
urllib.urlparse
The urlparse module provides functions for parsing URL strings.
It defines a standard interface to break the Uniform Resource Locator (URL).
Strings into several optional parts called components, known as
(Schema, Location, Path, Query and Fragment)
Suppose you have a URL:
http://www.python.org:80/index.html
Thethe planwould be http
TheLocationwould be www.python.org:80
TheAwayis index.html
We do not haveInquiryAndFragment
The most common functions are urljoin and urlsplit
import urlparseurl = "http://python.org"domain = urlparse.urlsplit(url)[1].split(':')[0]print "The domain name of the URL is: ", domain
For more information on urlparse, see the officialDocumentation.
urllib.urlencode
If you're passing information through a URL, you need to make sure it uses only
certain allowed characters.
Allowed characters are all alphabetic characters, digits and some special characters
Characters that have meaning in the URL string.
The most commonly encoded character is theSpaceCharacter.
You'll see this character whenever you see a plus sign (+) in a URL.
This represents the space.
The plus sign acts as a special character that represents a space in a URL
Arguments can be passed to the server by encoding and appending them
to the URL.
Let's look at the example below.
import urllibimport urllib2query_args = { 'q':'query string', 'foo':'bar' } # You must pass a dictionary encoded_args = urllib.urlencode(query_args)print 'Encoded:', encoded_argsurl = 'http://python .org/?' + encoded_argsprint urllib2.urlopen(url).read()
If I were to print that now, I would get an encoded string like this:
q=Abfrage+String&foo=Bar
Python's URL code takes variable/value pairs and creates a proper escape file
Query String:
from urllib import urlencodeartist = "Kruder & Dorfmeister"artist = urlencode({'ArtistSearch':artist})
This equates the artist variable to:
Output : ArtistSearch=Kruder+%26+Dorfmeister
error handling
This section of error handling is based on the information from the great voidspace.org.uk article:
„Urllib2 - The missing manual”
uropen increasedURL errorif it cannot process a response.
HTTP erroris the subclass ofURL errortriggered in the special case of HTTP URLs.
URL error
URLError is often thrown because there is no network connection,
or the specified server does not exist.
In this case, the thrown exception has a "reason" attribute,
This is a tuple containing an error code and a textual error message.
Example of URLError
req = urllib2.Request('http://www.pretend_server.org')try: urllib2.urlopen(req)außer URLError, e: print e.reason(4, 'getaddrinfo failed')
HTTP error
Each HTTP response from the server includes a numeric "status code".
Sometimes the status code indicates that the server cannot fulfill
the request.
The default handlers handle some of these responses for you (e.g.
if the response is a "redirect" asking the client to fetch the document
from another URL, urllib2 will do that for you).
For those it can't handle, urlopen throws an HTTPError.
Typical errors are "404" (page not found), "403" (request forbidden),
and "401" (requires authentication).
If an error is thrown, the server responds by returning an HTTP error code
and an error page.
You can use the HTTPError instance as a response on the returned page.
This means that in addition to the code attribute, there are also read, geturl,
and information, methods.
req = urllib2.Request('http://www.python.org/fish.html')try: urllib2.urlopen(req)außer URLError, e: print e.code print e.read()
from urllib2 import Request, urlopen, URLErrorreq = Request(someurl)try: response = urlopen(req)except URLError, e: if hasattr(e, 'reason'): print 'We could not reach a server.' print 'Reason: ', e.reason elif hasattr(e, 'code'): print 'The server could not fulfill the request.' print 'Error code: ', e.codeelse: # everything is fine
Please take a look at the links below to learn more about Urllib2
Library.
Sources and further reading
http://pymotw.com/2/urllib2/
http://www.kentsjohnson.com/
http://www.voidspace.org.uk/python/articles/urllib2.shtml
http://techmalt.com/
http://www.hacksparrow.com/
http://docs.python.org/2/howto/urllib2.html
http://www.stackoverflow.com
http://www.oreillynet.com/
Related
Recommended Python training
Course: Python 3 for beginners
15+ hours of video content with guided instruction for beginners. Learn how to build real applications and master the basics.
Sign up now