Content:
urllib
is a Python 3 library for making HTTP requests. It is part of the Python Standard Library.
The urllib
library has gone through a couple of iterations, starting life as a Python 2 library. Documentation can therefore be tricky to find, as it’s not always clear which version of urllib
the documentation is referring to.
This is further compounded by the existence of urllib3
, which is totally unrelated to the built-in urllib
library. You can learn more about the origins of urllib
, and a comparison between urllib
and the popular requests
library, in this article.
This article will cover the basics, and show you how to use urllib
in your application.
Importing urllib
The first step to using urllib
is to import it into your application.
You can do this by adding the following import statement.
from urllib import error, parse
from urllib.request import Request, urlopen
This not only imports the urllib.request.Request
class required to actually send a request, but also includes a few other useful classes to parse the response and handle errors.
As urllib
is part of the Python standard library, you shouldn’t need to install any additional packages.
The rest of the code examples in this article will assume you’re using the import statements listed above.
Initialising the Request
The simplest way to create a request is to use urllib.request.urlopen
. Pass in a string containing the URL to access. The response can be read using read()
.
with urlopen(url) as response:
html = response.read()
For more advanced queries, you can create a Request
object, and pass this to urlopen
instead. This allows a more customised query.
For example, using Request
allows headers and body content to be set. Note that Request
uses the data
attribute to hold data to send in the request body.
req = Request(url, headers=headers, data=body)
with urlopen(req) as response:
// Parse response
urllib
not only supports HTTP URLs, but can also connect using a variety of different protocols, such as FTP.
Setting Headers
There are two ways to add headers to a urllib
Request
object.
The first is to pass a dictionary containing the required headers to the Request
constructor.
headers = {
'Content-Type': 'application/x-www-form-urlencoded',
'Authorization': 'Basic'
}
req = Request(url, headers=headers)
This is useful if you’re adding a large number of headers.
The alternative is to use the add_header()
function, which takes the name of the header, followed by the value. This function alters an existing Request
object.
req = Request(url)
req.add_header('Content-Type', 'application/x-www-form-urlencoded')
req.add_header('Authorization', 'Basic')
This can be repeated as many times are required to add all of your headers. You might want to use this method when conditionally adding headers
Note that the header here takes the form of a tuple, rather than a dictionary.
Setting the Request Body
When sending a POST request, you’ll probably want to add data to the request body. This can be done in the Request
object constructor, similar to setting the headers.
This time, though, you can’t just pass in a dictionary. The data needs to be encoded correctly, using a combination of parse.urlencode()
and string.encode()
.
First, pass your dictionary to parse.urlencode()
.
body = parse.urlencode({
'colour': 'brown',
'size': 9,
'material': 'leather'
})
parse.urlencode()
converts the dictionary to a string containing key=value
pairs.
The resulting string needs to be converted to UTF8.
body = body.encode()
This can then be passed to the Request
constructor. urllib
refers to this attribute as data
.
req = Request(url, data=body)
When the data attribute is set to a value other than None
(which is the default), the request type is automatically changed to POST
.
Setting the Method
As specified above, the method will default to GET
if the data attribute is set to None
, and POST
otherwise.
It’s also possible to set it manually in the Request
constructor, by adding a value for the method
attribute. For example, the following will create a PUT
request.
req = Request(url, data=body, method='PUT')
The method
can be set to any value.
Parsing the Response
The simplest way to read the response from a urllib
request is to use read()
.
req = Request(url, headers=headers, data=body)
with urlopen(req) as response:
response_string = response.read()
The response value is returned as a string.
If you’re expecting a JSON response, you’ll need to use json.loads
to decode it. Simply pass the result from response.read()
to json.loads()
. Be sure to add an import statement for loads
to your code.
from json import loads
req = Request(url, headers=headers, data=body)
with urlopen(req) as response:
json = loads(response.read())
Handling Errors
HTTPError
is thrown when a request returns an error response. The HTTPError
contains the response code
, a reason
string, and the full response headers
.
The example below checks the error reason
attribute, and prints it.
from urllib import error
try:
req = Request(url, headers=headers, data=body)
with urlopen(req) as response:
html = response.read()
except error.HTTPError as e:
print(e.reason)
You should always try and catch this error, to properly handle request errors.