In my previous post I covered how to use the basehttp
Module. Now let's go to a higher level and see how to use urllib3. Then we will reach even higher horizons as we learn about inquiries. But first, a quick definition of urllib and urllib3.
The backstory
Once upon a time, when people rocked Python 2, there were these libraries called httplib and urllib2. Then Python 3 happened.
In Python 3, httplib has been redesigned to http.client, which you saw in Part 1, and urllib2 has been split into several submodules in a new module called urllib. urllib2 and urllib contained a high-level HTTP interface that didn't bother you with the details of http.client (formerly httplib). Aside from that, this new URLIB was missing a long list of critical features, such as:
- thread security
- connection pooling
- Client-side SSL/TLS inspection
- File uploads with multipart encoding
- Helpers for retrying requests and handling HTTP redirects
- Support for gzip and deflate encoding
- Proxy support for HTTP and SOCKS
To address these issues, urllib3 was created by the community. It's not a core Python module (and probably never will be), but it doesn't need to maintain compatibility with urllib.
urllib is not covered here because urllib3 can do almost everything it does and has some additional features, and the vast majority of programmers use urllib3 and queries.
Now that you know the difference between urllib and urllib3, here is an urllib example (the only one here) that uses thehttp.cookiejar.CookieJar
Class from Part 1:
>>> import urllib.request>>> import http.cookiejar>>> Politics = http.Cookie tins.DefaultCookiePolicy(... blocked_domains=["anzeigen.net", ".ads.net"])>>> cj = http.Cookie tins.CookieJar(Politics)>>> opener = screaming.Inquiry.build_opener(screaming.Inquiry.HTTPCookieProcessor(cj))>>> R = opener.open("http://beispiel.com")>>> Str(Typ(R))"<Klasse 'http.client.HTTPResponse'>"
Installation
Neither urllib3 nor requests are included in a standard Python installation (if your Python was packaged from a distribution, they may be present). So they need to be installed with pip.pip3 install 'urllib3[secure, socks]' 'requests[socks]'
should install it for you. Thesecure
part installs certificate-related packages required by urllib3 andParish
installs SOCKS protocol related packages.
urllib3
Of course you have to import it firstimport urllib3
, and for those of you reading Part 1, this is where it gets interesting. Instead of creating a connection directly, create onePoolManager
Object. This does the connection pooling and thread safety for you. There's one tooProxyManager
Object for forwarding requests via an HTTP/HTTPS proxy, and aSOCKSProxyManager
for SOCKS4 and SOCKS5 proxies. It looks like this:
>>> import urllib3>>> out of urllib3.post.socks import SOCKSProxyManager>>> Proxy = urllib3.ProxyManager('http://localhost:3128/')>>> Proxy.Inquiry('RECEIVE', 'http://google.com/')>>> Proxy = SOCKSProxyManager('socks5://localhost:8889/')
Note that HTTPS proxies cannot connect to HTTP websites.
urllib3 also has a logger that logs many messages. You can optimize verbosity by importing and invoking the Logger moduleLogging.getLogger("urllib3").setLevel(dein_Level)
.
As aHTTP connection
imhttp
module, urllib3 has aInquiry()
Method. It is called aspoolmanager.request('GET', 'http://httpbin.org/robots.txt')
. Similar tohttp
, this method also returns a class namedHTTPAntwort
.But don't be fooled!This is nohttp.client.HTTPResponse
. That is aurllib3.response.HTTPResponse
. The urllib3 version has some methods not defined inhttp
, and these will prove very useful and convenient.
As this explainsInquiry()
method returns aHTTPAntwort
Object. It has aData
Member representing the response content in a JSON string (encoded as UTF-8 bytes). To check it you can use:
import jsonpress(json.charges(Answer.Data.decode('utf-8'))
Creating a query parameter
A query parameter looks likehttp://httpbin.org/get?arg=value
. The easiest way to construct something like this is to have a string containing everything up to and including the question mark, and then pass the argument/value pairs to as a dictionaryurllib.parse.urlencode()
(And,screaming
) and concatenate this with your original string.
Here is a summary. Any parameter in this table that can be specified must be a dictionary. The response contains several JSON keys, some of which include:
parametersInquiry() | JSON key as response |
---|---|
N / A | "Origin" |
headers | "headers" |
Felder (HEAD/GET/DELETE) | "Arguments" |
codedURL Parameter (POST/PUT) | "Arguments" |
Felder (POST/PUT) | "form" |
codedBody mit Content-Type application/json inheaders | "json" |
'filefield': (filename, filedata, mime_type) InFelder Parameter | "files" |
binary dataBody with any content type inheaders Parameter | "Data" |
HTTPS in urllib3
There is additional boilerplate code that needs to be added to use certificates and hence HTTPS in aPoolManager
, but has the benefit of throwing an error if the connection can't be secured for some reason:
>>> import certificate>>> import urllib3>>> Swimming pool = urllib3.PoolManager(... cert_reqs='CERT_REQUIRED',... ca_certs=certificate.Wo())>>> Swimming pool.Inquiry('RECEIVE', 'https://google.com')(NO Exception)>>> Swimming pool.Inquiry('RECEIVE', 'https://abgelaufen.badssl.com')(throws urllib3.exceptions.SSLError)
Some extra goodies
Similar tohttp
,urllib3
Connections support request timeouts. For even more control, you can atime out
object to specify separate connection and read timeouts (all exceptions are obtained aturllib3.Exceptions
):
>>> Swimming pool.Inquiry(... 'RECEIVE', 'http://httpbin.org/delay/3', time out=2.5)MaxRetryError caused von ReadTimeoutError>>> Swimming pool.Inquiry(... 'RECEIVE',... 'http://httpbin.org/delay/3',... time out=urllib3.time out(connect=1.0))<urllib3.Answer.HTTPAntwort>>>> Swimming pool.Inquiry(... 'RECEIVE',... 'http://httpbin.org/delay/3',... time out=urllib3.time out(connect=1.0, read=2.0))MaxRetryError caused von ReadTimeoutError
Something, thathttp
does not have is repeating requests. urllib3 has this because it is a high-level library. It isDocumentationcouldn't explain it better:
urllib3 can automatically retry idempotent requests. The same mechanism also handles redirects. You can control the retries with the request() retries parameter. By default, urllib3 retries requests 3 times and follows up to 3 redirects.
To change the number of repetitions, simply specify an integer:
>>> Swimming pool.Requests('RECEIVE', 'http://httpbin.org/ip', repeated=10)
To disable all retries and redirection logic, specify retries=False :
>>> Swimming pool.Inquiry(... 'RECEIVE', 'http://nxdomain.example.com', repeated=INCORRECT)New connection error>>> R = Swimming pool.Inquiry(... 'RECEIVE', 'http://httpbin.org/redirect/1', repeated=INCORRECT)>>> R.Status302
To disable forwards but keep the retry logic, specify forward=False :
>>> R = Swimming pool.Inquiry(... 'RECEIVE', 'http://httpbin.org/redirect/1', redirect=INCORRECT)>>> R.Status302
Similar totime out
, there is one tooRepeat
-Object to set the maximum retries and redirects separately. It's done like this:retries=urllib3.Retry(3, redirect=2)
. The request is triggeredMaxRetryError
when too many requests are made.
Instead of passing aRepeat
Object for each request you can also specifyRepeat
object in thePoolManager
constructor so that it applies to all requests. The same applies totime out
.
Requests
Requests uses urllib3 under the hood and makes it even easier to make requests and get data. For one, keep-alive is 100% automatic, compared to urllib3 where it is not. It also has event hooks that invoke a callback function when an event is raised, e.g. B. when receiving a reply (but that is an advanced feature and is not covered here).
For requests, each request type has its own function. So instead of creating a connection or pool, you get (for example) a URL directly. Many of the keyword parameters used in urllib3 (see table above) can also be used identically for queries. See all exceptions belowRequests.Exceptions
.
import RequestsR = Requests.receive('https://httpbin.org/get')R = Requests.Post('https://httpbin.org/post', Data={'Taste':'Wert'})R = Requests.set('https://httpbin.org/put', Data={'Taste':'Wert'})R = Requests.extinguish('https://httpbin.org/delete')R = Requests.Kopf('https://httpbin.org/get')R = Requests.options('https://httpbin.org/get')# You can disable redirects if you wantR = Requests.options('https://httpbin.org/get', Allow_redirects=INCORRECT)# Or set a timeout for the number of seconds a server has to respondR = Requests.options('https://httpbin.org/get', time out=0,001)# Set connection and reading timeouts at the same timeR = Requests.options('https://httpbin.org/get', time out=(3.05, 27))# How to pass query parameters ('None' keys are not added to the request):R = Requests.receive('https://httpbin.org/get', Parameter={'key1': 'value1', 'key2': 'Wert2'})# If a key has a list value, add a key/value pair for each value in the list:R = Requests.receive('https://httpbin.org/get', Parameter={'key1': 'value1', 'key2': ['Wert2', 'Wert3']})# Headers can also be added:R = Requests.receive('https://httpbin.org/get', headers={'User-Agent': 'my-app/0.0.1'})# And only in requests (not urllib3) is there a cookies keyword argument.R = Requests.receive('https://httpbin.org/get', Cookies=Diktat(cookies_are='Work'))
The value returned by these calls is yet another type of response object. This time it's oneInquiries.Answer
(At least it wasn't anyone elseHTTPAntwort
🙂). This object contains a wealth of information, e.g. B. the time the request took, the JSON of the response, whether the page was redirected and even your ownCookieJar
Type. Here is a running list of the most useful members:
r.status_code
Andr.ground
: Numeric status code and human-readable reason.URL
: The canonical URL used in the request.Text
: The text retrieved from the request.Contents
: Die Bytes-Version vonText
.json()
: Attempts to return the JSON ofText
. ElevatedValueError
if this is not possible.coding
: If you know the correct encoding forText
, put it like this hereText
can read properly.apparent_encoding
: The encoding that requests guessed it.raise_for_status()
: ElevatedRequests.Exceptions.HTTPError
if the request encountered one.OK
: True ifStatuscode
is less than 400, otherwise False.is_redirect
Andis_permanent_redirect
: Whether the status code was a redirect or whether it was a permanent redirect.headers
: headers in the response.Cookies
: Cookies in the response.Story
: All response objects from redirected URLs to get to the current URL, sorted from oldest to newest.
Here's how you would save the response output to a file:
with open(file names, 'wb') if fd: for Piece In R.iter_content(chunk_size=128): fd.write(Piece)
Here's how to stream uploads without reading the entire file:
with open('solid body', 'rb') if F: Requests.Post('http://some.url/streamed', Data=F)
In the event of a network error, requests are triggeredconnection error
. When the request timeout expires, it is triggeredtime out
. And when too many redirects have been made, it increasesTooManyRedirects
.
Proxys
HTTP, HTTPS, and SOCKS proxies are supported. Inquiries is also sensitive to theHTTP-PROXY
AndHTTPS_PROXY
Environment variables, and when set, requests automatically use these values as proxies. Within Python, you can specify the proxies to use in the parameter:
# Instead of socks5 you could also use http and https.Proxys = { 'http': 'socks5://user:pass@host:port', 'https': 'socks5://user:pass@host:port'}Requests.receive('http://beispiel.org', Proxys=Proxys)
session objects
Ameeting
can persist cookies and some parameters across requests and reuse the underlying HTTP connection for the requests. It uses a urllib3PoolManager
, which significantly increases the performance of HTTP requests to the same host. It also has all the main request API methods (all the request methods you saw above). They can also be used as context managers:
with Requests.meeting() if S: S.receive('https://httpbin.org/cookies/set/sessioncookie/123456789')
And we're done
This concludes the Python HTTP series. Are there any mistakes here? Let me know so I can fix them.