Essential for web scraping: urllib & requests with Python (2023)

Web scraping offers people and businesses a way to understand what can be achieved with a reasonable amount of data. You can challenge your competitors and surpass them simply by doing good data analysis and research on data scraped from the internet. Maybe also from an individual point of view, if you are looking for a job; Automated web scraping can help you process all jobs published on the web into your spreadsheet where you can filter them based on your skills and experience. On the other hand, if you spend hours trying to get the information you want, now you can easily create your web scraping script that can work like a charm for all your manual labor hours.

There is so much information out there and new data is being generated every second that manual scraping and research is not possible. That's why we need automated web scraping to achieve our goals.

Web scraping has become an essential part of every business, individual and even government.

Join our editors each weekday evening as they guide you through the day's most important news, introducing you to new perspectives and offering unexpected moments of joy

Your newsletter subscriptions are subject to AIM's Privacy Policy and Terms of Service.
(Video) WEB SCRAPING in BARE PYTHON | HTTP requests with "urllib"

challenges

There are also some challenges in the web scraping field such as: B. The constant change of websites, so our web scraper may not work next timeDiffbot: A visual web scraping tool that combines computer vision, machine learning, and NLP to achieve universal web scraping techniques that are more powerful, more accurate, and easier to use.

Another problem with web scraping is the diversity of all websites in design and coding structure, so we cannot use a single web scraping script everywhere to get results. There needs to be a continuous change of code as the website changes.

(Video) Web Scraping using Python | Exploring Urllib, Requests & BeautifulSoup | Automation

Download our mobile app


Essential for web scraping: urllib & requests with Python (1) Essential for web scraping: urllib & requests with Python (2)

Today we are going to discuss some of the libraries that can reduce the creation time of your web scraper and are essential for web scraping purposes as they are the building blocks that everything is built on.

Screaming

Urllib is a package that combines several modules for preprocessing the URLs. In simple terms, it is an HTTP client for Python programming languages, the latest version of

vacation isurllib3 1.26.2which supports thread-safe connections, connection pooling, client-side verification with SSL/TLS verification, multi-part encoding, gzip support, and brotli encoding. It brings many important features that are missing from traditional Python libraries.

Urllib3 is one of the most downloaded packagesPyPi, and it's the first to run in a web scraping script, it's available at theMY license.

Essential for web scraping: urllib & requests with Python (3)

Through useurllib.requestwe can easily open and read URLs.

urllib.errordefines the exceptions and errors thrown bydie urllib.requestCommand.

urllib.parseused to parse URLs.

urllib.robotparseris used for parsingrobots.txtfiles.

(Video) Urllib - GET Requests || Python Tutorial || Learn Python Programming

Installation

pip install urllib3

Alternatively, you can install it from sourceCode:

Install git clone git://github.com/urllib3/urllib3.gitpython setup.py

Quick Start

import urllib3http = urllib3.PoolManager()r = http.request('GET', 'http://httpbin.org/robots.txt')r.statusR.data

Exit

Essential for web scraping: urllib & requests with Python (4)

Let's scrape a website using urllib and regular expressions

#1 required library import urllib.requestimport urllib.parse import re #2 search url = 'https://analyticsindiamag.com/'values ​​= {'s':'Web Scraping', 'submit':'search'} # 3 parsedata = urllib.parse.urlencode(values) data = data.encode('utf-8') req = urllib.request.Request(url, data) resp = urllib.request.urlopen(req) respData = resp.read () #4 Extracting with regular expressionsdocument = re.findall(r'<p>(.*?)</p>',str(respData)) for line in document: print(line)
Essential for web scraping: urllib &amp; requests with Python (5)

We can easily retrieve data without using any other module; OnlyscreamingAndRegarding(regular expression)

Let's understand the code explained above:

  1. First we imported the required module, i.e. re and urllib
  2. Defined a URL, i.e.Analytics India Magazin, and some test search values ​​that we want to extract.
  3. In the first line we encode the search values ​​and then we encode the data so that it can be understood by the machine.
    • In the third line, we requested values ​​from the predefined URL that we analyzed earlier.
    • NextVacation()used to open the HTML document.
    • read()used to read this document line by line
  4. We use a Python module to use regular expressions to look up values. In this case, our regular expressions will scrape any data that is in the paragraph tag.

We can use aTeams-Tag in regular expressionfind allFunction instead to extract all titles of article name like we did hereNice souptutorial. But now only with the help of the two lightest modules urllib and re.

Requests

Requests is an open source Python library that makesHTTPInquiries more user-friendly and easier to use. It is being developed by Kenneth Reitz, Cory Benfield, Ian Stapleton Cordasco and Nate Prewitt with an initial release in February 2011.

The Requests module library is Apache2-licensed, written in Python.

Essential for web scraping: urllib &amp; requests with Python (6)

Sounds pretty much like urllib, then why do we need it?

WeilRequestssupport a fully restful API and are easier to use and accessible.

Although theRequestsLibrary is operated byurllib3it's still more widely used today because of its readability and POST/GET freedom and much more.

(Video) Python Requests Tutorial: Request Web Pages, Download Images, POST Data, Read JSON, and More

Also, the urllib API is thoroughly broken, it was built for a different time and web structure, urllib requires more work than requests for the simplest task; So now we need a more flexible HTTP client, i.e. requests

Advantages:

  • RequestsThe library is easy to use and retrieve information
  • It is widely used for scraping data from websites
  • It is also used for Web API request purposes
  • With Request we can GET, POST, PUT and delete the data for the specified URL
  • It has authentication module support
  • It handles cookies and sessions very reliably.

Characteristics:

  • Access features for international domains and URLs.
  • SSL verification
  • JSON-Decoder
  • .netrc support
  • Multiple file uploads
  • thread security
  • Unicode response body

Installation

import requestsresp = requests.get('http://www.yourwebsite.com/user')resp = requests.post('http://www.yourwebsite.com/user')resp = requests.put('http: / /www.yourwebsite.com/user/put')resp=requests.delete('http://www.yourwebsite.com/user/delete')

No need to code any parameters like urllib3, just pass a dictionary as an argument and you're good to go:

Attributes = {"firstname": "John", "lastname": "Edison", "password": "jeddie123"} or = requests.post('http://www.yourwebsite.com/user', data= attributes )

It also has its own JSON decoder:

bzw.json()

Or If the response is text; Use :

or text

Web Scraping with "Requests"

We use Requests and Beautiful Soup to process and find information, or we can use regular expressions as shown in the demonstration aboveurllib3.

For this demonstration we useRequestswithnice soup, and we scrape the articles from awebsite

#1 Import modulesImport requests from bs4 Import BeautifulSoup#2 with .GET()res = reviews.get('https://analyticsindiamag.com/')#3 Nice to extract reliable data soup only = BeautifulSoup(res.text, ' html.parser')article_block =soup.find_all('div',class_='post-title') for titles in article_block:title =titles.find('span').get_text()print(title)
Essential for web scraping: urllib &amp; requests with Python (7)

explanation

  1. Imported requests andNice soup
  2. request.get() performs an HTTP request to the specified URL. It returns the HTML data
  3. Beautiful Soup parses this data with its HTML parser and then performs other operations like findall on the classsuffix, see more about BeautifulSoupHere

Use case of a request other than web scraping

We can use request module to request our web API to get responses like in this case when we use POST on web API:https://loan5.herokuapp.com/api

This API is used to predict credit approval. It returns 1 or 0, i.e. H. Approved or denied when passing some attributes like gender, credit history, married, etc.

#1import jsonimport reviewsurl= 'https://loan5.herokuapp.com/api'#2 sample datadata={'Gender':1, 'Married':1, 'Dependents':2, 'Education':0, 'Self_Employed ':1,'Credit_History':0,'Property_Area':1, 'Income':1}data = json.dumps(data)#3 Sending a request with data to webapi and it returns answer.send_req = reviews.post (URL, data)print(send_req.json())
Essential for web scraping: urllib &amp; requests with Python (8)

Diploma

We learned how the urllib and request two Python modules can help with web scraping from scratch. There are many ways to run your web scraper like in the previous article we usedSeleniumfor web scraping we then combined with seleniumnice soupand now we have integratedInquiryModul instead of selenium with nice soup.

(Video) Request Headers for Web Scraping

It all depends on the use case we have. If your scraper requires HTTP and Web API communication, you must start fetching your URLs with a request. Otherwise, you can use a scraper that runs in real time while scraping Selenium.

Videos

1. Python Tutorial: Web Scraping with Requests-HTML
(Corey Schafer)
2. Using Python and Requests to Scrape Static Websites
(Real Python)
3. Python Tutorial: Web Scraping with BeautifulSoup and Requests
(Corey Schafer)
4. Python Urllib Package - Making web requests
(Tech Programmer)
5. URLLIB Module in Python
(Stats Wire)
6. Python Urllib Introduction For HTTP Task | Send Request | Get Response
(Parwiz Forogh)
Top Articles
Latest Posts
Article information

Author: Delena Feil

Last Updated: 04/12/2023

Views: 5781

Rating: 4.4 / 5 (45 voted)

Reviews: 84% of readers found this page helpful

Author information

Name: Delena Feil

Birthday: 1998-08-29

Address: 747 Lubowitz Run, Sidmouth, HI 90646-5543

Phone: +99513241752844

Job: Design Supervisor

Hobby: Digital arts, Lacemaking, Air sports, Running, Scouting, Shooting, Puzzles

Introduction: My name is Delena Feil, I am a clean, splendid, calm, fancy, jolly, bright, faithful person who loves writing and wants to share my knowledge and understanding with you.