Web scraping offers people and businesses a way to understand what can be achieved with a reasonable amount of data. You can challenge your competitors and surpass them simply by doing good data analysis and research on data scraped from the internet. Maybe also from an individual point of view, if you are looking for a job; Automated web scraping can help you process all jobs published on the web into your spreadsheet where you can filter them based on your skills and experience. On the other hand, if you spend hours trying to get the information you want, now you can easily create your web scraping script that can work like a charm for all your manual labor hours.
There is so much information out there and new data is being generated every second that manual scraping and research is not possible. That's why we need automated web scraping to achieve our goals.
Web scraping has become an essential part of every business, individual and even government.
AIM Daily XO
Join our editors each weekday evening as they guide you through the day's most important news, introducing you to new perspectives and offering unexpected moments of joy
Your newsletter subscriptions are subject to AIM's Privacy Policy and Terms of Service.
challenges
There are also some challenges in the web scraping field such as: B. The constant change of websites, so our web scraper may not work next timeDiffbot: A visual web scraping tool that combines computer vision, machine learning, and NLP to achieve universal web scraping techniques that are more powerful, more accurate, and easier to use.
Another problem with web scraping is the diversity of all websites in design and coding structure, so we cannot use a single web scraping script everywhere to get results. There needs to be a continuous change of code as the website changes.
Download our mobile app
Today we are going to discuss some of the libraries that can reduce the creation time of your web scraper and are essential for web scraping purposes as they are the building blocks that everything is built on.
Screaming
Urllib is a package that combines several modules for preprocessing the URLs. In simple terms, it is an HTTP client for Python programming languages, the latest version of
vacation isurllib3 1.26.2which supports thread-safe connections, connection pooling, client-side verification with SSL/TLS verification, multi-part encoding, gzip support, and brotli encoding. It brings many important features that are missing from traditional Python libraries.
Urllib3 is one of the most downloaded packagesPyPi, and it's the first to run in a web scraping script, it's available at theMY license.
Through useurllib.requestwe can easily open and read URLs.
urllib.errordefines the exceptions and errors thrown bydie urllib.requestCommand.
urllib.parseused to parse URLs.
urllib.robotparseris used for parsingrobots.txtfiles.
Installation
pip install urllib3
Alternatively, you can install it from sourceCode:
Install git clone git://github.com/urllib3/urllib3.gitpython setup.py
Quick Start
import urllib3http = urllib3.PoolManager()r = http.request('GET', 'http://httpbin.org/robots.txt')r.statusR.data
Exit
Let's scrape a website using urllib and regular expressions
#1 required library import urllib.requestimport urllib.parse import re #2 search url = 'https://analyticsindiamag.com/'values = {'s':'Web Scraping', 'submit':'search'} # 3 parsedata = urllib.parse.urlencode(values) data = data.encode('utf-8') req = urllib.request.Request(url, data) resp = urllib.request.urlopen(req) respData = resp.read () #4 Extracting with regular expressionsdocument = re.findall(r'<p>(.*?)</p>',str(respData)) for line in document: print(line)
We can easily retrieve data without using any other module; OnlyscreamingAndRegarding(regular expression)
Let's understand the code explained above:
- First we imported the required module, i.e. re and urllib
- Defined a URL, i.e.Analytics India Magazin, and some test search values that we want to extract.
- In the first line we encode the search values and then we encode the data so that it can be understood by the machine.
- In the third line, we requested values from the predefined URL that we analyzed earlier.
- NextVacation()used to open the HTML document.
- read()used to read this document line by line
- We use a Python module to use regular expressions to look up values. In this case, our regular expressions will scrape any data that is in the paragraph tag.
We can use aTeams-Tag in regular expressionfind allFunction instead to extract all titles of article name like we did hereNice souptutorial. But now only with the help of the two lightest modules urllib and re.
Requests
Requests is an open source Python library that makesHTTPInquiries more user-friendly and easier to use. It is being developed by Kenneth Reitz, Cory Benfield, Ian Stapleton Cordasco and Nate Prewitt with an initial release in February 2011.
The Requests module library is Apache2-licensed, written in Python.
Sounds pretty much like urllib, then why do we need it?
WeilRequestssupport a fully restful API and are easier to use and accessible.
Although theRequestsLibrary is operated byurllib3it's still more widely used today because of its readability and POST/GET freedom and much more.
Also, the urllib API is thoroughly broken, it was built for a different time and web structure, urllib requires more work than requests for the simplest task; So now we need a more flexible HTTP client, i.e. requests
Advantages:
- RequestsThe library is easy to use and retrieve information
- It is widely used for scraping data from websites
- It is also used for Web API request purposes
- With Request we can GET, POST, PUT and delete the data for the specified URL
- It has authentication module support
- It handles cookies and sessions very reliably.
Characteristics:
- Access features for international domains and URLs.
- SSL verification
- JSON-Decoder
- .netrc support
- Multiple file uploads
- thread security
- Unicode response body
Installation
import requestsresp = requests.get('http://www.yourwebsite.com/user')resp = requests.post('http://www.yourwebsite.com/user')resp = requests.put('http: / /www.yourwebsite.com/user/put')resp=requests.delete('http://www.yourwebsite.com/user/delete')
No need to code any parameters like urllib3, just pass a dictionary as an argument and you're good to go:
Attributes = {"firstname": "John", "lastname": "Edison", "password": "jeddie123"} or = requests.post('http://www.yourwebsite.com/user', data= attributes )
It also has its own JSON decoder:
bzw.json()
Or If the response is text; Use :
or text
Web Scraping with "Requests"
We use Requests and Beautiful Soup to process and find information, or we can use regular expressions as shown in the demonstration aboveurllib3.
For this demonstration we useRequestswithnice soup, and we scrape the articles from awebsite
#1 Import modulesImport requests from bs4 Import BeautifulSoup#2 with .GET()res = reviews.get('https://analyticsindiamag.com/')#3 Nice to extract reliable data soup only = BeautifulSoup(res.text, ' html.parser')article_block =soup.find_all('div',class_='post-title') for titles in article_block:title =titles.find('span').get_text()print(title)
explanation
- Imported requests andNice soup
- request.get() performs an HTTP request to the specified URL. It returns the HTML data
- Beautiful Soup parses this data with its HTML parser and then performs other operations like findall on the classsuffix, see more about BeautifulSoupHere
Use case of a request other than web scraping
We can use request module to request our web API to get responses like in this case when we use POST on web API:https://loan5.herokuapp.com/api
This API is used to predict credit approval. It returns 1 or 0, i.e. H. Approved or denied when passing some attributes like gender, credit history, married, etc.
#1import jsonimport reviewsurl= 'https://loan5.herokuapp.com/api'#2 sample datadata={'Gender':1, 'Married':1, 'Dependents':2, 'Education':0, 'Self_Employed ':1,'Credit_History':0,'Property_Area':1, 'Income':1}data = json.dumps(data)#3 Sending a request with data to webapi and it returns answer.send_req = reviews.post (URL, data)print(send_req.json())
Diploma
We learned how the urllib and request two Python modules can help with web scraping from scratch. There are many ways to run your web scraper like in the previous article we usedSeleniumfor web scraping we then combined with seleniumnice soupand now we have integratedInquiryModul instead of selenium with nice soup.
It all depends on the use case we have. If your scraper requires HTTP and Web API communication, you must start fetching your URLs with a request. Otherwise, you can use a scraper that runs in real time while scraping Selenium.