import pandas as pd
import requests
from bs4 import BeautifulSoup
Web Scraping in Python with Beautiful Soupand Requests
This tutorial is mainly based on the tutorial Build a Web Scraper with Python in 5 Minutes by Natassha Selvaraj as well as the Beautiful Soup documentation.
In this tutorial, you will learn how to:
Scrape the web page “Quotes to Scrape” using Requests.
Pulling data out of HTML using Beautiful Soup.
Use Selector Gadget to inspect the CSS of the web page.
Store the scraped data in a pandas dataframe.
Prerequisites
To start this tutorial, you need:
- Some basic understanding of HTML and CSS and CSS selectors.
- Google’s web browser Chrome and the Chrome extension SelectorGadget
- To know how to use Chrome DevTools
To learn more about HTML, CSS, Chrome DevTools and the Selector Gadget, follow the instructions in this web scraping basics tutorial.
Setup
First of all, we use Anaconda to create a new environment called webscraping. We also install Python 3.11 and pip inside this new environment. Open your terminal (macOS) or your Anaconda Command Prompt (Windows) and enter:
conda create -n webscraping python=3.11 pip
Activate the environment:
conda activate webscraping
Let’s install some packages into our new environment:
pip install ipykernel jupyter pandas requests beautifulsoup4
If you are using Visual Studio Code, you first need to restart Vode before you can slect the new environment in your kernel picker.
Import the modules:
Scrape website with Requests
First, we use
requests
to scrape the website (using a GET request).requests.get()
fetches all the content from a particular website and returns a response object (we call ithtml
):
= 'http://quotes.toscrape.com/'
url
= requests.get(url) html
- Check if the response was succesful:
html
- Response 200 means that the request has succeeded.
Investigate HTML with Beautiful Soup
We can use the response object to access certain features such as content, text, headers, etc.
In our example, we only want to obtain
text
from the object.Therefore, we use
html.text
which only returns the text of the response.Running
html.text
through BeautifulSoup using thehtml.parser
gives us a Beautiful Soup object:
= BeautifulSoup(html.text, 'html.parser') soup
soup
represents the document as a nested data structure:
print(soup.prettify())
Next, we take a look at some ways to navigate that data structure.
Get all text
- A common task is extracting all the text from a page (since the output is quite large, we don’t actually print the output of the following function):
# print(soup.get_text())
Investigate title
- Print the complete HTML title:
soup.title
- Show name of the title tag:
soup.title.name
- Only print the text of the title:
soup.title.string
- Show the name of the parent tag of title:
soup.title.parent.name
Investigate hyperlinks
- Show the first hyperlink in the document:
soup.a
Investigate a text element
soup.span.text
Extract specific elements with find and find_all
Since there are many div tags in HTML, we can’t use the previous approaches to extract relevant information.
Instead, we need to use the
find
andfind_all
methods which you can use to extract specific HTML tags from the web page.This methods can be used to retrieve all the elements on the page that match our specifications.
Let’s say our goal is to obtain all quotes, authors and tags from the website “Quotes to Scrape”.
We want to store all information in a pandas dataframe (every row should contain a quote as well as the corresponding author and tags).
First, we use SelectorGadget in Google Chrome to inspect the website.
Review the web scraping basics tutorial to learn how inspect websites.
Extract all quotes
Task: Extract all quotes
- First, we use the div class “quote” to retrieve all relevant information regarding the quotes:
= soup.find_all('div', {'class': 'quote'}) quotes
- Next, we can iterate through our
quotes
object and extract the text of all quotes (the text of the quotes are available in the tag <span> as “class=text”):
for i in quotes:
print((i.find('span', {'class':'text'})).text)