Web Scraping is the process of automatically extracting the data from the websites. It involves fetching the content from the web pages and then parsing or analyzing that content to extract pieces of information such as text, links, images, etc. In this article, we will explore the Python libraries commonly used for web scraping.
Note. Please note that web scraping may be illegal if used on websites that prohibit it. Make sure you have permission before proceeding. For learning purposes, I will scrape a website created specifically for practicing web scraping.
In this example, we will extract the main heading(title), description, and list of countries shown on the webpage. Before writing the actual code, we need to understand the HTML structure of the web page. Use the developer tools to inspect the elements of the webpage.
Let’s scrap the website using Python as shown in the below steps. I am using Jupyter Notebook for this example.
First, import the libraries required for the Web Scrapping.
# Import the requests and BeautifulSoup library
import requests
from bs4 import BeautifulSoup as bs
Get the response content from the URL using the requests library. The URL I have used here is intended for learning purposes.
# The URL of the webpage to be scraped
url = 'https://www.scrapethissite.com/pages/simple/'
response = requests.get(url)
# Access the raw content of the response (HTML of the webpage)
response.content
Now, Parse the HTML content of the HTTP response using BeautifulSoup with the 'HTML.parser' parser. This creates a BeautifulSoup object that allows for easy navigation and extraction of data from the HTML structure.
responseHTML=bs(response.content,'html.parser')
responseHTML
Select all <div> elements with the class 'row' that are direct children of <div> elements with the class 'container', which are themselves direct children of the <section> element with the id 'countries'.
rowsHTML = responseHTML.select('section#countries > div.container > div.row')
rowsHTML
To find the heading, we need to find the first <h1> heading element within the first <div> in the rows_html list.
rowWithHeading = rowsHTML[0].find('h1',{'class':''}).get_text(strip=True);
rowWithHeading
Now, let’s scrap the description. Select the second element in the list 'rowsHTML', which contains the description. Find the first <p> tag within 'rowWithDescription' that has the class 'lead' and extract its text content, removing any leading or trailing whitespace
rowWithDescription = rowsHTML[1]
descriptionText = rowWithDescription.find('p', { 'class':'lead'}).text.strip()
descriptionText
Now, find Rows that we need to iterate over to scrap the country's data. Select a subset of elements from 'rows HTML', starting from the fourth element onward, which contains country data.
rowsContainingCountries=rowsHTML[3:]
rowsContainingCountries
# Iterate through each row in 'rowsContainingCountries'
for row in rowsContainingCountries:
countriesPerRow = row.find_all('div',{'class':'col-md-4 country'})
for countryDiv in countriesPerRow:
# Print a separator line for better readability
print('--------')
# Extract and print the country name, which is inside an <h3> tag with the class 'country-name'
print('Country Name: '+ countryDiv.find('h3',{'class','country-name'}).get_text(strip=True))
# Extract and print the capital city, which is inside a <span> tag with the class 'country-capital'
print('Capital: '+ countryDiv.find('span',{'class','country-capital'}).get_text(strip=True))
# Extract and print the population, which is inside a <span> tag with the class 'country-population'
print('Population: '+ countryDiv.find('span',{'class','country-population'}).get_text(strip=True))
# Extract and print the area in square kilometers, which is inside a <span> tag with the class 'country-area'
print('Area (Km Sq.): '+ countryDiv.find('span',{'class','country-area'}).get_text(strip=True))
For learning purposes, I have also attached an interactive Python notebook so you can quickly refer to it. I hope you enjoyed this article. Thank you.