Introduction
Data are the most valuable asset for any organization. It helps them to learn about operational activities, also the need of the market, and the data of competitors on the internet which helps them plan out future perspectives. We are going to learn one of the most demanding concepts on the Internet, that guides many institutions to transform their business to the next level. This is how to collect data from the webpage/website which is known as "Web Scraping" using one of the most trending programming languages, Python.
Definition
- The Process of extracting HTML data from a webpage/website.
- Transforming HTML unstructured data to structure data into excel or dataset.
- Let's study this concept with an example of extracting the name of weblinks available on the home page of www.c-sharpcorner.com website.
To start with web scraping, we need two libraries: BeautifulSoup in bs4 and request in urllib. Import both of these Python packages.
-
- from bs4 import BeautifulSoup
- import urllib.request
Select the URL to extract its HTML elements.
-
- url = "https://www.c-sharpcorner.com"
We could access the content on this webpage and save the HTML in “myUrl” by using urlopen() function in the request.
-
- myUrl = urllib.request.urlopen(url)
Create an object of BeautifulSoup to further extract the webpage element data, using its various inbuilt functions.
-
- soup=BeautifulSoup(myUrl, 'html.parser')
-
-
-
- print(soup.title)
-
-
- print(soup.title.name)
-
-
- print(soup.title.string)
-
-
- print(soup.title.parent.name)
-
-
- print(soup.p)
-
-
- print(soup.prettify())
Step 5
Locate and scrape the services. Using the soup.find_all() function, extract the specific HTML element tag from the entire or specific portion of the webpage.
We should find the HTML services on this web page, extract them, and store them. For each element in the web page, they always have a unique HTML "ID" or "class". To check their ID or class, we would need to INSPECT element on the webpage.
-
- div_list= soup.find_all('div')
-
-
- print(div_list)
Step 6
On inspecting the web page for extracting all the services names on the www.c-sharpcorner.com website, we located the ul tag with the class value as 'headerMenu' as the parent node.
To extract all the child node which is our target to extract all the weblink names on the www.c-sharpcorner.com website, we located the li tag as the target node.
-
- weblinks=[]
-
-
- for i in soup.find_all('ul',{'class':'headerMenu'}):
-
- for j in i.find_all('li'):
-
- per_service = j.find('a')
-
- print(per_service.get_text())
-
- weblinks.append(per_service.get_text())
Output of the above code:
TECHNOLOGIES
ANSWERS
LEARN
NEWS
BLOGS
VIDEOS
INTERVIEW
PREP
BOOKS
EVENTS
CAREER
MEMBERS
JOBS
Summary
This article taught the basics of how to extract HTML element data from any given URL.