Introduction
Webscraping is one of the most used techniques to learn about data on the Internet. It is the art of extracting the valuable and required data from a webpage which are considered as the input value for performing various computational operations to generate useful information. In this article, we will learn how we can collect email data published on any webpage. We are using Python, one of the most popular programming languages, to extract data elements value, as it has rich libraries that help to perform various required activities.
The following steps will help you to learn how to find an email on any webpage.
Step 1
We need to import all the essential libraries for our program.
- BeautifulSoup: It is a Python library for extracting data out of HTML and XML files.
- requests: The requests library allows us to send HTTP requests using Python.
- urllib.parse: This module provides functions for manipulating URLs and their component parts, to either break them down or build them up.
- collections: It provides different types of containers
- re: A module that handles regular expressions.
-
- from bs4 import BeautifulSoup
- import requests
- import requests.exceptions
- from urllib.parse import urlsplit
- from collections import deque
- import re
Step 2
Select the URL for extracting an email from a given URL.
-
- new_urls = deque(['https://www.gtu.ac.in/page.aspx?p=ContactUsA'])
Step 3
We have to process the given URL only once, so keep track of your processed URLs.
Step 4
While crawling the given URL, we may encounter more than one the email-ID so keep them in the collections.
Step 5
Time to start crawling, we need to crawl all the URLs in the queue, maintain the list of crawled URLs & get the page content from the webpage. If any error is encountered, move to the next page.
-
- while len(new_urls):
-
- url = new_urls.popleft()
- processed_urls.add(url)
-
- print("Processing %s" % url)
- try:
- response = requests.get(url)
- except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):
-
- continue
Step 6
Now we need to extract some base parts of the current URL, an essential part for transfering relative links found in the document into absolute ones:
-
- parts = urlsplit(url)
- base_url = "{0.scheme}://{0.netloc}".format(parts)
- path = url[:url.rfind('/')+1] if '/' in parts.path else url
Step 7
From the page content extract email(s) and add them to emails set.
-
- new_emails = set(re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", response.text, re.I))
- emails.update(new_emails)
Step 8
Once the current page is processed, its time to search links to other pages and add them to the URL queue (that's the magic of crawling). Get a Beautifulsoup object for parsing HTML pages.
-
- soup = BeautifulSoup(response.text)
Step 9
The soup object contains HTML elements. Now find all the anchor tags with its href attributes to resolve relative links and keep a record of processed URLs.
-
- for anchor in soup.find_all("a"):
-
- link = anchor.attrs["href"] if "href" in anchor.attrs else ''
-
- if link.startswith('/'):
- link = base_url + link
- elif not link.startswith('http'):
- link = path + link
-
- if not link in new_urls and not link in processed_urls:
- new_urls.append(link)
Step 10
List out all the email-ID extracted from the given URL.
- for email in emails:
- print(email)
Summary
This article taught how to perform webscraping, particularly if you are targeting any data on an HTML page using Python packages such as BeautifulSoup, collections, requests, re, and urllib.parse.