Extract an E-mail From a Webpage

Article

Introduction

Webscraping is one of the most used techniques to learn about data on the Internet. It is the art of extracting the valuable and required data from a webpage which are considered as the input value for performing various computational operations to generate useful information. In this article, we will learn how we can collect email data published on any webpage. We are using Python, one of the most popular programming languages, to extract data elements value, as it has rich libraries that help to perform various required activities.

The following steps will help you to learn how to find an email on any webpage.

Step 1

We need to import all the essential libraries for our program.

BeautifulSoup: It is a Python library for extracting data out of HTML and XML files.
requests: The requests library allows us to send HTTP requests using Python.
urllib.parse: This module provides functions for manipulating URLs and their component parts, to either break them down or build them up.
collections: It provides different types of containers
re: A module that handles regular expressions.

#import packages
from bs4 import BeautifulSoup
import requests
import requests.exceptions
from urllib.parse import urlsplit
from collections import deque
import re

Step 2

Select the URL for extracting an email from a given URL.

# a queue of urls to be crawled
new_urls = deque(['https://www.gtu.ac.in/page.aspx?p=ContactUsA'])

Step 3

We have to process the given URL only once, so keep track of your processed URLs.

# a set of urls that we have already crawled
processed_urls = set()

Step 4

While crawling the given URL, we may encounter more than one the email-ID so keep them in the collections.

# a set of crawled emails
emails = set()

Step 5

Time to start crawling, we need to crawl all the URLs in the queue, maintain the list of crawled URLs & get the page content from the webpage. If any error is encountered, move to the next page.

# process urls one by one until we exhaust the queue
while len(new_urls):
# move next url from the queue to the set of processed urls
url = new_urls.popleft()
processed_urls.add(url)
# get url's content
print("Processing %s" % url)
try:
response = requests.get(url)
except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):
# ignore pages with errors
continue

Step 6

Now we need to extract some base parts of the current URL, an essential part for transfering relative links found in the document into absolute ones:

# extract base url and path to resolve relative links
parts = urlsplit(url)
base_url = "{0.scheme}://{0.netloc}".format(parts)
path = url[:url.rfind('/')+1] if '/' in parts.path else url

Step 7

From the page content extract email(s) and add them to emails set.

# extract all email addresses and add them into the resulting set
new_emails = set(re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", response.text, re.I))
emails.update(new_emails)

Step 8

Once the current page is processed, its time to search links to other pages and add them to the URL queue (that's the magic of crawling). Get a Beautifulsoup object for parsing HTML pages.

# create a beutiful soup for the html document
soup = BeautifulSoup(response.text)

Step 9

The soup object contains HTML elements. Now find all the anchor tags with its href attributes to resolve relative links and keep a record of processed URLs.

# find and process all the anchors in the document
for anchor in soup.find_all("a"):
# extract link url from the anchor
link = anchor.attrs["href"] if "href" in anchor.attrs else ''
# resolve relative links
if link.startswith('/'):
link = base_url + link
elif not link.startswith('http'):
link = path + link
# add the new url to the queue if it was not enqueued nor processed yet
if not link in new_urls and not link in processed_urls:
new_urls.append(link)

Step 10

List out all the email-ID extracted from the given URL.

for email in emails:
print(email)

Summary

This article taught how to perform webscraping, particularly if you are targeting any data on an HTML page using Python packages such as BeautifulSoup, collections, requests, re, and urllib.parse.