This article is a part of a 30 day Python challenge series. You can find the links to all the previous posts of this series here,
Web Scraping is the technique for extracting data from a website by crawling it. It is mainly used to collect meaningful data from websites especially when there are no available APIs to extract information. Today I explored the basics of web scraping with Python and would like to share my experience.
Scraping is a form of scripting that allows us to automate the process of extracting large unstructured data from websites and organize it in a structured way to use it for several purposes such as gathering emails, product price, stock prices, flight data or any other relevant information. Doing such things manually takes a lot of time and effort. Python has some amazing libraries to make web scraping quite an easier and fun task to implement. I mainly explored the most basic and popular library Beautiful Soup to familiarize myself with the concept.
Good Practices
Web Scraping is extremely powerful and there is a lot of debate over its uses. Most websites have a robots.txt file which mentions which specific URLs should be crawled (scraped) and which should not be. This file is mainly an instruction for various search engine bots like Google bot, Yahoo bot, Bing bot etc as to which specific pages they should crawl for search engine optimization. So all search engine crawlers are mainly web scrapers that extract data from the website to rank them as per relevant keywords.
However, a website cannot literally restrict a web scraping program to not crawl its data even if it is disallowed in the robots.txt file. It is a good and ethical practice to go over a website’s robots.txt file if present and extract data from only mentioned URLs to prevent any kind of data breach issues.
Scraping using Beautiful Soup
For today’s session, I decided to try extracting data from the Hacker News website - An extremely popular website among the dev community. These are the rules defined in its robots.txt file
- User-Agent: *
- Disallow: /x?
- Disallow: /vote?
- Disallow: /reply?
- Disallow: /submitted?
- Disallow: /submitlink?
- Disallow: /threads?
- Crawl-delay: 30
So we are allowed to crawl and fetch data from the news page https://news.ycombinator.com/newest which lists the latest articles from the development world. The goal is to crawl the first 5 pages and extract the articles with at-least 100 points along with their links. This can be pretty useful to automatically fetch all highly voted items and read them from the terminal itself without having to visit hacker news website and manually search for popular posts.
First two libraries need to be installed, requests for doing HTTP requests and beautifulsoup4 for scraping the website.
pip install requests pip install beautifulsoup4
hacker_news_scraper.py
- import requests
- from bs4 import BeautifulSoup
-
- BASE_URL = 'https://news.ycombinator.com'
- response = requests.get(BASE_URL)
-
- response_text = response.text
-
- soup = BeautifulSoup(response_text, 'html.parser')
- print(soup.prettify())
The documentation for Beautiful Soup https://www.crummy.com/software/BeautifulSoup/bs4/doc/ showcases the various use cases. Using the browser’s inspect element tools, the selector for the elements can be viewed and then used for extracting data. In this case, all the articles have a storylink class and their associated points have the class score. These selectors can now be used to fetch the respective data and combine them.
-
- links_list = soup.select('.storylink')
-
-
- points_list = soup.select('.score')
After looping through the links, the associated, title, link and their points can be combined as a dictionary object and then appended to a popular posts list.
It is to be noted that the enumerate function is used to get the index of each element to fetch the respective points as the points are not contained within the links container.
Only posts with a minimum of 100 points are appended to popular lists.
-
- for idx, link in enumerate(links_list):
-
- post_title = link.get_text()
-
- post_href = link.get('href')
-
-
- post_points = int(points_list[idx].get_text().replace(' points', ''))
-
- if post_points >= 100:
- popular_posts.append(
- {'title': post_title, 'link': post_href, 'points': post_points})
There is a useful built-in Python library pprint that prints data in the console in a more readable format.
It can then be used to view the popular lists
-
- for idx, link in enumerate(links_list):
-
- post_title = link.get_text()
-
- post_href = link.get('href')
-
-
- post_points = int(points_list[idx].get_text().replace(' points', ''))
-
- if post_points >= 100:
- popular_posts.append(
- {'title': post_title, 'link': post_href, 'points': post_points})
The above script only fetches the popular posts from the first page of Hacker News. However, as per the desired goal, we need to fetch the lists from the top five pages or probably any entered number of pages. So the script can be modified accordingly.
Here is the final script to scrape the popular lists. The code can also be found in the
Github repository.
- import requests
- from bs4 import BeautifulSoup
- import pprint
- import time
-
- BASE_URL = 'https://news.ycombinator.com'
-
-
- def get_lists_and_points(soup):
-
- links_list = soup.select('.storylink')
-
-
- points_list = soup.select('.score')
-
- return (links_list, points_list)
-
- def parse_response(response):
-
- response_text = response.text
-
- soup = BeautifulSoup(response_text, 'html.parser')
- return soup
-
- def get_paginated_data(pages):
- total_links_list = []
- total_points_list = []
- for page in range(pages):
- URL = BASE_URL + f'?p={page+1}'
- response = requests.get(URL)
- soup = parse_response(response)
- links_list, points_list = get_lists_and_points(soup)
- for link in links_list:
- total_links_list.append(link)
- for point in points_list:
- total_points_list.append(point)
-
- time.sleep(30)
- return (total_links_list, total_points_list)
-
- def generate_popular_posts(links_list, points_list):
-
- popular_posts = []
-
-
- for idx, link in enumerate(links_list):
-
- post_title = link.get_text()
-
- post_href = link.get('href')
-
-
-
- try:
- post_points = int(
- points_list[idx].get_text().replace(' points', ''))
- except:
- points_list = 0
-
- if post_points >= 100:
- popular_posts.append(
- {'title': post_title, 'link': post_href, 'points': post_points})
- return popular_posts
-
- def sort_posts_by_points(posts):
- return sorted(posts, key=lambda x: x['points'], reverse=True)
-
- def main():
- total_links_list, total_points_list = get_paginated_data(5)
- popular_posts = generate_popular_posts(total_links_list, total_points_list)
- sorted_posts = sort_posts_by_points(popular_posts)
-
- pprint.pprint(sorted_posts)
-
- if(__name__ == '__main__'):
- main()
Now using this script, we don’t even need to visit Hacker News and search for popular news. We can run this script from our console and get the latest news delivered. Feel free to tweak the script as per your needs and experiment with it or try scraping data from your own favourite website.
We can possibly do a lot of things with the above data such as
- Create an API to use it for an app of website
- Use it for analysing trends using keywords
- Create a news aggregator website and more
Popular Scraping Libraries
Beautiful Soup has its restrictions when related to scraping data from websites. It is quite simple to use but for scraping data from complex websites that are rendered at the client-side (Angular, React-based websites), the HTML markup won’t be available when the website loads. To fetch data from such websites, more advanced libraries can be used. Here are some popular libraries and frameworks for Python.
- lxml
- Selenium
- Scrapy - It is a complete framework for web scraping
References
- https://realpython.com/beautiful-soup-web-scraper-python/
- https://realpython.com/python-web-scraping-practical-introduction/
- https://www.digitalocean.com/community/tutorials/how-to-scrape-web-pages-with-beautiful-soup-and-python-3
Web Scraping is a vast field. Using Beautiful Soup, we probably just scratched the surface. There are a whole lot of possibilities in this domain which I would explore while exploring more on data analysis with Python. Hopefully, I have been able to cover the basic concepts needed for further exploration.
Tomorrow I shall be going over the concepts of Web Development with Python.
Have a great one!