30 Days Of Python πŸ‘¨β€πŸ’» - Day 23 - Web Scraping

This article is a part of a 30 day Python challenge series. You can find the links to all the previous posts of this series here,
Web Scraping is the technique for extracting data from a website by crawling it. It is mainly used to collect meaningful data from websites especially when there are no available APIs to extract information. Today I explored the basics of web scraping with Python and would like to share my experience.
 
Scraping is a form of scripting that allows us to automate the process of extracting large unstructured data from websites and organize it in a structured way to use it for several purposes such as gathering emails, product price, stock prices, flight data or any other relevant information. Doing such things manually takes a lot of time and effort. Python has some amazing libraries to make web scraping quite an easier and fun task to implement. I mainly explored the most basic and popular library Beautiful Soup to familiarize myself with the concept.
 

Good Practices

 
Web Scraping is extremely powerful and there is a lot of debate over its uses. Most websites have a robots.txt file which mentions which specific URLs should be crawled (scraped) and which should not be. This file is mainly an instruction for various search engine bots like Google bot, Yahoo bot, Bing bot etc as to which specific pages they should crawl for search engine optimization. So all search engine crawlers are mainly web scrapers that extract data from the website to rank them as per relevant keywords.
 
However, a website cannot literally restrict a web scraping program to not crawl its data even if it is disallowed in the robots.txt file. It is a good and ethical practice to go over a website’s robots.txt file if present and extract data from only mentioned URLs to prevent any kind of data breach issues.
 

Scraping using Beautiful Soup

 
For today’s session, I decided to try extracting data from the Hacker News website - An extremely popular website among the dev community. These are the rules defined in its robots.txt file
  1. User-Agent: *     
  2. Disallow: /x?    
  3. Disallow: /vote?    
  4. Disallow: /reply?    
  5. Disallow: /submitted?    
  6. Disallow: /submitlink?    
  7. Disallow: /threads?    
  8. Crawl-delay: 30   
So we are allowed to crawl and fetch data from the news page https://news.ycombinator.com/newest which lists the latest articles from the development world. The goal is to crawl the first 5 pages and extract the articles with at-least 100 points along with their links. This can be pretty useful to automatically fetch all highly voted items and read them from the terminal itself without having to visit hacker news website and manually search for popular posts.
 
First two libraries need to be installed, requests for doing HTTP requests and beautifulsoup4 for scraping the website.
 
pip install requests pip install beautifulsoup4
 
hacker_news_scraper.py
  1. import requests    
  2. from bs4 import BeautifulSoup    
  3.     
  4. BASE_URL = 'https://news.ycombinator.com'    
  5. response = requests.get(BASE_URL)    
  6. # extract the text content of the web page    
  7. response_text = response.text    
  8. # parse HTML    
  9. soup = BeautifulSoup(response_text, 'html.parser')    
  10. print(soup.prettify()) # prints the html content in a readable format    
The documentation for Beautiful Soup https://www.crummy.com/software/BeautifulSoup/bs4/doc/ showcases the various use cases. Using the browser’s inspect element tools, the selector for the elements can be viewed and then used for extracting data. In this case, all the articles have a storylink class and their associated points have the class score. These selectors can now be used to fetch the respective data and combine them.
  1. # extract all the links using the class selector    
  2. links_list = soup.select('.storylink')    
  3.     
  4. # extract all the points using the class selector    
  5. points_list = soup.select('.score')    
After looping through the links, the associated, title, link and their points can be combined as a dictionary object and then appended to a popular posts list.
 
It is to be noted that the enumerate function is used to get the index of each element to fetch the respective points as the points are not contained within the links container.
 
Only posts with a minimum of 100 points are appended to popular lists.
  1. # loop though all links    
  2. for idx, link in enumerate(links_list):    
  3.     # fetch the title of the post    
  4.     post_title = link.get_text()    
  5.     # fetch the link of the post    
  6.     post_href = link.get('href')    
  7.     # fetch the point text using the index of the link    
  8.     # convert the point to integer    
  9.     post_points = int(points_list[idx].get_text().replace(' points'''))    
  10.     # append to popular posts as a dictionary object if points is atleast 100    
  11.     if post_points >= 100:    
  12.         popular_posts.append(    
  13.         {'title': post_title, 'link': post_href, 'points': post_points})    
There is a useful built-in Python library pprint that prints data in the console in a more readable format.
  1. import pprint   
It can then be used to view the popular lists
  1. # loop though all links    
  2. for idx, link in enumerate(links_list):    
  3.     # fetch the title of the post    
  4.     post_title = link.get_text()    
  5.     # fetch the link of the post    
  6.     post_href = link.get('href')    
  7.     # fetch the point text using the index of the link    
  8.     # convert the point to integer    
  9.     post_points = int(points_list[idx].get_text().replace(' points'''))    
  10.     # append to popular posts as a dictionary object if points is atleast 100    
  11.     if post_points >= 100:    
  12.         popular_posts.append(    
  13.         {'title': post_title, 'link': post_href, 'points': post_points})    
The above script only fetches the popular posts from the first page of Hacker News. However, as per the desired goal, we need to fetch the lists from the top five pages or probably any entered number of pages. So the script can be modified accordingly.
 
Here is the final script to scrape the popular lists. The code can also be found in the Github repository.
  1. import requests    
  2. from bs4 import BeautifulSoup    
  3. import pprint    
  4. import time    
  5.     
  6. BASE_URL = 'https://news.ycombinator.com'    
  7. # response = requests.get(BASE_URL)    
  8.     
  9. def get_lists_and_points(soup):    
  10.     # extract all the links using the class selector    
  11.     links_list = soup.select('.storylink')    
  12.     
  13.     # extract all the points using the class selector    
  14.     points_list = soup.select('.score')    
  15.     
  16.     return (links_list, points_list)    
  17.     
  18. def parse_response(response):    
  19.     # extract the text content of the web page    
  20.     response_text = response.text    
  21.     # parse HTML    
  22.     soup = BeautifulSoup(response_text, 'html.parser')    
  23.     return soup    
  24.     
  25. def get_paginated_data(pages):    
  26.     total_links_list = []    
  27.     total_points_list = []    
  28.     for page in range(pages):    
  29.         URL = BASE_URL + f'?p={page+1}'    
  30.         response = requests.get(URL)    
  31.         soup = parse_response(response)    
  32.         links_list, points_list = get_lists_and_points(soup)    
  33.         for link in links_list:    
  34.             total_links_list.append(link)    
  35.         for point in points_list:    
  36.             total_points_list.append(point)    
  37.         # add 30 seconds delay as per hacker news robots.txt rules    
  38.         time.sleep(30)    
  39.     return (total_links_list, total_points_list)    
  40.     
  41. def generate_popular_posts(links_list, points_list):    
  42.     # create an empty popular posts list    
  43.     popular_posts = []    
  44.     
  45.     # loop though all links    
  46.     for idx, link in enumerate(links_list):    
  47.         # fetch the title of the post    
  48.         post_title = link.get_text()    
  49.         # fetch the link of the post    
  50.         post_href = link.get('href')    
  51.         # fetch the point text using the index of the link    
  52.         # convert the point to integer    
  53.         # if points data is not available, assign it a default of 0    
  54.         try:    
  55.             post_points = int(    
  56.                 points_list[idx].get_text().replace(' points'''))    
  57.         except:    
  58.             points_list = 0    
  59.         # append to popular posts as a dictionary object if points is atleast 100    
  60.         if post_points >= 100:    
  61.             popular_posts.append(    
  62.                 {'title': post_title, 'link': post_href, 'points': post_points})    
  63.     return popular_posts    
  64.     
  65. def sort_posts_by_points(posts):    
  66.     return sorted(posts, key=lambda x: x['points'], reverse=True)    
  67.     
  68. def main():    
  69.     total_links_list, total_points_list = get_paginated_data(5)    
  70.     popular_posts = generate_popular_posts(total_links_list, total_points_list)    
  71.     sorted_posts = sort_posts_by_points(popular_posts)    
  72.     # print posts sorted by highest to lowest    
  73.     pprint.pprint(sorted_posts)    
  74.     
  75. if(__name__ == '__main__'):    
  76.     main()    
Now using this script, we don’t even need to visit Hacker News and search for popular news. We can run this script from our console and get the latest news delivered. Feel free to tweak the script as per your needs and experiment with it or try scraping data from your own favourite website.
 
We can possibly do a lot of things with the above data such as
  • Create an API to use it for an app of website
  • Use it for analysing trends using keywords
  • Create a news aggregator website and more

Popular Scraping Libraries

 
Beautiful Soup has its restrictions when related to scraping data from websites. It is quite simple to use but for scraping data from complex websites that are rendered at the client-side (Angular, React-based websites), the HTML markup won’t be available when the website loads. To fetch data from such websites, more advanced libraries can be used. Here are some popular libraries and frameworks for Python.
  • lxml
  • Selenium
  • Scrapy - It is a complete framework for web scraping
References
  • https://realpython.com/beautiful-soup-web-scraper-python/
  • https://realpython.com/python-web-scraping-practical-introduction/
  • https://www.digitalocean.com/community/tutorials/how-to-scrape-web-pages-with-beautiful-soup-and-python-3
Web Scraping is a vast field. Using Beautiful Soup, we probably just scratched the surface. There are a whole lot of possibilities in this domain which I would explore while exploring more on data analysis with Python. Hopefully, I have been able to cover the basic concepts needed for further exploration.
 
Tomorrow I shall be going over the concepts of Web Development with Python.
 
Have a great one!


Similar Articles