Scraping Web site Dynamic Data using WATIN

Introduction

The main objective of this article is to demonstrate the scraping of web pages using Testing tools like the Watin testing tool.

Generally, scraping of web pages is done with the HttpWebRequest and HttpWebResponse methods of C# in ASP.NET. However, it is observed that when server-side navigation is to be performed in the application then it becomes more difficult to fetch page data using the HttpWebRequest method (we need to perform some tricks to fetch the next page data).

The same thing can be done with the Watin Tool very easily and quickly. My objective here is not to challenge the HttpWebRequest and HttpWebResponse methods but to show how effectively we can do web site scraping using testing tools like Watin.

Background

In this article, I have used third-party tools like NUnit and Watin to demonstrate this example. Please refer to the following brief introduction for each tool and respective URL for further reference.

  • About Watin: Watin is a third-party web application testing tool designed for .NET. You can obtain more information about this tool by visiting this site: http://www.watin.org/
  • About NUnit: NUnit is a third-party unit-testing framework for all .NET languages. More information can be gathered by visiting this site: http://www.nunit.org

Using the code

This article consists of 2 applications. Please refer to the following brief details about this application.

The first application is a web-based application created in Visual Studio 2010 (.NET 4.0). This is a demonstration website with category and item listing pages. This website needs to be deployed on the local/remote server II Server.

A second application is a window-based class library project created using Visual Studio 2010 (.NET 4.0) and Watin DLL.

Please refer to the following pre-requisite software required to execute this demonstration.

  1. .NET Framework 4.0
  2. NUnit 2.6.2

Configure the web application

Please perform the following procedure to configure the web application.

  • Deploy the web application in your IIS assign .NET Framework 4.0 to this application and check that the application is running correctly in your workstation.

Configure the Web Scraping application

Perform the following procedure to configure the Web Scraping application.

  • Open the configuration file (App.config or WatinWebScraping.dll.config) of this application and change the values of the following configuration keys as specified.
    • WebApplicationPath: This is the application path where the demonstration web site is deployed. In the current application, I have deployed it on the localhost, therefore I have given the path http://localhost/WebApplication/CategoryListing.aspx.
    • ScraperEnginePhysicalLocation: This is the physical location where the scraper web application is hosted.

I have defined the value of this path as "D:\WebScraping\Web Scraper\WatinWebScraping\WatinWebScraping". I am using this path to store the scraped data in the text file.

Code snippet of demo web application

The following is the brief-level understanding of the code that resides on respective pages.

  • CategoryListing.aspx: contains just a listing of categories in the form of hyperlinks.
  • ItemListing.aspx: In this page, I have used a Grid View control and have used XMLDataSource instead of Database (for easy configuration) in the page.

Sample code snippet

Refer to the following code snippet for reference.

<asp:XmlDataSource ID="xmlSource" runat="server" DataFile="~/XMLDataBase/MenFashion.xml">
</asp:XmlDataSource>
<asp:GridView ID="gvItemListing" runat="server" DataSourceID="xmlSource" AutoGenerateColumns="false"
AllowPaging="true" PageSize="5" Width="100%" PagerSettings-Position="Bottom">

Code Snippet of WatinWebScraper Application

Here I will explain the following things.

  1. Initialization of Watin and NUnit in the application.
  2. Using Watin for website navigation.
  3. Using Regular Expression features (RegEx and MatchCollection) of .NET to fetch respective data from the HTML page source.
  4. Execute this application using NUnit.

Please refer to the following explanation of the respective sections.

Initializing Watin and NUnit in the application

To use Watin and NUnit in the application, add a reference to nunit.framework.dll, "Interop.SHDocVw.dll" and "WatiN.Core.dll".

Now add a reference to "NUnit.Framework" and "WatiN.Core" in this project.

As we will be using NUnit for scraping this application; therefore it requires mentioning "[TestFixture]" while creating the class for it and usage of "[Test]" and "[STAThread]" at the top of this method.

You can get more details of these attributes by referring to the http://nunit.org web site.

Using Watin for web scraping

// Create an instance of IE browser
IE ieInstance = new IE(webSitePath);
// This will opens IE browser in maximized mode
ieInstance.ShowWindow(WatiN.Core.Native.Windows.NativeMethods.WindowShowStyle.ShowMaximized);

The Watin window can be hidden from the user while performing web scraping using the following code snippet. This code is currently commented out (kept in a comment).

Users can also un-comment this code snippet.

// ieInstance.Visible = false;
// This will wait for the browser to complete loading of the page
ieInstance.WaitForComplete();
// This will store page source in categoryPageSource variable
string categoryPageSource = ieInstance.Html;

Using Regular Expression features (RegEx and MatchCollection) of .NET to fetch respective data from the HTML page source

I have used regular expressions for fetching categories and to do iterative logic to fetch items in the respective categories and to move to the next page using regular expressions for fetching all the pages for the respective category items.

Please refer to the following regular expressions used for Category, Item fetching, and page navigation respectively.

  1. Category Regular Expression: The following regular expression will fetch all the URL categories from the CategoryListing.aspx page and will navigate in a recursive loop.

    <A\S.*?class=bold\s.*?href="(?<href>.*?)"></span>
  2. Item Regular Expression: The following regular expression will fetch ProductID, Product Name, and Product Price for the respective item residing in the given page.

    <P\s*id=.*?>ProductID:\s*<B>(?<ProductID>.*?)</B>.*?</P>\s*.*?<P\s*id=.*?>ProductName:\s*<B>(?<ProductName>.*?)</B>.*?</P>\s*.*?<P\s*id=.*?>ProductPrice:\s*<B>(?<ProductPrice>.*?)</B>.*?</P>
  3. Paging Regular Expression: The following regular expression will fetch respective pages from the ItemListing.aspx page.

    (?(?=<SPAN>.*?</SPAN>)<SPAN>(?<PageNumber>.*?)\s*</SPAN>|<A\s*href="javascript.*?>(?<PageNumber>.*?)\s*</A>)
    

To use this regular expression in this application, I have used the "RegEx" class of the "System.Text.RegularExpression" namespace. RegEx will compile respective regular expression patterns using various options like "RegexOptions.Compiled", "RegexOptions.IgnoreCase", "RegexOptions.IgnorePatternWhitespace" and "RegexOptions.CultureInvariant".

Refer to the following code snippet for that.

// Regular expression for Category listing page
private const string _categoryRegEx = @"<A\S.*?class=bold\s.*?href=""(?<href>.*?)""></span>";
Regex categoryMatches = new Regex(_categoryRegEx, RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace | RegexOptions.CultureInvariant);

To fetch records based upon regular expression it requires using MatchCollection to fetch a list of successful matches for the respective

HTML Source generated (string categoryPageSource = ieInstance.Html;) as per #2 above. Refer to the following code for reference.

MatchCollection categoryMatchCollection = categoryMatches.Matches(categoryPageSource);

Use a for loop to fetch the respective results of a single match. Refer to the following code for reference.

foreach (Match categoryMatch in categoryMatchCollection)

If a regular expression is created based upon the group, then use the GroupCollection method to fetch groups of the respective results. Refer to the following code for reference.

GroupCollection categoryGroup = categoryMatch.Groups;

A Group collection contains multiple associated groups. To fetch a category, I have used "href" as a group. Refer to the following code for reference.

string itemListingURL = Convert.ToString(categoryGroup["href"].Value);

Now, the itemListingURL variable will contain the href for the respective category. Now Watin will navigate to this URL as in the following.

The itemListingPath variable contains the full path of the item listing page for the respective category.

ieInstance.GoTo(itemListingpath);

I have used the "WaitForComplete" method to wait until the respective page has loaded completely. Refer to the following code for reference.

ieInstance.WaitForComplete();

Using the code above, the application will navigate to the item listing page. A similar operation needs to be performed for fetching items.

Once all items of the respective page are fetched and navigation to the next page has been done, Watin provides a Click event to do click on a specific page.

The Click event can also be performed based on other criteria.

Please refer to the following

Find

You can get more information on all the preceding criteria by visiting http://watin.org.

In this article, I have used "Find.ByText" to find a link by text and then perform the click event. You can also attach a regular expression with the above criteria.

Refer to the following code for reference.

// Fetches the page number of the current page.
string linkText = Convert.ToString(pagingGroup[_pageNumber].Value);
// Performs click event on the given link. For e.g, if linkText contains "2" as a value then Watin will perform a click event on this second link.
ieInstance.Link(Find.ByText(linkText)).Click();
// Wait for the operation to complete
ieInstance.WaitForComplete();
// Store the result of the page in the itemListingPageSource variable
itemListingPageSource = ieInstance.Html;

Once the respective items in the web page are scraped, the current application will store respective items in "Output.txt" using StreamWriter of System.IO namespace.

Now open the "Output.txt" file and observe that it contains all the items of Men, Women, and Children Categories.

Execute this application using NUnit

Execute the WatinWebScraper application, requires doing the following procedure.

  1. Open the NUnit application.
  2. Now click on "File" -- "Open Project" and navigate to the DLL file ("WatinWebScraping.dll") of the Watin Web Scraper application. Refer to the following image for reference.
  3. Now click on the "Run" button as shown in the image above. Observe that the application will start scraping the Demo Web application by Navigating to the Category and all its respective items will be stored in the "Output.txt" file.
    image2.jpg