Introduction
In this blog, we will learn how to extract all the links from a Webpage, using a Web client. Thus, without wasting time, let's dive directly into the code.
Step 1
Thus, we are creating a link grabber. For it, we need some logic and it's always a good idea to clarify the logic before creating something. Thus, let's define the logic.
- We need a link for the page to crawl. We can get the link from a TextBox.
- Now, we have the link. The next step will be to download the Web page to crawl. We can either use a Web client for it or a WebBrowser control.
- Now, we have a HTML document. The next step is to extract the links from that page.
- As we know, most of the useful links are contained in href attribute of the anchor tags.
- Now, up to that point, we know that we want to grab the anchor elements of the page. Thus, we can do this, using getElementsByTagName().
- Now, we have the collection of all the anchor elements.
- The next step is to get href attribute and add it to a list. Let this list be a check box list.
- Now, we have all the extracted links.
Step 2
Open Visual Studio and choose "New project".
- Now, choose "Visual C#" -> Windows -> "Windows Forms Application".
- Now, drop a text box from the Toolbar onto the form.
- Now, drop a button from the Toolbar onto the form and name it "grab".
- Now, add one check list box from the Toolbar menu onto the form.
- Now, double-click on the button to generate the click handler.
- Add the code, mentioned below, for the click handler.
The following code is
- using System;
- using System.Collections.Generic;
- using System.ComponentModel;
- using System.Data;
- using System.Drawing;
- using System.Linq;
- using System.Text;
- using System.Threading.Tasks;
- using System.Windows.Forms;
- namespace linkGrabber {
- public partial class Form: Form {
- public Form() {
- InitializeComponent();
- }
- private void button_Click(object sender, EventArgs e) {
- WebBrowser wb = new WebBrowser();
- wb.Url = new Uri(textBox.Text);
- wb.DocumentCompleted += wb_DocumentCompleted;
- }
- void wb_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e) {
- HtmlDocument source = ((WebBrowser) sender).Document;
- extractLink(source);
-
- }
- private void extractLink(HtmlDocument source) {
-
- HtmlElementCollection anchorList = source.GetElementsByTagName("a");
-
- foreach(var item in anchorList) {
- checkedListBox.Items.Add(((HtmlElement) item).GetAttribute("href"));
- }
- }
- }
- }
Conclusion
In this blog, we learned about creating a link extractor and filter in C#.