In this article I am trying to share a piece of code that might be useful to
some of the developers.
We can find a lot of code in C# that will parse the http urls in given string.
But it is difficult to find a code that will:
- Accept a url as argument, parse the site
content
- Fetch all urls in the site content, parse
the site content of each urls
- Repeat the above process until all urls
are fetched.
Scenario
Taking the website http://valuestocks.in
(A Stock Market Site) as example I would like to get all the urls inside the
website recursively.
Design
The main class is SpiderLogic which contains all necessary methods and
properties.
The GetUrls() method is used to parse the website and return the urls. There are
two overloads for this method.
The first one takes 2 arguments. The url and and a Boolean indicating if
recursive parsing is needed or not.
E.g.: GetUrls(http://www.google.com", true);
The second one is 3 arguments, url, base url and recursive Boolean.
This method is intended for usage like the url is a sub level of the base url.
And the web page contains relative paths. So in order to construct the valid
absolute urls, the second argument is necessary.
E.g.: GetUrls("http://www.whereincity.com/india-kids/baby-names/
", http://www.whereincity.com/ ,
true);
Method Body of GetUrls()
public
IList<string>
GetUrls(string url,
string baseUrl,
bool
recursive)
{
if (recursive)
{
_urls.Clear();
RecursivelyGenerateUrls(url, baseUrl);
return _urls;
}
else
return InternalGetUrls(url,
baseUrl);
}
InternalGetUrls()
Another method of interest would be InternalGetUrls() which fetches the content
of url, parses the urls inside it and constructs the absolute urls.
private
IList<string>
InternalGetUrls(string baseUrl,
string absoluteBaseUrl)
{
IList<string>
list = new List<string>();
Uri uri = null;
if (!Uri.TryCreate(baseUrl,
UriKind.RelativeOrAbsolute,
out uri))
return list;
// Get the http content
string siteContent =
GetHttpResponse(baseUrl);
var allUrls = GetAllUrls(siteContent);
foreach (string uriString
in allUrls)
{
uri = null;
if (Uri.TryCreate(uriString,
UriKind.RelativeOrAbsolute,
out uri))
{
if (uri.IsAbsoluteUri)
{
if (uri.OriginalString.StartsWith(absoluteBaseUrl))
// If different domain / javascript: urls needed
exclude this check
{
list.Add(uriString);
}
}
else
{
string newUri =
GetAbsoluteUri(uri, absoluteBaseUrl, uriString);
if (!string.IsNullOrEmpty(newUri))
list.Add(newUri);
}
}
else
{
if (!uriString.StartsWith(absoluteBaseUrl))
{
string newUri =
GetAbsoluteUri(uri, absoluteBaseUrl, uriString);
if (!string.IsNullOrEmpty(newUri))
list.Add(newUri);
}
}
}
return list;
}
Handling Exceptions
There is an OnException delegate that can be used to get the exceptions
occurring while parsing.
Tester Application
A tester windows application is included with the source code of the article.
You can try executing it.
The form accepts a base url as the input and clicking the Go button it parses
the content of url and extracts all urls in it. If you need a recursive parsing
please check the Is Recursive check box.
Next Part
In the next part of the article, I would like to create a url verifier website
that verifies all the urls in a website. I agree after doing a search we can
find free providers like that. My aim is to learn & develop a custom code that
could be extensible and reusable across multiple projects by community.