Search Content Source to Website
In this article we can create a Search Content Source to a Website.
What is the Goal?
Our goal is to make the content in the following blog searchable from SharePoint 2013.
Please note that the web site above is a reference. You can come up with your own web site with a valid robots.txt file.
Procedure
The following is the procedure involved.
Step 1: Create Content Source
Open Central Administration then seelct "Service Applications" > "Search Service Application" > "Content Sources".
Create a new Content Source and enter the following information.
Click OK to save the changes.
Step 2: Crawl
Now choose the Full Crawl option for the content source.
Wait for a few minutes for the crawling to be completed.
SharePoint access the Home Page through the URL, parsing contents, reading metadata, extracting URLs and digging deeper for more contents and all together performs the indexing.
Step 3: View Log
You can check the Content Source for any Crawl Errors or Warnings that prevent the content from showing.
You will get the following page.
You can click on the links to view the error/warning. Discard the non-serious ones.
Step 4: Search
Open the Enterprise Search Center site and type in the following text.
You can see the results showing with blog URL above. This confirms our Web Content Source configuration.
Challenges
In the real world scenarios things won't work in this speed. You may encounter the following issues and I can provide some links to resolve them.
You can view these errors from the Content Source > View Crawl Log menu.
Items might not be crawled due to one of the following reasons: Preventive crawl rule; specified content source hops/depth exceeded; URL has query string parameter; required protocol handler not found; preventive robots directive.
Solution 1: If query strings are involved in the URL then go for Crawl Rules > http://bit.ly/1k1sIKt
Solution 2: If the source is in the same system then do a loop back check > http://support.microsoft.com/kb/896861/en-us
The content for this address was excluded by the crawler because this item was marked with a no-index meta-tag. To index this item, remove the meta-tag and recrawl.
Solution 1: If the source is an external web site then check the robots.txt > http://bit.ly/PomtFg
Solution 2: If the source is a SharePoint site or library then see http://bit.ly/1i99dBs
As a common measure I would recommend applying SharePoint Cumulative Updates and Operating System Service Packs to the machines.
References
http://technet.microsoft.com/en-us/library/jj219808(v=office.15).aspx
Summary
In this article we have explored how to create a Web Content Source in SharePoint 2013.