Andrew Wan

QUERY: comparing website contents

Feb 15 2008 10:02 AM

I've got two websites, one original, the other based off the original.

I like to diff/compare the websites using diff automatic comparison tools to see what text/information has changed. The problem is, the HTML code and layout has been changed drastically so I can't do a straight text file compare. What am interested in is purely the raw content (paragraphs, sentences, etc.). The original site has no javascript, onmouseover hovers, etc. The new revamped website has javascript, onmouseover hovers, popups, etc.

How can I create a script (Perl? C++?) so that it extracts the main text BODIEs from both sites? I guess also have to specify starting & ending delimiters. Once extracted, it would need to convert < p ></ p > paragraph tags, and strip out < a onmouseover... > anchor links (while maintaining the word inbetween the anchor link ofcourse). The new website uses two spaces after each full stop while the old website uses 1 space. Will this matter?

Once we got the plain text, how to wrap the paragraphs after 80 characters per line... so that we can easily do file compares.

And please do not suggest copying and pasting the text into NotePad or Word. I said 'website' which means they contain dozens of html files (probably 100s). Plus, I like a script to automate this compare process so I can repeat the process in future and remind myself of diffs....

Answers (1)

Nunit Testing in 3-TierArchitecture

Job serach application