Naveen HS

Naveen HS

  • NA
  • 1
  • 2.3k

Extracting Text from table using Regex

Jul 28 2010 3:04 AM
Hello Everyone,

I have one static HTML page table with some information in it, I am trying to extract the contents of the Table

HTML Table:-

<table border="0" cellpadding="5" cellspacing="0" width="165">
<tr>
<td nowrap>
<div class="saadirname">

Pradeep

G
</div>
<span class="saadirtext">

State College
<br>



First Grade
<br>


Library
<br>

</span><br>

<span class="saadirheader">Mailing Address:</span><br>

<span class="saadirtext">
<!--If using company address-->


Welcome Society
<br>



Library Arch
<br>


# 20 State Street
<br>


Mail Road
,
WI
<img src="/images/spacer.gif" alt="" height="1" width="5" border="0">
5000-1000
<img src="/images/spacer.gif" alt="" height="1" width="5" border="0">
IND
</span>
<p>


<b class="saadirheader">Phone:</b> <span class="saadirtext">(916) 060-6480</span><br>


<b class="saadirheader">Fax:</b> <span class="saadirtext">(916) 264-6336</span><br>





<b class="saadirheader">Email:</b>
<a href="MailTo:[email protected] "><span class="saadirtext">
[email protected]</span></a><br>



<br>
<b class="saadirheader">Membership Type:</b>
<span class="saadirtext">Individual</span><br>
<br>
</p></td>
</tr>
</table>

Program :-

 static void Main()
{
StreamReader str = new StreamReader("C:\\member.html");
string SFile = str.ReadToEnd();

Regex regex = new Regex(
@"<tr>
(
\s*
<td[^>]*>
\s*<div[^>]*>\s*
(\s*<!--((?!-->).)*-->)*\s*
(?<value>.*?)
(\s*<!--((?!-->).)*-->)*\s*
</div>\s*
</td>
)+
\s*</tr>
",

RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace);




foreach (Match m in regex.Matches(SFile))
{

foreach (Capture item in m.Groups["value"].Captures)
{
Console.WriteLine(item.Value);
}

Console.WriteLine();
}


Console.ReadLine();
}

I am getting the below output ....Entire table is getting printed with tags. can we handle the span,br tags ??

There is one more table in the Page Footer part, can we start the process after this line
<!--START FOOTER FILE-->

OutPut :-

 Pradeep

G
</div>
<span class="saadirtext">

State College
<br>



First Grade
<br>


Library
<br>

</span><br>

<span class="saadirheader">Mailing Address:</span><br>

<span class="saadirtext">
<!--If using company address-->


Welcome Society

<br>



Library Arch
<br>


# 20 State Street
<br>


Mail Road
,
WI
<img src="Newrecord_files/spacer.gif" alt="" border="0" height="1" width="5">
5000-1000
<img src="Newrecord_files/spacer.gif" alt="" border="0" height="1" width="5">
IND
</span>
<p>


<b class="saadirheader">Phone:</b> <span class="saadirtext">(916) 060-6480</span><br>


<b class="saadirheader">Fax:</b> <span class="saadirtext">(916) 264-6336</span><br>





<b class="saadirheader">Email:</b>
<a href="mailto:[email protected]"><span class="saadirtext">
[email protected]</span></a><br>



<br>
<b class="saadirheader">Membership Type:</b>
<span class="saadirtext">Individual</span><br>
<br>
</p></td>
</tr>
</tbody></table>

</td>

</tr>
</tbody></table>

<!--START FOOTER FILE-->

<table border="0" cellpadding="0" cellspacing="0" width="100%">
<tbody><tr>
<td><img src="Newrecord_files/transparent.gif" alt="" border="0" height="0" hspace="0" vspace="0" width="0"></td>
<td align="center" width="771">
<table border="0" cellpadding="0" cellspacing="0" width="771">
<tbody><tr>

<td>
<table border="0" cellpadding="0" cellspacing="0" width="100%">
<tbody><tr>
<td>

<table border="0" cellpadding="0" cellspacing="0" width="100%">
<tbody><tr>
<td valign="top"><img src="Newrecord_files/transparent.gif" alt="" border="0" height="25" hspace="0" vspace="0" width="165"></td>
<td valign="top"><img src="Newrecord_files/transparent.gif" alt="" border="0" height="25" hspace="0" vspace="0" width="10"></td>
</tr>

<tr>
<td valign="top" width="165"><img src="Newrecord_files/transparent.gif" alt="" border="0" height="25" hspace="0" vspace="0" width="165"></td>
<td valign="top">
<div id="footer" style="border-top: 1px solid rgb(204, 204, 204); padding: 10px 0pt 20px;">
<p>© The Archivists</p>
<ul>