There are many services on internet that provide the parsing
of address like if someone give an
address you need to parse various parts from the address of US postal format
from then and these can be useful in
many program like marketing etc. to gather data about specific state people
etc..
Here I'm presenting article about how to parse address that
are of US format from text file or any other sources. I coded it to help one of my friend and I am sharing here.
Here main two concepts I have used to program this kind of
parser that is RegEx and Text Processing.
Here is different format that are of US Postal Address
JEREMY MARTINSON
455 LARKSPUR DR
CALIFORNIA SPRINGS CA 92926-4601
----
MARY ROE
MOBILE SYSTEMS
455 E DRAGMAN K
TUCSON AZ 85705-4589
USA
----
MARY ROE
MOBILE SYSTEMS
500 E DRAGMAN SUITE 5A
TUCSON AZ 85705-4601
USA
----
JOHN DOE
CENTER FOR FINANCIAL ASSISTANCE TO DEPOSED
NIGERIAN ROYALTY
421 E DRACHMAN
TUCSON AZ 85705-7445
USA
Now we need a program that can identify various
parts of these addresses and Store it in csv like format that is consistent so
when we want to retrieve detail based on various parts it's easy to sort
/search the data according to various parts of address like we need to sort it
by postal code / area
I will store various part of these addresses in a class
called Address and then I will write all the address found in Text file to another
Text file in format separated by pipe '|' symbol. If you want you can use
database too here.
It's a text processing program so RegEx and some text processing thinker man would be good here..
We can start coding by looking and examine various address format available in TextFile from which we want to parse data to
another TextFile with better format that is readable by other programs.
First step to think here is that we need to check
weather that address is contains Postal Code of USA or not if not its not
standard address we can't process. If it contain US Postal code or not the we
will identify weather its having 4 lines or 5 line format than proceed to next Steps..
So here is my code that can identify above
mentioned address formats and can parse various parts of addresses from the text file. That contains many addresses.
I have coded Address Class to manipulate information better
public class Address
{
public string
Street;
public string
Locality;
public string
City ;
public string
State ;
public string
PostalCode;
public string
Country;
public
Address()
{
Street = "";
Locality = "";
City = "";
State = "";
PostalCode = "";
Country = "";
}
public void
ClearObject()
{
Street = "";
Locality = "";
City = "";
State = "";
PostalCode = "";
Country = "";
}
public string _Street
{
get
{
return
Street;
}
set
{
Street = value;
}
}
public string
_Country
{
get
{
return
Country;
}
set
{
Country = value;
}
}
public string _PostalCode
{
get
{
return
PostalCode;
}
set
{
PostalCode = value;
}
}
public string _State
{
get
{
return
State;
}
set
{
State = value;
}
}
public string
_Locality
{
get
{
return
Locality;
}
set
{
Locality = value;
}
}
public void
WriteAddress()
{
StreamWriter sw = new StreamWriter("formatted_data.txt", true);
sw.Write(String.Format("{0}|{1}|{2}|{3}|{4}|{5}\r\n",Street,Locality,City,State,PostalCode,Country));
sw.Close();
}
}
Here Address method writes the parsed
Address in consistent format into another TextFile called formatted data.
Now comes the main code that can parse addresses
in just one click from various addresses separated by blank line breaks.
If code looks horrible to you then check the
example as I have used lot of code here that are hard to understand .
private void button1_Click(object
sender, EventArgs e)
{
/* read the
file */
string
Data = File.ReadAllText("sample.txt");
/*
replace with single del */
Data =
Data.Replace("\r\n\r\n", "|");
string[]
AddressList = Data.Split('|');
Address obj = new Address();
for (int i = 0; i < AddressList.Length - 1; i++)
{
AddressList[i] = AddressList[i].Replace("\r\n","|");
string[] Fields = AddressList[i].Split('|');
/* if contain us
Postal */
Regex rex = new
Regex(@"\b[0-9]{5}(?:-[0-9]{4})?\b");
if
(rex.IsMatch(AddressList[i]) == true)
{
obj.ClearObject();
obj._Country = "USA";
if
(rex.Matches(Fields[2]).Count > 0)
{
obj.PostalCode
= rex.Matches(Fields[2])[0].Value.ToString();
obj._State =
Fields[2].Substring(rex.Matches(Fields[2])[0].Index - 3, 3);
string[] x = Fields[2].Split('
');
obj._Locality = x[0];
}
else if(rex.Matches(Fields[3]).Count
> 0)
{
obj.PostalCode = rex.Matches(Fields[3])[0].Value.ToString();
obj._State =
Fields[3].Substring(rex.Matches(Fields[3])[0].Index - 3, 3);
/* get
locality */
string[]
x = Fields[3].Split(' ');
obj._Locality = x[0];
}
if
(Fields.Length == 5)
{
obj._Street = Fields[2];
}
else
if (Fields.Length == 3)
{
obj._Street =
Fields[1];
}
obj.WriteAddress();
}
}
}
When you click Button you will get all the fields of address of USA separated by "|" in fomatted_data.txt file where the exe is there. It will contain result like below
455 LARKSPUR DR|CALIFORNIA||CA |92926-4601|USA
455 E DRAGMAN K|TUCSON||AZ |85705-4589|USA
500 E DRAGMAN SUITE 5A|TUCSON||AZ |85705-4601|USA
so now its easy to get any field of address according to need from the text file as its consistent and all address are in same format :)