Address Capitalization - street-address

I'm looking into using a CASS-Certified address validation service to correct user-provided street addresses at the time of entry. (Specifically, I'm looking at SmartyStreets' LiveAddress.) However, USPS dictates that a correct address must be in all caps, so CASS services almost uniformly return addresses that way. When mailing to the client at that address, though, it would be preferable to use a more humane, conventional casing.
The question, of course, is how to make that happen. I know there's no such thing as a perfect solution that doesn't involve an complete nation-wide database of correctly capitalized street and city names. A set of passable heuristics might be good enough, though, since we will probably be kicking the corrected address back to the user, ultimately leaving it up to them.
A short list of problems that I was able to come up with after a few minutes of thought:
SW FIRST ST should be SW First St, not Sw First St.
MCDOUGLE ST should be McDougle St, not Mcdougle St.
MACDOUGLE ST should probably be Macdougle St rather than MacDougle St, since Macoroni Rd should usually not be MacOroni Rd.
1ST ST should be 1st St, not 1St St.
Not knowing if a street name is based on a surname, we can possibly not safely make VAN into van, but VON can probably become von.
Are there any existing libraries that could at least get me started? Addresses are complicated and fickle things, and I'd rather not home-brew the whole thing if I don't have to. I'm using C#, but I'm open to porting code from another language.
Barring that, does anyone have a decent reference of common capitalization exceptions for street and/or city names?

Great to see that you're using the LiveAddress service to facilitate address verification and standardization. There is one thing you may want to be aware of that will help you significantly in the process of applying casing rules to your standardized address:
We recently introduced a new REST+JSON endpoint that returns the standardized form of the address as well as various component parts of the address. Because of this, it's very easy to apply your casing rules to "street_name" and "city_name" values returned independent of the street suffix and pre/post-directionals.
You're welcome to contact SmartyStreets support for additional help with this issue in addition to questions here on Stack Overflow (which we monitor continually). I should probably also mention that I'm the founder of SmartyStreets. Lastly, we're working on being able to return properly cased addresses, but I don't have any kind of release time frame on it yet.

Not a library, but you could probably solve the problem with the Google Maps API depending on your requirements.

Related

Why Google's Webservice is returning approximate vs. rooftop or other location types?

Regarding Google's Geocoding Webservice: Is there documentation (beyond Google's documentation), articles, or anything out there regarding how to format addresses to get accurate results.
Some of my locations have names preceding the address. If it recognizes a street address in the string, I can usually get rooftop back. In some cases having a single quote or special character in the name will cause it to just recognize the address and geocode rooftop.
However, I am also seeing cases where it finds the exact place correctly as an establishment, a church for example, but still says the 'location_type' is approximate. Other cases where having words preceding the address causes it to only seem to recognize the zip code and it just geocodes the zip.
I am wondering if anyone has insight into how Google's Geocoding webservice API recognizes/parses locations? What causes it to see the address in one case, but only see a zip in another?
Also, is there maybe a better way to interpret accuracy than just that 'location_type' field?
So basically you have to follow the standard mailing address format that is used by the respective country the address falls in and refrain from using apt/house/suite numbers. Use street number for the building/complex/entity rather than names.
This is what you need to go through. However, please do your research first and ask a question as a last resort.

How is geocoding from an address done?

I was wondering how Google geocodes and address? Does it work like a DNS lookup, where they have a big table of addresses which is a hash to a geocode, or is there any fun geometery that goes into it? If it is a big hash table how did they go about gathering all that data?
Busbina, I work for SmartyStreets where we verify and geocode street addresses -- so I'll tell you what I know, and link you to further sources for your own research.
To answer your question: It is both.
There are suppliers of massive databases (for example, those like TIGER Data) which contain relational, geo-political information including coordinates, streets, boundaries, and names. For US data, it is likely to obtain at least ZIP-level accuracy through tables like these simply by doing a lookup. For more accuracy, though, append the +4 code and you may narrow it down to a city block or floor of a high building.
To attempt further accuracy (ie. knowing where precisely on the street a building is located), Google and others perform what is called interpolation, where they take the known boundaries from their datasets and and the known range of primary numbers from the start of that block or street to the end of it, and they solve a ratio. If the correct primary number is known, and for straight streets in an ideal setting, a simple ratio like this works:
(primary number - starting primary number) / (ending primary number) =
(x - starting boundary coordinate) / (ending boundary coordinate)
Where x is a close guess to the actual location on the street - but only a guess. Accurate building-level data can be very expensive and I think is only available for some urban areas.
The key is to get the right primary number and accurate, up-to-date data. Maintaining this can be time-consuming and expensive because of all the overhead involved with so much information.
Note that Google and similar map services only perform address approximation, not address verification, and thus are liable to make mistakes (even if the geocoding algorithm is very precise) because the primary number may be wrong or may not even exist. So when that matters to you (or you aren't showing a Google Map and must honor the Terms of Service), something like LiveAddress, as a starting point, is certified by the USPS and won't return bad addresses.
So there are some things to consider.
More information:
http://en.wikipedia.org/wiki/Geocoding#Address_interpolation
http://www.ncgia.ucsb.edu/giscc/units/u016/u016.html
** I'll add a note, since I have had this question a lot: rooftop- or building-level accuracy is very expensive information. I know of very few providers who offer this, and they have mined and collected that data themselves. For example, Google has the Street View project, from which they've obtained accurate coordinates for approximate addresses, and they can provide such precision. But most geocoders use the same data from official sources, they just interpolate differently. If you want extremely precise coordinates like building-level, you can expect to pay mightily for it, or go collect the data yourself. (Yes, Google's is free to a point -- unless you intend to use the information for more than just showing a map, basically.)
Another service that is very similar is GeoNames which is a US Government run database of location names. This service is better tailored towards points of interest, like an airport or landmark. This is just a database of names, locations, and some meta data.
http://www.geonames.org/

How to defend against users with Multiple Accounts?

We have a service where we literally give away free money.
Naturally said service is ripe for abuse. To defend against this we do the following:
log ip address
use unique email addresses (only 1 acct/email addy)
collect more info like st. address, phone number, etc.
use signup captcha
BHOs (I've seen poker rooms use these)
Now, let's get real here -- NONE of this will stop a determined user.
Obviously ip addresses can be changed via a proxy (which could be blacklisted via akismet) but change anyways if the user has a dynamic ip or if more than one user is behind a NAT'd network (can we say almost everyone?)
I can sign up for thousands of unique email addresses each hour -- this is no defense.
I can put in fake information taken from lists for street addresses and phone numbers.
I can buy captchas from captcha solving services (1k for $5).
bhos seem only effective for downloadable software -- this is a website
What are some other ways to prevent multiple users from abusing the service? How do all the PPC people control click fraud?
I know we could actually call the person but I don't think we are trying to do that anytime soon.
Thanks,
It's pretty difficult to generate lots of fake phone numbers that can send and receive SMS messages. SMS verification could go a long way towards cutting down on fraud. Of course, it also limits you to giving away free money to cell phone owners.
I think only way is to bind your users accounts to 'real world' information, like his/her passport number, for instance. Of course, you'll need to make sure that information is securely stored and to find some way to validate it.
Re: signing up for new email accounts...
A user doesn't even need to do that. Please feel free to send your mail to brian_s#mailinator.com, or feydr.asks.a.question#spamherelots.com, or stackoverflow#safetymail.info, or my_arbitrary_username#zippymail.info. I haven't registered any of those email addresses, but all of them will work.
Those domains are owned by ManyBrain, and they (and probably others as well) set the domain to accept any email user. ManyBrain in particular then makes the inboxes for those emails publicly accessible without any registration (stripping everything by text from the email and deleting old mail). Check it out: admin#mailinator.com's email inbox!
Others have mentioned ways to try and keep user identities unique. This is just one more reason to not trust email addresses.
First, I suppose (hope) that you don't literally give away free money but rather give it to use your service or something like that.
That matters as there is a big difference between users trying to just get free money from you they can spend on buying expensive cars vs only spending on your service which would be much more limited.
Obviously many more user will try to fool the system in the former than in the latter case.
Why it matters? Because it is all about the balance between your control vs your user annoyance. I see many answers concentrating on the control part, so let's go through annoyance, shall we?
Log IP address. What if I am the next guy on the computer in say internet shop and the guy before me already used that IP? The other guy left your hot page that I now see but I am screwed because the IP is blocked. Yes, I can go to another computer but it is annoyance and I may have other things to do.
Collecting physical Adresses. For what??? Are you going to visit me? Or start sending me spam letters? Let me guess, more often than not you get addresses with misprints at best and fake ones at worst. In fact, it is much less hassle for me to give you fake address and not dealing with whatever possible spam letters I'll have to recycle in environment-friendly way. :)
Collecting phone numbers. Again, why shall I trust your site? This is the real story. I gave my phone nr to obscure site, then later I started receiving occasional messages full of nonsense like "hit the fly". That I simply deleted. Only later and by accident to discover that I was actually charged 2 euros to receive each of those messages!!! Do I want to get those hassles? Obviously not! So no, buddy, sorry to disappoint but I will not give your site my phone number unless your company is called Facebook or Google. :)
Use signup captcha. I love that :). So what are we trying to achieve here? Will the user who is determined to abuse your service, have problems to type in a couple of captchas? I doubt it. But what about the "good user"? Are you aware how annoying captchas are for many users??? What about users with impaired vision? But even without it, most captchas are so bad that they make you feel like you have impaired vision! The best advice I can give - if you care about user experience, avoid captchas as plague! If you have any doubts, do your online research first!
See here more discussion about control vs annoyance and here some more thoughts about being user-friendly.
You have to bind their information to something that is 'real world', as Rubens says. Of course, you also need to be able to verify this information (I can just make up passport numbers all day if you don't check to make sure they're correct).
How do you deliver the money? Perhaps you can index this off the paypal account, mailing address, or whatever you're sending the money to?
Sometimes the only way to prevent people abusing a system is to not have the system in the first place.
If you're doing what you say you're doing, "giving away money to people", then surprise surprise, there will be tons of people with more time available to try to find ways to game the system than you will have to fix it.
I guess it will never be possible to have an identification system which identifies fake identities that is:
cheap to run (I think it's called "operational cost"?)
cheap to implement (ideally one time cost - how do you call that?)
has no Type-I/Type-II errors
is scalable
But I think you could prevent users from having too many (to say a quite random number: more than 50) accounts.
You might combine the following approaches:
IP address: can be bypassed with VPN
CAPTCHA: can be bypassed with human farms (see this article, for example - although they claim that their test can't be that easily passed to other humans, I doubt this is true)
Ability-based identification: can be faked when you know what is stored and how exactly the identification works by randomly (but with a given distribution) acting (example: brainauth.com)
Real-world interaction: Although this might be the best one, but I guess it is expensive and not many users will accept it. Also, for some users/countries it might not be possible. (example: Postident in Germany, where the Post wants to see your identity card. I guess this can only be faced in massive scale by the government.)
Other sites/resources: This basically transforms the problem for other sites. You can use services, where it is not allowed/uncommon/expensive to have much more than 1 account
Email
Phone number: e.g. by using SMS, see Multi-factor authentication
Bank account: PayPal; transfer not much money or ask them to transfer a random (small) amount to you (which you will send back).
Social based
When you take the social graph (vertices are people, edges are connections), you will expect some distribution. You know that you are a single human and you know some other people. So you have a "network of trust" (in quotes, because I think this might be used in other context as well). Now you might not trust people / networks how interact heavily with your service, but are either isolated (no connection) or who connect a large group with another large group ("articulation points"). You also might not trust fast growing, heavily interacting new, isolated graphs.
When a user provides content that is liked by many other users (who you trust), this might be an indicator that there is a real human creating it.
We had a similar issue recently on our website, it is really a hassle to solve this issue if you are providing a business over one time or monthly recurring free credits system.
We are using a fraud detection solution https://fraudradar.io for a while and that helped us a lot to clean out most of the spam activities. It is pretty customizable with:
IP checks
Email domain validity
Regex rules
Whitelisting options per IP, email domain etc.
Simple API to communicate through
I would suggest to check that out.

Best Practice / Standard for storing an Address in a SQL Database

I am wondering if there is some sort of "standard" for storing US addresses in a database? It seems this is a common task, and there should be some sort of a standard.
What I am looking for is a specific schema of how the database tables should work and interact, already in third normal form, including data types (MySQL). A good UML document would work.
Maybe I'm just being lazy, but this is a very common task, and I am sure someone has published an efficient way to do this somewhere. I just don't know where to look and Google isn't helping. Please point me to the resource. Thanks.
EDIT
Although this is more of a general question, I would like to clarify my specific needs.
Addresses will be used to specify road addresses of locations of events. These addresses will need to be in a format that can be best broken down and searched, and also used by any third-party applications I may end up linking my data source to.
ALSO. Data will be geo-coded (long, lat) on entry and stored separately, so it must fit the (yet undecided) protocol of whatever geocoder / application / library does that.
For international addresses, refer to the Universal Postal Union's Postal Addressing Systems database.
For U.S. addresses, refer to USPS Publication 28 "Postal Addressing Standards".
The USPS wants the following unpunctuated address components concatenated on a single line:
house number
predirectional (N, SE, etc.)
street
suffix (AVE, BLVD, etc.)
postdirectional (SW, E, etc.)
unit (APT, STE, etc.)
apartment/suite number
E.g. 102 N MAIN ST SE APT B
If you keep the entire address line as a single field in your database, input and editing is easy, but searches can be more difficult (eg, in the case SOUTH EAST LANE is the street EAST as in S EAST LN or is it LANE as in SE LANE ST?).
If you keep the address parsed into separate fields, searches for components like street name or apartments become easier, but you have to append everything together for output, you need CASS software to parse correctly, and PO boxes, rural route addresses, and APO/FPO addresses have special parsings.
A physical location with multiple addresses at that location is either a multiunit building, in which case letters/numbers after units like APT and STE designate the address, or it's a Commercial Mail Receiving Agency (eg, UPS store) and a maildrop/private mailbox number is appended (like 100 MAIN ST STE B PMB 102), or it's a business with one USPS delivery point and mail is routed after USPS delivery (which usually requires a separate mailstop field which the company might need but the USPS won't want on the address line).
A contact with more than one physical address is usually a business or person with a street address and a PO box. Note that it's common for each address to have a different ZIP code.
It's quite typical that one business transaction might have a shipping address and a billing address (again, with different ZIP codes). The information I keep for EACH address is:
name prefix (DR, MS, etc)
first name and initial
last name
name suffix (III, PHD, etc)
mail stop
company name
address (one line only per Pub 28 for USA)
city
state/province
ZIP/postal code
country
I typically print mail stops somewhere between the person's name and company because the country contains the state/ZIP which contains the city which contains the address which contains the company which contains the mail stop which contains the person. I use CASS software to validate and standardize addresses when entered or edited.
First, as a person who spend most of there professional day working with addresses, they are hard to manage from a data perspective.
If you ask 5 people what address they live at; you will find that you get 5 different answers. While you and I can tell that 123 Main Street Apt 1 and Apt 1 123 Main Street
are the same address, the database program will have a challenge.
If you are using United States centric addresses CASS certified software from almost any vendor will standardize your addresses reasonably well. I would recommend a simple format as follows:
Address 1
Address 2
Address 3
City
State
Zip
Zip+4 (I would carry this so lookups are easier when checking for duplicates)
However, if you want a universal address I would look at the ADIS standard from IdeaAlliance. This standard can be used to breakdown (parse) addresses from almost any country into the relevant parts. Then they can be put back together using templates/components based on the Universal Postal Union standards (UPU S42 Standard on International Postal Address Components and Templates).
The big plus of this format is that addresses that dont exist in a postal database like CASS can be entered and stored as separate parts.
Very similar questions have been asked before.
Addresses are messy - at best.
It partly depends on what you want to do with the addresses. If you're going to use them to mail thing to people, then you simply need to record the image that will appear on the address label in a convenient form. If you're going to analyze the address, you have to work a lot harder.
Remember that the first time you have to deal with someone outside the US, all previous rules go astray. You may be strictly US-only, but beware.
I looked into this a while ago, but for international addresses. I didn't find much in the way of a consensus. However, for the US, I found the succinctly named United States Thoroughfare, Landmark, and Postal Address Data Standard (Draft):
http://www.fgdc.gov/standards/projects/FGDC-standards-projects/street-address/index_html
I don't think that they actually provide any specific database schema ideas, but it might be a good starting point.
First, the "best" means of storing an address depends greatly on how it will be used. Is it just for reference or searches on say city? Do you plan on addressing envelopes? Are you going to integrate with a shipping system like FedEx or UPS? Will you store non-US addresses? Once you get into the realm of integrating with something that ships, you should start looking at CASS. This is a specification for handling the USPS addresses. There are applications out there that are CASS certified which will store and verify addresses. Thus, the second best practice would be to try to avoid reinventing the wheel and see if there is a system out there that will solve your problem especially if you are going to go international. You want to leverage the fact that someone else has worked out all the details about how to properly and efficiently store addresses for many countries around the world instead of having to do that investigation yourself.
I've had to try to do this before and I'd found this document that gives you some pointers. I ended up shelving my schema since my application does have to deal with international addresses.

How do you deal with duplicate street suffixes?

I have a system where users need to enter addresses. I am trying to limit duplicates of course and something I started noticing was becoming a big problem was some users putting in "Road" and others "Rd", therefore duplicates were creeping in.
I looked up the list of USPS street suffix abbreviations but I still have a question which I can't find an answer to. Can I replace all words in a street address with the USPS standard abbreviation? An example would be "123 Forest Hill Road". If I were to replace it with the abbreviations it would then be "123 Frst Hl Rd" or does the "street suffix" that USPS is referring to mean they only want you to make go as far as "123 Forest Hill Rd"?
USPS has an API that can get you properly formatted addresses.
You would have to ask the USPS to be sure, but I imagine that your app and data would be in trouble if you started replacing "123 Forest Hill Rd" with "123 Frst Hl Rd".
I have done some work with addresses and let me tell you it is very complicated and time consuming to do even remotely correctly. In most cases you would be better off making use of existing packages out there. For example, you would be surprised what you can achieve with a few simple calls to the free Google Maps API.
Can you avoid the whole problem by expanding all of the terms rather than attempting to abbreviate any?
On the duplicates, just wondering if you'd be better to make the Users choose from a drop-down of Address Types. Take it out of the User's hands.
On the abbreviation, are you asking this because USPS needs the Address in some specific format? Just wondering what purpose there is in the abbreviating. Apologies if I've missed the mark.
You could also take a look at the USPS Postal Addressing Standards which has explanation of the preferred and acceptable formats for various address examples.
http://pe.usps.gov/text/pub28/pub28c2_toc.htm
In the example case, the relevant section is 23 Delivery Address Line.
http://pe.usps.gov/text/pub28/pub28c2_012.htm
The trouble with trying to expand/contract addresses yourself is that oftentimes abbreviations can be part of the street or even city name. For example: "100 Avenue A" where Avenue isn't supposed to be abbreviated. Or "900 St Louis Loop". In this case St don't mean street, it means Saint.
Within the USA, there is a component of a certificated address called a delivery point barcode (DPBC). It's a unique 12-digit value that can serve as the unique identifier of an address. To get this value you'll want to use an address verification or address standardization web service API, which can cost about $20/mo depending upon the volume of requests you make to it. Using this you can easily prevent duplicates or do fraud prevention/detection, etc.
In the interest of full disclosure, I'm the founder of SmartyStreets. We offer just such an address verification web service API called LiveAddress. You're more than welcome to contact me personally with any questions you have.