I'm trying to convert a zip code to an address that resides roughly at the center of that zip code - street-address

I've got a database of addresses provided by users. In some cases, the user only entered a zip code that, for a time, we thought would be adequate for our purposes. Now we know that a zip code doesn't cut it and we need an actual address.
So, we're looking to convert the zip codes to an address that's close enough to that location, and then letting the users know what we did so they can go in and change those addresses if need be.
Does anyone know of anything that can provide the data I'm looking for?

I was a developer at SmartyStreets (I have a lot of experience with street addresses).
Unfortunately, ZIP codes cannot be converted into an address.
US addresses are identified (almost uniquely) by an 11-digit number (12 with the check digit) called a Delivery Point Barcode. The ZIP code is typically the first 5 digits of this code. The remaining 6 digits provide even more granularity to pinpoint the address. Note that this does not pinpoint an address geographically, but only logically, according to the USPS' address assignment system. (They assign ZIP codes and addresses based on what makes the most sense for administration and delivery.) Without those other digits, accurate interpolation is highly, highly unlikely.
Trying to resolve an address from a ZIP code is like trying to scale a 100px bitmap into a 100,000px image, depending on the ZIP code.
Technical unfeasibility aside, I would advise against finding an address then asking users to correct it. From a UX and DBA perspective, why not simply ask your users to fill in the correct address? If you try to guess the address instead, users will be confused, and the database admins will probably lose track of which addresses are bogus and which ones were corrected. In other words, your data will lie to you.
You'd probably rather have incomplete data and know it than having incorrect data and not know it.
Even if users don't fill out their address, you can, with some more reasonable certainty, assume a city and state from the ZIP code. SmartyStreets has an API for that. At least in this way you'd have an approximate coordinate and city/state, instead of completely wrong address data. The chances of guessing correctly are tens of thousands to one.

You can use https://thezipcodes.com/ Get the API key from account section. Hit the get request.
https://thezipcodes.com/api/v1/search?zipCode=302010&countryCode={2digitCountryCode}&apiKey={apiKey}
{
"success": true,
"location": [
{
"country": "RU",
"countryCode2": "RU",
"countryCode3": "",
"state": "Орловская Область",
"stateCode2": "56",
"latitude": "53.0747",
"longitude": "36.2468",
"zipCode": "302019",
"city": "Орел 19"
},
{
"country": "India",
"countryCode2": "IN",
"countryCode3": "IND",
"state": "Rajasthan",
"stateCode2": "24",
"latitude": "26.7865",
"longitude": "75.5809",
"zipCode": "302019",
"city": "Shyam Nagar (Jaipur)",
"timeZone": "Indian Standard Time"
}
]
}

Related

Seeking Insight About MongoDB Performance for Specific Use Case

I'm in the beginning phases of a complete redesign of my existing youth sports league website (www.parochialathleticleague.org), using Vue.js for the frontend, and need to make a final decision on the backend structure, so I can move forward with the project. For the past 5 years, we've used a MySQL database with custom PHP scripts as the intermediary to send, receive and edit data in the league schedules and standings, etc.
I'm confident I could use MySQL effectively again, perhaps with smarter table structure, but very intrigued about the possibility of utilizing MongoDB for this project instead as a NoSQL experiment. I know, I know.. there's a lot of hate out there for Mongo and it's best suited for very specific use cases, but I think it would be an interesting thing to become acquainted with for future reference.
That said, if I decide to go with Mongo, I would really appreciate some insight from those with far more experience about what can be expected from performance of the database in this proposed use case. Specifically, there are three potential options that I would consider for organizing the schedule data. Please see below (I've included an example of one game's data for each option):
Option 1 - Collection: "2018 Boys Basketball - Coastal 2A"
{
"date": "September 7",
"time": "3:30 PM"
"home": "Blessed Sacrament"
"h_score": 0,
"visitor": "St. Columban",
"v_score": 0,
"location": "Blessed Sacrament",
"game_id": 260
}
In the above proposed solution, I would make a separate collection for each divison. This past season, there were 34 total divisions in the sport of basketball, so that would equal 34 collections in this example. That means more organizational overhead and more time to configure, but logic would dictate it's the fastest possible solution for the end user because any one division holds no more than about 50 games. That's also how I've had it set up in our existing website's MySQL solution: one table per division.
Option 2 - Collection: "2018 Boys Basketball"
{
"date": "September 7",
"time": "3:30 PM"
"home": "Blessed Sacrament"
"h_score": 0,
"visitor": "St. Columban",
"v_score": 0,
"location": "Blessed Sacrament",
"game_id": 260,
"division": "Coastal 2A"
}
In the above proposed solution, I would make a separate collection for each gender, resulting in only two collections - one for all of the boys' basketball games and one for the girls'. In terms of organization and setup, this seems far more desirable than having to create 34 collections as in the above example. However, I'm unsure of how much slower this will be for the end user because now the collection may hold up to 500 games, rather than 50. So, parsing through 500-ish games to locate the 50 or so in one particular division before rendering that division's schedule in the browser could be an issue - or, hopefully, not at all!
Option 3 - Collection: "2018 Basketball"
{
"date": "September 7",
"time": "3:30 PM"
"home": "Blessed Sacrament"
"h_score": 0,
"visitor": "St. Columban",
"v_score": 0,
"location": "Blessed Sacrament",
"game_id": 260,
"division": "Coastal 2A",
"gender": "boys"
}
In the above proposed solution, I would make a single collection for all of the games in the particular season. This could mean up to 1000 games total and I'm thinking that speed for the end user might become a real concern.
Your thoughts and suggestions are much appreciated. Would really value some real world feedback before making a final decision. Also, if you think I'm a complete idiot (I don't necessarily disagree!) and should definitely stick with SQL for this project, I would pose the same question about the above three options, just in table/row structure. Many thanks!
One important thing you need to have in mind is that schema design in MongoDB (and most document store solutions) usually revolves around embedding documents. So, you would be better off sticking the games themselves in the division docs. And then games would probably be better off containing, as well, the details of teams, players, etc.
Of course you can normalize data (like you do in a relational database) but you will probably be relying on anti-patterns and bump into annoying performance issues.
MongoDB also has a 16MB cap for each document, so you would need to have that in mind when designing the schema.
As a side-note and since you mentioned you have a MySQL background. Nowadays, MySQL also has a document store offering, similar to MongoDB (but with some differences). There would probably be a lot less to learn in your case, so maybe you should give it a look.
Disclaimer: I work at the MySQL connectors team at Oracle.

Pull username from input text in Watson Conversation

As part of our Watson Conversation flow, we have one node that asks for usernames that we put into a context variable.
Most people just give us their username (janedoe12), and that is handled fine using this code.
{
"context": {
"username": "<?input_text?>"
},
"output": {
"text": {
"values": [
"Hi $username!"
]
}
}
}
The issue is when people say "My username is janedoe12" that I can't seem to solve - I'm still trying to just pull out 'janedoe12'. We haven't often seen many other ways people give us their usernames (i.e. no one says "username janedoe12" or "my sn is janedoe12"), so we're not too worried about handling cases besides this one.
Here's what I have:
{
"context": {
"username": "<?input.text.extract('(?<=username is ).*')?>"
},
"output": {
"text": {
"values": [
"Hi $username!"
]
}
}
}
I feel like the problem is likely that I'm using the wrong RegEx style but I'm not quite sure how to go from here. I can get to this node fine, but it returns $username as null. Anyone have any suggestions on how to design this?
I blogged about this a long time ago.
Thankfully things have changed somewhat to make it easier for you. There is a BETA system entity called #sys-person. It can pick up a persons name.
Once you detect the entitiy, then you can just reference the input directly like so.
"username": "<? #sys-person ?>"
To expand out on this, because you are talking about a user name rather than a persons name. You are going to suffer the same issues as mentioned in the blog post.
Those being.
Answering with "My name is"
Refusing to give a name.
Asking why would you want the name.
Playing with the system to get a joke response.
There may be other conditions.
Here are two ways to approach it.
First, you can have the application layer hand over to conversation the user name, if they had to log into to talk to the chat bot. This means you don't need to ask the person, and it will be coming from a valid area.
Second, sometimes simply rewording how you speak to the user can negate any complex coding. If I say "What is your username?", someone can respond as you mentioned.
However if I say "Please enter your 6 digit user ID", then people are more likely to only enter in their ID.
At that point you can scan by using the following.
"username": "<? input.text.extract('^[A-Za-z0-9_]{6}') ?>"
If you don't detect the user ID you can then prompt them to type it in correctly.
('username\s+is\s+(.*)', 1)
-from Wiktor Stribiżew

How to create a view in Couchbase with multiple WHERE and OR clauses

I'm new to CouchBase, and I'm looking for a solution to scale my social network. Couchbase looks more interesting, specially it's easy to scale features.
But I'm struggling about creating a view for a specific kind of document.
My documents looks like this:
{
"id": 9476182,
"authorid": 86498,
"content": "some text here",
"uid": 41,
"accepted": "N",
"time": "2014-12-09 09:58:03",
"type": "testimonial"
}
{
"id": 9476183,
"authorid": 85490,
"content": "some text here",
"uid": 41,
"accepted": "Y",
"time": "2014-12-09 10:44:01",
"type": "testimonial"
}
What I'm looking for is for a view that would be equivalent to this SQL query.
SELECT * FROM bucket WHERE (uid='$uid' AND accepted='Y') OR
(uid='$uid' AND authorid='$logginid')
This way I could fetch all user's testimonials even the ones not approved, if the user who is viewing the testimonials page is the owner of that testimonials page, or if not, show all given users testimonials where accepted is =="Y", plus testimonials not approved yet, but written by the user's who is viewing the page.
If you could give me some tips about this I'll be very grateful.
Unlike SQL you cannot directly pass input parameters into views; however, you can emulate this to some extent by filtering ranges.
While not exactly matching SQL, I would suggest you simply filter testimonials based on the user ID, and then do the filtering on the client side. I am making the assumption that in most cases there will not even be any pending testimonials, and therefore you will not really end up with a lot of unnecessary data.
Note that it is possible to filter this using views entirely, however it would require:
Bigger keys OR
Multiple views OR
Multiple queries
In general it is recommended to make the emitted keys smaller, as this increases performance; so better stick with the above-mentioned solution.

Map or Array for RESTful design of finite, unordered collection?

A coworker and I are in a heated debate regarding the design of a REST service. For most of our API, GET calls to collections return something like this:
GET /resource
[
{ "id": 1, ... },
{ "id": 2, ... },
{ "id": 3, ... },
...
]
We now must implement a call to a collection of properties whose identifying attribute is "name" (not "id" as in the example above). Furthermore, there is a finite set of properties and the order in which they are sent will never matter. The spec I came up with looks like this:
GET /properties
[
{ "name": "{PROPERTY_NAME}", "value": "{PROPERTY_VALUE}", "description": "{PROPERTY_DESCRIPTION}" },
{ "name": "{PROPERTY_NAME}", "value": "{PROPERTY_VALUE}", "description": "{PROPERTY_DESCRIPTION}" },
{ "name": "{PROPERTY_NAME}", "value": "{PROPERTY_VALUE}", "description": "{PROPERTY_DESCRIPTION}" },
...
]
My coworker thinks it should be a map:
GET /properties
{
"{PROPERTY_NAME}": { "value": "{PROPERTY_VALUE}", "description": "{PROPERTY_DESCRIPTION}" },
"{PROPERTY_NAME}": { "value": "{PROPERTY_VALUE}", "description": "{PROPERTY_DESCRIPTION}" },
"{PROPERTY_NAME}": { "value": "{PROPERTY_VALUE}", "description": "{PROPERTY_DESCRIPTION}" },
...
}
I cite consistency with the rest of the API as the reason to format the response collection my way, while he cites that this particular collection is finite and the order does not matter. My question is, which design best adheres to RESTful design and why?
IIRC how you return the properties of a resource does not matter in a RESTful approach.
http://www.ics.uci.edu/~fielding/pubs/dissertation/rest_arch_style.htm
From an API client point of view I would prefer your solution, considering it is explicitly stating that the name of a property is XYZ.
Whereas your coworkers solution would imply it is the name, but how would I know for sure (without reading the API documenation). Try not to assume anything regarding your consuming clients, just because you know what it means (and probably is easy enough to assume to what it means) it might not be so obvious for your clients.
And on top of that, it could break consuming clients if you are ever to decide to revert that value from being a name back to ID. Which in this case you have done already in the past. Now all the clients need to change their code, whereas they would not have to in your solution, unless they need the newly added id (or some other property).
To me the approach would depend on how you need to use the data. Are the property names known before hand by the consuming system, such that having a map lookup could be used to directly access the record you want without needing to iterate over each item? Would there be a method such as...
GET /properties/{PROPERTY_NAME}
If you need to look up properties by name and that sort of method is NOT available, then I would agree with the map approach, otherwise, I would go with the array approach to provide consistent results when querying the resource for a full collection.
I think returning a map is fine as long as the result is not paginated or sorted server side.
If you need the result to be paginated and sorted on the server side, going for the list approach is a much safer bet, as not all clients might preserve the order of a map.
In fact in JavaScript there is no built in guarantee that maps will stay sorted (see also https://stackoverflow.com/a/5467142/817385).
The client would need to implement some logic to restore the sort order, which can become especially painful when server and client are using different collations for sorting.
Example
// server sent response sorted with german collation
var map = {
'ä':{'first':'first'},
'z':{'second':'second'}
}
// but we sort the keys with the default unicode collation algorigthm
Object.keys(map).sort().forEach(function(key){console.log(map[key])})
// Object {second: "second"}
// Object {first: "first"}
A bit late to the party, but for whoever stumbles upon this with similar struggles...
I would definitely agree that consistency is very important and would generally say that an array is the most appropriate way to represent a list. Also APIs should be designed to be useful in general, preferably without optimizing for a specific use-case. Sure, it could make implementing the use-case you're facing today a bit easier but it will probably make you want to hit yourself when you're implementing a different one tomorrow. All that being said, of course for quite some applications the map-formed response would just be easier (and possibly faster) to work with.
Consider:
GET /properties
[
{ "name": "{PROPERTY_NAME}", "value": "{PROPERTY_VALUE}", "description": "{PROPERTY_DESCRIPTION}" },
...
]
and
GET /properties/*
{
"{PROPERTY_NAME}": { "value": "{PROPERTY_VALUE}", "description": "{PROPERTY_DESCRIPTION}" },
...
}
So / gives you a list whereas /* gives you a map. You might read the * in /* as a wildcard for the identifier, so you're actually requesting the entities rather than the collection. The keys in the response map are simply the expansions of that wildcard.
This way you can maintain consistency across your API while the client can still enjoy the map-format response when preferred. Also you could probably implement both options with very little extra code on your server side.

How should I populate city/state fields based on the zip?

I'm aware there are databases for zip codes, but how would I grab the city/state fields based on that? Do these databases contain the city/states or do I have to do some sort of lookup to a webservice?
\begin{been-there-done-that}
Important realization: There is not a one-to-one mapping between cities/counties and ZIP codes. A ZIP code is not based on a political area but instead a distribution area as defined for the USPS's internal use. It doesn't make sense to look up a city based on a ZIP code unless you have the +4 or the entire street address to match a record in the USPS address database; otherwise, you won't know if it's RICHMOND or HENRICO, DALLAS or FORT WORTH, there's just not enough information to tell.
This is why, for example, many e-commerce vendors find dealing with New York state sales tax frustrating, since that tax scheme is based on county, e-commerce systems typically don't ask for the county, and ZIP codes (the only information they provide instead) in New York can span county lines.
The USPS updates its address database every month and costs real money, so pretty much any list that you find freely available on the Internet is going to be out of date, especially with the USPS closing post offices to save money.
One ZIP code may span multiple place names, and one city often uses several (but not necessarily whole) ZIP codes. Finally, the city name listed in the ZIP code file may not actually be representative of the place in which the addressee actually lives; instead, it represents the location of their post office. Our office mail is addressed to ASHLAND, but we work about 7 miles from the town's actual political limits. ASHLAND just happens to be where our carrier's route originates from.
For guesstimating someone's location, such as for a search of nearby points of interest, these sources and City/State/ZIP sets are probably fine, they don't need to be exact. But for address validation in a data entry scenario? Absolutely not--validate the whole address or don't bother at all.
Just a friendly reminder to take a step back and remember the data source's intended use!
\end{been-there-done-that}
Modern zip code databases contain columns for City, State fields.
http://sourceforge.net/projects/zips/
http://www.populardata.com/
Using the Ziptastic HTTP/JSON API
This is a pretty new service, but according to their documentation, it looks like all you need to do is send a GET request to http://ziptasticapi.com, like so:
GET http://ziptasticapi.com/48867
And they will return a JSON object along the lines of:
{"country": "US", "state": "MI", "city": "OWOSSO"}
Indeed, it works. You can test this from a command line by doing something like:
curl http://ziptasticapi.com/48867
Using the US Postal Service HTTP/XML API
According to this page on the US Postal Service website which documents their XML based web API, specifically Section 4.0 (page 22) of this PDF document, they have a URL where you can send an XML request containing a 5 digit Zip Code and they will respond with an XML document containing the corresponding City and State.
According to their documentation, here's what you would send:
http://SERVERNAME/ShippingAPITest.dll?API=CityStateLookup&XML=<CityStateLookupRequest%20USERID="xxxxxxx"><ZipCode ID= "0"><Zip5>90210</Zip5></ZipCode></CityStateLookupRequest>
And here's what you would receive back:
<?xml version="1.0"?>
<CityStateLookupResponse>
<ZipCode ID="0">
<Zip5>90210</Zip5>
<City>BEVERLY HILLS</City>
<State>CA</State>
</ZipCode>
</CityStateLookupResponse>
USPS does require that you register with them before you can use the API, but, as far as I could tell, there is no charge for access. By the way, their API has some other features: you can do Address Standardization and Zip Code Lookup, as well as the whole suite of tracking, shipping, labels, etc.
I'll try to answer the question "HOW should I populate...", and not "SHOULD I populate..."
Assuming you are going to do this more than once, you would want to build your own database. This could be nothing more than a text file you downloaded from any of the many sources (see Pentium10 reply here). When you need a city name, you search for the ZIP, and extract the city/state text. To speed things up, you would sort the table in numeric order by ZIP, build an index of lines, and use a binary search.
If you ZIP database looked like (from sourceforge):
"zip code", "state abbreviation", "latitude", "longitude", "city", "state"
"35004", "AL", " 33.606379", " -86.50249", "Moody", "Alabama"
"35005", "AL", " 33.592585", " -86.95969", "Adamsville", "Alabama"
"35006", "AL", " 33.451714", " -87.23957", "Adger", "Alabama"
The most simple-minded extraction from the text would go something like
$zipLine = lookup($ZIP);
if($zipLine) {
$fields = explode(", ", $zipLine);
$city = $fields[4];
$state = $fields[5];
} else {
die "$ZIP not found";
}
If you are just playing with text in PHP, that's all you need. But if you have a database application, you would do everything in SQL. Further details on your application may elicit more detailed responses.