Understanding openaddresses data format

Understanding openaddresses data format - gis

I have downloaded us-west geolocation data (postal addresses) from openaddresses.io. Some of the addresses in the datasets are not complete i.e., some of them doesn't have info like zip_code. Is there a way to retrieve it or is the data incomplete?
I have tried to search other files hoping to find any related info. The complete dataset doesn't contain any info relate to it. City of Mesa, AZ has multiple zip codes, so it is hard to assign one to the address. Is there any way to address this problem?
This is how data looks like (City of Mesa, AZ)
LON,LAT,NUMBER,STREET,UNIT,CITY,DISTRICT,REGION,POSTCODE,ID,HASH
-111.8747353,33.456605,790,N DOBSON RD,,SRPMIC,,,,,dc0c53196298eb8d
-111.8886227,33.4295194,2630,W RIO SALADO PKWY,,MESA,,,,,c38b700309e1e9ce
-111.8867018,33.4290795,2401,E RIO SALADO PKWY,,TEMPE,,,,,9b912eb2b1300a27
-111.8832045,33.4232903,700,S EVERGREEN RD,,TEMPE,,,,,3435b99ab3f4f828
-111.8761202,33.4296416,2100,W RIO SALADO PKWY,,MESA,,,,,b74349c833f7ee18
-111.8775844,33.4347782,1102,N RIVERVIEW,,MESA,,,,,17d0cf1542c66083

Short Answer: The data incomplete.
The data in OpenAddresses.io is only as complete as the datasource it pulls from. OpenAddresses is just an aggregation of publicly available datasets. There's no real consistency between government agencies that make their data available. As a result, other sections of the OpenAddresses dataset might have city names or zip codes, but there's often something missing.
If you're looking to fill in the missing data, take a look at how projects like Pelias use multiple data sources to augment missing data.
Personally, I always end up going back to OpenStreetMaps (OSM). One could argue that OpenAddresses is better quality because it comes from official sources and doesn't try to fill in data using approximations, but the large gaps of missing data make it far less useful, at least on its own.

Related

How to analyze/calculate a lot of csv-logfiles automatically

I have a bunch of logilfes from machines at my work, and i want to analyze them. These files are named like #increasingnumer_date_time.csv and alltogether these are round about 10.000 files (0.5-2MB each) that contain information about temperatures, pressures, status of actors like vents, pumps, ect.
I want that script to do some calculations (forming integrals of the pressures when special conditions are met) with any of these csv-files and store the result of every calculation in a csv/excel-file in the way, that I have a list of the logfile-names and the corresponding result of the calculation...
So it is not needed to put all these files in one super-big megafile, it is totally fine when these files were opened and processed one-after-another and just every result of the calculation is written in the result-file so that i can allocate the results of the calculation to the corresponding csv-file...
How can I do this? I am no programming expert, but I use python/pandas sometimes...
Thank you
I tried to do it file-after-file using excel :-))
I expect that this is a batch job that can be automated (but I do not know how) :))

Designing REST - save big set of related entities

In my system, I have an entity (sales) who can serve people which have certain ZIP codes.
So, each sales can have thousands of ZIP codes binded to his account.
I need to develop REST API that would allow to load and edit list of sales zip codes.
Basically I have 2 options:
1) Creates 2 Resources : Sales and SalesZip. Submit Sales data, and then sumbit SalesZip records for each supported zip code.
2) Create Sales entity, and load list of supported zip codes like this:
{
id : 1,
name : "John",
zip : [
"90231",
"12341",
...
]
}
And submit zip codes like an array:
zip[]=90231,12341
Both ways have some disadvantages.
If use first option, I may need to submit too many separate HTTP requests.
If use second option, I may need to send quite big PUT/POST request.
Question
Which option should I use?
What's best practics of designing such functionality?

What is exactly "quite big"?
In a rough estimation, if each char are 2 bytes, and your ZIP codes have 5 chars, each code is 10 bytes. Assuming that US has 41,741 ZIP codes, in US worst case scenario, a salesman that sells across all country, would need a payload of around 417,410 bytes, or 407.6 kbytes.
In average, to how many ZIP codes a salesman belong? how is it distributed? How often do you get these requests? You may discover that is not that bad after all.
There is not enough data to make a decision, but it seems that second option is not bad.

What data format is this?

I was checking one share trading site's AJAX response and below is what it showed up in Firebug Response tab of XHR section. Can anyone explain me what format is this and how is it parsed ?
<ST=tat>
<SI=0>
<TB=txtSearch>
<560v=Tata Motors Ltdv=TATMOT>
<566v=Tata Steel Ltdv=TATSTE>
<3199v=Ashram Online.com Ltdv=ASHONL>
<4866v=Kreon Finnancial Services Ltdv=KREFIN>
<552v=Tata Chemicals Ltdv=TATCHE>
<554v=Tata Power Company Ltdv=TATPOW>
<2986v=Tata Metaliks Ltdv=TATMET>
<300v=Tata Sponge Iron Ltdv=TATSPO>
<121v=Tata Coffee Ltdv=TATCOF>
<2295v=Tata Communications Ltdv=TATCOM>
<0v=Time In Milli-Secondsv=0>

I think what we are dealing with here is some proprietary format, likely an Eldricht SGML Horror of some sort.
Banking in general has all sorts of Eldricht horrors running about.
On a related note, this is very much not XML.
Edit:
A quick analysis* indicates that this is a format consisting of a series of statements bracketed by <>; with the parts of the statements separated by = or v=. = seems to indicate a parameter to a control statement, indicated by a two-letter code. (<ST=tat>), while v= seems to indicate an assignment or coupling of some kind (short for "value"?), or perhaps just a field separator.
<ST appears to be short for "search term"; <TB appears to be short for "(source) table". The meaning of <SI eludes me. It is possible that <TB terminates the metadata section, but it's equally possible that the metadata section has a fixed number of terms.
As nothing refers to the number of fields in each statement in the data section, and they are all of the same length (3 fields), it is likely that the number of fields is fixed, but it might derive from the value of <TB, or even <SI, in some way.
What is abundantly clear, however, is that this data is not intended for consumption by other applications than the one that supplies it.
*Caveat: Without a much larger sample it's impossible to tell if this analysis is valid.

It is not a commonly used "web format".
It is probably a proprietary format used by that site and will be parsed by their custom JavaScript.

When could a CSV records not have the same number of fields?

I am storing a series of events to a CSV file, each event type comes with a different set of data.
To illustrate, say I have two events (there will be many more):
Running, which has a data set containing speed and incline.
Sleeping, which has a data set containing snores.
There are two options to store this data in CSV records:
Option A
Storing each possible item of data in it's own field...
speed, incline, snores
therefore...
15mph, 20%, ,
, , 12
16mph, 20%, ,
14mph, 20%, ,
Option B
Storing each event in its own record...
event, value1...
therefore...
running, 15mph, 20%
sleeping, 12
running, 16mph, 20%
running, 14mph, 20%
Without a specific CSV specification, the consensus seems to be:
Each record "should" contain the same number of comma-separated fields.
Context
There are a number of events which each have a large & different set of data values.
CSV data is to be of use to other developers (I will/could/should/won't use either structure).
The 'other developers' to be toward the novice end of the spectrum and/or using resource limited systems. CSV is accessible.
The CSV format is being provided non-exclusively as feature not requirement. Although, if said application is providing a CSV file it should be provided in the correct manner from now on.
Question
Would it be valid – in this case - to go with Option B?
Thoughts
Option B maintains a level of human readability, which is an advantage say CSV is read by human not processor. Neither method is more complex to parse using a custom parser, but will Option B void the usefulness of a CSV format with other libraries, frameworks, applications et al. With Option A future changes/versions to the data set of an individual event may break the CSV structure (zombie , , to maintain forwards compatibility); whereas Option B will fail gracefully.
edit
This may be aimed at students and frameworks like OpenFrameworks, Plask, Proccessing et al. where CSV is easier to implement.

Any "other frameworks, libraries and applications" I've ever used all handle CSV parsing differently, so trying to conform to one or many of these standards might over-complicate your end result. My recommendation would be to keep it simple and use what works for your specific task. If human readbility is a requirement, then CSV in the form of Option B would work fine. Otherwise, you may want to consider JSON or XML.

As you say there is no "CSV Standard" with regard to contents. The real answer depend on what you are doing and why. You mention "other frameworks, libraries and applications". The one thing I've learnt is "Dont over engineer". i.e. Don't write reams of code today on the assumption that you will plug it into some other framework tomorrow.
I'd say option B is fine, unless you have specific requirements to use other apps etc.
< edit >
Having re-read your context, I'd probably pick one output format and use it, and forget about having multiple formats:
Having multiple output formats is a source of inconsistency (e.g. bug in one format but not another).
Having multiple formats means more code that needs to be
tested
documented
supported
< /edit >

Is there any reason you can't use XML? Yes, it's slightly more difficult to parse, at least for novices, but if so they probably need the practice. File size would be much greater, of course, but it's compressible.

How should I populate city/state fields based on the zip?

I'm aware there are databases for zip codes, but how would I grab the city/state fields based on that? Do these databases contain the city/states or do I have to do some sort of lookup to a webservice?

\begin{been-there-done-that}
Important realization: There is not a one-to-one mapping between cities/counties and ZIP codes. A ZIP code is not based on a political area but instead a distribution area as defined for the USPS's internal use. It doesn't make sense to look up a city based on a ZIP code unless you have the +4 or the entire street address to match a record in the USPS address database; otherwise, you won't know if it's RICHMOND or HENRICO, DALLAS or FORT WORTH, there's just not enough information to tell.
This is why, for example, many e-commerce vendors find dealing with New York state sales tax frustrating, since that tax scheme is based on county, e-commerce systems typically don't ask for the county, and ZIP codes (the only information they provide instead) in New York can span county lines.
The USPS updates its address database every month and costs real money, so pretty much any list that you find freely available on the Internet is going to be out of date, especially with the USPS closing post offices to save money.
One ZIP code may span multiple place names, and one city often uses several (but not necessarily whole) ZIP codes. Finally, the city name listed in the ZIP code file may not actually be representative of the place in which the addressee actually lives; instead, it represents the location of their post office. Our office mail is addressed to ASHLAND, but we work about 7 miles from the town's actual political limits. ASHLAND just happens to be where our carrier's route originates from.
For guesstimating someone's location, such as for a search of nearby points of interest, these sources and City/State/ZIP sets are probably fine, they don't need to be exact. But for address validation in a data entry scenario? Absolutely not--validate the whole address or don't bother at all.
Just a friendly reminder to take a step back and remember the data source's intended use!
\end{been-there-done-that}

Modern zip code databases contain columns for City, State fields.
http://sourceforge.net/projects/zips/
http://www.populardata.com/

Using the Ziptastic HTTP/JSON API
This is a pretty new service, but according to their documentation, it looks like all you need to do is send a GET request to http://ziptasticapi.com, like so:
GET http://ziptasticapi.com/48867
And they will return a JSON object along the lines of:
{"country": "US", "state": "MI", "city": "OWOSSO"}
Indeed, it works. You can test this from a command line by doing something like:
curl http://ziptasticapi.com/48867

Using the US Postal Service HTTP/XML API
According to this page on the US Postal Service website which documents their XML based web API, specifically Section 4.0 (page 22) of this PDF document, they have a URL where you can send an XML request containing a 5 digit Zip Code and they will respond with an XML document containing the corresponding City and State.
According to their documentation, here's what you would send:
http://SERVERNAME/ShippingAPITest.dll?API=CityStateLookup&XML=<CityStateLookupRequest%20USERID="xxxxxxx"><ZipCode ID= "0"><Zip5>90210</Zip5></ZipCode></CityStateLookupRequest>
And here's what you would receive back:
<?xml version="1.0"?>
<CityStateLookupResponse>
<ZipCode ID="0">
<Zip5>90210</Zip5>
<City>BEVERLY HILLS</City>
<State>CA</State>
</ZipCode>
</CityStateLookupResponse>
USPS does require that you register with them before you can use the API, but, as far as I could tell, there is no charge for access. By the way, their API has some other features: you can do Address Standardization and Zip Code Lookup, as well as the whole suite of tracking, shipping, labels, etc.

I'll try to answer the question "HOW should I populate...", and not "SHOULD I populate..."
Assuming you are going to do this more than once, you would want to build your own database. This could be nothing more than a text file you downloaded from any of the many sources (see Pentium10 reply here). When you need a city name, you search for the ZIP, and extract the city/state text. To speed things up, you would sort the table in numeric order by ZIP, build an index of lines, and use a binary search.
If you ZIP database looked like (from sourceforge):
"zip code", "state abbreviation", "latitude", "longitude", "city", "state"
"35004", "AL", " 33.606379", " -86.50249", "Moody", "Alabama"
"35005", "AL", " 33.592585", " -86.95969", "Adamsville", "Alabama"
"35006", "AL", " 33.451714", " -87.23957", "Adger", "Alabama"
The most simple-minded extraction from the text would go something like
$zipLine = lookup($ZIP);
if($zipLine) {
$fields = explode(", ", $zipLine);
$city = $fields[4];
$state = $fields[5];
} else {
die "$ZIP not found";
}
If you are just playing with text in PHP, that's all you need. But if you have a database application, you would do everything in SQL. Further details on your application may elicit more detailed responses.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008