Designing REST - save big set of related entities

Designing REST - save big set of related entities - json

In my system, I have an entity (sales) who can serve people which have certain ZIP codes.
So, each sales can have thousands of ZIP codes binded to his account.
I need to develop REST API that would allow to load and edit list of sales zip codes.
Basically I have 2 options:
1) Creates 2 Resources : Sales and SalesZip. Submit Sales data, and then sumbit SalesZip records for each supported zip code.
2) Create Sales entity, and load list of supported zip codes like this:
{
id : 1,
name : "John",
zip : [
"90231",
"12341",
...
]
}
And submit zip codes like an array:
zip[]=90231,12341
Both ways have some disadvantages.
If use first option, I may need to submit too many separate HTTP requests.
If use second option, I may need to send quite big PUT/POST request.
Question
Which option should I use?
What's best practics of designing such functionality?

What is exactly "quite big"?
In a rough estimation, if each char are 2 bytes, and your ZIP codes have 5 chars, each code is 10 bytes. Assuming that US has 41,741 ZIP codes, in US worst case scenario, a salesman that sells across all country, would need a payload of around 417,410 bytes, or 407.6 kbytes.
In average, to how many ZIP codes a salesman belong? how is it distributed? How often do you get these requests? You may discover that is not that bad after all.
There is not enough data to make a decision, but it seems that second option is not bad.

Related

Understanding openaddresses data format

I have downloaded us-west geolocation data (postal addresses) from openaddresses.io. Some of the addresses in the datasets are not complete i.e., some of them doesn't have info like zip_code. Is there a way to retrieve it or is the data incomplete?
I have tried to search other files hoping to find any related info. The complete dataset doesn't contain any info relate to it. City of Mesa, AZ has multiple zip codes, so it is hard to assign one to the address. Is there any way to address this problem?
This is how data looks like (City of Mesa, AZ)
LON,LAT,NUMBER,STREET,UNIT,CITY,DISTRICT,REGION,POSTCODE,ID,HASH
-111.8747353,33.456605,790,N DOBSON RD,,SRPMIC,,,,,dc0c53196298eb8d
-111.8886227,33.4295194,2630,W RIO SALADO PKWY,,MESA,,,,,c38b700309e1e9ce
-111.8867018,33.4290795,2401,E RIO SALADO PKWY,,TEMPE,,,,,9b912eb2b1300a27
-111.8832045,33.4232903,700,S EVERGREEN RD,,TEMPE,,,,,3435b99ab3f4f828
-111.8761202,33.4296416,2100,W RIO SALADO PKWY,,MESA,,,,,b74349c833f7ee18
-111.8775844,33.4347782,1102,N RIVERVIEW,,MESA,,,,,17d0cf1542c66083

Short Answer: The data incomplete.
The data in OpenAddresses.io is only as complete as the datasource it pulls from. OpenAddresses is just an aggregation of publicly available datasets. There's no real consistency between government agencies that make their data available. As a result, other sections of the OpenAddresses dataset might have city names or zip codes, but there's often something missing.
If you're looking to fill in the missing data, take a look at how projects like Pelias use multiple data sources to augment missing data.
Personally, I always end up going back to OpenStreetMaps (OSM). One could argue that OpenAddresses is better quality because it comes from official sources and doesn't try to fill in data using approximations, but the large gaps of missing data make it far less useful, at least on its own.

Association Rules for Text File

I am a student using Rapidminer, and I am doing a project using Yummly's What's Cooking dataset (https://www.kaggle.com/c/whats-cooking/data). The dataset has 20 different cuisine types (e.g. Italian, Chinese, Indian, etc.).
Our goal is to develop a data mining model that identifies the cuisine type of future dishes by analyzing the ingredient list of the dish. We are using association rules to do so. However, I keep getting "no rules found" and have no idea why. I am thinking this has something to do with my attributes being formatted as text and not using the nominal to binominal operator, but am not sure how to fix it.
Currently my process looks like....
data -> select attributes -> FP growth -> create association rules
Can you help?

According to the documentation for the FP-Growth operator, all the attributes in the example set need to be binomial.
I'll admit--I haven't looked at the data directly because I didn't want to register an account on kaggle, so I'm not sure exactly how it's formatted, but you would probably want to set the type of cuisine as a label and then have each of the remaining attributes represent each ingredient that is included in one or more of the recipes. Each dish would have a 1 in the column if the ingredient is used and a 0 if it's not used. (Depending on the original format of the data, since you mentioned it's text, you may want to check out the text processing extension, which can create an example set like what I just described.) Then, if you convert the 0s and 1s to binomial, you should be able to use FP-Growth.

authorized.net ambiguity in country names

Hi I am working on a site and integrating authorize.net payment gateway. I am thinking of adding a dropdown for country names, will passing of "United States Of America" as country variable work? Or should I use "US"? Should I use ISO codes for every country? I tried on test developer account but it seems to accept everything I passes to it as correct!
~Ajit

I know authorize.net doesn't require country names. A simple way to see if they even validate them would be to run a transaction through the production gateway, pass a nonsense value and see if the transaction still goes through.
If you do standardize to support authorize.net (or for another reason), I'd suggest country codes versus full names. Codes seem to change less often, and also can be useful as identifiers. For example, I have an application which presents data for roughly 200 countries; I have flag icons (multiple sizes for each country) that use a 2 digit country code in their name. Using codes made this fairly easy to implement and maintain.

According to their AIM Guide:
x_country: Optional
Value: The country of the customer’s billing
Format: Up to 60 characters (no symbols)

How do I use awk to parse a fixed-width (NACHA) file format?

My company has a problem: we suspect that the NACHA files we are receiving from one of our application service providers that we use to draw money from our clients are incorrect.
We have all of the ACH agreements and legal mumbo-jumbo in place, so it's not a problem with our use of the ACH network, and we're not receiving word from the banks that things are going wrong, so we suspect that when the file is built from the sales information, it's missing some transactions that we are still getting charged for by our service provider.
My task: Take several months worth of NACHA files and decipher them to find out what was drawn from each customer and what was deposited to our accounts, and then compare them to sales data, bank statements, and other information via Access/Excel. Use MySQL for data.
At this point, awk (or similar Linux command line tool) is the tool that I have; I'm not proficient with 'actual' programming tools or practice, I'm more of a system-and-database administrator. I'm not afraid to get my hands dirty, I just don't have a lot of programming experience in reading this sort of thing with, say, C#.
My chief difficulty is in working with the actual NACHA file format: it's 94 characters wide, with fields determined by their position only, no delimiters. Using awk is (in my previous experience) dependent on the field separator variable, which is either white space or anything else...but I have been unsuccessful using it to tease out fields via position. I need to use something like awk because of the different record types in each file, there are 5 different line types in a file: 1, 5, 6, 8, and 9. Types 1 and 9 are the outer group, with header info, and 5 and 8 are batch header lines. Type 6 lines are details. My original plan was to read the header info into variables and then duplicate it on each line, basically de-normalizing it into a large table (or CSV, in the interim) with one record for each individual transaction, associated with all header info from the batch and the day, so:
[transaction data1, data2],[batch data1, data2],[file info1, info2, etc]
[transaction data1, data2],[batch data1, data2],[file info1, info2, etc]
[transaction data1, data2],[batch data1, data2],[file info1, info2, etc]
I favor building a tool that can do this on a continual basis going forward because it will become part of the data-monitoring we do on a daily/weekly basis.
So, how can I denormalize a NACHA file using awk or some similar tool? If there's a better tool for the job, I'm more than happy to hear about it. I haven't found anything in my searching online, unfortunately.

If you look at the gawk info file (info gawk), there is a section called "3.6 Reading Fixed-Width Data". That may provide the information you need if you're using gawk.
From that file:
The splitting of an input record into fixed-width fields is
specified by assigning a string containing space-separated numbers to
the built-in variable `FIELDWIDTHS'.

How should I populate city/state fields based on the zip?

I'm aware there are databases for zip codes, but how would I grab the city/state fields based on that? Do these databases contain the city/states or do I have to do some sort of lookup to a webservice?

\begin{been-there-done-that}
Important realization: There is not a one-to-one mapping between cities/counties and ZIP codes. A ZIP code is not based on a political area but instead a distribution area as defined for the USPS's internal use. It doesn't make sense to look up a city based on a ZIP code unless you have the +4 or the entire street address to match a record in the USPS address database; otherwise, you won't know if it's RICHMOND or HENRICO, DALLAS or FORT WORTH, there's just not enough information to tell.
This is why, for example, many e-commerce vendors find dealing with New York state sales tax frustrating, since that tax scheme is based on county, e-commerce systems typically don't ask for the county, and ZIP codes (the only information they provide instead) in New York can span county lines.
The USPS updates its address database every month and costs real money, so pretty much any list that you find freely available on the Internet is going to be out of date, especially with the USPS closing post offices to save money.
One ZIP code may span multiple place names, and one city often uses several (but not necessarily whole) ZIP codes. Finally, the city name listed in the ZIP code file may not actually be representative of the place in which the addressee actually lives; instead, it represents the location of their post office. Our office mail is addressed to ASHLAND, but we work about 7 miles from the town's actual political limits. ASHLAND just happens to be where our carrier's route originates from.
For guesstimating someone's location, such as for a search of nearby points of interest, these sources and City/State/ZIP sets are probably fine, they don't need to be exact. But for address validation in a data entry scenario? Absolutely not--validate the whole address or don't bother at all.
Just a friendly reminder to take a step back and remember the data source's intended use!
\end{been-there-done-that}

Modern zip code databases contain columns for City, State fields.
http://sourceforge.net/projects/zips/
http://www.populardata.com/

Using the Ziptastic HTTP/JSON API
This is a pretty new service, but according to their documentation, it looks like all you need to do is send a GET request to http://ziptasticapi.com, like so:
GET http://ziptasticapi.com/48867
And they will return a JSON object along the lines of:
{"country": "US", "state": "MI", "city": "OWOSSO"}
Indeed, it works. You can test this from a command line by doing something like:
curl http://ziptasticapi.com/48867

Using the US Postal Service HTTP/XML API
According to this page on the US Postal Service website which documents their XML based web API, specifically Section 4.0 (page 22) of this PDF document, they have a URL where you can send an XML request containing a 5 digit Zip Code and they will respond with an XML document containing the corresponding City and State.
According to their documentation, here's what you would send:
http://SERVERNAME/ShippingAPITest.dll?API=CityStateLookup&XML=<CityStateLookupRequest%20USERID="xxxxxxx"><ZipCode ID= "0"><Zip5>90210</Zip5></ZipCode></CityStateLookupRequest>
And here's what you would receive back:
<?xml version="1.0"?>
<CityStateLookupResponse>
<ZipCode ID="0">
<Zip5>90210</Zip5>
<City>BEVERLY HILLS</City>
<State>CA</State>
</ZipCode>
</CityStateLookupResponse>
USPS does require that you register with them before you can use the API, but, as far as I could tell, there is no charge for access. By the way, their API has some other features: you can do Address Standardization and Zip Code Lookup, as well as the whole suite of tracking, shipping, labels, etc.

I'll try to answer the question "HOW should I populate...", and not "SHOULD I populate..."
Assuming you are going to do this more than once, you would want to build your own database. This could be nothing more than a text file you downloaded from any of the many sources (see Pentium10 reply here). When you need a city name, you search for the ZIP, and extract the city/state text. To speed things up, you would sort the table in numeric order by ZIP, build an index of lines, and use a binary search.
If you ZIP database looked like (from sourceforge):
"zip code", "state abbreviation", "latitude", "longitude", "city", "state"
"35004", "AL", " 33.606379", " -86.50249", "Moody", "Alabama"
"35005", "AL", " 33.592585", " -86.95969", "Adamsville", "Alabama"
"35006", "AL", " 33.451714", " -87.23957", "Adger", "Alabama"
The most simple-minded extraction from the text would go something like
$zipLine = lookup($ZIP);
if($zipLine) {
$fields = explode(", ", $zipLine);
$city = $fields[4];
$state = $fields[5];
} else {
die "$ZIP not found";
}
If you are just playing with text in PHP, that's all you need. But if you have a database application, you would do everything in SQL. Further details on your application may elicit more detailed responses.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008