How to handle complex addresses with schema.org/PostalAddress? - street-address

Let's say I have an address which includes a parent building or other additional address info, for example:
Barnes & Noble
Union Square
33 E 17th St
New York, NY 10003
OR
Koi Restaurant
Bryant Park Hotel
40 W 40th St
New York, NY 10018
How should I markup the "Union Square" or "Bryant Park Hotel" part of the address using schema.org? Is this considered part of the street address? Yelp seems to put it all in the street address, e.g.:
<span itemprop="streetAddress">Union Square<br>33 E 17th St</span>

That extra info is called the Firm Name, and can sometimes yield a better result when searching for a particular address or verifying it (trust me).
However, there seem to be two lines of "Firm Name" in each of the above addresses, which is extraneous and wouldn't yield helpful results.
As far as microdata goes, they don't have a "Firm Name" field, but they do have a "Name" field. The Schema.org example is thus:
<div itemscope itemtype="http://schema.org/PostalAddress">
<span itemprop="name">Google Inc.</span>
P.O. Box<span itemprop="postOfficeBoxNumber">1234</span>
<span itemprop="addressLocality">Mountain View</span>,
<span itemprop="addressRegion">CA</span>
<span itemprop="postalCode">94043</span>
<span itemprop="addressCountry">United States</span>
</div>
Notice that Google Inc. is in the "Name" field. That's where I'd put the extra info (if you have to have two lines of firm name, then do it there... but to use it for actual mailing or verification, take out the extra name.)

Matt is correct about the firm name.
In both your examples, it looks like the input data is consistent, with the first line being more precise than the second line. If that is always the case with your data then you are golden. Stick with the first line as the recipient (or firm name) and the second line as "extra data". That "extra data" is really irrelevant from an address validation perspective since the address would get to the location regardless of the "extra data". The USPS relies on the address data more than "referential data" (except in the case of the North Pole, which we all know has only one valid address).
I took the liberty of submitting some variations of your addresses to be validated against the USPS data. I was looking to see if the USPS had the Firm Name attached to the address in either of these two cases. Nope.
barnes & noble
union square
33 e 17th st
10003
barnes & noble union square
33 e 17th st
10003
barnes & noble
33 e 17th st
10003
union square barnes & noble
33 e 17th st
10003
Koi Restaurant
Bryant Park Hotel
40 W 40th St
10018
Bryant Park Hotel
Koi Restaurant
40 W 40th St
10018
Bryant Park Hotel Koi Restaurant
40 W 40th St
10018
In each case the address was parsed out correctly while the addressee AND the "extra data" was ignored. I hope that gives a little insight into how the USPS address validation process works.

Related

How do i split a JSON file by a specific amount of objects in separate files?

import json
content = []
with open("articles.jsonl", "rt") as file:
for a in file:
out = json.loads(a)
content.append(out)
file.close()
count = 0
file_count = 1
with open("articles" + str(file_count) + ".jsonl", "wt") as fp:
for a in content:
json.dump(a, fp)
fp.write("\n")
count +=1
if count == 2000:
file_count +=1
count = 0
continue
fp.close()
{"id": "f7ca322d-c3e8-40d2-841f-9d7250ac72ca", "content": "VETERANS saluted Worcester's first ever breakfast club for ex-soldiers which won over hearts, minds and bellies. \n \nThe Worcester Breakfast Club for HM Forces Veterans met at the Postal Order in Foregate Street at 10am on Saturday. \n \nThe club is designed to allow veterans a place to meet, socialise, eat and drink, giving hunger and loneliness their marching orders. \n \nFather-of-two Dave Carney, aged 43, of Merrimans Hill, Worcester, set up the club after being inspired by other similar clubs across the country. \n \nHe said: \"As you can see from the picture, we had a good response. Five out of the 10 that attended said they saw the article in the newspaper and turned up. \n \n\"We even had an old chap travel from Droitwich and he was late on parade by three hours. \n \n\"It's generated a lot of interest and I estimate (from other veterans who saw the article) that next month's meeting will attract about 20 people. Onwards and upwards.\" \n \nHe said the management at the pub had been extremely hospitable to them. \n \nMr Carney said: \"They bent over backwards for us. They really looked after us well. That is the best choice of venue I could have made. They even put 'reserved for the armed forces'. \n Promoted stories \nThe reserve veteran with the Royal Engineers wanted to go to a breakfast club but found the nearest ones were in Bromsgrove and Gloucester so he decided to set up his own, closer to home. \n \nHe was influenced by Derek Hardman who set up a breakfast club for veterans in Hull and Andy Wilson who set one up in Newcastle. He said the idea has snowballed and there were now 70 similar clubs across the country and even some in Germany. \n \nMr Carney said with many Royal British Legion clubs closing he wanted veterans and serving personnel to feel they had somewhere they could go for good grub, beer and banter to recapture the comradery of being in the forces. \n \nThe Postal Order was chosen because of its central location and its proximity to the railway station and hotels and reasonably priced food and drink. \n \nThe management of the pub have even given the veterans a designated area within the pub. \n \n Share article \n \nThe next meeting is at the Postal Order on Saturday, October 3 at 10am. \n \nThe breakfast club meets on the first Saturday of each month for those who want to attend in future.", "title": "Worcester breakfast club for veterans gives hunger its marching orders", "media-type": "News", "source": "Redditch Advertiser", "published": "2015-09-07T10:16:14Z"}
Above is a small sample of the articles.jsonl file.
This just writes everything to a single file called articles1.jsonl instead of multiple files with a specific set of objects. Any suggestions?

Clojure: parse json and extract values

I'm making an API call and using Cheshire to parse the JSON:
(defn fetch_headlines [source]
(let [articlesUrl (str "https://newsapi.org/v2/top-headlines?sources="
source
"&apiKey=a688e6494c444902b1fc9cb93c61d6987")]
(-> articlesUrl
client/get
generate-string
parse-string)))
The JSON payload:
{"status" 200, "headers" {"access-control-allow-headers" "x-api-key,
authorization", "content-type" "application/json; charset=utf-8",
"access-control-allow-origin" "*", "content-length" "7434",
"connection" "close", "pragma" "no-cache", "expires" "-1",
"access-control-allow-methods" "GET", "date" "Thu, 28 Mar 2019
20:22:16 GMT", "x-cached-result" "false", "cache-control" "no-cache"},
"body"
"{\"status\":\"ok\",\"totalResults\":10,\"articles\":[{\"source\":{\"id\":\"cnn\",\"name\":\"CNN\"},\"author\":null,\"title\":\"Trump:
Mueller probe was 'attempted takeover' of government - CNN
Video\",\"description\":\"In a Fox News interview with Sean Hannity,
President Trump called special counsel Robert Mueller's probe an
\\"attempted takeover of our
government.\\"\",\"url\":\"http://us.cnn.com/videos/politics/2019/03/28/trump-mueller-probe-attempted-takeover-hannity-cpt-sot-vpx.cnn\",\"urlToImage\":\"https://cdn.cnn.com/cnnnext/dam/assets/190324191527-06-trump-mueller-reaction-0324-super-tease.jpg\",\"publishedAt\":\"2019-03-28T20:09:04.1891948Z\",\"content\":\"Chat
with us in Facebook Messenger. Find out what's happening in the world
as it
unfolds.\"},{\"source\":{\"id\":\"cnn\",\"name\":\"CNN\"},\"author\":null,\"title\":\"James
Clapper reacts to call he should be investigated - CNN
Video\",\"description\":\"Former Director of National Intelligence
James Clapper reacts to White House press secretary Sarah Sanders
saying he and other former intelligence officials should be
investigated after special counsel Robert Mueller did not establish
collusion between the
Tr…\",\"url\":\"http://us.cnn.com/videos/politics/2019/03/26/james-clapper-reponse-mueller-report-sarah-sanders-criticism-bts-ac360-vpx.cnn\",\"urlToImage\":\"https://cdn.cnn.com/cnnnext/dam/assets/190325211210-james-clapper-ac360-03252019-super-tease.jpg\",\"publishedAt\":\"2019-03-28T20:08:43.1736236Z\",\"content\":\"Chat
with us in Facebook Messenger. Find out what's happening in the world
as it
unfolds.\"},{\"source\":{\"id\":\"cnn\",\"name\":\"CNN\"},\"author\":\"Maegan
Vazquez, CNN\",\"title\":\"Trump set for first rally since Mueller
investigation ended\",\"description\":\"President Donald Trump, making
his first appearance before supporters since Robert Mueller ended his
investigation, is set to speak during a rally in Grand Rapids,
Michigan Thursday
night.\",\"url\":\"http://us.cnn.com/2019/03/28/politics/donald-trump-grand-rapids-rally/index.html\",\"urlToImage\":\"https://cdn.cnn.com/cnnnext/dam/assets/190321115403-07-donald-trump-lead-image-super-tease.jpg\",\"publishedAt\":\"2019-03-28T19:49:26Z\",\"content\":\"Washington
(CNN)President Donald Trump, making his first appearance before
supporters since Robert Mueller ended his investigation, is set to
speak during a rally in Grand Rapids, Michigan Thursday
night.\r\nThe rally follows a chaotic week in Washington, preci…
[+2099
chars]\"},{\"source\":{\"id\":\"cnn\",\"name\":\"CNN\"},\"author\":\"Katelyn
Polantz, CNN\",\"title\":\"Judge orders Justice Dept. to turn over
Comey memos\",\"description\":\"A federal judge has ordered that the
James Comey memos are turned over, in a court case brought by CNN and
other media organizations for access to the documents memorializing
former FBI Director's interactions with President Donald
Trump.\",\"url\":\"http://us.cnn.com/2019/03/28/politics/james-comey-memo-lawsuit/index.html\",\"urlToImage\":\"https://cdn.cnn.com/cnnnext/dam/assets/181209143047-comey-1207-super-tease.jpg\",\"publishedAt\":\"2019-03-28T19:14:45Z\",\"content\":\"Washington
(CNN)A federal judge has ordered that the Justice Department and FBI
submit James Comey's memos in full to the court under seal, in a court
case brought by CNN and other media organizations for access to the
documents memorializing the former FBI d… [+1043
chars]\"},{\"source\":{\"id\":\"cnn\",\"name\":\"CNN\"},\"author\":\"Clare
Foran and Manu Raju, CNN\",\"title\":\"Pelosi calls AG's summary of
Mueller report 'arrogant'\",\"description\":\"House Speaker Nancy
Pelosi on Thursday criticized Attorney General William Barr's summary
of special counsel Robert Mueller's report, calling it
\\"condescending\\" and \\"arrogant\\" and saying \\"it wasn't
the right thing to
do.\\"\",\"url\":\"http://us.cnn.com/2019/03/28/politics/pelosi-mueller-report-congress-barr-summary/index.html\",\"urlToImage\":\"https://cdn.cnn.com/cnnnext/dam/assets/190328130240-02-nancy-pelosi-03282019-super-tease.jpg\",\"publishedAt\":\"2019-03-28T18:48:25Z\",\"content\":null},{\"source\":{\"id\":\"cnn\",\"name\":\"CNN\"},\"author\":\"Analysis
by Chris Cillizza, CNN Editor-at-large\",\"title\":\"The 43 most
outrageous lines from Donald Trump's phone interview with Sean
Hannity\",\"description\":\"There's no \\"reporter\\" that President
Donald Trump likes more than Fox News' Sean Hannity -- largely due to
Hannity's unwavering, puppy dog-like support for the President. Trump
likes to reward people who play nice with him, which brings us to the
45-minute
ph…\",\"url\":\"http://us.cnn.com/2019/03/28/politics/sean-hannity-donald-trump-mueller/index.html\",\"urlToImage\":\"https://cdn.cnn.com/cnnnext/dam/assets/190328140149-01-hannity-trump-file-super-tease.jpg\",\"publishedAt\":\"2019-03-28T18:44:21Z\",\"content\":\"(CNN)There's
no \\"reporter\\" that President Donald Trump likes more than Fox
News' Sean Hannity -- largely due to Hannity's unwavering, puppy
dog-like support for the President. Trump likes to reward people who
play nice with him, which brings us to the 45-minu… [+14785
chars]\"},{\"source\":{\"id\":\"cnn\",\"name\":\"CNN\"},\"author\":null,\"title\":\"Puerto
Rico Gov.: I'll punch the bully in the mouth - CNN
Video\",\"description\":\"In an exclusive interview with CNN, Puerto
Rico Governor Ricardo Rosselló said he would not sit back and allow
his officials to be bullied by the White
House.\",\"url\":\"http://us.cnn.com/videos/politics/2019/03/28/ricardo-rossello-trump-bully-puerto-rico-sot-vpx.cnn\",\"urlToImage\":\"https://cdn.cnn.com/cnnnext/dam/assets/190328123504-puerto-rico-gov-ricardo-rosello-super-tease.jpg\",\"publishedAt\":\"2019-03-28T18:08:33.7312458Z\",\"content\":\"Chat
with us in Facebook Messenger. Find out what's happening in the world
as it
unfolds.\"},{\"source\":{\"id\":\"cnn\",\"name\":\"CNN\"},\"author\":\"Jeremy
Herb, Manu Raju and Ted Barrett, CNN\",\"title\":\"Jared Kushner
interviewed by Senate Intelligence
Committee\",\"description\":\"President Donald Trump's son-in-law
Jared Kushner returned to the Senate Intelligence Committee for a
closed door interview Thursday as part of the committee's Russia
investigation.\",\"url\":\"http://us.cnn.com/2019/03/28/politics/jared-kushner-senate-intelligence/index.html\",\"urlToImage\":\"https://cdn.cnn.com/cnnnext/dam/assets/180302124221-30-jared-kushner-super-tease.jpg\",\"publishedAt\":\"2019-03-28T16:21:29Z\",\"content\":null},{\"source\":{\"id\":\"cnn\",\"name\":\"CNN\"},\"author\":\"Jeremy
Herb and Laura Jarrett, CNN\",\"title\":\"Mueller report more than 300
pages, sources say\",\"description\":\"Special counsel Robert
Mueller's confidential report on the Russia investigation is more than
300 pages, according to a Justice Department official and a second
source with knowledge of the
matter.\",\"url\":\"http://us.cnn.com/2019/03/28/politics/mueller-report-pages/index.html\",\"urlToImage\":\"https://cdn.cnn.com/cnnnext/dam/assets/190324130054-05-russia-investigation-0324-super-tease.jpg\",\"publishedAt\":\"2019-03-28T15:52:01Z\",\"content\":null},{\"source\":{\"id\":\"cnn\",\"name\":\"CNN\"},\"author\":\"Jim
Acosta and Kevin Liptak, CNN\",\"title\":\"Exclusive: Puerto Rico
governor warns White House over funding\",\"description\":\"Tensions
are escalating between President Donald Trump and Puerto Rico's
governor over disaster relief efforts that have been slow in coming
for the still-battered island after Hurricane
Maria.\",\"url\":\"http://us.cnn.com/2019/03/28/politics/ricardo-rossell-donald-trump-puerto-rico-funding/index.html\",\"urlToImage\":\"https://cdn.cnn.com/cnnnext/dam/assets/180920230539-pr-storm-of-controversy-rossello-trump-super-tease.jpg\",\"publishedAt\":\"2019-03-28T15:19:39Z\",\"content\":null}]}",
"trace-redirects"
["https://newsapi.org/v2/top-headlines?sources=cnn&apiKey=a688e6494c444902b1fc9cb93c61d687"]}
I'd like to extract to extract the URLs from the returned JSON payload, I've tried this:
(defn fetch_headlines [source]
(let [articlesUrl (str "https://newsapi.org/v2/top-headlines?sources="
source
"&apiKey=a688e6494c444902b1fc9cb93c61d697")]
(-> articlesUrl
client/get
generate-string
parse-string
(get-in ["source" "url"]))))
But I get a nil result, any ideas?
SOLUTION based on user feedback:
(defn fetch-headlines [source]
(let [articlesUrl (str "https://newsapi.org/v2/top-headlines?sources="
source
"&apiKey=a688e6494c444902b1fc9cb93c61d697")]
(-> articlesUrl
client/get
:body
parse-string
(get-in ["articles" 0 "url"]))))
What you need is inside the body key, but the value corresponding to that key is still a string and not yet a clojure map. When you look for source, you're getting nil back because that key doesn't exist (it should be inside body, after correctly parsing the string into json).
Once you've properly parsed the body value, it should be something like:
(let [index-of-article 0]
(get-in response ["body" "articles" index-of-article "url"]))
where index-of-article is the positional index of the article you want, since articles contains a vector of articles.

Geocode returns postal_code null for certain address

Geocode returns postal_code value null for certain addresses and I am not able to do reverse address lookup to retrieve the zip at that level.
An example address is "Peachtree Dunwoody Road, Atlanta, GA, United States"
There is no street number; Dunwoody is also a city name in vicinity. It is not happening for all two words street names, but happening only if one of the word(second one in the street name) is also a city name.
It works for most of the cases but just a few certain types ie "Peachtree Street Northwest, Atlanta, GA, United States"
The search is for "address".
geoLocationScript: "https://maps.googleapis.com/maps/api/js?",
geoLocationSensor: "sensor=false",
Is it a google glitch? and Is there any work around?
Zip codes actually only correspond to mailing routes. "Peachtree Dunwoody Road, Atlanta, GA, United States" isn't a mailing address and as such Google is trying to give you the best results it can, balancing exactness with usefulness. It's likely that "Peachtree Dunwoody Road" traverses multiple zip codes, and Google returns a pin at a geometric center for the road (try the search in Maps) but doesn't try to guess a zip code. Zip codes can be complicated and it's probably best not to make a guess unless the entire street is contained in one zip code. For instance, sometimes, the east side of a road has one zip code but the west side has a different zip code.
As for whether there is a workaround or not, I believe the answer is no. To illustrate, you might look at the SmartyStreets demo site and fill in the address components as much as possible. I just tried "Peachtree Dunwoody Road, Atlanta, GA, United States." While I was typing, SmartyStreets suggested the following three results:
Peachtree Dunwoody Rd, Atlanta GA
W Peachtree Dunwoody Rd, Atlanta GA
Peachtree Dunwoody Rd NE, Atlanta GA
Full disclosure: I worked for SmartyStreets, an address validation company.
Some location entries returned by the geocoder won't have postal_codes. The response for "Peachtree Dunwoody Road, Atlanta, GA, United States" is of type "GEOMETRIC_CENTER".
A "road" doesn't necessarily have a postal code (it isn't a postal address).

google geocode cannot find address but google maps can

I am trying to resolve an address using Google gecode. The address is imported from elsewhere or user entered and not very orderly.
With the address spaced out like this:
O/s NCP Car Park, Coram St Jn Woburn Place London WC1
or spaces replaced with "+"
O/s+NCP+Car+Park,Coram+St+Jn+Woburn+Place+London+WC1
My code will attempt to match the address in full and then word by word will remove leading words until a match occurs.
In Google Maps this resolves fine but even when my address is reduced to
"Woburn Place London WC1"
gecode will not resolve.
These seems to me to be a pretty tidy address with capital city and postcode.
Should I be using some other maps function?
I already tried autocomplete+places and that was resolving even less of my addresses list.
thanks all,
jON
The Google Maps Places API textSearch finds two results for that string ("O/s NCP Car Park,Coram St Jn Woburn Place London WC1"):
Found 2 results for O/s NCP Car Park,Coram St Jn Woburn Place London WC1
[ 0 ]: London, United Kingdom (51.523813, -0.12609699999995883)
[ 1 ]: Woburn Place, Coram St, London WC1H 0ND, United Kingdom (51.5238677, -0.12642860000005385)
To use the geocoding API you need to use "WC1H" for the postcode:
Found 1 results for O/s NCP Car Park,Coram St Jn Woburn Place London WC1H
[ 0 ]: Woburn Pl, London WC1H, UK (51.5237701, -0.12688589999993383)
url: https://maps.googleapis.com/maps/api/geocode/json?address=O/s%20NCP%20Car%20Park,Coram%20St%20Jn%20Woburn%20Place%20London%20WC1H

Extracting city,province, country etc from address in SSIS

I am doing ETL and want to extract city, province from 1000s of addresses. I have tried SUBSTRING( «character_expression», «start», «length» ) transformation in SSIS but could not get the result becuase I don't know in each address, where is the start of substring and what is length of each required substring.
Following are the examples from which I want "M.D.A. ROAD" , "MULTAN" etc.
PAKISTAN COTTON GINNERS' ASSOCIATION PCGA HOUSE, M.D.A. ROAD, MULTAN
PAKISTAN CROP PROTECTION ASSOCIATION 2-A, INDUSTRIAL ESTATE ROOMY COTTON FACTORY, MULTAN
PAKISTAN AGRICULTURE & DAIRY FARMERS ASSOCIATION 16-C, PEOPLES COLONY, FAISALABAD
THE FAISALABAD CHAMBER OF COMMERCE & INDUSTRY FCCI COMPLEX, EAST, CANAL ROAD, FAISALABAD
Thanks in advance.
There's two strategies that you could use, depending whether the address structure is regular or not
Use the location of the , characters to break up the address into components
Use a lookup table of known streets and provinces to identify them in the address
Option 1 is the simplest - but relies on the address format being consistent: i.e. all components are separated by a ,, province is always the last component, city is always the second last.
You'd use the TOKEN function to break the address up by ,
For province (the last component) it would be something like:
TRIM(TOKEN([Address], ",", TOKENCOUNT([Address], ",")))
For city (the second last component), it would be:
TRIM(TOKEN([Address], ",", TOKENCOUNT([Address], ",") - 1))