Osmosis - Removing business from OSM data for Geocoding usage - gis

I'm trying to set up Nominatim database for address geocoding. Database would be used by komoot's Photon, but I guess that's not so important info.
The problem is that the osm xml/pbf files I have contain not just the addresses, but the whole bunch of other things like bars, various offices and so on, which I'm trying to remove.
The idea is to go with something like this 'till I get the desired result set:
osmosis --read-xml us-northeast-latest.osm.bz2 \
--tf reject-nodes landuse=* \
--tf reject-nodes amenity=* \
--tf reject-nodes office=* \
--tf reject-nodes shop=* \
--tf reject-nodes place=house \
--write-xml output.osm
However, after importing the resulting file, I still get those nodes (which should have been excluded) in the search results:
{
properties: {
osm_key: "office",
osm_value: "ngo",
extent: [
-73.9494926,
40.6998938,
-73.9482012,
40.6994192
],
street: "Flushing Avenue",
name: "Public Lab NYC",
state: "New York",
osm_id: 250328718,
osm_type: "W",
housenumber: "630",
postcode: "11206",
city: "New York City",
country: "United States of America"
},
type: "Feature",
geometry: {
type: "Point",
coordinates: [
-73.9490215989286,
40.699639649999995
]
}
}
Note the osm_key and value.
I'm unsure what I'm doing wrong here. Any help would be appreciated.

I don't think you are familiar enough with OSM's elements and tags yet.
Dropping nodes (or ways or relations) that contain specific tags is definitely not what you want. Instead you want to either drop specific tags or keep only specific tags and drop everything else – instead of dropping complete objects.
For understanding the difference between those two you have to know that addresses in OSM are modeled in two different ways. Either they are modeled by a separate address node or they are attached to an already existing feature such as a building, a shop, a restaurant etc. The second way is the important part here where your approach would drop all of these addresses.
Therefore you want to keep elements even if they are "just" a shop or a restaurant because they can still contain an address. But you are free to drop all non-address tags from these elements and to drop all elements that don't contain any address tags at all. This should be possible with osmosis however I'm not familiar enough with osmosis to provide you the required parameters.
Yet I'm not sure if this is really a good idea because more than one object can share the same name. Imagine a river, a mountain peak, a small village and a large village all sharing the same name. If you decide to drop all additional tags that are required for distinguishing a river from a peak and a small village from a large one then you will run into trouble when trying to decide which name to choose from the list of search results.

Related

When creating a file format. Is it better to create multiple formats or have several optional sections?

Let's say I am the technical lead for a software company specializing in writing applications for chefs that help organize recipes. Starting out we are developing one app for bakers who make cakes and another for sushi chefs. One of the requirements is to create a standard file format for importing and exporting recipes. (This file format will go on to be an industry standard with other companies using it to interface with our products) we are faced with two options: Make a standard recipes format (lets say .recipe) which uses common properties where applicable and optional properties where they differ, or making independent formats for each application (let us say .sushi and .cake).
Imagine the file format would look something like this for sushi:
{
"name":"Big California",
"type":"sushi",
"spiciness": 0,
"ingredients": [
{
"name":"rice"
"amount": 20.0,
"units": "ounces"
},
{
...
}
],
}
and imagine the file format would look something like this for cakes:
{
"name":"Wedding Cake",
"type":"cake",
"layers": 3,
"ingredients": [
{
"name":"flour"
"amount": 40.0,
"units": "ounces"
},
{
...
}
],
}
Notice the file formats are very similar with only the spiciness and layers properties differing between them. Undoubtedly as the applications grow in complexity and sophistication, and will cause many more specialized properties to be added. There will also be more applications added to the suite for other types of chefs. With this context,
Is it wiser to have each application read/write .recipe files that adhere to a somewhat standardized interface, or is it wiser to remove all interdependence and have each application read/write their respective .sushi and .cake file types?
This kind of thing get be a very, very deep thing to get right. I think a lot of it depends on what you want to be able to do with recipes beyond simply displaying them for use by a chef.
For example, does one want to normalise data across the whole system? That is, when a recipe specifies "flour", what do you want to say about that flour, and how standardised do you want that to be? Imagine a chef is preparing an entire menu, and wants to know how much "plain white high gluten zero additives flour" is used by all the recipes in that menu. They might want to know this so they know how much to buy. There's actually quite a lot you can say about just flour that means simply having "flour" as a data item in a recipe may not go far enough.
The "modern" way of going about these things is to simply have plain text fields, and rely on some kind of flexible search to make equivalency associations between fields like "flour, white, plain, strong" and "high gluten white flour". That's what Google does...
The "proper" way to do it is to come up with a rigid schema that allows "flour" to be fully specified. It's going to be hard to come up with a file / database format schema that can exhaustively and unambiguously describe every single possible aspect of "flour". And if you take it too far, then you have the problem of making associations between two different records of "flour" that, for all reasonable purposes are identical, but differ in some minor aspect. Suppose you had a field for particle size; the search would have to be clever enough to realise that there's no real difference in flours that differ by, for example, 0.5 micrometer in average particle size.
We've discussed the possible extent to the definition of a "flour". One also has to consider the method by which the ingredient is prepared. That adds a whole new dimension of difficulty. And then one would have to desribed all the concievable kitchen utensils too. One can see the attractions of free text...
With that in mind, I would aim to have a single file format (.recipe), but not to break down the data too much. I would forget about trying to categorise each and every ingredient down to the last possible level of detail. Instead, for each ingredient I'd have a free text description, then perhaps a well structured quantity field (e.g. a number and a unit, 1 and cup), and finally a piece of free text describing the ingredient preparation (e.g. sieved). Then I'd have something that describes a preparation step, referencing the ingredients; that would have some free text fields and structured fields ("sieve the", , "into a bowl"). The file will contain a list of these. You might also have a list of required utensils, and a general description field too. You'll be wanting to add structured fields for recipe metadata (e.g. 'cake', or 'sushi', or serves 2).
Or something like that. Having some structure allows some additional functionality to be implemented (e.g. tidy layout of the recipe on a screen). Just having a single free-text field for the whole thing means that it'd be difficult to add, say, an ingredient ordering feature - who is to say what lines in the text represent ingredients?
Having separate file formats would involve coming up with a large number of schema. It would be even more unmanagable.

How to extract needed informations from text?

I have a lot of publications from which I want to parse and extract needed and useful informations.
Suppose I have this publication A
2 places available tomorrow at 12AM from California to Alaska. Cost is 100$. And this is my phone number 814141243.
Another one B
One place available to Texas. We will be leaving at 13PM today. Cost will be discussed. Tel: 2323575456.
I want to find the best way to extract data from these publications using an algorithm with linear complexity.
For each publication, the algorithm must produce this:
{ "publication": [
{ "id":"A",
"date":"26/01/2016",
"time":"12AM",
"from":"California",
"to":"Alaska",
"cost":"100$",
"nbrOfPlaces":"2",
"tel":"814141243" },
{ "id":"B",
"date":"25/01/2016",
"time":"13PM",
"from":"",
"to":"Texas",
"cost":"",
"nbrOfPlaces":"1",
"tel":"2323575456" }
]
}
So i want the maximum of informations from those publications. But obviously the problem is with the the words chosen by the writer of the publication and how they are structured. Simply, publications don't have common structure so that i can't easily parse and extract needed informations.
Is there any concepts or paradigms that deal with this kind of problem?
Note: I can't force publications' writers to respect a precise structure for the text.
It seems all the comments are discouraging you from trying to do this. However, the variation in the text seems quite limited; I can see a simple algorithm finding the info in most (but obviously not all) input. I'd try something like this:
Split the text into parts on interpunction: .;?!() and then look at the text line by line; this will help determine context.
Use a list of often-used words and abbreviations to determine where each bit of info is located.
Date: look for the names of days or months, "today", "tomorrow" or typical notations of dates like "12/31".
Time: look for combinations with "AM", "PM", "morning", "noon" etc., or typical time notations like "12:30"
Route: look for "from" and "to", possibly combined with "going", "driving", "traveling" etc. and maybe look for capital letters to find the place names (and/or use a list of often-used destinations).
Cost: look for a line that contains "$" or "cost" or "price" or similar, and find the number, or typical "to be discussed" or "to be determined" phrasing.
Places: look for "places", "seats", "people" and find the number, or "place", "seat" or "person" and conclude there is 1 place.
Phone: look for a sequence of digits of a certain length, with maybe spaces or ./() between them.
If you're certain that you've found a part of the info, mark it so that it isn't used again; e.g. if you find "8.30" together with "AM", it's obviously a time. However, if you just find "8.30" it could be a date or a time, or even $8.30.
You'll have to take into account that a small percentage of input will never be machine-readable; something like "off to the big apple at the crak-o-dawn, wanna come with? you pay the gas-moh-nay!" will always need human interpretation.

Creating Family Tree with Neo4J

I have a set of data for a family tree in Neo4J and am trying to build a Cypher query that produces a JSON data set similar to the following:
{Name: "Bob",
parents: [
{Name: "Roger",
parents: [
Name: "Robert",
Name: "Jessica"
]},
{Name: "Susan",
parents: [
Name: "George",
Name: "Susan"
]}
]}
My graph has a relationship of PARENT between MEMBER nodes (i.e. MATCH (p.Member)-[:PARENT]->(c.Member) ). I found Nested has_many relationships in cypher and neo4j cypher nested collect which ends up grouping all parents together for the main child node I am searching for.
Adding some clarity based on feedback:
Every member has a unique identifier. The unions are currently all associated with the PARENT relationship. Everything is indexed so that performance will not suffer. When I run a query to just get back the node graph I get the results I expect. I'm trying to return an output that I can use for visualization purposes with D3. Ideally this will be done with a Cypher query as I'm using the API to access neo4j from the frontend being built.
Adding a sample query:
MATCH (p:Person)-[:PARENT*1..5]->(c:Person)
WHERE c.FirstName = 'Bob'
RETURN p.FirstName, c.FirstName
This query returns a list of each parent for five generations, but instead of showing the hierarchy, it's listing 'Bob' as the child for each relationship. Is there a Cypher query that would show each relationship in the data at least? I can format it as I need to from there...
Genealogical data might comply with the GEDCOM standard and include two types of nodes: Person and Union. The Person node has its identifier and the usual demographic facts. The Union nodes have a union_id and the facts about the union. In GEDCOM, Family is a third element bringing these two together. But in Neo4j, I found it suitable to also include the union_id in Person nodes. I used 5 relationships: father, mother, husband, wife and child. The family is then two parents with an inward vector and each child with an outward vector. The image illustrates this. This is very handy for visualizing connections and generating hypotheses. For example, consider the attached picture and my ancestor Edward G Campbell, the product of union 1917 where three brothers married three Vaught sisters from union 8944 and two married Gaither sisters from union 2945. Also, in the upper left, how Mahala Campbell married her step-brother John Greer Armstrong. Next to Mahala is an Elizabeth Campbell who is connected by marriage to other Campbell, but is likely directly related to them. Similarly, you can hypothesize about Rachael Jacobs in the upper right and how she might relate to the other Jacobs.
I use bulk inserts which can populate ~30000 Person nodes and ~100,000 relationships in just over a minute. I have a small .NET function that returns the JSon from a dataview; this generic solution works with any dataview so it is scalable. I'm now working on adding other data, such as locations (lat/long), documentation (particularly that linking folks, such as a census), etc.
You might also have a look at Rik van Bruggens Blog on his family data:
Regarding your query
You already create a path pattern here: (p:Person)-[:PARENT*1..5]->(c:Person) you can assign it to a variable tree and then operate on that variable, e.g. returning the tree, or nodes(tree) or rels(tree) or operate on that collection in other ways:
MATCH tree = (p:Person)-[:PARENT*1..5]->(c:Person)
WHERE c.FirstName = 'Bob'
RETURN nodes(tree), rels(tree), tree, length(tree),
[n in nodes(tree) | n.FirstName] as names
See also the cypher reference card: http://neo4j.com/docs/stable/cypher-refcard and the online training http://neo4j.com/online-training to learn more about Cypher.
Don't forget to
create index on :Person(FirstName);
I'd suggest building a method to flatten out your data into an array. If they objects don't have UUIDs you would probably want to give them IDs as you flatten and then have a parent_id key for each record.
You can then run it as a set of cypher queries (either making multiple requests to the query REST API, or using the batch REST API) or alternatively dump the data to CSV and use cypher's LOAD CSV command to load the objects.
An example cypher command with params would be:
CREATE (:Member {uuid: {uuid}, name: {name}}
And then running through the list again with the parent and child IDs:
MATCH (m1:Member {uuid: {uuid1}}), (m2:Member {uuid: {uuid2}})
CREATE m1<-[:PARENT]-m2
Make sure to have an index on the ID for members!
The only way I have found thus far to get the data I am looking for is to actually return the relationship information, like so:
MATCH ft = (person {firstName: 'Bob'})<-[:PARENT]-(p:Person)
RETURN EXTRACT(n in nodes(ft) | {firstName: n.firstName}) as parentage
ORDER BY length(ft);
Which will return a dataset I am then able to morph:
["Bob", "Roger"]
["Bob", "Susan"]
["Bob", "Roger", "Robert"]
["Bob", "Susan", "George"]
["Bob", "Roger", "Jessica"]
["Bob", "Susan", "Susan"]

How is the tilde escaping in the JSON Patch RFC supposed to operate?

Referencing https://www.rfc-editor.org/rfc/rfc6902#appendix-A.14:
A.14. ~ Escape Ordering
An example target JSON document:
{
"/": 9,
"~1": 10
}
A JSON Patch document:
[
{"op": "test", "path": "/~01", "value": 10}
]
The resulting JSON document:
{
"/": 9,
"~1": 10
}
I'm writing an implementation of this RFC, and I'm stuck on this. What is this trying to achieve, and how is it supposed to work?
Assuming the answer to the first part is "Allowing json key names containing /s to be referenced," how would you do that?
The ~ character is a keyword in JSON pointer. Hence, we need to "encode" it as ~0. To quote jsonpatch.com,
If you need to refer to a key with ~ or / in its name, you must escape the characters with ~0 and ~1 respectively. For example, to get "baz" from { "foo/bar~": "baz" } you’d use the pointer /foo~1bar~0
So essentially,
[
{"op": "test", "path": "/~01", "value": 10}
]
when decoded yields
[
{"op": "test", "path": "/~1", "value": 10}
]
~0 expands to ~ so /~01 expands to /~1
I guess they mean that you shouldn't "double expand" so that expanded /~1 should not be expanded again to // and thus must not match the documents "/" key (which would happen if you double expanded). Neither should you expand literals in the source document so the "~1" key is literally that and not equivalent to the expanded "/". But I repeat that's my guess about the intention of this example, the real intention may be different.
The example is indeed really bad, in particular since it's using a "test" operation and doesn't specify the result of that operation. Other examples like the next one at A.15 at least says its test operation must fail, A.14 doesn't tell you if the operation should succeed or not. I assume they meant the operation should succeed, so that implies /~01 should match the "~1" key. That's probably all about that example.
If I were to write an implementation I'd probably not worry too much about this example and just look at what other implementations do - to check if I'm compatible with them. It's also a good idea to look for test suites of other projects, for example I found one from http://jsonpatch.com/ at https://github.com/json-patch/json-patch-tests
I think the example provided in RFC isn't exactly best thought-out, especially that it tries to document a feature only through example, which is vague at best - without providing any kind of commentary.
You might be interested in interpretation presented in following documents:
Documentation of Rackspace API
Documentation of OpenStack API
These seem awfully similar and I think it's due to nature of relation between Rackspace and OpenStack:
OpenStack began in 2010 as a joint project of Rackspace Hosting and NASA (...)
It actually provides some useful details including grammar it accepts and rationale behind introducing these tokens, as opposed to the RFC itself.
Edit: it seems that JSON pointers have separate RFC 6901, which is available here and OpenStack and Rackspace specifications above are consistent with it.

How should I populate city/state fields based on the zip?

I'm aware there are databases for zip codes, but how would I grab the city/state fields based on that? Do these databases contain the city/states or do I have to do some sort of lookup to a webservice?
\begin{been-there-done-that}
Important realization: There is not a one-to-one mapping between cities/counties and ZIP codes. A ZIP code is not based on a political area but instead a distribution area as defined for the USPS's internal use. It doesn't make sense to look up a city based on a ZIP code unless you have the +4 or the entire street address to match a record in the USPS address database; otherwise, you won't know if it's RICHMOND or HENRICO, DALLAS or FORT WORTH, there's just not enough information to tell.
This is why, for example, many e-commerce vendors find dealing with New York state sales tax frustrating, since that tax scheme is based on county, e-commerce systems typically don't ask for the county, and ZIP codes (the only information they provide instead) in New York can span county lines.
The USPS updates its address database every month and costs real money, so pretty much any list that you find freely available on the Internet is going to be out of date, especially with the USPS closing post offices to save money.
One ZIP code may span multiple place names, and one city often uses several (but not necessarily whole) ZIP codes. Finally, the city name listed in the ZIP code file may not actually be representative of the place in which the addressee actually lives; instead, it represents the location of their post office. Our office mail is addressed to ASHLAND, but we work about 7 miles from the town's actual political limits. ASHLAND just happens to be where our carrier's route originates from.
For guesstimating someone's location, such as for a search of nearby points of interest, these sources and City/State/ZIP sets are probably fine, they don't need to be exact. But for address validation in a data entry scenario? Absolutely not--validate the whole address or don't bother at all.
Just a friendly reminder to take a step back and remember the data source's intended use!
\end{been-there-done-that}
Modern zip code databases contain columns for City, State fields.
http://sourceforge.net/projects/zips/
http://www.populardata.com/
Using the Ziptastic HTTP/JSON API
This is a pretty new service, but according to their documentation, it looks like all you need to do is send a GET request to http://ziptasticapi.com, like so:
GET http://ziptasticapi.com/48867
And they will return a JSON object along the lines of:
{"country": "US", "state": "MI", "city": "OWOSSO"}
Indeed, it works. You can test this from a command line by doing something like:
curl http://ziptasticapi.com/48867
Using the US Postal Service HTTP/XML API
According to this page on the US Postal Service website which documents their XML based web API, specifically Section 4.0 (page 22) of this PDF document, they have a URL where you can send an XML request containing a 5 digit Zip Code and they will respond with an XML document containing the corresponding City and State.
According to their documentation, here's what you would send:
http://SERVERNAME/ShippingAPITest.dll?API=CityStateLookup&XML=<CityStateLookupRequest%20USERID="xxxxxxx"><ZipCode ID= "0"><Zip5>90210</Zip5></ZipCode></CityStateLookupRequest>
And here's what you would receive back:
<?xml version="1.0"?>
<CityStateLookupResponse>
<ZipCode ID="0">
<Zip5>90210</Zip5>
<City>BEVERLY HILLS</City>
<State>CA</State>
</ZipCode>
</CityStateLookupResponse>
USPS does require that you register with them before you can use the API, but, as far as I could tell, there is no charge for access. By the way, their API has some other features: you can do Address Standardization and Zip Code Lookup, as well as the whole suite of tracking, shipping, labels, etc.
I'll try to answer the question "HOW should I populate...", and not "SHOULD I populate..."
Assuming you are going to do this more than once, you would want to build your own database. This could be nothing more than a text file you downloaded from any of the many sources (see Pentium10 reply here). When you need a city name, you search for the ZIP, and extract the city/state text. To speed things up, you would sort the table in numeric order by ZIP, build an index of lines, and use a binary search.
If you ZIP database looked like (from sourceforge):
"zip code", "state abbreviation", "latitude", "longitude", "city", "state"
"35004", "AL", " 33.606379", " -86.50249", "Moody", "Alabama"
"35005", "AL", " 33.592585", " -86.95969", "Adamsville", "Alabama"
"35006", "AL", " 33.451714", " -87.23957", "Adger", "Alabama"
The most simple-minded extraction from the text would go something like
$zipLine = lookup($ZIP);
if($zipLine) {
$fields = explode(", ", $zipLine);
$city = $fields[4];
$state = $fields[5];
} else {
die "$ZIP not found";
}
If you are just playing with text in PHP, that's all you need. But if you have a database application, you would do everything in SQL. Further details on your application may elicit more detailed responses.