Visualizing two sets of data on one map in D3: Is there a way to map two datasets to a range -1 to 1? - csv

I have two csv files containing countries and values that correspond to each country.
The data from CSV 1 denotes the number of times a country has been attacked on their own soil.
The data from CSV 2 denotes the number of times a country has attacked another country abroad.
There is overlap between the two sets of data and I intend to demonstrate values from both data sets in one grey scale range to be shown on a choropleth map.
I have some (obviously) phony data below to demonstrate what I'm working with.
TARGET.csv
country, code, value
Iran, IRN, 5
Russia, RUS, 4
United States, USA, 0
Egypt, EGY, 2
Spain, ESP, 1
ATTACKER.csv
country, code, value
Iran, IRN, 3
Russia, RUS, 9
United States, USA, 4
Egypt, EGY, 0
Spain, ESP, 0
There are more targets than attackers.
I want to ensure that I represent the data accurately, but do not know how I would create a normalized range of values between -1 and 1.
It is my understanding that displaying the data in this way would accurately represent the reality best, but I feel like I may be wrong.
In summation:
1) Am I thinking about this problem properly? Is this even the right way to think about displaying the data?
2) What is the proper language used to describe my question?
I am usually able to figure these things out but I'm stumped with dead-end search queries.
3) How do I make sure that my range is normalized. Notice that USA above appears as the only attacker who has never been a target, Would that make the USA the nearest value to +1, despite Russia's larger number of attacks?
I would appreciate whatever input you all can offer.

Related

Google maps sum of unique road distance in given area

I would like to calculate the number of unique kilometers of roadways in my city. More generally, I wish to sum the distance of every road within a bound, for simplicity a rectangle will do.
Is this possible using the Google Maps suite of APIs? If so, how would you go about doing it? If anyone has any resources related to this type of problem, I would be interested in reading them regardless of language (or even solutions with other mapping tools).
Bonus points: A general solution to this problem that can be applied to the pre set "cities" (example) that appear in Google Maps with well defined city limits.
You can use OpenStreetMap to calculate the total road length of a specific country or geographic area. There are multiple solutions available, based on multiple similar questions already asked.
Approach 1 from Total road length in Kilometers for a country at help.openstreetmap.org:
Use the Perl script osm-length-2.pl. There is an example at a mailing list post.
Approach 2 from Actual road length of exported map at help.openstreetmap.org:
Import your data (the planet or an country or area extract) into a PostGIS database, then use the following queries proposed by Frederik Ramm:
SELECT way AS clip
INTO clipping_polygon
FROM planet_osm_polygon
WHERE boundary='administrative' AND admin_level='8' and name='My City';
SELECT name, highway, ST_INTERSECTION(way, clip)
INTO clipped_roads
FROM planet_osm_line, clipping_polygon
WHERE ST_INTERSECTS(way, clip) AND highway IS NOT NULL;
SELECT highway, SUM(ST_LENGTH(way::geography))
FROM clipped_roads
GROUP BY highway;

Retrieving a fully qualified street address from ZIP / postal code

I have a form in which my users need to enter the following location data:
Full address line (street address, apartment, suite, unit, building, floor)
House number
City
State / province / region
ZIP / Postal code
Country
To simplify the completion of this form, I would like to automatically fill in the fully qualified address (addrses line, city, state province etc) by letting the user only enter his country, zip code and house number.
Is it correct that these 3 items are sufficient to lookup the address in the United States? Or is less or more information necessary? And is the answer to this question different for every country? Moreover, is there a service, API, or library that can be utilized for this purpose (e.g. Google Maps or OpenStreetMap)?
Great questions!
Is it correct that these 3 items are sufficient to look up the address in the United States?
No. Unfortunately these three will get you down to ~hundreds of possible addresses in the
US.
Is the answer to this question different for every country?
Yes! The postal systems from country to country vary greatly and you're users in them will have different expectations about what they expect to supply - Brits don't expect to have to enter a full address for example.
With the UK, Canada and Australia you can usually get to a single address from the house number and postcode. BUT, you can not guarantee this. There may be sub-premise information or business information which requires a bit of interaction with the user to check you have right address.
Some countries, such as France, do not have complete premise number coverage. With these you can take the premise number & postcode but depending upon the town you have to alter your behavior to either trust and accept the input or prompt them for a correction.
Another important consideration when planning your workflow is the need to allow for people who perhaps do not know their postcode / zip. It does not happen often but sometimes people have just moved, or occasionally a properties postcode/zip changes so it is important to be flexible in the information you need.
Is there a service, API, or library that can be utilized for this purpose?
Yes - there are several solutions around that offer the ability to capture global addresses. Experian Data Quality (my company) offer a hosted or on premise solution that allows for this.
Try it out here - on the right hand side under the "Do you want to know more?" you can switch countries, the prompt updates and the interaction occurs if needed.
I can only answer about US addresses (I work at SmartyStreets), but the answer is no, that won't work.
Kudos for your desires to improve the user experience. Unfortunately, I would not recommend trying this, and here's why:
A US ZIP code, in its entirety, is actually 11 digits long (12 with the check digit):
The first three digits are the SCF (Sectional Center Facility), kind of like a region code
The first five digits are your typical 5-digit ZIP code that specifies a set of carrier routes
The next 4 digits are more precise, often narrowing down an address to block-level.
The next 2 digits are seldom used except in barcodes, but they indicate the delivery point. In theory, this would specify a particular house, apartment, or mailbox, but in reality, sometimes the 11-digit code is ambiguous (common in large complexes, street blocks, or PO facilities). It's typical for the delivery point to correlate to the house or apartment number, but not always.
So in your situation:
Knowing the country narrows down the possibilities to just 350,000,000+ addresses
Knowing the 5-digit ZIP code narrows it down to somewhere around 10,000+ addresses (important note: not everyone knows the 5-digit ZIP code, and they change. What's more, is that they may not be sure whether to enter their PO box ZIP code or their house ZIP code. And what if their house doesn't receive mail? Or what if they're in the military and their 5-digit ZIP is in flux?)
Knowing the house number may narrow down the address candidates to anywhere from 1-1000. It depends how "big" the ZIP code is. (But ZIP codes are not polygons).
So no, it is not sufficient to know these three parts of the address. The country is practically worthless at that point, and the ZIP code is locality/city-specific at best. The house number might appear dozens, if not hundreds, of times in a ZIP code. (I grew up in the boonies where our house number was unique, but that's rare.)
And yes, the answer to this question varies country to country, but this reasoning holds true for most developed countries. Less developed countries don't have such organization to their postal system.
Is there a service that can do this? Not if you don't want your users to scroll through dozens or hundreds of results. If they have to look through more than just a couple, you're better off just asking them to type their full address.
I answered a very similar question just the other day. You might find it useful.
So now that I've rained doomsday on your idea, how about an alternative? Of course I'm partial to SmartyStreets' autocomplete, which suggests addresses, geo-located close to the user, as they're typing. I should mention that it's free. It doesn't actually verify the address until the user is finished or has chosen one of the suggestions, but it does reduce keystrokes.
Further on this UX vein, I'd recommend putting country as the first field of your address form. This way, you can alter the form's format based on the country they choose. If you use a service like LiveAddress, you can have the user type their address in a format comfortable to them in a single field, rather than across multiple text boxes in your arbitrary order, since LiveAddress can parse their input.
You could easily achieve this by using the google maps reverse geocoding api. Heres a link to its documentation. link
I don't know of any country where there is a one-to-one mapping between a post code and a street address. Except Singapore. Postal Codes in SG
In that particular case you can use the post code to fill in the remaining fields, in any other case you can derive the city name and the street address, but not likely the House number.
Example 1: (derive full street address from post code)
https://geocode.xyz/339696?geoit=xml
<geodata>
<latt>1.32035</latt>
<longt>103.87430</longt>
<elevation/>
<standard>
<stnumber>88</stnumber>
<addresst>88 GEYLANG BAHRU</addresst>
<postal>339696</postal>
<city>Singapore</city>
<prov>SG</prov>
<countryname>Singapore</countryname>
<confidence>0.5</confidence>
</standard>
</geodata>
Example 2: (Get most common street address, and other variations of city name)
https://geocode.xyz/27777?region=DE&geoit=xml
<geodata>
<latt>53.06060</latt>
<longt>8.58388</longt>
<elevation/>
<standard>
<stnumber>20</stnumber>
<addresst>20 Bokenbusch</addresst>
<postal>27777</postal>
<city>Ganderlesee</city>
<prov>DE</prov>
<countryname>Germany</countryname>
<confidence>0.5</confidence>
</standard>
<alt>
<loc>
<city>Ganderkesee</city>
<latt>53.06868</latt>
<longt>8.57437</longt>
<cc>951</cc>
</loc>
<loc>
<city>Bremen</city>
<latt>53.07675</latt>
<longt>8.57559</longt>
<cc>172</cc>
</loc>
<loc>
<city>Schierbrok</city>
<latt>53.08639</latt>
<longt>8.58037</longt>
<cc>166</cc>
</loc>
The number in "cc" indicates how many street addresses in that city share the given post code.
Good luck!

Report builder 2.0 expression to sort values highest to lowest on a chart

I have a 3d cylinder chart that I am having some problems with. I want to effectively sort the cylinders with the highest value at the back and the lowest value at the front. Otherwise the tallest valuest cover the smallest values.
I have tried sorting both a-z and z-a but I really need it to be dynamic based on the values. I have also tried sorting the values by the actual value field. both a-z and z-a but this seems to return completely random results.
the data in the database (example) looks like. I use a parameter to separate by supplier.
Date catgeory_Type cost supplier
01/01/2013 apple $5 abc
01/01/2013 pear $10 def
01/01/2013 bannana $15 cgi
01/02/2013 apple $7 etc
01/02/2013 pear $12 etc
01/02/2013 banana $18 etc
I believe I need some form of expression that sorts the values based on cost. as both a-z and z-a in the instance would provide cylinders that blocked other cylinders.
I have tried sorting the series group by :=Sum(Fields!cost.Value, "DataSet1") and =Fields!cost.Value but this seems to return random results.
I would be happy even if I could achieve a custom sort such as sort by "bannana, pear, apple" although for some "suppliers" this would still cause me an issue.
edit 1: strangely enough this works with a line chart but not a 3d cylinder
edit 2: example
attached is an example. I want the tallest cylinders at the back. but methods mentioned above do not work
In chart area properties -> 3D-options , Enable,
series clustering
Choose this option to cluster series groups. When multiple series for
bar or column charts are clustered, they are displayed along two
distinct rows in the chart area. If series are not clustered, their
corresponding data points are displayed adjacent to each other in one
row. This option is applicable only to bar and column charts.
Also try changing the Rotation & Inclination degrees, to get a better look.
Decrease wall thickness also.

How can I filter out fictional locations (ex. "under a rock", "hiding") from Google Maps API geocode results?

Google Maps API does a great job trying to locate a match for nearly every query. But if I'm only interested in real locations, how can I filter out Google's guesses?
For example, according to Google, "under a rock" is located at "The Rock, Shifnal, Shropshire TF11, UK". But a person who answers the question, "Where are you?" with "Under a rock" does not mean to indicate that they are in Shropshire, UK. Instead they just don't want to tell you — well, either that or they are in real trouble, thankfully with web access, stuck under some rock.
I have several million user generated location strings that I'm attempting to find coordinates for. If someone writes "under a rock" I'd rather just leave the coordinates null instead of putting an obviously wrong point in Shropshire, UK.
Here are some other examples:
under a rock => Shropshire, UK
planet earth => Cheshire, UK
nowhere => Scituate, RI, USA
travelling => Madrid, Spain
hiding => Anderson, CA, USA
global => Midland, TX, USA
on the web => North Part, ON, Canada
internet => Frisco, TX, USA
worldwide => Mie Prefecture, Japan
Ultimately I'm after a solid way to return coordinates from a string but return false if the location is like the above.
I need to build a function that returns the following:
Twin Cities => Return the colloquial coordinates of Minneapolis-St. Paul
right behind you => false [Google get's this one "right" -- at least for my purposes]
under a rock => false
nowhere => false
Canada => Return coordinates
Mission District San Francisco => Return coordinates
Chicago => Return coordinates
a galaxy far far away => false [Google also get's this "right" — zero results]
What do you recommend?
Here's a comma-delimited array for you to play at home:
'twin cities','right behind you','under a rock','nowhere','canada','mission district san francisco','chicago','a galaxy far far away','london, england','1600 pennsylvania ave, washington, d.c.','california','41.87194,12.56738','global','worldwide','on the internet','mars'
And here's the url format:
'http://maps.googleapis.com/maps/api/geocode/json?address=' + query + '&sensor=false'
ex: http://maps.googleapis.com/maps/api/geocode/json?address=twin+cities&sensor=false
It seems most of your incorrect results have a "partial_match" attribute set to "true".
e.g.
Twin Cities, no partial match:
http://maps.googleapis.com/maps/api/geocode/json?address=Twin%20Cities&sensor=false
under a rock, 10+ results, all with partial match:
http://maps.googleapis.com/maps/api/geocode/json?address=under%20a%20rock&sensor=false
Though the original purpose of this attribute is not to tell wether a locality is correct or not, it's still pretty accurate on the dataset you provided.
From Google Maps API documentation:
partial_match indicates that the geocoder did not return an exact match for the original request, though it was able to match part of the requested address. You may wish to examine the original request for misspellings and/or an incomplete address.
Partial matches most often occur for street addresses that do not exist within the locality you pass in the request. Partial matches may also be returned when a request matches two or more locations in the same locality. For example, "21 Henr St, Bristol, UK" will return a partial match for both Henry Street and Henrietta Street. Note that if a request includes a misspelled address component, the geocoding service may suggest an alternate address. Suggestions triggered in this way will not be marked as a partial match.
This might not be the direct answer to your question.
If you are currently going through 1000s of user input saved in db, and filter out the invalid ones, I think it is too late and not feasible. The output can be only good as input.
The better way is to make input as valid as possible, and end users don't always know what they want.
I would suggest you that user enter their address through autocomplete, so that you will always have the valid address
User enters text, and select the suggestions
An marker and info window will be shown
When user confirms input, you save info window text as user input, not from text input.
By doing this way, you don't need to validate or filter user input.
I know there are Bayes Classifier implementations in javascript. Never tried them though, I currently use a Ruby implementation which works correctly.
You could have two classifications (Real and Unreal), training each of them with how many samples you want (30, 50 samples each?). "If your classifier is well trained, it will be more accurate".
Then you'd have to test the location before calling GoogleMaps API to filter Unreal locations.
To truly succeed here you are going to have to build a database driven system that facilitates both positive and negative lookups with AI that gets smarter over time, just like Google did. I don't believe that there is a single algorithm that will filter out results based on cosmetics alone.
I looked around and found a site that contains every city in the world. Unfortunately, it doesn't give it as a single list so you'd have to spend a bit of time harvesting data. the site is http://www.fallingrain.com/world/index.html.
They seem to be using individual directories for organizing countries, states, and cities. Then, broken down further by alphabet. It is however the only comprehensive that I could find.
If you manage to get all of these locations into a database then you will have the beginnings of a positive lookup system for your queries. Also, you'll need to start building separate lists of bi, tri, and quad-city areas as well as popular destinations and land marks.
You should also store a negative lookup table for all known mismatches. People have a tendency to generate similar false data and type-o's across large populations. So, the most popular "nowhere" and "planet earth" answers will be repeated over and over again and, in every language you can think of.
One of the benefits of this strategy is that you can run relational queries against your data to get matches in bulk instead as well as one at a time. Since some false negatives will occur at the beginning then your main decision is to determine what you want to do with unmatched items. You may want to adopt a strategy where you have the ability to both reject non-matches as well as substituting partial matches with the nearest actual match.
Anyhow, I hope this helps. It is a bit of effort but if it's important it will be worth it. Who knows, you may end up with a database that's actually worth something. Maybe even a Google maps gateway service for companies/developers who need the same functionality. (:
Take care.

Human name comparison: ways to approach this task

I'm not a Natural Language Programming student, yet I know it's not trivial strcmp(n1,n2).
Here's what i've learned so far:
comparing Personal Names can't be solved 100%
there are ways to achieve certain degree of accuracy.
the answer will be locale-specific, that's OK.
I'm not looking for spelling alternatives! The assumption is that the input's spelling is correct.
For example, all the names below can refer to the same person:
Berry Tsakala
Bernard Tsakala
Berry J. Tsakala
Tsakala, Berry
I'm trying to:
build (or copy) an algorithm which grades the relationship 2 input names
find an indexing method (for names in my database, for hash tables, etc.)
note:
My task isn't about finding names in text, but to compare 2 names. e.g.
name_compare( "James Brown", "Brown, James", "en-US" ) ---> 99.0%
I used Tanimoto Coefficient for a quick (but not super) solution, in Python:
"""
Formula:
Na = number of set A elements
Nb = number of set B elements
Nc = number of common items
T = Nc / (Na + Nb - Nc)
"""
def tanimoto(a, b):
c = [v for v in a if v in b]
return float(len(c)) / (len(a)+len(b)-len(c))
def name_compare(name1, name2):
return tanimoto(name1, name2)
>>> name_compare("James Brown", "Brown, James")
0.91666666666666663
>>> name_compare("Berry Tsakala", "Bernard Tsakala")
0.75
>>>
Edit: A link to a good and useful book.
Soundex is sometimes used to compare similar names. It doesn't deal with first name/last name ordering, but you could probably just have your code look for the comma to solve that problem.
We've just been doing this sort of work non-stop lately and the approach we've taken is to have a look-up table or alias list. If you can discount misspellings/misheard/non-english names then the difficult part is taken away. In your examples we would assume that the first word and the last word are the forename and the surname. Anything in between would be discarded (middle names, initials). Berry and Bernard would be in the alias list - and when Tsakala did not match to Berry we would flip the word order around and then get the match.
One thing you need to understand is the database/people lists you are dealing with. In the English speaking world middle names are inconsistently recorded. So you can't make or deny a match based on the middle name or middle initial. Soundex will not help you with common name aliases such as "Dick" and "Richard", "Berry" and "Bernard" and possibly "Steve" and "Stephen". In some communities it is quite common for people to live at the same address and have 2 or 3 generations living at that address with the same name. The only way you can separate them is by date of birth. Date of birth may or may not be recorded. If you have the clout then you should probably make the recording of date of birth mandatory. A lot of "people databases" either don't record date of birth or won't give them away due to privacy reasons.
Effectively people name matching is not that complicated. Its entirely based on the quality of the data supplied. What happens in practice is that a lot of records remain unmatched - and even a human looking at them can't resolve the mismatch. A human may notice name aliases not recorded in the aliases list or may be able to look up details of the person on the internet - but you can't really expect your programme to do that.
Banks, credit rating organisations and the government have a lot of detailed information about us. Previous addresses, date of birth etc. And that helps them join up names. But for us normal programmers there is no magic bullet.
Analyzing name order and the existence of middle names/initials is trivial, of course, so it looks like the real challenge is knowing common name alternatives. I doubt this can be done without using some sort of nickname lookup table. This list is a good starting point. It doesn't map Bernard to Berry, but it would probably catch the most common cases. Perhaps an even more exhaustive list can be found elsewhere, but I definitely think that a locale-specific lookup table is the way to go.
I had real problems with the Tanimoto using utf-8.
What works for languages that use diacritical signs is difflib.SequenceMatcher()