I have a dataset, currently just stored in a JSON file, which contains about 40k different geolocations. It looks something like this:
[
{"title": "Place 1", "loc": {"x": "00.000", "y": "00.00000"}},
{"title": "Place 2", "loc": {"x": "00.000", "y": "00.00000"}},
]
where a place's loc is just its coordinates.
I'd like to be able to run queries on this data so, for any given user-inputted loc I can get the n nearest Places.
Or in other words I'd like to write some function f so that this works:
def f(loc, n): ...
f({"x": "5", "y": "5"}, 3) #=> [{"title": "Place 1", "distance": 7.073}, {"title": "Place 2": "distance": 7.073}, {"title": "Place 3", "distance": 7.073}]
if there is a place 1, 2 and 3 all at {x: 0, y: 0}.
I have no idea what the standard way of solving an issue like this is. Using an SQL DB with an index on precomputed distances doesn't work, because the supplied loc is arbitrary. Running through the entire database and calculating distances for everything is far too inefficient, and far too slow. (I need < 30ms response times.)
The only solution that makes sense would be to somehow make "buckets" of close locations (within some r of eachother), and then to computer the distance between the user-given loc and the bucket's loc to narrow down the options first. But I feel like creating such a solution myself would be similar to not using databases at all; there must be a more efficient/industry standard approach. Is there one?
This is a generalized form of the nearest neighbor problem (more formally known as k-nearest neighbor). You're right, the solution that makes sense uses buckets. You could store the buckets in the database which allows you to leverage SQL, just filter out all points not in the appropriate buckets. Depending on your database, this actually may already be implemented for you, which would be the "industry standard" approach you suggested.
Otherwise, writing it yourself is pretty efficient and can be done without deviating too much from the database.
Oracle provides Spatial data capabilities. It has an inbuilt nearest neighbour function SDO_NN which will do the job for you. Only overheard will be putting all the data in db, rest will be taken care of by oracle db.
You can use a database with a point data type and a spatial index like MySQL. You can also use a quadkey or a quadtree. It's subdivide the plane and reduce the dimension. You can download my PHP class Hilbert-curve# phpclasses.org. It's uses a quadkey and can help to organize locations in buckets and build a proximity searching. A quadkey can reduce overlapping searches because of a special database.
Related
I'm new to programming/ray and have a simple question about which parameters can be specified when using Ray Tune. In particular, the ray tune documentation says that all of the auto-filled fields (steps_this_iter, episodes_this_iter, etc.) can be used as stopping conditions or in the Scheduler/Search Algorithm specification.
However, the following only works once I remove the "episodes_this_iter" specification. Does this work only as part of the stopping criteria?
ray.init()
tune.run(
PPOTrainer,
stop = {"training_iteration": 1000},
config={"env": qsdm.QSDEnv,
"env_config": defaultconfig,
"num_gpus": 0,
"num_workers": 1,
"lr": tune.grid_search([0.00005, 0.00001, 0.0001]),},
"episodes_this_iter": 2500,
)
tune.run() is the one filling up those fields so we can use them elsewhere. And the stopping criterion is just one of the places where we can use them in.
To see why the example doesn't work, consider a simpler analogue:
episodes_total: 100
The trainer itself is the one incrementing the episode count so the rest of the system knows how far along we are. It doesn't work on them if we try to change it or fix it to a particular value. The same reasoning applies to other fields in the list.
As for the scheduler and search algorithms, I have no experience with.
But what we want to do is put those conditions inside the schedule or search algorithm itself, and not in the trainer directly.
Here's an example with Bayesian optimisation search, although I don't know what it would mean to do this:
from ray.tune.suggest.bayesopt import BayesOptSearch
tune.run(
# ...
# 10 trials
num_samples=10,
search_alg=BayesOptSearch(
# look for learning rates within this range:
{'lr': (0.1, 0.00001)},
# optimise for this metric:
metric='episodes_this_iter', # <------- auto-filled field here
mode='max',
utility_kwargs={
'kind': 'ucb',
'kappa': '2.5',
'xi': 0.0
}
)
)
Let's say I have created a model with ~30 items for each of 10 categories. I've taken all of the defaults that were provided to me.
The Average F1 Score for the model is 0.875 (I have 2 categories that are very closely related, so that's hurting accuracy a bit).
If I do a real-time prediction for a piece of text that should match positively for category 3 and 8, I get this result:
{
"Prediction": {
"details": {
"Algorithm": "SGD",
"PredictiveModelType": "MULTICLASS"
},
"predictedLabel": "8",
"predictedScores": {
"1": 0.002642059000208974,
"2": 0.010648942552506924,
"3": 0.41401588916778564,
"4": 0.02918998710811138,
"5": 0.008376320824027061,
"6": 0.009010250680148602,
"7": 0.006029266398400068,
"8": 0.4628857374191284,
"9": 0.04102163389325142,
"10": 0.01617990992963314
}
}
}
What I'm wondering is whether 3 & 8 both had effectively an ~80% certainty, but because they both matched the certainty was split between the two. If you sum all the predictedScores, you get .999999997, which has me questioning whether there's a total 1.0 score that gets split amongst each of the available categories...
If I instead set up 10 different models, and did binary matches against each of them independently, would I see that 3 & 8 would score higher (e.g. something closer to 0.8)?
I guess a related question, that I don't really need answered but might help clarify the overall question, is ... If I had a theoretical piece of text that definitely fit all 10 categories, could Amazon Machine Learning respond with a predictedScore value of 1.0 for each category? Or, because the maximum predictedScore is 1.0, would it return 0.1 for each category?
Amazon ML returns probabilities for each category known from the input set. Because they are true modeled probabilities, they must sum up to 1. In other words, you are correct when you say "there's a total 1.0 score that gets split amongst each of the available categories..."
Here is a reference page that answers this and some of your other questions:
http://docs.aws.amazon.com/machine-learning/latest/dg/reading-the-batchprediction-output-files.html#interpreting-the-contents-of-batch-prediction-files-for-a-multiclass-classification-ml-model
Assume I have a structure of ranges, and their associated data, for instance:
data [
[ [0, 100], "New York"],
[ [101, 200], "Boston"],
...
]
For a function that receives a N as an arguments and returns the entry where N is in the range of the left element.
for instance,
> 103
< "Boston"
What will be the best structure to transform the above to achieve the fastest lookup time?
If your data set should be dynamic, use interval tree.
I would suggest you to try with B+ tree. As I haven't personally tried this problem either. B+ tree can have array as its child so you could set the data value for index 0-100 as New York with 101 pointing to child 2 in the tree.
Check about B+ tree here
I would recommend you to take regular approach for this, incase your data is small.
Bearing in mind various quirks of the data types, and localization, what is the best way for a web service to communicate monetary values to and from applications? Is there a standard somewhere?
My first thought was to simply use the number type. For example
"amount": 1234.56
I have seen many arguments about issues with a lack of precision and rounding errors when using floating point data types for monetary calculations--however, we are just transmitting the value, not calculating, so that shouldn't matter.
EventBrite's JSON currency specifications specify something like this:
{
"currency": "USD",
"value": 432,
"display": "$4.32"
}
Bravo for avoiding floating point values, but now we run into another issue: what's the largest number we can hold?
One comment (I don’t know if it’s true, but seems reasonable) claims that, since number implementations vary in JSON, the best you can expect is a 32-bit signed integer. The largest value a 32-bit signed integer can hold is 2147483647. If we represent values in the minor unit, that’s $21,474,836.47. $21 million seems like a huge number, but it’s not inconceivable that some application may need to work with a value larger than that. The problem gets worse with currencies where 1,000 of the minor unit make a major unit, or where the currency is worth less than the US dollar. For example, a Tunisian Dinar is divided into 1,000 milim. 2147483647 milim, or 2147483.647 TND is $1,124,492.04. It's even more likely values over $1 million may be worked with in some cases. Another example: the subunits of the Vietnamese dong have been rendered useless by inflation, so let’s just use major units. 2147483647 VND is $98,526.55. I’m sure many use cases (bank balances, real estate values, etc.) are substantially higher than that. (EventBrite probably doesn’t have to worry about ticket prices being that high, though!)
If we avoid that problem by communicating the value as a string, how should the string be formatted? Different countries/locales have drastically different formats—different currency symbols, whether the symbol occurs before or after the amount, whether or not there is a space between the symbol and amount, if a comma or period is used to separate the decimal, if commas are used as a thousands separator, parentheses or a minus sign to indicate negative values, and possibly more that I’m not aware of.
Should the app know what locale/currency it's working with, communicate values like
"amount": "1234.56"
back and forth, and trust the app to correctly format the amount? (Also: should the decimal value be avoided, and the value specified in terms of the smallest monetary unit? Or should the major and minor unit be listed in different properties?)
Or should the server provide the raw value and the formatted value?
"amount": "1234.56"
"displayAmount": "$1,234.56"
Or should the server provide the raw value and the currency code, and let the app format it?
"amount": "1234.56"
"currencyCode": "USD"
I assume whichever method is used should be used in both directions, transmitting to and from the server.
I have been unable to find the standard--do you have an answer, or can point me to a resource that defines this? It seems like a common issue.
I don't know if it's the best solution, but what I'm trying now is to just pass values as strings unformatted except for a decimal point, like so:
"amount": "1234.56"
The app could easily parse that (and convert it to double, BigDecimal, int, or whatever method the app developer feels best for floating-point arithmetic). The app would be responsible for formatting the value for display according to locale and currency.
This format could accommodate other currency values, whether highly inflated large numbers, numbers with three digits after the decimal point, numbers with no fractional values at all, etc.
Of course, this would assume the app already knows the locale and currency used (from another call, an app setting, or local device values). If those need to be specified per call, another option would be:
"amount": "1234.56",
"currency": "USD",
"locale": "en_US"
I'm tempted to roll these into one JSON object, but a JSON feed may have multiple amounts for different purposes, and then would only need to specify currency settings once. Of course, if it could vary for each amount listed, then it would be best to encapsulate them together, like so:
{
"amount": "1234.56",
"currency": "USD",
"locale": "en_US"
}
Another debatable approach is for the server to provide the raw amount and the formatted amount. (If so, I would suggest encapsulating it as an object, instead of having multiple properties in a feed that all define the same concept):
{
"displayAmount":"$1,234.56",
"calculationAmount":"1234.56"
}
Here, more of the work is offloaded to the server. It also ensures consistency across different platforms and apps in how the numbers are displayed, while still providing an easily parseable value for conditional testing and the like.
However, it does leave a problem--what if the app needs to perform calculations and then show the results to the user? It will still need to format the number for display. Might as well go with the first example at the top of this answer and give the app control over the formatting.
Those are my thoughts, at least. I've been unable to find any solid best practices or research in this area, so I welcome better solutions or potential pitfalls I haven't pointed out.
AFAIK, there is no "currency" standard in JSON - it is a standard based on rudimentary types. Things you might want to consider is that some currencies do not have a decimal part (Guinean Franc, Indonesian Rupiah) and some can be divided into thousandths (Bahraini Dinar)- hence you don't want to assume two decimal places. For Iranian Real $2million is not going to get you far so I would expect you need to deal with doubles not integers. If you are looking for a general international model then you will need a currency code as countries with hyperinflation often change currencies every year of two to divide the value by 1,000,000 (or 100 mill). Historically Brazil and Iran have both done this, I think.
If you need a reference for currency codes (and a bit of other good information) then take a look here: https://gist.github.com/Fluidbyte/2973986
Amount of money should be represented as string.
The idea of using string is that any client that consumes the json should parse it into decimal type such as BigDecimal to avoid floating point imprecision.
However it would only be meaningful if any part of the system avoids floating point too. Even if the backend is only passing data and not doing any calculation, using floating point would eventually result in what you see (in the program) is not what you get (on the json).
And assuming that the source is a database, it is important to have the data stored with right type. If the data is already stored as floating point then any subsequent conversion or casting would be meaningless as it would technically be passing imprecision around.
ON Dev Portal - API Guidelines - Currencies you may find interesting suggestions :
"price" : {
"amount": 40,
"currency": "EUR"
}
It's a bit harder to produce & format than just a string, but I feel this is the cleanest and meaningful way to achieve it :
uncouple amount and currency
use number JSON type
Here the JSON format suggested:
https://pattern.yaas.io/v2/schema-monetary-amount.json
{
"$schema": "http://json-schema.org/draft-04/schema#",
"type": "object",
"title": "Monetary Amount",
"description":"Schema defining monetary amount in given currency.",
"properties": {
"amount": {
"type": "number",
"description": "The amount in the specified currency"
},
"currency": {
"type": "string",
"pattern": "^[a-zA-Z]{3}$",
"description": "ISO 4217 currency code, e.g.: USD, EUR, CHF"
}
},
"required": [
"amount",
"currency"
]
}
Another questions related to currency format pointed out right or wrongly, that the practice is much more like a string with base units :
{
"price": "40.0"
}
There probably isn't any official standard. We are using the following structure for our products:
"amount": {
"currency": "EUR",
"scale": 2,
"value": 875
}
The example above represents amount €8.75.
Currency is defined as string (and values should correspond to ISO4217), scale and value are integers. The meaning of "scale" is obvious. This structure solves many of the problems with currencies not having fractions, having non-standard fractions etc.
Suppose I want to match address records (or person names or whatever) against each other to merge records that are most likely referring to the same address. Basically, I guess I would like to calculate some kind of correlation between the text values and merge the records if this value is over a certain threshold.
Example:
"West Lawnmower Drive 54 A" is probably the same as "W. Lawn Mower Dr. 54A" but different from "East Lawnmower Drive 54 A".
How would you approach this problem? Would it be necessary to have some kind of context-based dictionary that knows, in the address case, that "W", "W." and "West" are the same? What about misspellings ("mover" instead of "mower" etc)?
I think this is a tricky one - perhaps there are some well-known algorithms out there?
A good baseline, probably an impractical one in terms of its relatively high computational cost and more importantly its production of many false positive, would be generic string distance algorithms such as
Edit distance (aka Levenshtein distance)
Ratcliff/Obershelp
Depending on the level of accuracy required (which, BTW, should be specified both in terms of its recall and precision, i.e. generally expressing whether it is more important to miss a correlation than to falsely identify one), a home-grown process based on [some of] the following heuristics and ideas could do the trick:
tokenize the input, i.e. see the input as an array of words rather than a string
tokenization should also keep the line number info
normalize the input with the use of a short dictionary of common substituions (such as "dr" at the end of a line = "drive", "Jack" = "John", "Bill" = "William"..., "W." at the begining of a line is "West" etc.
Identify (a bit like tagging, as in POS tagging) the nature of some entities (for example ZIP Code, and Extended ZIP code, and also city
Identify (lookup) some of these entities (for example a relative short database table can include all the Cities / town in the targeted area
Identify (lookup) some domain-related entities (if all/many of the address deal with say folks in the legal profession, a lookup of law firm names or of federal buildings may be of help.
Generally, put more weight on tokens that come from the last line of the address
Put more (or less) weight on tokens with a particular entity type (ex: "Drive", "Street", "Court" should with much less than the tokens which precede them.
Consider a modified SOUNDEX algorithm to help with normalization of
With the above in mind, implement a rule-based evaluator. Tentatively, the rules could be implemented as visitors to a tree/array-like structure where the input is parsed initially (Visitor design pattern).
The advantage of the rule-based framework, is that each heuristic is in its own function and rules can be prioritized, i.e. placing some rules early in the chain, allowing to abort the evaluation early, with some strong heuristics (eg: different City => Correlation = 0, level of confidence = 95% etc...).
An important consideration with search for correlations is the need to a priori compare every single item (here address) with every other item, hence requiring as many as 1/2 n^2 item-level comparisons. Because of this, it may be useful to store the reference items in a way where they are pre-processed (parsed, normalized...) and also to maybe have a digest/key of sort that can be used as [very rough] indicator of a possible correlation (for example a key made of the 5 digit ZIP-Code followed by the SOUNDEX value of the "primary" name).
I would look at producing a similarity comparison metric that, given two objects (strings perhaps), returns "distance" between them.
If you fulfil the following criteria then it helps:
distance between an object and
itself is zero. (reflexive)
distance from a to b is the same in
both directions (transitive)
distance from a to c is not more
than distance from a to b plus
distance from a to c. (triangle
rule)
If your metric obeys these they you can arrange your objects in metric space which means you can run queries like:
Which other object is most like
this one
Give me the 5 objects
most like this one.
There's a good book about it here. Once you've set up the infrastructure for hosting objects and running the queries you can simply plug in different comparison algorithms, compare their performance and then tune them.
I did this for geographic data at university and it was quite fun trying to tune the comparison algorithms.
I'm sure you could come up with something more advanced but you could start with something simple like reducing the address line to the digits and the first letter of each word and then compare the result of that using a longest common subsequence algorithm.
Hope that helps in some way.
You can use Levenshtein edit distance to find strings that differ by only a few characters. BK Trees can help speed up the matching process.
Disclaimer: I don't know any algorithm that does that, but would really be interested in knowing one if it exists. This answer is a naive attempt of trying to solve the problem, with no previous knowledge whatsoever. Comments welcome, please don't laugh too laud.
If you try doing it by hand, I would suggest applying some kind of "normalization" to your strings : lowercase them, remove punctuation, maybe replace common abbreviations with the full words (Dr. => drive, St => street, etc...).
Then, you can try different alignments between the two strings you compare, and compute the correlation by averaging the absolute differences between corresponding letters (eg a = 1, b = 2, etc.. and corr(a, b) = |a - b| = 1) :
west lawnmover drive
w lawnmower street
Thus, even if some letters are different, the correlation would be high. Then, simply keep the maximal correlation you found, and decide that their are the same if the correlation is above a given threshold.
When I had to modify a proprietary program doing this, back in the early 90s, it took many thousands of lines of code in multiple modules, built up over years of experience. Modern machine-learning techniques ought to make it easier, and perhaps you don't need to perform as well (it was my employer's bread and butter).
So if you're talking about merging lists of actual mailing addresses, I'd do it by outsourcing if I can.
The USPS had some tests to measure quality of address standardization programs. I don't remember anything about how that worked, but you might check if they still do it -- maybe you can get some good training data.