Converting strings, based of units in MongoDB field - json

I am using MongoDB to store different values based of units.
For example I have a speed field:
"Speed":"1 m/s"
or
"Speed":"1 mph"
I also have distance field, like this:
"Distance": "1 ft"
or
"Distance":"1 meter"
I have about 20 different field types, like speed, distance, power, area, angle, and others I would like to store all the fields of different units types in the same units, so I can compare them. I am not sure if it would be best to do this on input or when I am reading from the database, but either is an option.
I am planning on storing a field unit type, I.E. this field is a speed, and an equation to get to the base unit, I.E. if the speed field has m/s and the base field in ft/sec multiple by 3.28, but I am not sure how to structure this. So ideally the fields above would be something like:
{"Speed":"1 m/s"},
{"Speed":"1 mph"},
{"Distance": "1 ft"},
{"Distance":"1 meter"}
Would become
{"Speed":{"base(ft/sec)":3.28,"orig_val":1,"orig_unit":"m/s"},
{"Speed":{"base(ft/sec)":1.47,"orig_val":1,"orig_unit":"mph"},
{"Distance":{"base(in)":12,"orig_val":1,"orig_unit":"ft"},
{"Distance":{"base(in)":39.37,"orig_val":1,"orig_unit":"meter"}

Some thoughts.
Store a specific field as the same unit, irrespective of how you capture - e.g., distance is always stored as feet. The field would be stored as { distance_feet: 120 }.
Store a specific field as it is captured, i.e., a field can be captured with different units. This will have an additional field specifying the "units". For example, { distance: 120, units: "feet" }. In this case, the field units can be either "feet" or "meters" .
In both the cases, the application (or program) logic can take care of the conversion from feet to meters or vice-versa.
An additional field called as "conversion_factor" can be stored in the collection (e.g., for feet and meters conversion, { conversion_factor: 3.38084 } ). This involves storage of same information in the database many times and with large amount of data this adds to the storage space and memory when data is read - a factor to consider.
What info to store and how depends upon factors:
What is purpose of this data and how it is used in the application?
Is it queried? How and in what format? Captured in what format? How
often? What kind of computations (calculations, comparisons, etc.)
happen upon this data?
I think the application requirements or functionality should drive how you design the data, not how you have to program it. You should know at this stage what are the important things you will be doing with the data.
I have about 20 different field types, like speed, distance, power,
area, angle, and others I would like to store all the fields of
different units types in the same units, so I can compare them.
I think, comparing is a computation and the program can take care of data like "conversion factor" and converting from one unit to another.

Related

roundings with Access

With Microsoft Access 2010, I have two Single fields:
A = 1.1
B = 2.1
I create a query where I have defined C=A*B
Microsoft Access says that C = 2.30999994277954
but, in reality, C =2.31
How can I get the right result (2.31)?
Slightly off results from operations performed on decimal values can happen if your numeric field size is single or double rather than decimal. Single and double (or floating point) numbers are very close approximations of the "true" numbers, but should not be relied upon if accuracy in operations is required. A related stackoverflow question has more information about this issue: Access comparing floating-point numbers "incorrectly"
If it's possible to modify the underlying table's design, you should change the field size property for the "A" and "B" fields from single to decimal. After changing the field size BUT BEFORE saving the table, you will also need to adjust the Scale property for "A" and "B" from 0 to whatever number of places to the right of the decimal point you might require. You will likely still have a notice about losing data, but if you adjust the field properties correctly before saving the table, this shouldn't be a problem. You should probably make a copy of the table before doing this so that you can verify that there was no data loss. After saving your table and verifying the changes did not result in data loss, your query should represent A * B accurately.

How do I store decimals with USER-specified precision in MySQL?

I have an application in which users are typing measured amounts. I would like to respect the precision they are entering when storing the values in MySQL, i.e. if they type 0.050 I don't want that to become 0.05 since that is loosing information on how exact the measurement was done. Is there a way other than storing the value as a string?
0.050 is equal to 0.05 . If you want 3 digits after comma, you have to implement this feature in application.
As per my knowledge for handling precision point we are using FLOAT, DOUBLE or DECIMAL data types. In your case if you are not using any function like SUM(),AVG(),etc then you can use VARCHAR.
Add always a "control" digit, lets say "1", save to database, then get data from database and always trim ONE time the "control" digit "1" from what you've got.
example:
user inputs 0,000050
save as 0,0000501
get the 0,0000501
trim the last "1" (only the last one be carefull)
k.i.s.s. people :D
edit: proper solution of course, add a column to store precision and right-pad zeroes if needed

Comparision of data in huge databases

I have a database in mysql which has a collection of attributes (ex. 'weight', 'height', 'no of pages' etc) and attribute values (ex. '30 tons', '12 inches', '2 pgs' etc) and mapped with the respective product ids.
The data has been collected from different sites and hence the attribute values have different formats (ex. '222 pgs' or '222 pages' or '222') (ex2. '12 inches', '12 meters', '12 cms').
What I need to do is that I have to compare the values of same attributes of different products. So I have to compare '222 pgs' with '222 pages' for all the attributes which differ in formats.
There are around 4000 attributes and the number will increase further. Is there any way to compare these without having to assign each attribute a specific type individually? Or what is the fastest way to compare these?
Well, until they invent a clairvoyant computer, a human being will have to tell it that pgs and pages mean the same thing and that inches and meters are convertible.
You'll have to sanitize the data one way or another. I'd probably start by identifying units that measure the same dimension1 and common aliases2 for each unit, then parse the data to split the quantity from the unit and normalize3 the unit. Once you have done that, the data becomes directly comparable.
But all this is really just a remedy for the problem that should not have been there in the first place, were the database designed properly.
1 A "mass" is a dimension measured by units such as kg, t, lb etc. A "length" is a dimension measured by m, km, in etc.
2 E.g. an in and inch denote exactly the same unit, pgs and pages are the same etc.
3 I.e. make sure a particular dimension is always represented by the same unit: for example convert all lengths to m, all masses to kg, all pages to pages etc.
You haven't explained what you want to do after you find out that attributes for a pair of products differ (while still meaning the same thing).
I.e.: if I see that in Instance A has field Length set to "12 pgs" and Instance B has Length reporting "12 pages" what do you do?
List this? Autocorrect? Drop one of the two values? Open a window for a human user to correct?
Personally I'd go for a "select attribute,count(*) from X group by attribute" so that you can find out the most common spelling of the unit, and then you can also write corrective scripts that may automatically convert ".. pgs" to " pages" as soon as you have decided the correct representation.
Of course this will not help at all unless you enforce correct spelling of the units, and this requires for sure better input-output filters, including the main UI, but also any kind of bulk uploader utility you may use to create or update products.
A redesign of the DB to add "Unit" as an extra, categorized attribute for each measure would also help a lot.

Best practice for storing weights in a SQL database?

An application I'm working on needs to store weights of the format X pounds, y.y ounces. The database is MySQL, but I imagine this is DB agnostic.
I can think of three ways to do this:
Convert the weight to decimal pounds and store in a single field. (5 lbs 6.2 oz = 5.33671875 lbs)
Convert the weight to decimal ounces and store in a single field. (5 lbs 6.2 oz = 86.2 oz)
Store the pounds portion as an integer and the ounces portion as a decimal, in two fields.
I'm thinking that #1 is not such a good idea, since decimal pounds will produce numbers of arbitrary precision, which would need to be stored as a float, which could lead to inaccuracies which are inherent in floating point numbers.
Is there a compelling reason to choose #2 over #3 or vise-versa?
TL;DR
Choose either option #1 or option #2—there's no difference between them. Don't use option #3, because it's awkward to work with.
You claim that there are inherent inaccuracies in floating point numbers. I think that this deserves to be explored a little first.
When deciding upon a numeral system for representing a number (whether on a piece of paper, in a computer circuit, or elsewhere), there are two separate issues to consider:
its basis; and
its format.
Pick a base, any base…
Limited by finite space, one cannot represent an arbitrary member of an infinite set. For example: no matter how much paper you buy or how small your handwriting, it'd always be possible to find an integer that won't fit in the given space (you could just keep appending extra digits until the paper runs out). So, with integers, we usually restrict our finite space to representing only those that fall within some particular interval—e.g. if we have space for the positive/negative sign and three digits, we might restrict ourselves to the interval [-999,+999].
Every non-empty interval contains an infinite set of real numbers. In other words, no matter what interval one takes over the real numbers—be it [-999,+999], [0,1], [0.000001,0.000002] or anything else—there is still an infinite set of reals within that interval (one need only keep appending (non-zero) fractional digits)! Therefore arbitrary real numbers must always be "rounded" to something that can be represented in finite space.
The set of real numbers that can be represented in finite space depends upon the numeral system that is used. In our (familiar) positional base-10 system, finite space will suffice for one-half (0.510) but not for one-third (0.33333…10); by contrast, in the (less familiar) positional base-9 system, it is the other way around (those same numbers are respectively 0.44444…9 and 0.39). The consequence of all this is that some numbers that can be represented using only a small amount of space in positional base-10 (and therefore appear to be very "round" to us humans), e.g. one-tenth, would actually require infinite binary circuits to be stored precisely (and therefore don't appear to be very "round" to our digital friends)! Notably, since 2 is a factor of 10, the same is not true in reverse: any number that can be represented with finite binary can also be represented with finite decimal.
We can't do any better for continuous quantities. Ultimately such quantities must use a finite representation in some numeral system: it's arbitrary whether that system happens to be easy on computer circuits, on human fingers, on something else or on nothing at all—whichever system is used, the value must be rounded and therefore it always results in "representation error".
In other words, even if one has a perfectly accurate measuring instrument (which is physically impossible), then any measurement it reports will already have been rounded to a number that happens to fit on its display (in whatever base it uses—typically decimal, for obvious reasons). So, "86.2 oz" is never actually "86.2 oz" but rather a representation of "something between 86.1500000... oz and 86.2499999... oz". (Actually, because in reality the instrument is imperfect, all we can ever really say is that we have some degree of confidence that the actual value falls within that interval—but that is definitely departing some way from the point here).
But we can do better for discrete quantities. Such values are not "arbitrary real numbers" and therefore none of the above applies to them: they can be represented exactly in the numeral system in which they were defined—and indeed, should be (as converting to another numeral system and truncating to a finite length would result in rounding to an inexact number). Computers can (inefficiently) handle such situations by representing the number as a string: e.g. consider ASCII or BCD encoding.
Apply a format…
Since it's a property of the numeral system's (somewhat arbitrary) basis, whether or not a value appears to be "round" has no bearing on its precision. That's a really important observation, which runs counter to many people's intuition (and it's the reason I spent so much time explaining numerical basis above).
Precision is instead determined by how many significant figures a representation has. We need a storage format that is capable of recording our values to at least as many significant figures as we consider them to be correct. Taking by way of example values that we consider to be correct when stated as 86.2 and 0.0000862, the two most common options are:
Fixed point, where the number of significant figures depends on magnitude: e.g. in fixed 5-decimal-point representation, our values would be stored as 86.20000 and 0.00009 (and therefore have 7 and 1 significant figures of precision respectively). In this example, precision has been lost in the latter value (and indeed, it wouldn't take much more for us to have been totally unable to represent anything of significance); and the former value stored false precision, which is a waste of our finite space (and indeed, it wouldn't take much more for the value to become so large that it overflows the storage capacity).
A common example of when this format might be appropriate is for an accounting system: monetary sums must usually be tracked to the penny irrespective of their magnitude (therefore less precision is required for small values, and more precision is required for large values). As it happens, currency is usually also considered to be discrete (pennies are indivisible), so this is also a good example of a situation where a particular basis (decimal for most modern currencies) is desirable to avoid the representation errors discussed above.
One usually implements fixed point storage by treating one's values as quotients over a common denominator and storing the numerator as an integer. In our example, the common denominator could be 105, so instead of 86.20000 and 0.00009 one would store the integers 8620000 and 9 and remember that they must be divided by 100000.
Floating point, where the number of significant figures is constant irrespective of magnitude: e.g. in 5-significant-figure decimal representation, our values would be stored as 86.200 and 0.000086200 (and, by definition, have 5 significant figures of precision both times). In this example, both values have been stored without any loss of precision; and they both also have the same amount of false precision, which is less wasteful (and we can therefore use our finite space to represent a far greater range of values—both large and small).
A common example of when this format might be appropriate is for recording any real world measurements: the precision of measuring instruments (which all suffer from both systematic and random errors) is fairly constant irrespective of scale so, given sufficient significant figures (typically around 3 or 4 digits), absolutely no precision is lost even if a change of base resulted in rounding to a different number.
One usually implements floating point storage by treating one's values as integer significands with integer exponents. In our example, the significand could be 86200 for both values whereupon the (base-10) exponents would be -4 and -9 respectively.
But how precise are the floating point storage formats used by our computers?
An IEEE754 single precision (binary32) floating point number has 24 bits, or log10(224) (over 7) digits, of significance—i.e. it has a tolerance of less than ±0.000006%. In other words, it is more precise than saying "86.20000".
An IEEE754 double precision (binary64) floating point number has 53 bits, or log10(253) (almost 16) digits, of significance—i.e. it has a tolerance of just over ±0.00000000000001%. In other words, it is more precise than saying "86.2000000000000".
The most important thing to realise is that these formats are, respectively, over ten thousand and over one trillion times more precise than saying "86.2"—even though exact conversions of the binary back into decimal happens to include erroneous false precision (which we must ignore: more on this shortly)!
Notice also that both fixed and floating point formats will result in loss of precision when a value is known more precisely than the format supports. Such rounding errors can propagate in arithmetic operations to yield apparently erroneous results (which no doubt explains your reference to the "inherent inaccuracies" of floating point numbers): for example, 1⁄3 × 3000 in 5-place fixed point would yield 999.99000 rather than 1000.00000; and 1⁄7 − 7⁄50 in 5-significant figure floating point would yield 0.0028600 rather than 0.0028571.
The field of numerical analysis is dedicated to understanding these effects, but it is important to realise that any usable system (even performing calculations in your head) is vulnerable to such problems because no method of calculation that is guaranteed to terminate can ever offer infinite precision: consider, for example, how to calculate the area of a circle—there will necessarily be loss of precision in the value used for π, which will propagate into the result.
Conclusion
Real world measurements should use binary floating point: it's fast, compact, extremely precise and no worse than anything else (including the decimal version from which you started). Since MySQL's floating-point datatypes are IEEE754, this is exactly what they offer.
Currency applications should use denary fixed point: whilst it's slow and wastes memory, it ensures both that values are not rounded to inexact quantities and that pennies are not lost on large monetary sums. Since MySQL's fixed-point datatypes are BCD-encoded strings, this is exactly what they offer.
Finally, bear in mind that programming languages usually represent fractional values using binary floating-point types: so if your database stores values in another format, you need to be careful how they are brought into your application or else they may get converted (with all the ensuing issues that entails) at the interface.
Which option is best in this case?
Hopefully I've convinced you that your values can safely (and should) be stored in floating point types without worrying too much about any "inaccuracies"? Remember, they're more precise than your flimsy 3-significant-digit decimal representation ever was: you just have to ignore false precision (but one must always do that anyway, even if using a fixed-point decimal format).
As for your question: choose either option 1 or 2 over option 3—it makes comparisons easier (for example, to find the maximal mass, one could just use MAX(mass), whereas to do it efficiently across two columns would require some nesting).
Between those two, it doesn’t matter which one chooses—floating point numbers are stored with a constant number of significant bits irrespective of their scale.
Furthermore, whilst in the general case it could happen that some values are rounded to binary numbers that are closer to their original decimal representation using option 1 whilst simultaneously others are rounded to binary numbers that are closer to their original decimal representation using option 2, as we shall shortly see such representation errors only manifest within the false precision that should always be ignored.
However, in this case, because it happens that there are 16 ounces to 1 pound (and 16 is a power of 2), the relative differences between original decimal values and stored binary numbers using the two approaches is identical:
5.387510 (not 5.3367187510 as stated in your question) would be stored in a binary32 float as 101.0110001100110011001102 (which is 5.3874998092651367187510): this is 0.0000036% from the original value (but, as discussed above, the "original value" was already a pretty lousy representation of the physical quantity it represents).
Knowing that a binary32 float stores only 7 decimal digits of precision, our compiler knows for certain that everything from the 8th digit onwards is definitely false precision and therefore must be ignored in every case—thus, provided that our input value didn't require more precision than that (and if it did, binary32 was obviously the wrong choice of format), this guarantees a return to a decimal value that looks just as round as that from which we started: 5.38750010. However, we should really apply domain knowledge at this point (as we should with any storage format) to discard any further false precision that might exist, such as those two trailing zeroes.
86.210 would be stored in a binary32 float as 1010110.001100110011001102 (which is 86.199996948242187510): this is also 0.0000036% from the original value. As before, we then ignore false precision to return to our original input.
Notice how the binary representations of the numbers are identical, except for the placement of the radix point (which is four bits apart):
101.0110 00110011001100110
101 0110.00110011001100110
This is because 5.3875 × 24 = 86.2.
As an aside: being European (albeit British), I also have a strong aversion to imperial units of measurement—handling values of different scales is just so messy. I'd almost certainly store masses in SI units (e.g. kilograms or grams) and then perform conversions to imperial units as required within the presentation layer of my application. Plus rigidly adhering to SI units might one day save you from losing $125m.
I’d be tempted to store it in a metric unit, as they tend to be simple decimals and not complex values like pounds and ounces. That way, you can just store the one value (i.e. 103.25 kg) rather than the pounds–ounces equivalent, and it’s easier to perform conversions.
This is something I’ve dealt with in the past. I do a lot of work on pro wrestling and mixed martial arts (MMA) websites where fighters’ heights and weights need to be recorded. They tend to be displayed as feet and inches and pounds and ounces, but I still store the values in their centimetres and kilogram equivalents, and then do the conversion when displaying on the site.
First, I had not known about how floating point numbers were inaccurate - thankfully a search latter helps me understand: Floating Point Inaccuracy Examples
I would fully agree with #eggyal - keep the data in a single format in a single column. This allows you to expose it to the application and let the application deal with the presentation of it - be it in lbs/oz, rounded up lbs, whatever.
The database should keep the raw data while the presentation layer dictates the layout.
You can use decimal data type for weight column.
decimal('weight', 8, 2); // precision = 8, scale = 2
Storage size:
Precision 1-9 5 Bytes
Precision 10-19 9 Bytes
Precision 20-28 13 Bytes
Precision 29-38 17 Bytes

Text-correlation in MySQL [duplicate]

Suppose I want to match address records (or person names or whatever) against each other to merge records that are most likely referring to the same address. Basically, I guess I would like to calculate some kind of correlation between the text values and merge the records if this value is over a certain threshold.
Example:
"West Lawnmower Drive 54 A" is probably the same as "W. Lawn Mower Dr. 54A" but different from "East Lawnmower Drive 54 A".
How would you approach this problem? Would it be necessary to have some kind of context-based dictionary that knows, in the address case, that "W", "W." and "West" are the same? What about misspellings ("mover" instead of "mower" etc)?
I think this is a tricky one - perhaps there are some well-known algorithms out there?
A good baseline, probably an impractical one in terms of its relatively high computational cost and more importantly its production of many false positive, would be generic string distance algorithms such as
Edit distance (aka Levenshtein distance)
Ratcliff/Obershelp
Depending on the level of accuracy required (which, BTW, should be specified both in terms of its recall and precision, i.e. generally expressing whether it is more important to miss a correlation than to falsely identify one), a home-grown process based on [some of] the following heuristics and ideas could do the trick:
tokenize the input, i.e. see the input as an array of words rather than a string
tokenization should also keep the line number info
normalize the input with the use of a short dictionary of common substituions (such as "dr" at the end of a line = "drive", "Jack" = "John", "Bill" = "William"..., "W." at the begining of a line is "West" etc.
Identify (a bit like tagging, as in POS tagging) the nature of some entities (for example ZIP Code, and Extended ZIP code, and also city
Identify (lookup) some of these entities (for example a relative short database table can include all the Cities / town in the targeted area
Identify (lookup) some domain-related entities (if all/many of the address deal with say folks in the legal profession, a lookup of law firm names or of federal buildings may be of help.
Generally, put more weight on tokens that come from the last line of the address
Put more (or less) weight on tokens with a particular entity type (ex: "Drive", "Street", "Court" should with much less than the tokens which precede them.
Consider a modified SOUNDEX algorithm to help with normalization of
With the above in mind, implement a rule-based evaluator. Tentatively, the rules could be implemented as visitors to a tree/array-like structure where the input is parsed initially (Visitor design pattern).
The advantage of the rule-based framework, is that each heuristic is in its own function and rules can be prioritized, i.e. placing some rules early in the chain, allowing to abort the evaluation early, with some strong heuristics (eg: different City => Correlation = 0, level of confidence = 95% etc...).
An important consideration with search for correlations is the need to a priori compare every single item (here address) with every other item, hence requiring as many as 1/2 n^2 item-level comparisons. Because of this, it may be useful to store the reference items in a way where they are pre-processed (parsed, normalized...) and also to maybe have a digest/key of sort that can be used as [very rough] indicator of a possible correlation (for example a key made of the 5 digit ZIP-Code followed by the SOUNDEX value of the "primary" name).
I would look at producing a similarity comparison metric that, given two objects (strings perhaps), returns "distance" between them.
If you fulfil the following criteria then it helps:
distance between an object and
itself is zero. (reflexive)
distance from a to b is the same in
both directions (transitive)
distance from a to c is not more
than distance from a to b plus
distance from a to c. (triangle
rule)
If your metric obeys these they you can arrange your objects in metric space which means you can run queries like:
Which other object is most like
this one
Give me the 5 objects
most like this one.
There's a good book about it here. Once you've set up the infrastructure for hosting objects and running the queries you can simply plug in different comparison algorithms, compare their performance and then tune them.
I did this for geographic data at university and it was quite fun trying to tune the comparison algorithms.
I'm sure you could come up with something more advanced but you could start with something simple like reducing the address line to the digits and the first letter of each word and then compare the result of that using a longest common subsequence algorithm.
Hope that helps in some way.
You can use Levenshtein edit distance to find strings that differ by only a few characters. BK Trees can help speed up the matching process.
Disclaimer: I don't know any algorithm that does that, but would really be interested in knowing one if it exists. This answer is a naive attempt of trying to solve the problem, with no previous knowledge whatsoever. Comments welcome, please don't laugh too laud.
If you try doing it by hand, I would suggest applying some kind of "normalization" to your strings : lowercase them, remove punctuation, maybe replace common abbreviations with the full words (Dr. => drive, St => street, etc...).
Then, you can try different alignments between the two strings you compare, and compute the correlation by averaging the absolute differences between corresponding letters (eg a = 1, b = 2, etc.. and corr(a, b) = |a - b| = 1) :
west lawnmover drive
w lawnmower street
Thus, even if some letters are different, the correlation would be high. Then, simply keep the maximal correlation you found, and decide that their are the same if the correlation is above a given threshold.
When I had to modify a proprietary program doing this, back in the early 90s, it took many thousands of lines of code in multiple modules, built up over years of experience. Modern machine-learning techniques ought to make it easier, and perhaps you don't need to perform as well (it was my employer's bread and butter).
So if you're talking about merging lists of actual mailing addresses, I'd do it by outsourcing if I can.
The USPS had some tests to measure quality of address standardization programs. I don't remember anything about how that worked, but you might check if they still do it -- maybe you can get some good training data.