Comparision of data in huge databases - mysql

I have a database in mysql which has a collection of attributes (ex. 'weight', 'height', 'no of pages' etc) and attribute values (ex. '30 tons', '12 inches', '2 pgs' etc) and mapped with the respective product ids.
The data has been collected from different sites and hence the attribute values have different formats (ex. '222 pgs' or '222 pages' or '222') (ex2. '12 inches', '12 meters', '12 cms').
What I need to do is that I have to compare the values of same attributes of different products. So I have to compare '222 pgs' with '222 pages' for all the attributes which differ in formats.
There are around 4000 attributes and the number will increase further. Is there any way to compare these without having to assign each attribute a specific type individually? Or what is the fastest way to compare these?

Well, until they invent a clairvoyant computer, a human being will have to tell it that pgs and pages mean the same thing and that inches and meters are convertible.
You'll have to sanitize the data one way or another. I'd probably start by identifying units that measure the same dimension1 and common aliases2 for each unit, then parse the data to split the quantity from the unit and normalize3 the unit. Once you have done that, the data becomes directly comparable.
But all this is really just a remedy for the problem that should not have been there in the first place, were the database designed properly.
1 A "mass" is a dimension measured by units such as kg, t, lb etc. A "length" is a dimension measured by m, km, in etc.
2 E.g. an in and inch denote exactly the same unit, pgs and pages are the same etc.
3 I.e. make sure a particular dimension is always represented by the same unit: for example convert all lengths to m, all masses to kg, all pages to pages etc.

You haven't explained what you want to do after you find out that attributes for a pair of products differ (while still meaning the same thing).
I.e.: if I see that in Instance A has field Length set to "12 pgs" and Instance B has Length reporting "12 pages" what do you do?
List this? Autocorrect? Drop one of the two values? Open a window for a human user to correct?
Personally I'd go for a "select attribute,count(*) from X group by attribute" so that you can find out the most common spelling of the unit, and then you can also write corrective scripts that may automatically convert ".. pgs" to " pages" as soon as you have decided the correct representation.
Of course this will not help at all unless you enforce correct spelling of the units, and this requires for sure better input-output filters, including the main UI, but also any kind of bulk uploader utility you may use to create or update products.
A redesign of the DB to add "Unit" as an extra, categorized attribute for each measure would also help a lot.

Related

Converting strings, based of units in MongoDB field

I am using MongoDB to store different values based of units.
For example I have a speed field:
"Speed":"1 m/s"
or
"Speed":"1 mph"
I also have distance field, like this:
"Distance": "1 ft"
or
"Distance":"1 meter"
I have about 20 different field types, like speed, distance, power, area, angle, and others I would like to store all the fields of different units types in the same units, so I can compare them. I am not sure if it would be best to do this on input or when I am reading from the database, but either is an option.
I am planning on storing a field unit type, I.E. this field is a speed, and an equation to get to the base unit, I.E. if the speed field has m/s and the base field in ft/sec multiple by 3.28, but I am not sure how to structure this. So ideally the fields above would be something like:
{"Speed":"1 m/s"},
{"Speed":"1 mph"},
{"Distance": "1 ft"},
{"Distance":"1 meter"}
Would become
{"Speed":{"base(ft/sec)":3.28,"orig_val":1,"orig_unit":"m/s"},
{"Speed":{"base(ft/sec)":1.47,"orig_val":1,"orig_unit":"mph"},
{"Distance":{"base(in)":12,"orig_val":1,"orig_unit":"ft"},
{"Distance":{"base(in)":39.37,"orig_val":1,"orig_unit":"meter"}
Some thoughts.
Store a specific field as the same unit, irrespective of how you capture - e.g., distance is always stored as feet. The field would be stored as { distance_feet: 120 }.
Store a specific field as it is captured, i.e., a field can be captured with different units. This will have an additional field specifying the "units". For example, { distance: 120, units: "feet" }. In this case, the field units can be either "feet" or "meters" .
In both the cases, the application (or program) logic can take care of the conversion from feet to meters or vice-versa.
An additional field called as "conversion_factor" can be stored in the collection (e.g., for feet and meters conversion, { conversion_factor: 3.38084 } ). This involves storage of same information in the database many times and with large amount of data this adds to the storage space and memory when data is read - a factor to consider.
What info to store and how depends upon factors:
What is purpose of this data and how it is used in the application?
Is it queried? How and in what format? Captured in what format? How
often? What kind of computations (calculations, comparisons, etc.)
happen upon this data?
I think the application requirements or functionality should drive how you design the data, not how you have to program it. You should know at this stage what are the important things you will be doing with the data.
I have about 20 different field types, like speed, distance, power,
area, angle, and others I would like to store all the fields of
different units types in the same units, so I can compare them.
I think, comparing is a computation and the program can take care of data like "conversion factor" and converting from one unit to another.

AS3 - Massive Numbers/Integers, Beyond MAX_VALUE

Can anyone help me write a class, e.g. BigNumber.as (or BigInt.as) which will:
Allow for really really big numbers/integers.
Include a method to express a number in format "1.54 Million", "1.98 Vigintillion" and so on...
Allow the maximum number to stop only at the last number word (e.g. Million, Vigintillion, etc) in the defined list. (e.g. list built from here: https://en.wikipedia.org/wiki/Names_of_large_numbers under Standard dictionary numbers [Short scale])
I had an idea to have a class which contains 2 Number values ("value" and "timesMaxedOut"). When "value" >= Number.MAX_VALUE, it would then increment "timesMaxedOut" by 1 and reset "value" back to the difference that the value went over by.
The problem? It seems if you hit or surpass "MAX_VALUE" then the Number will reset to 0. I'm also sure it would then be difficult to properly multiply or divide numbers with this approach, as it would need to take into account "timesMaxedOut" just for the calculations to work correctly.
My goal is to write a game which would allow players to reach really big numbers, and play indefinitely essentially, but AS3 lacks very large number support it seems.

SSRS Format to display as percent

I've gone through quite a few examples on here and I apologize if I'm asking a repeat question, as far as I can tell, I am not.
I have an SSRS report made that shows gross sales for certain aspects of our sales departments. They are broken down, in row, by "cost, gross profit, gross profit %, order count, total sales." The columns are the aspects of our sales. Web sales, phone sales, etc....
In the tablix I can format a text box to display the results as numbers, but as you can see, I have also Percentage and Count in there. I don't know how to format those within the context of the original text box format. So I know I have everything that shows under there as a number already, but how do I handle getting the percentage to show as a percentage and the count to show as a count?
For example, all the percentages currently show as, "$0.35" and various other numbers that follow that form. The count's currently appear as currency too.
I've used an example I found on here, "=Iif ( Me.Value = Floor ( Me.Value ) , "0%" , "0.00%" )," but all that did was make everything that showed up in that column, "0.00%" I am fairly new to SSRS and have been cramming consistently for the past two weeks, but I just cannot find help on this. Thank you in advance for anything you can offer.
Update: =IIF(Fields!LVS_Web.Value=0.00, "0%", format(Fields!LVS_Web.Value, "P"))
That worked... to a degree, but now everything is a percent.... thinking ELSE here but I don't know how ELSE goes in, I've not once seen the word ELSE.
Update 2: The thing that I've noticed is that in the statement, where it says, "=0.00, "0%"," that doesn't even really apply. I've just put that there because I'm new to this and I just needed an argument involved. I took the 0% and changed it to N under the condition that the number was < .99, hopeing I would just catch all of the decimals that fell below the value of 1. Like, "$.23", which later became 23.45%, so I COULD do that, but what I don't udnerstand is it made everything else, "N," instead of a number. Why is that? It doesn't make everything else, "P?"
I'm losing my damned mind.
There is also the fact that this is information being pulled from a stored procedure, I don't really know too much about those quite yet, I get assigned simple tasks ever so often as a stepping stool for learning. I don't really know what the query was, but I couldn't edit it if I wanted to. This can be done with expression formatting but my expression is too broad, but I get mixed results using Greater or Less than, and it's probably not the wisest thing to use since these numbers are not set in stone. My day is almost done, I've made very very little progress, but I had a good lunch. So success.
So I provided my own answer for this problem, and it works. Thanks me. Thanks to all the tried to help me and did help as well. I appreciate the effort strangers will put out for each other.
I've had a new problem develop, I need to display a time relative to the data being pulled. I can put NOW in there and get today's date, but if someone is pulling information from FEB, they may be a little off-put by the current date. I'll probably get this figured out soon, but if anyone can help in the meantime, I would appreciate it.
A standard principle is to separate data from display, so use the Value property to store the data in its native data type and use the Format property to display it how you want. So rather than use an expression formatting the Value property such as =Format(Fields.SomeField.Value, "0.00%") leave the Value as =Fields!SomeField.Value and set the Format property to P2.
This is especially important when exporting your report to Excel because if you have the right data type for your data it will export to Excel as the right data type. If you use the Format function it will export as text, making sorting and formula not work properly.
The easiest thing to do to control the formatting is use the standard numeric formats. Click on the cell or range of cells that you want to have a certain format and set the Format property. They are a format specifier letter followed by an optional digit for precision (number of decimal places). Some useful ones are:
C Currency with 2 decimal places (by default)
N4 Number with 4 decimal places
P0 Percentage with no decimal places
Click on the link above for the full list. Format the number cells as numbers and the percents as percents - you don't need to try to make one format string fit every cell.
These standard numeric formats also respect regional settings. You should set your report's Language property to =User!Language to use the user's regional settings rather than the report server's.
If the number is already * 100 eg. 9.5 should be shown as 9.5% then use the format:
0.00\%
9.5 -> 9.5%
0.34 -> 0.34%
This way you can use the standard number formatting and just add the % to the end. The \ escapes the %, preventing the *100 in formatting (which would make 9.5 show 950%.).
=iif(Fields!Metric.Value = "Gross Profit %",
Format(Fields!LVS_Web.Value,"P"),
iif(Fields!Metric.Value = "Order Count",
Format(Fields!LVS_Web.Value,"G4"),
Format(Fields!LVS_Web.Value,"C")))
This is what saved me and did what I wanted. There is another error, but it's my bosses fault, so now I get to laugh at him. Thanks everyone.
Source:
https://technet.microsoft.com/en-us/library/bb630415(v=sql.100).aspx
This is simple to use,
Percent of (the sum of line item totals for the current scope)/(the sum of line item totals for the dataset).
This value is formatted using FormatPercent specifying one decimal place.
="Percentage contributing to all sales: " & FormatPercent(Sum(Field!LineTotal.Value)/Sum(Field!LineTotal.Value,"Sales"),1)

Saving user's height and weight

How should I store a user's height and weight in a MySQL database such that I can use the information to find users within a certain height or weight? Also, I will need to be able to display this information in either English or metric system.
My idea is to store the information for height in centimeters and weight in kilograms (I prefer metric over English). I can even let the user enter their information and English system, but do the conversion to metric before saving. I think converting kilograms to pounds might be easy to do in SQL, but I'm not sure how easy it would be to convert 178 centimeters to 5'10" (rounded slightly down).
Should I be saving English and metric values in the database so that I don't need to do conversions when I do my queries? Sounds like a bad idea to store derived/computed values.
There are several ways... one is to just have two numeric columns, one for height, one for weight, then do the conversions (if necessary) at display time. Another is to create a "height" table and a "weight" table, each with a primary key that is linked from another table. Then you can store both English and metric values in these tables (along with any other meta info you want):
CREATE TABLE height (
id SERIAL PRIMARY KEY,
english VARCHAR,
inches INT,
cm INT,
hands INT // As in, the height of a horse
);
INSERT INTO height VALUES
(1,'4 feet', 48, 122, 12),
(2,'4 feet, 1 inch', 49, 124, 12),
(3,'4 feet, 2 inches', 50, 127, 12),
(3,'4 feet, 3 inches', 51, 130, 12),
....
You get the idea...
Then your users table will reference the height and weight tables--and possibly many other dimension tables--astrological sign, marital status, etc.
CREATE TABLE users (
uid SERIAL PRIMARY KEY,
height INT REFERENCES height(id),
weight INT references weight(id),
sign INT references sign(id),
...
);
Then to do a search for users between 4 and 5 feet:
SELECT *
FROM users
JOIN height ON users.height = height.id
WHERE height.inches >= 48 AND height.inches <= 60;
Several advantages to this method:
You don't have to duplicate the "effort" (as if it were any real work) to do the conversion on display--just select the format you wish to display!
It makes populating drop-down boxes in an HTML select super easy--just SELECT english FROM height ORDER BY inches, for instance.
It makes your logic for various dimensions--including non-numerical ones (like astrological signs) obviously similar--you don't have special case code all over the place for each data type.
It scales really well
It makes it easy to add new representations of your data (for instance, to add the 'hands' column to the height table)
I would do it the way that you have said you would like to do it, but on the converting part, you would not convert 178 centimeters to 5'10", you would convert it to 70", then if need be, convert that into 5'10".
Think of 5'10" as either 70" or 5.8333333'. In that case, converting betwen 70" or 5.83333 is just a multiplication, so its easy to store in the db as centimeters if you so choose.
The issue of what the user sees is a presentation issue and nothing to do with the database.
I agree that storing computed values in this case is not ok. Your choices are perfect.
However, I would do the computations at the application level and query the DB with those values - depending on the language your application is written in , I am sure there are plenty o libraries/modules that are made that can compute those transformations.
Edit - to address the issue of storing computed values in DB:
While this is considered to be a bad practice in working with DBs, I usually am not 100% against this practice - just 90%.
I tend to store computed values in DB only when the computations are complex and would take enormous resources to get to the result wanted - this is clearly not the case.
If you would store computed values here you would have only the disadvantages of this technique - when modifying a record, you would have to modify the data in multiple places to keep the consistency of your DB

Text-correlation in MySQL [duplicate]

Suppose I want to match address records (or person names or whatever) against each other to merge records that are most likely referring to the same address. Basically, I guess I would like to calculate some kind of correlation between the text values and merge the records if this value is over a certain threshold.
Example:
"West Lawnmower Drive 54 A" is probably the same as "W. Lawn Mower Dr. 54A" but different from "East Lawnmower Drive 54 A".
How would you approach this problem? Would it be necessary to have some kind of context-based dictionary that knows, in the address case, that "W", "W." and "West" are the same? What about misspellings ("mover" instead of "mower" etc)?
I think this is a tricky one - perhaps there are some well-known algorithms out there?
A good baseline, probably an impractical one in terms of its relatively high computational cost and more importantly its production of many false positive, would be generic string distance algorithms such as
Edit distance (aka Levenshtein distance)
Ratcliff/Obershelp
Depending on the level of accuracy required (which, BTW, should be specified both in terms of its recall and precision, i.e. generally expressing whether it is more important to miss a correlation than to falsely identify one), a home-grown process based on [some of] the following heuristics and ideas could do the trick:
tokenize the input, i.e. see the input as an array of words rather than a string
tokenization should also keep the line number info
normalize the input with the use of a short dictionary of common substituions (such as "dr" at the end of a line = "drive", "Jack" = "John", "Bill" = "William"..., "W." at the begining of a line is "West" etc.
Identify (a bit like tagging, as in POS tagging) the nature of some entities (for example ZIP Code, and Extended ZIP code, and also city
Identify (lookup) some of these entities (for example a relative short database table can include all the Cities / town in the targeted area
Identify (lookup) some domain-related entities (if all/many of the address deal with say folks in the legal profession, a lookup of law firm names or of federal buildings may be of help.
Generally, put more weight on tokens that come from the last line of the address
Put more (or less) weight on tokens with a particular entity type (ex: "Drive", "Street", "Court" should with much less than the tokens which precede them.
Consider a modified SOUNDEX algorithm to help with normalization of
With the above in mind, implement a rule-based evaluator. Tentatively, the rules could be implemented as visitors to a tree/array-like structure where the input is parsed initially (Visitor design pattern).
The advantage of the rule-based framework, is that each heuristic is in its own function and rules can be prioritized, i.e. placing some rules early in the chain, allowing to abort the evaluation early, with some strong heuristics (eg: different City => Correlation = 0, level of confidence = 95% etc...).
An important consideration with search for correlations is the need to a priori compare every single item (here address) with every other item, hence requiring as many as 1/2 n^2 item-level comparisons. Because of this, it may be useful to store the reference items in a way where they are pre-processed (parsed, normalized...) and also to maybe have a digest/key of sort that can be used as [very rough] indicator of a possible correlation (for example a key made of the 5 digit ZIP-Code followed by the SOUNDEX value of the "primary" name).
I would look at producing a similarity comparison metric that, given two objects (strings perhaps), returns "distance" between them.
If you fulfil the following criteria then it helps:
distance between an object and
itself is zero. (reflexive)
distance from a to b is the same in
both directions (transitive)
distance from a to c is not more
than distance from a to b plus
distance from a to c. (triangle
rule)
If your metric obeys these they you can arrange your objects in metric space which means you can run queries like:
Which other object is most like
this one
Give me the 5 objects
most like this one.
There's a good book about it here. Once you've set up the infrastructure for hosting objects and running the queries you can simply plug in different comparison algorithms, compare their performance and then tune them.
I did this for geographic data at university and it was quite fun trying to tune the comparison algorithms.
I'm sure you could come up with something more advanced but you could start with something simple like reducing the address line to the digits and the first letter of each word and then compare the result of that using a longest common subsequence algorithm.
Hope that helps in some way.
You can use Levenshtein edit distance to find strings that differ by only a few characters. BK Trees can help speed up the matching process.
Disclaimer: I don't know any algorithm that does that, but would really be interested in knowing one if it exists. This answer is a naive attempt of trying to solve the problem, with no previous knowledge whatsoever. Comments welcome, please don't laugh too laud.
If you try doing it by hand, I would suggest applying some kind of "normalization" to your strings : lowercase them, remove punctuation, maybe replace common abbreviations with the full words (Dr. => drive, St => street, etc...).
Then, you can try different alignments between the two strings you compare, and compute the correlation by averaging the absolute differences between corresponding letters (eg a = 1, b = 2, etc.. and corr(a, b) = |a - b| = 1) :
west lawnmover drive
w lawnmower street
Thus, even if some letters are different, the correlation would be high. Then, simply keep the maximal correlation you found, and decide that their are the same if the correlation is above a given threshold.
When I had to modify a proprietary program doing this, back in the early 90s, it took many thousands of lines of code in multiple modules, built up over years of experience. Modern machine-learning techniques ought to make it easier, and perhaps you don't need to perform as well (it was my employer's bread and butter).
So if you're talking about merging lists of actual mailing addresses, I'd do it by outsourcing if I can.
The USPS had some tests to measure quality of address standardization programs. I don't remember anything about how that worked, but you might check if they still do it -- maybe you can get some good training data.