SQL - How to find optimal performance numbers for query - mysql

First time here so forgive me for any faux pas. I have a question about the limitation of SQL as I am new to the code, and what I need I believe to be rather complex.
Is it possible to automate finding the optimal data for a specific query. For example, say I have the following columns:
1) Vehicle type (Text) e.g. car,bike,bus
2) Number of passengers (Numeric) e.g. 0-7
3) Was in an accident (Boolean) e.g. t or f
From here, I would like to get percentages. So if I were to select only cars with 3 passengers, what percentage of the total accidents does that account for.
I understand how to get this as a one off or mathematically calculate it, however my question relates how to automate this process to get the optimum number.
So, keeping with this example, say I look at just cars, what number of passengers covers the highest percentage of accidents?
At the moment, I am currently going through and testing number by number, is there a way to 'find' the optimal number? It is easy when it is just 0-7 like in the example, but I would naturally like to deal with a larger range and even multiple ranges. For example, say we add another variable titled:
4) Number of doors (numeric) e-g- 0-3
Would there be a way of finding the best combination of numbers from these two variables that cover the highest percentage of accidents?
So say we took: Car, >2 passengers, <3 doors on the vehicle. Out of the accidents variable 50% were true
But if we change that to:Car, >4 passengers, <3 doors. Out of the accidents variable 80% were true.
I hope I have explained this well. I understand that this is most likely not possible with SQL, however is there another way to find these optimum numbers?
Thanks in advance

Here's an example that will give you an answer for all possibilities. You could add a limit clause to show only the top answer, or add to the where clause to limit to specific terms.
SELECT
`vehicle_type`,
`num_passengers`,
sum(if(`in_accident`,1,0)) as `num_accidents`,
count(*) as `num_in_group`,
sum(if(`in_accident`,1,0)) / count(*) as `percent_accidents`
FROM `accidents`
GROUP BY `vehicle_type`,
`num_passengers`
ORDER BY sum(if(`in_accident`,1,0)) / count(*)

Related

Smart Queries That Deal With NULL Values

I currently inherited a table similar to the one in the image below. I don't have the resources to do what should be done in the allotted time, which is obviously to normalize the data into separate tables break it into a few smaller tables to eliminate redundancy, etc.
My current idea for a short-term solution is to create a query for each product type and store it in a new table based on ParentSKU. In the image below, a different query would be necessary for each of the 3 example ParentSKUs. This will work okay, but if new attributes are added to a SKU the query needs to be adjusted manually. What would be ideal in the short term (but probably not very likely) is to be able to come up with a query that would only include and display attributes where there weren't any NULL values. The desired results for each of the three ParentSKUs would be the same as they are in the examples below. If there were only 3 queries total, that would be easy enough, but there are dozens of combinations based on the products and categories of each product.
I'm certainly not the man for the job, but there are scores of people way smarter than I am that frequent this site every day that may be able to steer me in a better direction. I realize I'm probably asking for the impossible here, but as the saying goes, "There are no stupid questions, only ill-advised questions that deservedly and/or inadvertently draw the ire of StackOverflow users for various reasons." Okay, I embellished a tad, but you get my point...
I should probably add that this is currently a MySQL database.
Thanks in advance to anyone that attempts to help!
First create SKUTypes with the result of
SELECT ParentSKU , count(Attr1) as Attr1,..
FROM tbl_attr
GROUP BY ParentSKU;
Then create script which will generate an SQL query for every row of SKUTypes taking every AttrN column which value > 0.

MySQL architecture for n * (n - 1) / 2 algorithm

I'm currently developing a website where users can search for other users based on attributes (age, height, town, education, etc.). I now want to implement some kind of rating between user profiles. The rating is calculated via its own algorithm based on similiarity between the 2 given profiles. User A has a rating "match rating" of 85 with User B and 79 with User C for example. B and C have a rating of 94 and so on....
The user should be able to search for certain attributes and filter the results by rating.
Since the rating differs from profile to profile and also depends on the user doing the search, I can't simply add a field to my users table and use ORDER BY. So far I came up with 2 solutions:
My first solution was to have a nightly batch job, that calculates the rating for every possible user combination and stores it in a separate table (user1, user2, rating). I then can join this table with the user table and order the result by rating. After doing some math I figured that this solution doesn't scale that well.
Based on the formula n * (n - 1) / 2 there are 45 possible combination for 10 users. For 1.000 users I suddenly have to insert 499.500 rating combinations into my rating table.
The second solution was to leave MySQL be and just calculate the rating on the fly within my application. This also doesn't scale well. Let's say the search should only return 100 results to the UI (with the highest rated on top). If I have 10.000 users and I want to do a search for every user living in New York sorted by rating, I have to load EVERY user that is living in NY into my app (let's say 3.000), apply the algorithm and then return only the top 100 to the user. This way I have loaded 2.900 useless user objects from the DB and wasted CPU on the algorithm without ever doing anything with it.
Any ideas how I can design this in my MySQL db or web app so that a user can have an individual rating with every other user in a way that the system scales beyond a couple thousand users?
If you have to match every user against every other user, the algorithm is O(N^2), whatever you do.
If you can exploit some sort of 1-dimensional "metric", then you can try and associate each user with a single synthetic value. But that's awkward and could be impossible.
But what you can do is to note which users require a change in their profiles (whenever any of the parameters on which the matching is based, changes). At that point you can batch-recalculate the table for those users only, thus working in O(N): if you have 10000 users and only 10 require recalculation, you have to examine 100,000 records instead of 100,000,000.
Other strategies would be to only run the main algorithm for records which have the greater chance of being compared: in your example, "same city". Or when updating records (but this would require to store (user_1, user_2, ranking, last_calculated), only recalculate those records with high ranking, very old, or never calculated. Lowest ranked matches aren't likely to change so much that they float to the top in a short time.
UPDATE
The problem is also operating with O(N^2) storage space.
How to reduce this space? I think I can see two approaches. One is to not put some information in the match table at all. The "match" function makes the more sense the more it is rigid and steep; having ten thousand "good matches" would mean that matching means very little. So we would still need lotsa recalculations when User1 changes some key data, in case it brings some of User1's "no-no" matches back into the "maybe" zone. But we would keep a smaller clique of active matches for each user.
Storage would still grow quadratically, but less steeply.
Another strategy would be to recalculate the match, and then we would need to develop some method for quickly selecting which users are likely to have a good match (thus limiting the number of rows retrieved by the JOIN), and some method to quickly calculate a match; which could entail somehow rewriting the match between User1 and User2 to a very simple function of a subset of DataUser1, DataUser2 (maybe using ancillary columns).
The challenge would be to leverage MySQL capabilities and offload some calculations the the MySQL engine.
To this purpose you might perhaps "map" some data, at input time (therefore in O(k)), to spatial information, or to strings and employ Levenshtein distance.
The storage for a single user would grow, but it would grow linearly, not quadratically, and MySQL SPATIAL indexes are very efficient.
If the search should only return the top 100 best matches, then why not just store those? It sounds like you would never want to search the bottom end of the results anyway, so just don't calculate them.
That way, your storage space is only o(n), rather than o(n^2), and updates should be, as well. If someone really wants to see matches past the first 100 (and you want to let them) then you have the option of running the query in real time at that point.
I agree with everything #Iserni says.
If you have a web app and users need to "login", then you might have an opportunity to create that user's rankings at that time and stash them into a temporary table (or rows in an existing table).
This will work in a reasonable amount of time (a few seconds) if all the data needed for the calculation fits into memory. The database engine should then be doing a full table scan and creating all the ratings.
This should work reasonably well for one user logging in. Passably for two . . . but it is not going to scale very well if you have, say, a dozen users logging in within one second.
Fundamentally, though, your rating does not scale well. You have to do a comparison of all users to all users to get the results. Whether this is batch (at night) or real-time (when someone has a query) doesn't change the nature of the problem. It is going to use a lot of computing resources, and multiple users making requests at the same time will be a bottleneck.

Which of these would be safer/better to run?

I have 451 cities with coordinates. Now I want to calculate the distance between each city and then order some results by that distance. Now I have 2 options:
I can run a loop that would calculate distance for every possible combination of cities and storing them into a table, which would result in roughly 200k rows.
Or, I can leave the cities without pre-calculating and then, when results are displayed (about 30 per page), and calculate the distance for each city separately.
I don't know which would be better for performance, but I would prefer going for option one, in which case I have another concern: Is there a way I could get out as little rows as possible? Currently, I would count the possibilities as 451^2, but I think I could divide that by 2, since the distance in case of City1-City2 is the same as City2-City1.
Thanks
If your table of cities is more or less static, then you should definitely per-calculate all distances and store them in a separate table. In this case you will have (451^2/2) rows (just make sure thet id of City1 is always lower then id of City2 (or another way round, doesn't really matter)).
Normally the cost of a single MySQL query is quite high and the cost of mathematical operations really low. Especially if the scale of your map is small and the required precision is low, so you can calculate with a fixed distance between degrees, you will be faster with calculating.
Furthermore you would have a problem if the number of cities rises because of a change in your project and therefore the number of combinations you'd have to store in the DB exceeds the limits.
So you'd probably better off without pre-calculating.

Birthday Paradox: How to programmatically estimate the probability of 3, and N, people sharing a birthday

There are extensive resources on the internet discussing the famous Birthday Paradox. It is clear to me how you calculate the probability of two people sharing a birthday i.e. P(same) = 1 - P(different). However if I ask myself something apparently more simple I stall: firstly, let's say I generate two random birthdays. Getting the same birthday is like tossing a coin. Either the two persons share a birthday (Heads) or they don't share a birthday (Tail). Run this 500 times and the end result (#Heads/500) will somehow be close to 0.5
Q1) But how do I think about this if I generate three random birthdays? How can I estimate the probability then? Obviously my coin analogy won't be applicable.
Q2) once I have figured out the above I will need to scale it up and generate 30 or 50 birthdays. Is there a recommended technique or algorithm to isolate identical birthdays from a large set? Should I put them into arrays and loop through them?
Here's what I think I need:
Q1)
r = 25 i.e. each trial run generates 25 birthdays
Trial 1 >
3 duplicates: 0
Trial 2 >
3 duplicates: 0
Trial 3 >
3 duplicates: 2
Trial 4 >
3 duplicates: 1
...
T100 >
3 duplicates: 2
estimated probability of 3 persons sharing a birthday in a room of 25 = (0+0+2+1+...+2)/100
Q2)
Create an array for 2 duplicates, an array for 3 duplicates and one for more than 3 duplicates
add each generated birthday one by one into the first array. But before doing so, loop through the array to see if it's in there already. If so, add it to the second array, but before doing so repeat the above process and so on
It doesn't seem to be a very efficient algorithm though :) suggestions to improve the Big O here?
Create an integer array of length 365, initialized to 0. Then generate N (in your case 25) random numbers between 1-365 and increase that number in the array (ie. bdays[random_value]++). Since you are only interested in a collision happening, right after increasing the number in the array check if it is greater than 2 (If it is then there is a second collision, which means there are 3 people with the same birthday). Keep track of collisions and execute this as many times as you wish (1000).
In the end, the ratio of collisions/1000 will be your requested value.
and, no tossing coins analogy is wrong.
Check this similar question and its answers on CrossValidated, but I think it is really worth thinking about the classic Birthday problem again to get the basics.
To the second part of your question: depends on the language you use. I definitely suggest using R to solve a problem like that, as checking identical birthdays in a list/vector/data frame can easily done with a simple unique call. To run a such simple MC simulation R is again really handy, check the second answer on the link above.
Sounds like your first task will be to create a method that will generate random birthdays. To keep things simple, you can use the numbers 1-365 to denote unique birthdays.
Store however many random birthdays (2 in the first case more later) in an ArrayList as Strings. You will want to use a loop to call the random number function and store the value in your list.
Then make a function to search the ArrayList for duplicates. If there are any duplicates (no matter how many) then that's a Heads result. If there are no matches then it's a Tails.
Your probabilities will be far different from 50/50 until you get to 20 or so.

How to convert the following MySQL schema into CouchDB?

I am not sure how I can design the following problem in CouchDB.
I have a logger web app that keeps track of how many items are in a warehouse. To simplify the problem we just need to know the total number items currently in warehouse and how long is each item stays in warehouse before it ships. Lets say the warehouse only have shoes but each shoe have different id and need to keep track by id.
MySQL schema looks like this
id name date-in data-out
1 shoe 08/0/2010 null
2 shoe 07/20/2010 08/01/2010
The output will be
Number of shoe in warehouse: 1
Average time in warehouse: 14 days
Thanks
jhs' answer is great, but I just wanted to add something:
To use the build-in reduce function for the avg calculation (_stats in your case), you have to use two "separate" views. But if your map-function is exactly the same, CouchDB will detect that and not generate a whole new index for that second view. This way you can have one map function feeding multiple reduce functions.
If each shoe is a document, with a date_in and date_out, then your reduce function will +1 if the date_out is null, and +0 (no change) if date_out is not null. That will give you the total count of shoes in the warehouse.
To compute the average time, for each shoe, you know the time in the warehouse. So the reduce function simply accumulates the average. Since reduce functions must be commutative and associative, you use a different average algorithm. The easiest way is to reduce to a [sum, count] array, where sum is an accumulator of all time for all shoes, and count is a counter for the number of shoes counted. Then the client simply divides sum / count to compute the final average.
I think you could combine both of these into one big reduce if you want, perhaps building up a {"shoes in warehouse": 1, "average time in warehouse": [253, 15]} kind of object.
However, if you can accept two different views for this data, then there is a shortcut for the average. In the map, emit(null, time) where time is the time spent in the warehouse. In the reduce, set the entire reduce value to _stats (see Built-in reduce functions). The view output will be an object with the sum and count already computed.