Before i start my question i cover briefly what the problem is:
I have a table that stores around 4 million 'parameter' values. These values have an id, simulation id and parameter id.
The parameter id maps to a parameter table that basically just maps the id to a text like representation of the parameter x,y, etc etc
The simulation table has around 170k entries that map parameter values to a job.
There is also a score table which stores the score of each simulation , simulations have varying scores for example one might have one score another might have three. The scores tables has a simulation_id column for selecting this.
Each job has an id and an objective.
Currently im trying to select all the parameter_values who's parameter is 'x' and where the job id is 17 and fetch the score of it. The variables of the select will change but in princible its only really these things im interested in.
Currently im using this statement:
SELECT simulation.id , value , name , ( SELECT GROUP_CONCAT(score) FROM score WHERE score.simulation_id = simulation.id ) AS score FROM simulation,parameter_value,parameter WHERE simulation.id=parameter_value.simulation_id AND simulation.job_id = 17 AND parameter_value.parameter_id=parameter.id AND parameter.name = "$x1"
This works nicley except its taking around 3 seconds to execute. Can this be done any faster?
I don't know if it would be faster doing a query before this a pre-calculating the parameter_ids im searching for and doing an WHERE parameter_id IN (1,2,3,4) etc.
But i was under the impression SQL would optimize this anyway?
I have created index's where ever possible but cant get faster than the 2.7 seconds mark.
So my question would be:
Should i pre-calculate some values and avoid the joins,
Is there another other than group_concat to get the scores
and is there any other optimizations i could make to this?
I should also add that the scores must be in the same row or at least return sorted so i can easily read them from the result set.
Thanks,
Lewis
Related
Imagine I would like to develop something similar to Tinder. I have a database with roughly 170k rows (=persons) and I would like to present them on my website. After the user's response, the next person is shown etc.
Once a person has been shown, this is marked in the column 'seen' with a 1. The order in which the persons are shown should be random and only persons that have not been seen yet should be shown.
At the moment, I have this solution. However, this is rather slow and takes too much time for a smooth experience. What would be a more efficient approach to this problem? What is the gold standard for such problems?
SELECT * FROM data WHERE (seen = 0) ORDER BY RAND() LIMIT 1
Add a non-clustered index on the 'seen' column and PK column which will improve querying on the same.
If the primary id is sequential and u know the limits of the records, you can get a random number between max value and min value and query like
SELECT *
FROM data
WHERE seen = 0 AND id >= random_id
LIMIT 1
I am trying to extract a random article who has a picture from a database.
SELECT FLOOR(MAX(id) * RAND()) FROM `table` WHERE `picture` IS NOT NULL
My table is 33 MB big and has 1,006,394 articles but just 816 with pictures.
My problem is this query takes 0.4640 sek
I need this to be much much more faster.
Any idea is welcome.
P.S.
1. of course I have a index on id.
2. there is no index on the picture field. should I add one?
3. the product name is unique, also the product number, but thats out of question.
RESULT OF TESTING SESSION.
#cHao's Solution is faster when I use it to select one of the random entries with a picture.(les then 0.1 sec.
But its slower if I try to do the opposite, to select a random article without picture. 2..3 sec.
#Kickstart's Solution is a bit slower when trying to find a entry with picture, but is almost same speed when trying to find a entry without picture. average 0,149 sec.
#bob-kruithof's Solution don't work for me.
when trying to find a entry with picture, it selects a entry without picture.
and #ganesh-bora, yes you are right, in my case the speed difference is about 5..15 times.
I want to thank you all for your help, and I decided for #Kickstart.
You need to get a range of values with matching records and then find a matching record within that range.
Something like this:-
SELECT r1.id
FROM `table` AS r1
INNER JOIN (
SELECT RAND( ) * ( MAX( id ) - MIN( id ) ) + MIN( id ) AS id
FROM `table`
WHERE `picture` IS NOT NULL
) AS r2
ON r1.id >= r2.id
WHERE `picture` IS NOT NULL
ORDER BY r1.id ASC
LIMIT 1
However for any hope of efficiency you need an index on the field it is checking (ie, picture in your example)
Just an explanation of how this works.
The sub select finds a random id from the table which is between the min and max ids for records for a picture. This random id may or may not be for a picture.
The resulting id from this sub select is joined back against the main table, but using >= and with a WHERE clause specifying that the record is a picture record. Hence it joins against all picture records where the id is greater than or equal to the random id. The highest random id will be the one for the picture record with the highest id, so it will always find a record (if there are any picture records). The ORDER BY / LIMIT is then used to bring back that single id.
Note that there is an obvious flaw to this, but most of the time it will be irrelevant. The record retrieved may not be entirely random. The picture with the lowest id is unlikely to be returned (will only be returned if the RAND() returns exactly 0), but if this is important this is easy enough to fix by rounding the resulting random id. The other flaw is that if the ids are not vaguely equally distributed in the full range of ids then some will be returned more often than others. For example, take the situation where the first 1000 ids were pictures, then no more until the last (33 millionth) record. The random id could be any of those 33 million, but unless it is less than or equal to 1000 then it will be the 33 millionth record that will be returned.
You might try attaching a random number to each row, then sorting by that. The row with the lowest number will be at the top.
SELECT `table`.`id`, RAND() as `order`
FROM `table`
WHERE `picture` IS NOT NULL
ORDER BY `order`
LIMIT 1;
This is of course slower than just magicking up an ID with RAND(), but (1) it'll always give you a valid ID (as long as there's a record with a non-null picture field in the table, anyway), and (2) the WTF ratio is pretty low; most people can tell what's going on here. :) Its performance rivals Kickstart's solution with a decently indexed table, when the number of items to select from is relatively small (around 1%). Definitely don't try to select from a whole huge table like this; limit it first with a WHERE clause on some indexed field(s).
Performancewise, if you have a long-running app (ie: not PHP; i'm talking about Java, .net, etc where the app is alive even between requests), you might try to keep a list of all the IDs of items with pictures, select a random ID from that list, and load the article. You could do that in PHP too, if you wanted. It might not work as well when you have to query all the IDs each time, but it could be very useful if you can cache the list of IDs in APC or something.
for performance you can first add index on picture column so 814 records get sorted out at the top while executing the query and then you can fire your query.
How has someone else solved the problem?
I would suggest looking at the this article about different possible ways of selecting random rows in mysql.
Modified example from the article
SELECT name
FROM random JOIN
( SELECT CEIL( RAND() * (
SELECT MAX( id ) FROM random WHERE picture IS NOT NULL
) ) AS id ) AS r2 USING ( id );
This might work in your case.
Efficiency
As user Kickstart mentioned: Do you have an index on the column picture? This might help getting you the results a bit faster.
Are your tables optimized?
I have a table named countwronganswer with columns cwa_id, question_num. How can I generate a table with query that shows two columns, one column lists all the question_num and second column lists the number of times that cwa_id that related to the question_num.
Question Number |Total # of Mistake |
1 12
2 22
..etc
ATTENTION: This question was asked without the awareness of the existence of count or Groupby method because of the knowledge level at that state. Count() or Groupby() were the key to generate the 2nd column of total # values which I did not aware of completely, therefore, any attempt, at that point of time, to write the code for the data will be close to meaningless. Vote up if possible if you think its useful or resolved your issue.
Probably something like this
SELECT question_num, COUNT(cwa_id) total_mistakes
FROM countwronganswer
GROUP BY question_num
select question_num , count(cwa_id)
from tableName group by question_num
SELECT DISTINCT `Stock`.`ProductNumber`,`Stock`.`Description`,`TComponent_Status`.`component`, `TComponent_Status`.`certificate`,`TComponent_Status`.`status`,`TComponent_Status`.`date_created`
FROM Stock , TBOM , TComponent_Status
WHERE `TBOM`.`Component` = `TComponent_Status`.`component`
AND `Stock`.`ProductNumber` = `TBOM`.`Product`
Basically table TBOM HAS :
24,588,820 rows
The query is ridiculously slow, i'm not too sure what i can do to make it better. I have indexed all the other tables in the query but TBOM has a few duplicates in the columns so i can't even run that command. I'm a little baffled.
To start, index the following fields:
TBOM.Component
TBOM.Product
TComponent_Status.component
Stock.ProductNumber
Not all of the above indexes may be necessary (e.g., the last two), but it is a good start.
Also, remove the DISTINCT if you don't absolutely need it.
The only thing I can really think of is having an index on your Stock table on
(ProductNumber, Description)
This can help in two ways. Since you are only using those two fields in the query, the engine wont be required to go to the full data row of each stock record since both parts are in the index, it can use that. Additionally, you are doing DISTINCT, so having the index available to help optimize the DISTINCTness, should also help.
Now, the other issue for time. Since you are doing a distinct from stock to product to product status, you are asking for all 24 million TBOM items (assume bill of materials), and each BOM component could have multiple status created, you are getting every BOM for EVERY component changed.
If what you are really looking for is something like the most recent change of any component item, you might want to do it in reverse... Something like...
SELECT DISTINCT
Stock.ProductNumber,
Stock.Description,
JustThese.component,
JustThese.certificate,
JustThese.`status`,
JustThese.date_created
FROM
( select DISTINCT
TCS.Component,
TCS.Certificate,
TCS.`staus`,
TCS.date_created
from
TComponent_Status TCS
where
TCS.date_created >= 'some date you want to limit based upon' ) as JustThese
JOIN TBOM
on JustThese.Component = TBOM.Component
JOIN Stock
on TBOM.Product = Stock.Product
If this is a case, I would ensure an index on the component status table, something like
( date_created, component, certificate, status, date_created ) as the index. This way, the WHERE clause would be optimized, and distinct would be too since pieces already part of the index.
But, how you currently have it, if you have 10 TBOM entries for a single "component", and that component has 100 changes, you now have 10 * 100 or 1,000 entries in your result set. Take this and span 24 million, and its definitely not going to look good.
I am trying to find a way to get a random selection from a large dataset.
We expect the set to grow to ~500K records, so it is important to find a way that keeps performing well while the set grows.
I tried a technique from: http://forums.mysql.com/read.php?24,163940,262235#msg-262235 But it's not exactly random and it doesn't play well with a LIMIT clause, you don't always get the number of records that you want.
So I thought, since the PK is auto_increment, I just generate a list of random id's and use an IN clause to select the rows I want. The problem with that approach is that sometimes I need a random set of data with records having a spefic status, a status that is found in at most 5% of the total set. To make that work I would first need to find out what ID's I can use that have that specific status, so that's not going to work either.
I am using mysql 5.1.46, MyISAM storage engine.
It might be important to know that the query to select the random rows is going to be run very often and the table it is selecting from is appended to frequently.
Any help would be greatly appreciated!
You could solve this with some denormalization:
Build a secondary table that contains the same pkeys and statuses as your data table
Add and populate a status group column which will be a kind of sub-pkey that you auto number yourself (1-based autoincrement relative to a single status)
Pkey Status StatusPkey
1 A 1
2 A 2
3 B 1
4 B 2
5 C 1
... C ...
n C m (where m = # of C statuses)
When you don't need to filter you can generate rand #s on the pkey as you mentioned above. When you do need to filter then generate rands against the StatusPkeys of the particular status you're interested in.
There are several ways to build this table. You could have a procedure that you run on an interval or you could do it live. The latter would be a performance hit though since the calculating the StatusPkey could get expensive.
Check out this article by Jan Kneschke... It does a great job at explaining the pros and cons of different approaches to this problem...
You can do this efficiently, but you have to do it in two queries.
First get a random offset scaled by the number of rows that match your 5% conditions:
SELECT ROUND(RAND() * (SELECT COUNT(*) FROM MyTable WHERE ...conditions...))
This returns an integer. Next, use the integer as an offset in a LIMIT expression:
SELECT * FROM MyTable WHERE ...conditions... LIMIT 1 OFFSET ?
Not every problem must be solved in a single SQL query.