mysql query using where clause with 24 million rows

mysql query using where clause with 24 million rows - mysql

SELECT DISTINCT `Stock`.`ProductNumber`,`Stock`.`Description`,`TComponent_Status`.`component`, `TComponent_Status`.`certificate`,`TComponent_Status`.`status`,`TComponent_Status`.`date_created`
FROM Stock , TBOM , TComponent_Status
WHERE `TBOM`.`Component` = `TComponent_Status`.`component`
AND `Stock`.`ProductNumber` = `TBOM`.`Product`
Basically table TBOM HAS :
24,588,820 rows
The query is ridiculously slow, i'm not too sure what i can do to make it better. I have indexed all the other tables in the query but TBOM has a few duplicates in the columns so i can't even run that command. I'm a little baffled.

To start, index the following fields:
TBOM.Component
TBOM.Product
TComponent_Status.component
Stock.ProductNumber
Not all of the above indexes may be necessary (e.g., the last two), but it is a good start.
Also, remove the DISTINCT if you don't absolutely need it.

The only thing I can really think of is having an index on your Stock table on
(ProductNumber, Description)
This can help in two ways. Since you are only using those two fields in the query, the engine wont be required to go to the full data row of each stock record since both parts are in the index, it can use that. Additionally, you are doing DISTINCT, so having the index available to help optimize the DISTINCTness, should also help.
Now, the other issue for time. Since you are doing a distinct from stock to product to product status, you are asking for all 24 million TBOM items (assume bill of materials), and each BOM component could have multiple status created, you are getting every BOM for EVERY component changed.
If what you are really looking for is something like the most recent change of any component item, you might want to do it in reverse... Something like...
SELECT DISTINCT
Stock.ProductNumber,
Stock.Description,
JustThese.component,
JustThese.certificate,
JustThese.`status`,
JustThese.date_created
FROM
( select DISTINCT
TCS.Component,
TCS.Certificate,
TCS.`staus`,
TCS.date_created
from
TComponent_Status TCS
where
TCS.date_created >= 'some date you want to limit based upon' ) as JustThese
JOIN TBOM
on JustThese.Component = TBOM.Component
JOIN Stock
on TBOM.Product = Stock.Product
If this is a case, I would ensure an index on the component status table, something like
( date_created, component, certificate, status, date_created ) as the index. This way, the WHERE clause would be optimized, and distinct would be too since pieces already part of the index.
But, how you currently have it, if you have 10 TBOM entries for a single "component", and that component has 100 changes, you now have 10 * 100 or 1,000 entries in your result set. Take this and span 24 million, and its definitely not going to look good.

Related

I always have a "WHERE date" in all my SQL queries. Speed up?

I have a large table with hundreds of thousands of rows. However only about 50,000 rows are actually "active" and part of my queries, because I only select the rows that have been updated last 14 days with WHERE crdate > "2014-08-10". So to speed up the queries to the table I'm thinking what of the following options (or maybe you have another suggestion?) that is the best one:
I can delete all old entries and insert them into a "history" table with a cronjob running every day/week. However this will still make the history table slow if I want to do queries to that one.
I can make an index on my "crdate" column. However my dates are in the format of "2014-08-10 06:32:59" so I guess because it is storing so many different values, that index will be quite large(?) and potentially slow(?).
Do you guys have any other suggestion of how I can speed up queries to this table? Is it an bad idea to set an index on a date-column that have so many different values?

1st rule of databases. Always have indexes on columns you are filtering on.
So yes, put an index on crdate.
You can also go with a history table in parallel but make sure you put the index on the crdate column in the history table too. Having the history table, will allow you to have a smaller index in the main table.

I wanted to add to this for future googler's. if you are querying a datatime a more distinct query will result in a more efficient query for example
SELECT * FROM MyTable WHERE MyDateTime = '01/01/2015 00:00:00'
Will be faster than:
SELECT * FROM MyTable WHERE MyDateTime = '01/01/2015'
I tested this repeatedly on an indexed view(by datetime) of 5 million rows the more distinct query gave me a 1 second quicker response

What SQL indexes to put for big table

I have two big tables from which I mostly select but complex queries with 2 joins are extremely slow.
First table is GameHistory in which I store records for every finished game (I have 15 games in separate table).
Fields: id, date_end, game_id, ..
Second table is GameHistoryParticipants in which I store records for every player participated in certain game.
Fields: player_id, history_id, is_winner
Query to get top players today is very slow (20+ seconds).
Query:
SELECT p.nickname, count(ghp.player_id) as num_games_today
FROM `GameHistory` as gh
INNER JOIN GameHistoryParticipants as ghp ON gh.id=ghp.history_id
INNER JOIN Players as p ON p.id=ghp.player_id
WHERE TIMESTAMPDIFF(DAY, gh.date_end, NOW())=0 AND gh.game_id='scrabble'
GROUP BY ghp.player_id ORDER BY count(ghp.player_id) DESC LIMIT 10
First table has 1.5 million records and the second one 3.5 million.
What indexes should I put ? (I tried some and it was all slow)

You are only interested in today's records. However, you search the whole GameHistory table with TIMESTAMPDIFF to detect those records. Even if you have an index on that column, it cannot be used, due to the fact that you use a function on the field.
You should have an index on both fields game_id and date_end. Then ask for the date_end value directly:
WHERE gh.date_end >= DATE(NOW())
AND gh.date_end < DATE_ADD(DATE(NOW()), INTERVAL 1 DAY)
AND gh.game_id = 'scrabble'
It would even be better to have an index on date_end's date part rather then on the whole time carrying date_end. This is not possible in MySQL however. So consider adding another column trunc_date_end for the date part alone which you'd fill with a before-insert trigger. Then you'd have an index on trunc_date_end and game_id, which should help you find the desired records in no time.
WHERE gh.trunc_date_end = DATE(NOW())
AND gh.game_id = 'scrabble'

add 'EXPLAIN' command at the beginning of your query then run it in a database viewer(ex: sqlyog) and you will see the details about the query, look for the 'rows' column and you will see different integer values. Now, index the table columns indicated in the EXPLAIN command result that contain large rows.
-i think my explanation is kinda messy, you can ask for clarification

I want to extract a random id from a MYSQL database

I am trying to extract a random article who has a picture from a database.
SELECT FLOOR(MAX(id) * RAND()) FROM `table` WHERE `picture` IS NOT NULL
My table is 33 MB big and has 1,006,394 articles but just 816 with pictures.
My problem is this query takes 0.4640 sek
I need this to be much much more faster.
Any idea is welcome.
P.S.
1. of course I have a index on id.
2. there is no index on the picture field. should I add one?
3. the product name is unique, also the product number, but thats out of question.
RESULT OF TESTING SESSION.
#cHao's Solution is faster when I use it to select one of the random entries with a picture.(les then 0.1 sec.
But its slower if I try to do the opposite, to select a random article without picture. 2..3 sec.
#Kickstart's Solution is a bit slower when trying to find a entry with picture, but is almost same speed when trying to find a entry without picture. average 0,149 sec.
#bob-kruithof's Solution don't work for me.
when trying to find a entry with picture, it selects a entry without picture.
and #ganesh-bora, yes you are right, in my case the speed difference is about 5..15 times.
I want to thank you all for your help, and I decided for #Kickstart.

You need to get a range of values with matching records and then find a matching record within that range.
Something like this:-
SELECT r1.id
FROM `table` AS r1
INNER JOIN (
SELECT RAND( ) * ( MAX( id ) - MIN( id ) ) + MIN( id ) AS id
FROM `table`
WHERE `picture` IS NOT NULL
) AS r2
ON r1.id >= r2.id
WHERE `picture` IS NOT NULL
ORDER BY r1.id ASC
LIMIT 1
However for any hope of efficiency you need an index on the field it is checking (ie, picture in your example)
Just an explanation of how this works.
The sub select finds a random id from the table which is between the min and max ids for records for a picture. This random id may or may not be for a picture.
The resulting id from this sub select is joined back against the main table, but using >= and with a WHERE clause specifying that the record is a picture record. Hence it joins against all picture records where the id is greater than or equal to the random id. The highest random id will be the one for the picture record with the highest id, so it will always find a record (if there are any picture records). The ORDER BY / LIMIT is then used to bring back that single id.
Note that there is an obvious flaw to this, but most of the time it will be irrelevant. The record retrieved may not be entirely random. The picture with the lowest id is unlikely to be returned (will only be returned if the RAND() returns exactly 0), but if this is important this is easy enough to fix by rounding the resulting random id. The other flaw is that if the ids are not vaguely equally distributed in the full range of ids then some will be returned more often than others. For example, take the situation where the first 1000 ids were pictures, then no more until the last (33 millionth) record. The random id could be any of those 33 million, but unless it is less than or equal to 1000 then it will be the 33 millionth record that will be returned.

You might try attaching a random number to each row, then sorting by that. The row with the lowest number will be at the top.
SELECT `table`.`id`, RAND() as `order`
FROM `table`
WHERE `picture` IS NOT NULL
ORDER BY `order`
LIMIT 1;
This is of course slower than just magicking up an ID with RAND(), but (1) it'll always give you a valid ID (as long as there's a record with a non-null picture field in the table, anyway), and (2) the WTF ratio is pretty low; most people can tell what's going on here. :) Its performance rivals Kickstart's solution with a decently indexed table, when the number of items to select from is relatively small (around 1%). Definitely don't try to select from a whole huge table like this; limit it first with a WHERE clause on some indexed field(s).
Performancewise, if you have a long-running app (ie: not PHP; i'm talking about Java, .net, etc where the app is alive even between requests), you might try to keep a list of all the IDs of items with pictures, select a random ID from that list, and load the article. You could do that in PHP too, if you wanted. It might not work as well when you have to query all the IDs each time, but it could be very useful if you can cache the list of IDs in APC or something.

for performance you can first add index on picture column so 814 records get sorted out at the top while executing the query and then you can fire your query.

How has someone else solved the problem?
I would suggest looking at the this article about different possible ways of selecting random rows in mysql.
Modified example from the article
SELECT name
FROM random JOIN
( SELECT CEIL( RAND() * (
SELECT MAX( id ) FROM random WHERE picture IS NOT NULL
) ) AS id ) AS r2 USING ( id );
This might work in your case.
Efficiency
As user Kickstart mentioned: Do you have an index on the column picture? This might help getting you the results a bit faster.
Are your tables optimized?

Very slow query, any other ways to format this with better performace?

I have this query (I didn't write) that was working fine for a client until the table got more then a few thousand rows in it, now it's taking 40 seconds+ on only 4200 rows.
Any suggetions on how to optimize and get the same result?
I've tried a few other methods but didn't get the correct result that this slower query returned...
SELECT COUNT(*) AS num
FROM `fl_events`
WHERE id IN(
SELECT DISTINCT (e2.id)
FROM `fl_events` AS e1, fl_events AS e2
WHERE e1.startdate >= now() AND e1.startdate = e2.startdate
)
ORDER BY `startdate`
Any help would be greatly appriciated!

Appart from the obvious indexes needed, I don't really get why you are joining your table with itself for choosing the IN condition. The ORDER BY is also not needed. Are you sure that your query can't be written just like this?:
SELECT COUNT(*) AS num
FROM `fl_events` AS e1
WHERE e1.startdate >= now()

I don't think rewriting the query will help. The key to your question is "until the table got more than a few thousand rows." This implies that important columns aren't indexed. Prior to a certain number of records, all the data fit on a single memory block - over that point, it takes a new block. And index is the only way to speed up the search.
first - check to see that the ID in fl_events is actually marked as a primary key. That physically orders the records and without it you can see data corruption and occasionally super-slow results. The use of distinct in the query makes it look like it might NOT be a unique value. That will pose a problem.
Then, make sure to add an index on the start_date.

The slowness is probably related to the join of the event table with itself, and possibly startdate not having an index.

SQL ORDER BY performance

I have a table with more than 1 million records. The problem is that the query takes too much times, like 5 minutes. The "ORDER BY" is my problem, but i need the expression in the query order by to get most popular videos. And because of the expression i can't create an index on it.
How can i resolve this problem?
Thx.
SELECT DISTINCT
`v`.`id`,`v`.`url`, `v`.`title`, `v`.`hits`, `v`.`created`, ROUND((r.likes*100)/(r.likes+r.dislikes),0) AS `vote`
FROM
`videos` AS `v`
INNER JOIN
`votes` AS `r` ON v.id = r.id_video
ORDER BY
(v.hits+((r.likes-r.dislikes)*(r.likes-r.dislikes))/2*v.hits)/DATEDIFF(NOW(),v.created) DESC

Does the most popular have to be calculated everytime? I doubt if the answer is yes. Some operations will take a long time to run no matter how efficient your query is.
Also bear in mind you have 1 million now, you might have 10 million in the next few months. So the query might work now but not in a month, the solution needs to be scalable.
I would make a job to run every couple of hours to calculate and store this information on a different table. This might not be the answer you are looking for but I just had to say it.

What I have done in the past is to create a voting system based on Integers.
Nothing will outperform integers.
The voting system table has 2 Columns:
ProductID
VoteCount (INT)
The votecount stores all the votes that are submitted.
Like = +1
Unlike = -1
Create an Index in the vote table based on ID.

You have to alternatives to improve this:
1) create a new column with the needed value pre-calculated
1) create a second table that holds the videos primary key and the result of the calculation.
This could be a calculated column (in the firts case) or modify your app or add triggers that allow you to keep it in sync (you'd need to manually load it the firs time, and later let your program keep it updated)
If you use the second option your key could be composed of the finalRating plus the primary key of the videos table. This way your searches would be hugely improved.

Have you try moving you arithmetic of the order by into your select, and then order by the virtual column such as:
SELECT (col1+col2) AS a
FROM TABLE
ORDER BY a
Arithmetic on sort is expensive.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008