Mysql Performance: Which of the query will take more time? - mysql

I have two tables:
1. user table with around 10 million data
columns: token_type, cust_id(Primary)
2. pm_tmp table with 200k data
columns: id(Primary | AutoIncrement), user_id
user_id is foreign key for cust_id
1st Approach/Query:
update user set token_type='PRIME'
where cust_id in (select user_id from pm_tmp where id between 1 AND 60000);
2nd Approach/Query: Here we will run below query for different cust_id individually for 60000 records:
update user set token_type='PRIME' where cust_id='1111110';

Theoretically time will be less for the first query as it involves less number of commits and in turn less number of index rebuilds. But, I would recommend to go with the second option since its more controlled and will appear to be less in time and you can event think about executing 2 seperate sets parellelly.
Note: The first query will need sufficient memory provisioned for mysql buffers to get it executed quickly. Second query being set of independent single transaction queries, they will need comparatively less memory and hence will appear faster if executed on limited memory environments.
Well, you may rewrite the first query this way too.
update user u, pm_tmp p set u.token_type='PRIME' where u.cust_id=p.id and p.in <60000;

Some versions of MySQL have trouble optimizing in. I would recommend:
update user u join
pm_tmp pt
on u.cust_id = pt.user_id and pt.id between 1 AND 60000
set u.token_type = 'PRIME' ;
(Note: This assumes that cust_id is not repeated in pm_temp. If that is possible, you will want a select distinct subquery.)
Your second version would normally be considerably slower, because it requires executing thousands of queries instead of one. One consideration might be the update. Perhaps the logging and locking get more complicated as the number of updates increases. I don't actually know enough about MySQL internals to know if this would have a significant impact on performance.

IN ( SELECT ... ) is poorly optimized. (I can't provide specifics because both UPDATE and IN have been better optimized in some recent version(s) of MySQL.) Suffice it to say "avoid IN ( SELECT ... )".
Your first sentence should say "rows" instead of "columns".
Back to the rest of the question. 60K is too big of a chunk. I recommend only 1000. Aside from that, Gordon's Answer is probably the best.
But... You did not use OFFSET; Do not be tempted to use it; it will kill performance as you go farther and farther into the table.
Another thing. COMMIT after each chunk. Else you build up a huge undo log; this adds to the cost. (And is a reason why 1K is possibly faster than 60K.)
But wait! Why are you updating a huge table? That is usually a sign of bad schema design. Please explain the data flow.
Perhaps you have computed which items to flag as 'prime'? Well, you could keep that list around and do JOINs in the SELECTs to discover prime-ness when reading. This completely eliminates the UPDATE in question. Sure, the JOIN costs something, but not much.

Related

Efficiency of query using NOT IN()?

I have a query that runs on my server:
DELETE FROM pairing WHERE id NOT IN (SELECT f.id FROM info f)
It takes two different tables, pairing and info and says to DELETE all entries from pairing whenever the id of that pairing is not in info.
I've run into an issue on the server where this is beginning to take too long to execute, and I believe it has to do with the efficiency (or lack of constraints in the SELECT statement).
However, I took a look at the MySQL slow_log and the number of compared entries is actually LOWER than it should be. From my understanding, this should be O(mn) time where m is the number of entries in pairing and n is the number of entries in info. The number of entries in pairing is 26,868 and in info is 34,976.
This should add up to 939,735,168 comparisons. But the slow_log is saying there are only 543,916,401: almost half the amount.
I was wondering if someone could please explain to me how the efficiency of this specific query works. I realize the fact that it's performing quicker than I think it should is a blessing in this case, but I still need to understand where the optimization comes from so that I can further improve upon it.
I haven't used the slow query log much (at all) but isn't it possible that the difference can just be chalked up to simple... can't think of the word. Basically, 939,735,168 is the theoretical worst case scenario where the query literally checks every single row except the one it needs to first. Realistically, with a roughly even distribution (and no use of indexing), a check of row in pairing will on average compare to half the rows in info.
It looks like your real world performance is only 15% off (worse) than what would be expected from the "average comparisons".
Edit: Actually, "worse than expected" should be expected when you have rows in pairing that are not in info, as they will skew the number of comparisons.
...which is still not great. If you have id indexed in both tables, something like this should work a lot faster.
DELETE pairing
FROM pairing LEFT JOIN info ON pairing.id = info.id
WHERE info.id IS NULL
;
This should take advantage of an index on id to make the comparisons needed something like O(NlogM).

How to speed up InnoDB count(*) query?

There are a number of similar questions on here, but a lot of the answers say to force the use of an index and that doesn't seem to speed anything up for me.
I am wanting to show a "live" counter on my website showing the number of rows in a table. Kind of like how some websites show the number of registered users, or some other statistic, in "real time" (i.e. updated frequently using ajax or websockets).
My table has about 5M rows. It's growing fairly quickly and there is a high volume of inserts and deletes on it. Running
select count(*) from my_table
Takes 1.367 seconds, which is unacceptable because I need my application to get the new row count about once per second.
I tried what many of the answers on here suggest and changed the query to:
select count(*) from my_table use index(my_index)
Where my_index is Normal, BTREE on a bigint field. But the time actually increased to 1.414 seconds.
Why doesn't using an index speed up the query as many answers on here said it would?
Another option some answers suggest is to put a trigger on the table that increments a column in another table. So I could create a stats table and whenever a row is inserted or deleted in my_table have a trigger increment or decrement a column in the stats table. Is this the only other option, since using an index doesn't seem to work?
EDIT: Here's a perfect example of the type of thing I'm trying to accomplish: https://www.freelancer.com. Scroll to the bottom of the page and you'll see:
Those numbers update every second or so.
It takes time to read 5 million records and count them -- whether in an index or in the raw data form.
If a "fast-and-dirty" solution is acceptable, you can use metadata:
SELECT table_rows
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = <whatever> and TABLE_NAME = <whatever2>;
Note that this can get out-of-sync.
Another possibility is to partition the table into smaller chunks. One advantage is that if the inserts and deletes tend to be to one partition, you can just count that and use metadata for the other partitions.
A trigger may or may not help in this situation, depending on the insert/delete load. If you are doing multiple inserts per minute, then a trigger is a no-brainer -- a fine solution. If you are doing dozens or hundreds of changes per second, then the overhead of the trigger might slow down the server.
If your system is so busy that the counting is having too much impact, then probably the INSERTing/DELETEing is also having impact. One way to improve INSERT/DELETE is to do them in 'batches' instead of one at a time.
Gather the INSERTs, preferably in the app, but optionally in a 'staging' table. Then, once a second (or whatever) copy them into the real table using an INSERT..SELECT, or (if needed) INSERT..ON DUPLICATE KEY UPDATE. DELETEs can go into the same table (with a flag) or a separate table.
The COUNT(*) can be done at the end of the batch. Or it could be dead reckoned (at much lower cost) by knowing what the count was, then adjusting by what the staging table(s) will change it by.
This is a major upheaval to you app code, so don't embark on it unless you have spikes of, say, >100 INSERTs/DELETEs per second. (A steady 100 INSERTs/sec = 3 billion rows per year.)
For more details on "staging table", see http://mysql.rjweb.org/doc.php/staging_table Note that that blog advocates flip-flopping between a pair of staging tables, so as to minimize locks on them, and to allow multiple clients to coexist.
Have a job running in the background that does the following; then use its table for getting the count:
Loop:
INSERT INTO Counter (ct_my_table)
SELECT COUNT(*) FROM my_table;
sleep 1 second
end loop
At worst, it will be a couple of seconds out of date. Also note that INSERTs and DELETEs interfere (read: slow down) the SELECT COUNT(*), hence the "sleep".
Have you noticed that some UIs say "About 120,000 thingies"? They are using even cruder estimations. But it is usually good enough for the users.
Take inaccurate value from information_schema as Gordon Linoff suggested
Another inaccurate source of rows count is SELECT MAX(id) - MIN(id)
Create table my_table_count where you store rows count of table my_table and update it with triggers
In many cases you don't need an accurate value. Who cares if you show 36,400 users instead of the accurate 36,454?

How to deal with a rapidly growing mysql table

I have two tables in mysql database
tbl_comments
tbl_votes
When a User Clicks on a Like or Dislike button under the comment, a new row is inserted in the tbl_votes, with comment_id, user_id and vote_type. It means if 100 users click the Like or Dislike button on 100 comments per day, it will insert 10,000 rows in tbl_votes table. So, with increasing number of users and increasing number of votes, tbl_votes will be increased rapidly. And suppose when there are 100,000,000 rows in tbl_votes then it will also effect the performance and slow down the sql queries.
How can I deal with this solution or any other solution.
This is a perfectly fine solution.
As long as you have the indexes set correct it's okay.(index on primary key, and post id)
Take forexample stackoverflow, every post, reply comment has it's own voting system, up or down, remembers who voted, and they have about 200million+ messages+replies with each their own votes, and still it responds quickly.
As long as the indexes are set correctly, it should perform just fine. I might suggest using a bigint for the primary key though...
I would not worry about application performance with 1 billion rows on a machine that can keep the indexes in memory.
Performance depends on:
How many joins those queries do
How well your indexes are set up
How much RAM is in the machine
Speed and number of processors
Type and spindle speed of hard drives
size of the row/amount of data returned in the query
Some conclusions:
If you go for rdbms:
Doesn't really matter how many rows you insert into table if it's correctly indexed to select sum of total likes for a comment, of course you'l need to keep result cached.
Another way for fast data selection - is to keep some vote data aggregated, so if user votes up for a comment there would be 1 insert/delete in your table and update on another table like
comment_id
rate
So you'l select rate for any comment you need, and total rows of aggregated table would be much less.
Another good way is to use key-value storage. Suppose that your key would be comment_id, and stored value for raw data
user_id
vote_type
Depending or noSql storage you select, data may be totally stored in memory and all select/update operations will work really fast
It is not totally true that the size of table does not affect the SELECT query.
For big tables, I would suggest TokuDB.
In both cases the problem will arise when you want to DELETE some data.
At that point, you have 2 choices: clustered keys or begin to think to different architectures (horizontal sharding could be a good way):

MySQL takes 10 seconds for counting with conditions on for 100k records

SELECT COUNT(*) AS count_all, products.id AS products_id
FROM `products`
INNER JOIN `product_device_facilities`
ON `product_device_facilities`.`product_id` = `products`.`id`
INNER JOIN `product_vendors`
ON `product_vendors`.`ProductId` = `products`.`id`
INNER JOIN `screen_shots`
ON `screen_shots`.`ProductVendorId` = `product_vendors`.`id`
WHERE ( (DownloadCount >= 10 or DownloadCount is NULL)
and (Minsdk <= 10 or Minsdk is null))
GROUP BY products.id
HAVING GROUP_CONCAT(device_facility_id ORDER BY device_facility_id ASC ) IN (0)
This is taking 10 seconds for 100k records.
How to improve the performance?
There are a few things that you can try.
Use persistent connections to the database to avoid connection overhead
Check that all of your tables have Primary keys on the key tables e.g. (product_id)
Use less RAM per row by declaring columns only as large as they need to be to hold the values stored in them. i.e. as #manurajhada said don't use count(*) use count(primary key)
Using simpler permissions when you issue GRANT statements enables MySQL to reduce permission-checking overhead.
use indices on the references between different tables. just remember not to index too many columns, simple rule of thumb, if you never refer to a column in comparisons, there’s no need to index it.
Try using ANALYZE TABLE to help mysql to better optimize the query.
You can speed up a query a tiny bit by making sure all coumns which are not null are declared NOT NULL — thus you speed up table traversing a bit.
Tune MySQL caching: allocate enough memory for the buffer (e.g. SET GLOBAL query_cache_size = 1000000) and define query_cache_min_res_unit depending on average query result set size.
I know it sounds counter intuitive but sometimes it is worth de-normalising tables i.e. duplicate some data in several tables to avoid JOINs which are expensive. You can support data integrity with foreign keys or triggers.
and if all else fails
upgrade your hardware if you can, more RAM, faster HDD can make a significant difference to the speed of the database, and when you have done that allocate more memory to mysql.
EDIT
Another option if you do not require the results live as #ask-bjorn-hansen suggested you could use a background task (cron job) once a day, and store the result of the query in a separate table, then in your application all you have to do is check that table for the returned result, that way you get around having to query 100k results and would be able to run queries that take hours and not overly impact your users.
Do indexing on join columns on tables and instead of count(*) use count(some indexed primary key column).
Are minsdk and download count in the same table? If so adding an index on those two might help.
It could be that it's just a hard/impossible query to do quickly. Without seeing your full schema and the data it's hard to be specific, but it's possible that splitting it up into several easier to execute queries would be faster. Or as Amadeus suggested maybe denormalize the data a bit.
Another variation would be to just live with it taking 10 seconds, but make sure it's always done periodically in the background (with cron or similar) and never while a user is waiting. Then take the time to fix it if/when it takes minutes instead of seconds or otherwise puts an unacceptable burden on your user experience or servers.

mysql index performance on small "fast-moving" tables

We've got a table we use as a queue. Entries are constantly being added, and constantly being updated, and then deleted. Though we might be adding 3 entries/sec the table never grows to be more than a few hundred rows.
To get entries out of table we are doing a simple select.
SELECT * FROM queue_table WHERE some_id = ?
We are debating adding an index on some_id. I think the small size and speed at which we are adding and removing rows would say no, but conventionally, it seems we should have an index.
Any thoughts?
If you are using InnoDB (which you should do having a table of this kind) and the table is accessed concurrently, then you should definitely create the index.
When performing DML operations, InnoDB locks all rows it scans, not only those that match the WHERE clause conditions.
This means that without an index, a query like this:
DELETE
FROM mytable
WHERE some_id = ?
will have to do a full table scan and lock all rows.
This kills all concurrency (even if the threads access different some_id's they'll still have to wait for each other), and may even result in deadlocks.
With 3 transactions per second, no index should be a problem, so just create it.
To be sure, a benchmark using both techniques would be needed.
But generally, if the access is 50% reads and 50% writes, the penalty of updating an index could well not be worth it. But if the number of rows increase, that weights both the read and write performance such that an index should be used.
The only way to know for sure would be doing some benchmarks in actual/real conditions ; for example, measure the time each query takes, and :
for one day, collect that information each time the query is run -- without the index
and for another day, do exactly the same -- with the index.
For a table with a few hundreds rows doing both lots and inserts/deletes and select/updates, the difference should not be that big, so I think you can test in your production environment (and in real conditions) without much danger.
Yes, I know, testing in production is bad ; but in that case, it's the best way to know for sure : those conditions are probably too hard to replicate in a testing environment...