I have MySQL database which consist of 13 tables. One table transactions will store in future a lot of data (nearly one million records). This table use InnoDB storage Engine. So business rules require to know amount of all records in this table. So, my question is what is the faster way to count all of this records?
First
Of course I can use something like that:
SELECT COUNT(*) FROM transaction
but obviously this is not a best solution.
Second
I can create additional table where I can store incrementable variable
and add trigger which start executing when row was added into transaction table.
CREATE TRIGGER update_counter AFTER INSERT ON transaction
ON counter
BEGIN
count_var = count_var + 1;
END;
but what happens if 10 entries are added at the same time, for example?
And the last solution is to use information_schema. Something like that
SELECT TABLE_ROWS
FROM information_schema.tables
WHERE table_name = "transaction"
So what is the most appropriate way to resolve this situation?
A "business rule" that requires the exact value of a number around a million? Send that to Dilbert; the pointy-hair boss will love it.
Remember when search engines would show you the exact number of hits, yet they would return the value so fast that it was suspect? Then they got a little more honest and said "hits 1-20 out of more than 120,000"? Now they don't even bother.
You should as a serious question -- Why do you need the exact number? Will an approximate number do? Will the number as of last night suffice?
With those answers, we can help design a "good enough" computation that is also "fast enough".
Related
I have the following situation:
A user can have maximum number of partnerships. For example - 40.000
Question:
In case user wants to add a new partnership, how it will be faster to check the current number of partnerships ?
Solution 1:
Using a count(*) statement ?
Solution 2:
Storing the value into a separate column of user. And always when a new partnership needs to be added, to get it and then to increment that column ?
Personal remarks:
Are there any better solution to check the total number of rows ?
Does anyone have a statistic of how performance is affected during time ? I suppose that solution 1 is faster when there are a limited number of rows. But in case there are multiple rows, then it makes more sense to use solution 2. For example, after what period of time (amount of rows) solution 2 becomes better than 1 ?
I would prefer of course solution 1, because I get more control. Bugs might happen and the column from solution 2 to not be incremented. And in such cases, the number will not be correct.
Solution 2 is an example of denormalization, storing an aggregated value instead of relying on the base data. Querying this denormalized value is practically guaranteed to be faster than counting the base data, even for small numbers of rows.
But it comes at a cost for maintaining the stored value. You have to account for errors, which were discussed in the comments above. How will you know when there's an error? Answer: you have to run the count query and compare that to the value stored in the denormalized column.
How frequently do you need to verify the counts? Perhaps after every update? In that case, it's just as costly to verify the stored count as to calculate the real count from base data. In fact more costly, because you have to count and also update the user row.
Then it becomes a balance between how frequently you need to recalculate the counts versus how frequently you only query the stored count value. Every time you query between updates, you benefit from some cost savings, and if queries are a lot more frequent than updates, then you get a lot of savings. But if you update as frequently as you query, then you get no savings.
I'll vote for Solution 2 (keep an exact count elsewhere).
This will be much faster than COUNT(*), but there are things that can go wrong. Adding/deleting a partnership implies incrementing/decrementing the counter. And is there some case that is not exactly an INSERT/DELETE?
The count should be done in a transaction. For "adding":
START TRANSACTION;
SELECT p_count FROM Users WHERE user_id = 123 FOR UPDATE;
if >= 40K and close the transaction
INSERT INTO partnerships ...;
UPDATE Users SET p_count = p_count+1 WHERE user_id = 123;
COMMIT;
The overhead that is involved might be as much as 10ms. Counting to 40K would be much slower.
I am working on a database that has a table user having columns user_id and user_service_id. My application needs to fetch all the users whose user_service_id is a particular value. Normally I would add an index to the user_service_id column and run a query like this :
select user_id from user where user_service_id = 2;
Since the cardinality of the column user_service_id is very less than around 3-4 and the table has around 10M entries, the query will end up scanning almost the entire table.
I was wondering what is the recommendation for such usecases. Also, would it make more sense to move the data to another nosql datastore as this doesn't seem to be an efficient usecase for MySQL or any SQL datastore? Tried to search this but couldn't find any recommendations here. Can someone please help or provide the necessary references?
Thanks in advance.
That query needs this index, which is both "composite" and "covering":
INDEX(user_service_id, user_id) -- in this order
But what will you do with the millions of rows that you get? Sounds like it will choke the client, whether it comes fast or slow.
See my Index Cookbook
"very dynamic" -- Not a problem.
"cache" -- the dynamic nature defeats caching.
"cardinality" -- not important, except to point out that there will be millions of rows.
"millions of rows" -- that takes time to deliver to the client. The number of rows delivered is the biggest factor in cost.
"select entire table, then filter in client" -- That will be even slower! (See "millions of rows".)
There are a number of similar questions on here, but a lot of the answers say to force the use of an index and that doesn't seem to speed anything up for me.
I am wanting to show a "live" counter on my website showing the number of rows in a table. Kind of like how some websites show the number of registered users, or some other statistic, in "real time" (i.e. updated frequently using ajax or websockets).
My table has about 5M rows. It's growing fairly quickly and there is a high volume of inserts and deletes on it. Running
select count(*) from my_table
Takes 1.367 seconds, which is unacceptable because I need my application to get the new row count about once per second.
I tried what many of the answers on here suggest and changed the query to:
select count(*) from my_table use index(my_index)
Where my_index is Normal, BTREE on a bigint field. But the time actually increased to 1.414 seconds.
Why doesn't using an index speed up the query as many answers on here said it would?
Another option some answers suggest is to put a trigger on the table that increments a column in another table. So I could create a stats table and whenever a row is inserted or deleted in my_table have a trigger increment or decrement a column in the stats table. Is this the only other option, since using an index doesn't seem to work?
EDIT: Here's a perfect example of the type of thing I'm trying to accomplish: https://www.freelancer.com. Scroll to the bottom of the page and you'll see:
Those numbers update every second or so.
It takes time to read 5 million records and count them -- whether in an index or in the raw data form.
If a "fast-and-dirty" solution is acceptable, you can use metadata:
SELECT table_rows
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = <whatever> and TABLE_NAME = <whatever2>;
Note that this can get out-of-sync.
Another possibility is to partition the table into smaller chunks. One advantage is that if the inserts and deletes tend to be to one partition, you can just count that and use metadata for the other partitions.
A trigger may or may not help in this situation, depending on the insert/delete load. If you are doing multiple inserts per minute, then a trigger is a no-brainer -- a fine solution. If you are doing dozens or hundreds of changes per second, then the overhead of the trigger might slow down the server.
If your system is so busy that the counting is having too much impact, then probably the INSERTing/DELETEing is also having impact. One way to improve INSERT/DELETE is to do them in 'batches' instead of one at a time.
Gather the INSERTs, preferably in the app, but optionally in a 'staging' table. Then, once a second (or whatever) copy them into the real table using an INSERT..SELECT, or (if needed) INSERT..ON DUPLICATE KEY UPDATE. DELETEs can go into the same table (with a flag) or a separate table.
The COUNT(*) can be done at the end of the batch. Or it could be dead reckoned (at much lower cost) by knowing what the count was, then adjusting by what the staging table(s) will change it by.
This is a major upheaval to you app code, so don't embark on it unless you have spikes of, say, >100 INSERTs/DELETEs per second. (A steady 100 INSERTs/sec = 3 billion rows per year.)
For more details on "staging table", see http://mysql.rjweb.org/doc.php/staging_table Note that that blog advocates flip-flopping between a pair of staging tables, so as to minimize locks on them, and to allow multiple clients to coexist.
Have a job running in the background that does the following; then use its table for getting the count:
Loop:
INSERT INTO Counter (ct_my_table)
SELECT COUNT(*) FROM my_table;
sleep 1 second
end loop
At worst, it will be a couple of seconds out of date. Also note that INSERTs and DELETEs interfere (read: slow down) the SELECT COUNT(*), hence the "sleep".
Have you noticed that some UIs say "About 120,000 thingies"? They are using even cruder estimations. But it is usually good enough for the users.
Take inaccurate value from information_schema as Gordon Linoff suggested
Another inaccurate source of rows count is SELECT MAX(id) - MIN(id)
Create table my_table_count where you store rows count of table my_table and update it with triggers
In many cases you don't need an accurate value. Who cares if you show 36,400 users instead of the accurate 36,454?
Is it possible to set the number of rows that a table can accommodate in MySQL ?
I don't want to use any java code. I want to do this using pure mysql scripts.
I wouldn't recommend trying to limit the number of rows in a SQL table, unless you had a very good reason to do so. It seems you would be better off using a query like:
select top 1000 entityID, entityName from TableName
rather than physically limiting the rows of the table.
However, if you really want to limit it to 1000 rows:
delete from TableName where entityID not in (select top 1000 entityID from TableName)
Mysql supports a MAX_ROWS parameter when creating (and maybe altering?) a table. http://dev.mysql.com/doc/refman/5.0/en/create-table.html
Edit: Sadly it turns out this is only a hint for optimization
"The maximum number of rows you plan to store in the table. This is not a hard limit, but rather a hint to the storage engine that the table must be able to store at least this many rows."
.. Your question implied that scripts are ok; is it ridiculous to make one as simple as a cron job regularly dropping table rows above a given ID ? It's not nearly as elegant as it would've been to have mysql throw errors when something tries to add a row too many, but it would do the job - and you may be able to have your application also then check if it's ID is too high, and throw a warning to the user/relevant party.
Is it possible to set the number of rows that a table can accommodate in MySQL ?
I don't want to use any java code. I want to do this using pure mysql scripts.
I wouldn't recommend trying to limit the number of rows in a SQL table, unless you had a very good reason to do so. It seems you would be better off using a query like:
select top 1000 entityID, entityName from TableName
rather than physically limiting the rows of the table.
However, if you really want to limit it to 1000 rows:
delete from TableName where entityID not in (select top 1000 entityID from TableName)
Mysql supports a MAX_ROWS parameter when creating (and maybe altering?) a table. http://dev.mysql.com/doc/refman/5.0/en/create-table.html
Edit: Sadly it turns out this is only a hint for optimization
"The maximum number of rows you plan to store in the table. This is not a hard limit, but rather a hint to the storage engine that the table must be able to store at least this many rows."
.. Your question implied that scripts are ok; is it ridiculous to make one as simple as a cron job regularly dropping table rows above a given ID ? It's not nearly as elegant as it would've been to have mysql throw errors when something tries to add a row too many, but it would do the job - and you may be able to have your application also then check if it's ID is too high, and throw a warning to the user/relevant party.