I have the following situation:
A user can have maximum number of partnerships. For example - 40.000
Question:
In case user wants to add a new partnership, how it will be faster to check the current number of partnerships ?
Solution 1:
Using a count(*) statement ?
Solution 2:
Storing the value into a separate column of user. And always when a new partnership needs to be added, to get it and then to increment that column ?
Personal remarks:
Are there any better solution to check the total number of rows ?
Does anyone have a statistic of how performance is affected during time ? I suppose that solution 1 is faster when there are a limited number of rows. But in case there are multiple rows, then it makes more sense to use solution 2. For example, after what period of time (amount of rows) solution 2 becomes better than 1 ?
I would prefer of course solution 1, because I get more control. Bugs might happen and the column from solution 2 to not be incremented. And in such cases, the number will not be correct.
Solution 2 is an example of denormalization, storing an aggregated value instead of relying on the base data. Querying this denormalized value is practically guaranteed to be faster than counting the base data, even for small numbers of rows.
But it comes at a cost for maintaining the stored value. You have to account for errors, which were discussed in the comments above. How will you know when there's an error? Answer: you have to run the count query and compare that to the value stored in the denormalized column.
How frequently do you need to verify the counts? Perhaps after every update? In that case, it's just as costly to verify the stored count as to calculate the real count from base data. In fact more costly, because you have to count and also update the user row.
Then it becomes a balance between how frequently you need to recalculate the counts versus how frequently you only query the stored count value. Every time you query between updates, you benefit from some cost savings, and if queries are a lot more frequent than updates, then you get a lot of savings. But if you update as frequently as you query, then you get no savings.
I'll vote for Solution 2 (keep an exact count elsewhere).
This will be much faster than COUNT(*), but there are things that can go wrong. Adding/deleting a partnership implies incrementing/decrementing the counter. And is there some case that is not exactly an INSERT/DELETE?
The count should be done in a transaction. For "adding":
START TRANSACTION;
SELECT p_count FROM Users WHERE user_id = 123 FOR UPDATE;
if >= 40K and close the transaction
INSERT INTO partnerships ...;
UPDATE Users SET p_count = p_count+1 WHERE user_id = 123;
COMMIT;
The overhead that is involved might be as much as 10ms. Counting to 40K would be much slower.
Related
I have two tables:
1. user table with around 10 million data
columns: token_type, cust_id(Primary)
2. pm_tmp table with 200k data
columns: id(Primary | AutoIncrement), user_id
user_id is foreign key for cust_id
1st Approach/Query:
update user set token_type='PRIME'
where cust_id in (select user_id from pm_tmp where id between 1 AND 60000);
2nd Approach/Query: Here we will run below query for different cust_id individually for 60000 records:
update user set token_type='PRIME' where cust_id='1111110';
Theoretically time will be less for the first query as it involves less number of commits and in turn less number of index rebuilds. But, I would recommend to go with the second option since its more controlled and will appear to be less in time and you can event think about executing 2 seperate sets parellelly.
Note: The first query will need sufficient memory provisioned for mysql buffers to get it executed quickly. Second query being set of independent single transaction queries, they will need comparatively less memory and hence will appear faster if executed on limited memory environments.
Well, you may rewrite the first query this way too.
update user u, pm_tmp p set u.token_type='PRIME' where u.cust_id=p.id and p.in <60000;
Some versions of MySQL have trouble optimizing in. I would recommend:
update user u join
pm_tmp pt
on u.cust_id = pt.user_id and pt.id between 1 AND 60000
set u.token_type = 'PRIME' ;
(Note: This assumes that cust_id is not repeated in pm_temp. If that is possible, you will want a select distinct subquery.)
Your second version would normally be considerably slower, because it requires executing thousands of queries instead of one. One consideration might be the update. Perhaps the logging and locking get more complicated as the number of updates increases. I don't actually know enough about MySQL internals to know if this would have a significant impact on performance.
IN ( SELECT ... ) is poorly optimized. (I can't provide specifics because both UPDATE and IN have been better optimized in some recent version(s) of MySQL.) Suffice it to say "avoid IN ( SELECT ... )".
Your first sentence should say "rows" instead of "columns".
Back to the rest of the question. 60K is too big of a chunk. I recommend only 1000. Aside from that, Gordon's Answer is probably the best.
But... You did not use OFFSET; Do not be tempted to use it; it will kill performance as you go farther and farther into the table.
Another thing. COMMIT after each chunk. Else you build up a huge undo log; this adds to the cost. (And is a reason why 1K is possibly faster than 60K.)
But wait! Why are you updating a huge table? That is usually a sign of bad schema design. Please explain the data flow.
Perhaps you have computed which items to flag as 'prime'? Well, you could keep that list around and do JOINs in the SELECTs to discover prime-ness when reading. This completely eliminates the UPDATE in question. Sure, the JOIN costs something, but not much.
I have a page on my site that keeps track of the number of people accessing it, on another part I displays the data containing information about the users that access this page, it displays only about 10 at a time.
The problem is I need to create pagination so I need to know how much data is on my table at every time and this causes the display page to take some time to load 2-3 seconds, sometimes 7-10, because I have millions of record. I am wondering, how do I get this page to load faster.
Select COUNT(*) as Count from visits
My first response is . . . if you are paging records 10 at a time, why do you need the total count of more than a million?
Second, counting a million rows should not take very long, unless your rows are wide (lots of columns or wide columns). If that is the case, then:
select count(id) from t;
can help, because it will explicitly use an index. Note that the first run may be slower than subsequent runs because of caching.
If you decide that you do need an exact row count, then your only real option for speeding it up using MySQL is to create triggers to maintain the count in another table. However, that will slow down inserts and deletions, which might not be a good idea.
The best answer is to say "About 1,234,000 visits", not the exact number. Then calculate it daily (or whatever).
But if you must have the exact count, ...
If this table is "write only", then there is a solution. It involves treating it as a "Fact" table in a Data Warehouse. Then create and maintain a "Summary table" with a row for, say, each hour. Then the COUNT becomes:
SELECT SUM(hourly_count) FROM SummaryTable;
This will be much faster because there is much less to scan. However, there is a problem in that it does not include the count for the last (partial) hour. But that can be solved if you use INSERT ... ON DUPLICATE KEY UPDATE ... to increment the counter for the current hour or insert a new row with a "1".
Some more info is here .
But, before we take this too far, please inform us of how often a new 'visit' occurs.
You cannot make that query get faster without changing the server's hardware or adding more servers to run it in parallel. In the second case it would be better to move to a nosql database.
My approach would be to reduce the number of records. That you could do by having some temporary table where you record the access logs for the past hour/day and after that time run a cronjob that deletes the data, or moves it to another table for log term storage.
You usually do not need to know exact number of rows for pagination
SELECT COUNT(*) FROM
(SELECT TOP 10000 * FROM visits) as v
would tell You, that there are at least 1000 pages. In most cases You do not need to know more.
You can store total count somewhere and update it from time to time if You want some reasonable estimate. If You need exact number, You can use trigger to keep it actual. The more up to date info, the more expensive, of course.
Decide on limit (let's say, 1000 last ones) from practical (business requirements) point of view. Have auto_increment index (id) or timestamp (createdon). Grab max 1000 records
select count(*) from (select id from visits order by id desc limit 1000)
or grab all 1000 and count paginate on the client side (php) (as if you paginate mysql will still go through those records):
select * from visits order by id desc limit 1000
I have MySQL database which consist of 13 tables. One table transactions will store in future a lot of data (nearly one million records). This table use InnoDB storage Engine. So business rules require to know amount of all records in this table. So, my question is what is the faster way to count all of this records?
First
Of course I can use something like that:
SELECT COUNT(*) FROM transaction
but obviously this is not a best solution.
Second
I can create additional table where I can store incrementable variable
and add trigger which start executing when row was added into transaction table.
CREATE TRIGGER update_counter AFTER INSERT ON transaction
ON counter
BEGIN
count_var = count_var + 1;
END;
but what happens if 10 entries are added at the same time, for example?
And the last solution is to use information_schema. Something like that
SELECT TABLE_ROWS
FROM information_schema.tables
WHERE table_name = "transaction"
So what is the most appropriate way to resolve this situation?
A "business rule" that requires the exact value of a number around a million? Send that to Dilbert; the pointy-hair boss will love it.
Remember when search engines would show you the exact number of hits, yet they would return the value so fast that it was suspect? Then they got a little more honest and said "hits 1-20 out of more than 120,000"? Now they don't even bother.
You should as a serious question -- Why do you need the exact number? Will an approximate number do? Will the number as of last night suffice?
With those answers, we can help design a "good enough" computation that is also "fast enough".
There are a number of similar questions on here, but a lot of the answers say to force the use of an index and that doesn't seem to speed anything up for me.
I am wanting to show a "live" counter on my website showing the number of rows in a table. Kind of like how some websites show the number of registered users, or some other statistic, in "real time" (i.e. updated frequently using ajax or websockets).
My table has about 5M rows. It's growing fairly quickly and there is a high volume of inserts and deletes on it. Running
select count(*) from my_table
Takes 1.367 seconds, which is unacceptable because I need my application to get the new row count about once per second.
I tried what many of the answers on here suggest and changed the query to:
select count(*) from my_table use index(my_index)
Where my_index is Normal, BTREE on a bigint field. But the time actually increased to 1.414 seconds.
Why doesn't using an index speed up the query as many answers on here said it would?
Another option some answers suggest is to put a trigger on the table that increments a column in another table. So I could create a stats table and whenever a row is inserted or deleted in my_table have a trigger increment or decrement a column in the stats table. Is this the only other option, since using an index doesn't seem to work?
EDIT: Here's a perfect example of the type of thing I'm trying to accomplish: https://www.freelancer.com. Scroll to the bottom of the page and you'll see:
Those numbers update every second or so.
It takes time to read 5 million records and count them -- whether in an index or in the raw data form.
If a "fast-and-dirty" solution is acceptable, you can use metadata:
SELECT table_rows
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = <whatever> and TABLE_NAME = <whatever2>;
Note that this can get out-of-sync.
Another possibility is to partition the table into smaller chunks. One advantage is that if the inserts and deletes tend to be to one partition, you can just count that and use metadata for the other partitions.
A trigger may or may not help in this situation, depending on the insert/delete load. If you are doing multiple inserts per minute, then a trigger is a no-brainer -- a fine solution. If you are doing dozens or hundreds of changes per second, then the overhead of the trigger might slow down the server.
If your system is so busy that the counting is having too much impact, then probably the INSERTing/DELETEing is also having impact. One way to improve INSERT/DELETE is to do them in 'batches' instead of one at a time.
Gather the INSERTs, preferably in the app, but optionally in a 'staging' table. Then, once a second (or whatever) copy them into the real table using an INSERT..SELECT, or (if needed) INSERT..ON DUPLICATE KEY UPDATE. DELETEs can go into the same table (with a flag) or a separate table.
The COUNT(*) can be done at the end of the batch. Or it could be dead reckoned (at much lower cost) by knowing what the count was, then adjusting by what the staging table(s) will change it by.
This is a major upheaval to you app code, so don't embark on it unless you have spikes of, say, >100 INSERTs/DELETEs per second. (A steady 100 INSERTs/sec = 3 billion rows per year.)
For more details on "staging table", see http://mysql.rjweb.org/doc.php/staging_table Note that that blog advocates flip-flopping between a pair of staging tables, so as to minimize locks on them, and to allow multiple clients to coexist.
Have a job running in the background that does the following; then use its table for getting the count:
Loop:
INSERT INTO Counter (ct_my_table)
SELECT COUNT(*) FROM my_table;
sleep 1 second
end loop
At worst, it will be a couple of seconds out of date. Also note that INSERTs and DELETEs interfere (read: slow down) the SELECT COUNT(*), hence the "sleep".
Have you noticed that some UIs say "About 120,000 thingies"? They are using even cruder estimations. But it is usually good enough for the users.
Take inaccurate value from information_schema as Gordon Linoff suggested
Another inaccurate source of rows count is SELECT MAX(id) - MIN(id)
Create table my_table_count where you store rows count of table my_table and update it with triggers
In many cases you don't need an accurate value. Who cares if you show 36,400 users instead of the accurate 36,454?
In an effort to add statistics and tracking for users on my site, I've been thinking about the best way to keep counters of pageviews and other very frequently occurring events. Now, my site obviously isn't the size of Facebook to warrant some of the strategies they've implemented (sharding isn't even necessary, for example), but I'd like to avoid any blatantly stupid blunders.
It seems the easiest way to keep track is to just have an integer column in the table. For example, each page has a pageview column that just gets incremented by 1 for each pageview. This seems like it might be an issue if people hit the page faster than the database can write.
If two people hit the page at the same time, for example, then the previous_pageview count would be the same prior to both updates, and each update would update it to be previous_pageview+1 rather than +2. However, assuming a database write speed of 10ms (which is really high, I believe) you'd need on the order of a hundred pageviews per second, or millions of pageviews per day.
Is it okay, then, for me to be just incrementing a column? The exact number isn't too important, so some error here and there is tolerable. Does an update statement on one column slow down if there are many columns for the same row? (My guess is no.)
I had this plan to use a separate No-SQL database to store pk_[stat]->value pairs for each stat, incrementing those rapidly, and then running a cron job to periodically update the MySQL values. This feels like overkill; somebody please reassure me that it is.
UPDATE foo SET counter = counter + 1 is atomic. It will work as expected even if two people hit at the same time.
It's also common to throw view counts into a secondary table, and then update the actual counts nightly (or at some interval).
INSERT INTO page_view (page_id) VALUES (1);
...
UPDATE page SET views = views + new_views WHERE id = 1;
This should be a little faster than X = X + 1, but requires a bit more work.