In an effort to add statistics and tracking for users on my site, I've been thinking about the best way to keep counters of pageviews and other very frequently occurring events. Now, my site obviously isn't the size of Facebook to warrant some of the strategies they've implemented (sharding isn't even necessary, for example), but I'd like to avoid any blatantly stupid blunders.
It seems the easiest way to keep track is to just have an integer column in the table. For example, each page has a pageview column that just gets incremented by 1 for each pageview. This seems like it might be an issue if people hit the page faster than the database can write.
If two people hit the page at the same time, for example, then the previous_pageview count would be the same prior to both updates, and each update would update it to be previous_pageview+1 rather than +2. However, assuming a database write speed of 10ms (which is really high, I believe) you'd need on the order of a hundred pageviews per second, or millions of pageviews per day.
Is it okay, then, for me to be just incrementing a column? The exact number isn't too important, so some error here and there is tolerable. Does an update statement on one column slow down if there are many columns for the same row? (My guess is no.)
I had this plan to use a separate No-SQL database to store pk_[stat]->value pairs for each stat, incrementing those rapidly, and then running a cron job to periodically update the MySQL values. This feels like overkill; somebody please reassure me that it is.
UPDATE foo SET counter = counter + 1 is atomic. It will work as expected even if two people hit at the same time.
It's also common to throw view counts into a secondary table, and then update the actual counts nightly (or at some interval).
INSERT INTO page_view (page_id) VALUES (1);
...
UPDATE page SET views = views + new_views WHERE id = 1;
This should be a little faster than X = X + 1, but requires a bit more work.
Related
I have the following situation:
A user can have maximum number of partnerships. For example - 40.000
Question:
In case user wants to add a new partnership, how it will be faster to check the current number of partnerships ?
Solution 1:
Using a count(*) statement ?
Solution 2:
Storing the value into a separate column of user. And always when a new partnership needs to be added, to get it and then to increment that column ?
Personal remarks:
Are there any better solution to check the total number of rows ?
Does anyone have a statistic of how performance is affected during time ? I suppose that solution 1 is faster when there are a limited number of rows. But in case there are multiple rows, then it makes more sense to use solution 2. For example, after what period of time (amount of rows) solution 2 becomes better than 1 ?
I would prefer of course solution 1, because I get more control. Bugs might happen and the column from solution 2 to not be incremented. And in such cases, the number will not be correct.
Solution 2 is an example of denormalization, storing an aggregated value instead of relying on the base data. Querying this denormalized value is practically guaranteed to be faster than counting the base data, even for small numbers of rows.
But it comes at a cost for maintaining the stored value. You have to account for errors, which were discussed in the comments above. How will you know when there's an error? Answer: you have to run the count query and compare that to the value stored in the denormalized column.
How frequently do you need to verify the counts? Perhaps after every update? In that case, it's just as costly to verify the stored count as to calculate the real count from base data. In fact more costly, because you have to count and also update the user row.
Then it becomes a balance between how frequently you need to recalculate the counts versus how frequently you only query the stored count value. Every time you query between updates, you benefit from some cost savings, and if queries are a lot more frequent than updates, then you get a lot of savings. But if you update as frequently as you query, then you get no savings.
I'll vote for Solution 2 (keep an exact count elsewhere).
This will be much faster than COUNT(*), but there are things that can go wrong. Adding/deleting a partnership implies incrementing/decrementing the counter. And is there some case that is not exactly an INSERT/DELETE?
The count should be done in a transaction. For "adding":
START TRANSACTION;
SELECT p_count FROM Users WHERE user_id = 123 FOR UPDATE;
if >= 40K and close the transaction
INSERT INTO partnerships ...;
UPDATE Users SET p_count = p_count+1 WHERE user_id = 123;
COMMIT;
The overhead that is involved might be as much as 10ms. Counting to 40K would be much slower.
I am trying to understand how the huge volume of updates in tables affects Data availability for users. I have been going through various posts(fastest-way-to-update-120-million-records, Avoid locking while updating)which walks through the different mechanisms to do large updates like populating completely new table if this can be done offline. If it cannot be offline then doing batch updates.
I am trying to understand how these large updates affects Table availability to user and what is the best way to do large updates while making sure Table is available for reads.
Use case: Updating transaction details based on Primary key (Like updating the stock holding due to stock split.)
It is unclear what you need to do.
Replace the entire table -- populate new table, then swap
Change one column for all rows -- Sounds like sloppy design. Please elaborate on what you are doing.
Change one column for some rows -- ditto.
Adding a new column and initializing it -- Consider creating a parallel table, etc. This will have zero blockage but adds some complexity to your code.
The values are computed from other columns -- consider a "generated" column. (What version of MySQL are you using?)
Here is a discussion of how to walk through a table using the PRIMARY KEY and have minimal impact on other queries: http://mysql.rjweb.org/doc.php/deletebig#deleting_in_chunks (It is written with DELETE in mind, but the principle applies to UPDATE, too.)
Table availability
When any operation occurs, the rows involved are "locked" to prevent other queries from modifying them at the same time. ("Locking involves multi-version control, etc, etc.) They need to stay locked until the entire "transaction" is completed. Meanwhile, any changes need to be recorded in case the server crashes or the user decides to "roll back" the changes.
So, if there are millions of rows are being changed, then millions of locks are being held. That takes time.
My blog recommends doing only 1000 rows at a time; this is usually a small enough number to have very little interference with other tasks, yet large enough to get the task finished in a reasonable amount of time.
Stock Split
Assuming the desired query (against a huge table) is something like
UPDATE t
SET price = 2 * price
WHERE date < '...'
AND ticker = '...'
You need an index (or possibly the PRIMARY KEY) to be (ticker, date). Most writes are date-oriented, but most reads are ticker-oriented? Given this, the following may be optimal:
PRIMARY KEY(ticker, date),
INDEX(date, ticker)
With that the rows that need modifying by the UPDATE are 'clustered' (consecutive) in the data's BTree. Hence there is some degree of efficiency. If, however, that is not "good enough", then it should be pretty easy to write code something like:
date_a = SELECT MIN(date) FROM t WHERE ticker = ?
SET AUTOCOMMIT=ON
Loop
date_z = date_a + 1 month
UPDATE t
SET price = 2 * price
WHERE date >= ? -- put date_a here
AND date < ? -- put date_z here
AND ticker = '...'
check for deadlock; if found, re-run the UPDATE
set date_a = date_z
exit loop when finished
End Loop
This will be reasonably fast, and have little impact on other queries. However, is someone looks at that ticker over a range of days, the prices may not be consistently updated. (If this concerns you; we can discuss further.)
There are a number of similar questions on here, but a lot of the answers say to force the use of an index and that doesn't seem to speed anything up for me.
I am wanting to show a "live" counter on my website showing the number of rows in a table. Kind of like how some websites show the number of registered users, or some other statistic, in "real time" (i.e. updated frequently using ajax or websockets).
My table has about 5M rows. It's growing fairly quickly and there is a high volume of inserts and deletes on it. Running
select count(*) from my_table
Takes 1.367 seconds, which is unacceptable because I need my application to get the new row count about once per second.
I tried what many of the answers on here suggest and changed the query to:
select count(*) from my_table use index(my_index)
Where my_index is Normal, BTREE on a bigint field. But the time actually increased to 1.414 seconds.
Why doesn't using an index speed up the query as many answers on here said it would?
Another option some answers suggest is to put a trigger on the table that increments a column in another table. So I could create a stats table and whenever a row is inserted or deleted in my_table have a trigger increment or decrement a column in the stats table. Is this the only other option, since using an index doesn't seem to work?
EDIT: Here's a perfect example of the type of thing I'm trying to accomplish: https://www.freelancer.com. Scroll to the bottom of the page and you'll see:
Those numbers update every second or so.
It takes time to read 5 million records and count them -- whether in an index or in the raw data form.
If a "fast-and-dirty" solution is acceptable, you can use metadata:
SELECT table_rows
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = <whatever> and TABLE_NAME = <whatever2>;
Note that this can get out-of-sync.
Another possibility is to partition the table into smaller chunks. One advantage is that if the inserts and deletes tend to be to one partition, you can just count that and use metadata for the other partitions.
A trigger may or may not help in this situation, depending on the insert/delete load. If you are doing multiple inserts per minute, then a trigger is a no-brainer -- a fine solution. If you are doing dozens or hundreds of changes per second, then the overhead of the trigger might slow down the server.
If your system is so busy that the counting is having too much impact, then probably the INSERTing/DELETEing is also having impact. One way to improve INSERT/DELETE is to do them in 'batches' instead of one at a time.
Gather the INSERTs, preferably in the app, but optionally in a 'staging' table. Then, once a second (or whatever) copy them into the real table using an INSERT..SELECT, or (if needed) INSERT..ON DUPLICATE KEY UPDATE. DELETEs can go into the same table (with a flag) or a separate table.
The COUNT(*) can be done at the end of the batch. Or it could be dead reckoned (at much lower cost) by knowing what the count was, then adjusting by what the staging table(s) will change it by.
This is a major upheaval to you app code, so don't embark on it unless you have spikes of, say, >100 INSERTs/DELETEs per second. (A steady 100 INSERTs/sec = 3 billion rows per year.)
For more details on "staging table", see http://mysql.rjweb.org/doc.php/staging_table Note that that blog advocates flip-flopping between a pair of staging tables, so as to minimize locks on them, and to allow multiple clients to coexist.
Have a job running in the background that does the following; then use its table for getting the count:
Loop:
INSERT INTO Counter (ct_my_table)
SELECT COUNT(*) FROM my_table;
sleep 1 second
end loop
At worst, it will be a couple of seconds out of date. Also note that INSERTs and DELETEs interfere (read: slow down) the SELECT COUNT(*), hence the "sleep".
Have you noticed that some UIs say "About 120,000 thingies"? They are using even cruder estimations. But it is usually good enough for the users.
Take inaccurate value from information_schema as Gordon Linoff suggested
Another inaccurate source of rows count is SELECT MAX(id) - MIN(id)
Create table my_table_count where you store rows count of table my_table and update it with triggers
In many cases you don't need an accurate value. Who cares if you show 36,400 users instead of the accurate 36,454?
I want to start counting the numbers of times a webpage is viewed and hence need some kind of simple counter. What is the best scalable method of doing this?
Suppose I have a table Frobs where each row corresponds to a page - some obvious options are:
Have an unsigned int NumViews field
in the Frobs table which gets
updated upon each view using UPDATE
Frobs SET NumViews = NumViews + 1. Simple but not so good at scaling as I understand it.
Have a separate table FrobViews
where a new row is inserted for each view. To display the
number of views, you then need to do a simple SELECT COUNT(*) AS NumViews FROM FrobViews WHERE FrobId = '%d' GROUP BY FrobId. This doesn't involve any updates so can avoid table locking in MyISAM tables - however, the read performance will suffer if you want to display the number of views on each page.
How do you do it?
There's some good advice here:
http://www.mysqlperformanceblog.com/2007/07/01/implementing-efficient-counters-with-mysql/
but I'd like to hear the views of the SO community.
I'm using InnoDb at the moment, but am interested in answers for both InnoDb and MyISAM.
If scalability is more important to you than absolute accuracy of the figures then you could cache the view count in your application for a short time rather than hitting the database on every page view - eg, only update the database once every 100 views.
If your application crashes between database updates then obviously you'll lose some of your data, but if you can tolerate a certain amount of inaccuracy then this might be a useful approach.
Inserting into a Database is not something you want to do on page views. You are likely to run into problems with updating your slave databases with all of the inserts since replication is single threaded on MySQL.
At my company we serve 25M page views a day and we have taken a tiered approach.
The view counter is stored in a separate table with 2 columns (profileId, viewCounter) both are unsigned integers.
For items that are infrequently viewed we update the table on page view.
For frequently viewed items we update MySQL about 1/10 of the time. For both types we update Memcache on every hit.
int Memcache::increment ( string $key [, int $value = 1 ] )
if (pageViews < 10000) { UPDATE page_view SET viewCounter=viewCounter+1 WHERE profileId = :? }
else if ((int)rand(10) == 1) { //UPDATE page_view SET viewCounter= ?:cache_value WHERE profileId = :? }
doing count(*) is very inefficient in InnoDB (MyISAM keeps count stats in the index), but MyISAM will lock the table on reads reducing concurrency. doing a count() for 50,000 or 100,000 rows is going to take a long time. Doing a select on a PK will be very fast.
If you require more scalability, you might want to look at redis
I would take your second approach and aggregate the data into the table from your first solution on a regular base. On this way you get the advandages of both solutions. To be clearer:
On every hit you insert a row into a table (lets name it hit_counters). This table got only one field (the pageid). Every x seconds you run a script (via a cronjob) which aggregates the data from the hit_counters table and put it into a second table (lets name it 'hits'. There you got two fields: the pageid and the total hits.
Im not sure but imho does innodb not help you very much for solution 1 if you get many hits on the same page: Innodb locks the row while updating so all other updates to this row will be delayed.
Depending on whats your program written in you could also batch the updates together by counting in your application and updating the database only every x seconds. This would only work if you use a programing language where you got persistent storage (like Java Servlets but not PHP)
What I do, and it may not apply to your scenario, is in the stored procedure that prepares/returns the data that is displayed on the page, I make the table counter update at the same time it returns the data - that way, there is only one call to the server that both gets the data, and updates the counter in the same call.
If you are not using SP's,(or if there is no database data on your page) this option may not be available to you, but if you are, its something to consider.
I am working on a website with a simple normalized database.
There is a table called Pages and a table called Views. Each time a Page is viewed, a unique record of that View is recorded in the Views table.
When displaying a Page on the site, I use a simple MySQL COUNT() to total up the number of Views for display.
Database design seems fine, except for this problem: I am at a loss for how to retrieve the top 10 most viewed pages among thousands.
Should I denormalize the Pages table by adding a Pages.views column to hold the total number of views for each page? Or is there an efficient way to query for the top 10 most viewed pages?
SELECT p.pageid, count(*) as viewcount FROM
pages p
inner join views v on p.pageid = v.pageid
group by p.pageid
order by count(*) desc
LIMIT 10 OFFSET 0;
I can't test this, but something along those lines. I would not store the value unless I have to due to performance constraints (I just learned the term "premature optimization", and it seems to apply if you do).
It depends on the level of information you are trying to maintain. If you want to record who viewed when? Then the separate table is fine. Otherwise, a column for Views is the way to go. Also If you keep a separate column, you'll find that the table will be locked more often since each page view will try to update the column for its corresponding row.
Select pageid, Count(*) as countCol from Views
group by pageid order by countCol DESC
LIMIT 10 OFFSET 0;
Database normalization is all about the most efficient / least redundant way to store data. This is good for transaction processing, but often directly conflicts with the need to efficiently get the data out again. The problem is usually addressed by having derived tables (indexes, materialized views, rollup tables...) with more accessible, pre-processed data. The (slightly dated) buzzword here is Data Warehousing.
I think you want to keep your Pages table normalized, but have an extra table with the totals. Depending on how recent those counts need to be, you can update the table when you update the original table, or you can have a background job to periodically recalculate the totals.
You also want to do this only if you really run into a performance problem, which you will not unless you have a very large number of records, or a very large number of concurrent accesses. Keep your code flexible to be able to switch between having the table and not having it.
I would probably include the views column in the Pages table.
It seems like a perfectly reasonable breaking of normalization to me. Especially since I can't imagine you deleting views so you wouldn't expect the count to get out of whack. Referential integrity doesn't seem super-critical in this case.
Denormalizing would definitely work in this case. Your loss is the extra storage room used up by the extra column.
Alternatively you could set up a scheduled job to populate this information on a nightly basis, whenever your traffic is low, x period of time.
In this case you would be losing the ability to instantly know your page counts unless you run this query manually.
Denormalization can definitely be employed to increase performance.
--Kris
While this is an old question, I'd like to add my answer because I find the accepted one to be misguided.
It is one thing to have COUNT for a single selected row; it is quite another to sort the COUNT of ALL columns.
Even if you have just 1000 rows, each counted with some join, you can easily involve reading tens of thousands if not millions of rows.
It can be ok if you only call this occasionally, but it is very costly otherwise.
What you can do is to add a TRIGGER:
CREATE TRIGGER ins AFTER INSERT ON table1 FOR EACH ROW
UPDATE table2
SET count = count + 1
WHERE CONDITION