I want to start counting the numbers of times a webpage is viewed and hence need some kind of simple counter. What is the best scalable method of doing this?
Suppose I have a table Frobs where each row corresponds to a page - some obvious options are:
Have an unsigned int NumViews field
in the Frobs table which gets
updated upon each view using UPDATE
Frobs SET NumViews = NumViews + 1. Simple but not so good at scaling as I understand it.
Have a separate table FrobViews
where a new row is inserted for each view. To display the
number of views, you then need to do a simple SELECT COUNT(*) AS NumViews FROM FrobViews WHERE FrobId = '%d' GROUP BY FrobId. This doesn't involve any updates so can avoid table locking in MyISAM tables - however, the read performance will suffer if you want to display the number of views on each page.
How do you do it?
There's some good advice here:
http://www.mysqlperformanceblog.com/2007/07/01/implementing-efficient-counters-with-mysql/
but I'd like to hear the views of the SO community.
I'm using InnoDb at the moment, but am interested in answers for both InnoDb and MyISAM.
If scalability is more important to you than absolute accuracy of the figures then you could cache the view count in your application for a short time rather than hitting the database on every page view - eg, only update the database once every 100 views.
If your application crashes between database updates then obviously you'll lose some of your data, but if you can tolerate a certain amount of inaccuracy then this might be a useful approach.
Inserting into a Database is not something you want to do on page views. You are likely to run into problems with updating your slave databases with all of the inserts since replication is single threaded on MySQL.
At my company we serve 25M page views a day and we have taken a tiered approach.
The view counter is stored in a separate table with 2 columns (profileId, viewCounter) both are unsigned integers.
For items that are infrequently viewed we update the table on page view.
For frequently viewed items we update MySQL about 1/10 of the time. For both types we update Memcache on every hit.
int Memcache::increment ( string $key [, int $value = 1 ] )
if (pageViews < 10000) { UPDATE page_view SET viewCounter=viewCounter+1 WHERE profileId = :? }
else if ((int)rand(10) == 1) { //UPDATE page_view SET viewCounter= ?:cache_value WHERE profileId = :? }
doing count(*) is very inefficient in InnoDB (MyISAM keeps count stats in the index), but MyISAM will lock the table on reads reducing concurrency. doing a count() for 50,000 or 100,000 rows is going to take a long time. Doing a select on a PK will be very fast.
If you require more scalability, you might want to look at redis
I would take your second approach and aggregate the data into the table from your first solution on a regular base. On this way you get the advandages of both solutions. To be clearer:
On every hit you insert a row into a table (lets name it hit_counters). This table got only one field (the pageid). Every x seconds you run a script (via a cronjob) which aggregates the data from the hit_counters table and put it into a second table (lets name it 'hits'. There you got two fields: the pageid and the total hits.
Im not sure but imho does innodb not help you very much for solution 1 if you get many hits on the same page: Innodb locks the row while updating so all other updates to this row will be delayed.
Depending on whats your program written in you could also batch the updates together by counting in your application and updating the database only every x seconds. This would only work if you use a programing language where you got persistent storage (like Java Servlets but not PHP)
What I do, and it may not apply to your scenario, is in the stored procedure that prepares/returns the data that is displayed on the page, I make the table counter update at the same time it returns the data - that way, there is only one call to the server that both gets the data, and updates the counter in the same call.
If you are not using SP's,(or if there is no database data on your page) this option may not be available to you, but if you are, its something to consider.
Related
I am trying to understand how the huge volume of updates in tables affects Data availability for users. I have been going through various posts(fastest-way-to-update-120-million-records, Avoid locking while updating)which walks through the different mechanisms to do large updates like populating completely new table if this can be done offline. If it cannot be offline then doing batch updates.
I am trying to understand how these large updates affects Table availability to user and what is the best way to do large updates while making sure Table is available for reads.
Use case: Updating transaction details based on Primary key (Like updating the stock holding due to stock split.)
It is unclear what you need to do.
Replace the entire table -- populate new table, then swap
Change one column for all rows -- Sounds like sloppy design. Please elaborate on what you are doing.
Change one column for some rows -- ditto.
Adding a new column and initializing it -- Consider creating a parallel table, etc. This will have zero blockage but adds some complexity to your code.
The values are computed from other columns -- consider a "generated" column. (What version of MySQL are you using?)
Here is a discussion of how to walk through a table using the PRIMARY KEY and have minimal impact on other queries: http://mysql.rjweb.org/doc.php/deletebig#deleting_in_chunks (It is written with DELETE in mind, but the principle applies to UPDATE, too.)
Table availability
When any operation occurs, the rows involved are "locked" to prevent other queries from modifying them at the same time. ("Locking involves multi-version control, etc, etc.) They need to stay locked until the entire "transaction" is completed. Meanwhile, any changes need to be recorded in case the server crashes or the user decides to "roll back" the changes.
So, if there are millions of rows are being changed, then millions of locks are being held. That takes time.
My blog recommends doing only 1000 rows at a time; this is usually a small enough number to have very little interference with other tasks, yet large enough to get the task finished in a reasonable amount of time.
Stock Split
Assuming the desired query (against a huge table) is something like
UPDATE t
SET price = 2 * price
WHERE date < '...'
AND ticker = '...'
You need an index (or possibly the PRIMARY KEY) to be (ticker, date). Most writes are date-oriented, but most reads are ticker-oriented? Given this, the following may be optimal:
PRIMARY KEY(ticker, date),
INDEX(date, ticker)
With that the rows that need modifying by the UPDATE are 'clustered' (consecutive) in the data's BTree. Hence there is some degree of efficiency. If, however, that is not "good enough", then it should be pretty easy to write code something like:
date_a = SELECT MIN(date) FROM t WHERE ticker = ?
SET AUTOCOMMIT=ON
Loop
date_z = date_a + 1 month
UPDATE t
SET price = 2 * price
WHERE date >= ? -- put date_a here
AND date < ? -- put date_z here
AND ticker = '...'
check for deadlock; if found, re-run the UPDATE
set date_a = date_z
exit loop when finished
End Loop
This will be reasonably fast, and have little impact on other queries. However, is someone looks at that ticker over a range of days, the prices may not be consistently updated. (If this concerns you; we can discuss further.)
case 1: i have a table A with 1 insert/per seconde .
From my admin i need to make some heavy read and delets on this table to perform some statistic and maintenance .
Is it make sense to insert incoming data in 2 differents tables A and B , and use the table B for my administration. Goal is to not overload table A .
case 2 :
Another exemple to fully understand the logic , i have a table (tmpA) dedicated to fill search result . Each time there is a search , result is insert into this table and help for pagination.The night , olds results are delet .
actually i have 5 request per second for this table , so aproximativly 500 rows * 5 = 2500 rows /per second .
Is it make sens to creat more tables (tmpA , tmpB , tmpC ,etc..) to dispatch insert and avoid overload ?
for case 1 , if make sens to duplicate ,
whats is the difference with inserting "manualy" incoming data in 2 (or more) differentes
tables between use the mysql replication ?
Thanks to you,
jess
This is kinda difficult to answer, as it depends on your setup hardware-wise.
An insert per second isn't that much. A properly setup server should be able to handle it.
Reads on a table are non-blocking. so gathering info to do statistics (and assuming you don do the calculations for the statistics in the database) shouldn't influence the performance of your database.
Deletes on the other hand are blocking, and will add up to load on a table with heavy inserts.
For Case 1, I do not understand how you would want to split the load on different tables. Generally speaking, there's a database-server load, and not specifically a table load (unless we define blocking processes as table load).
I gather from the comments that Case 1 are user signups/registration. splitting user information over two tables is horrid from a maintenance perspective, plus the coupling of two tables that inevitably needs happen only increases overhead -load-, instead of decreasing it. Deleting data (users?) is also a major issue if the data is divided over two tables. Can you explain how you see administering your data if this is divided over two tables? I'm probably missing something.
Looking at the above, I do not recommend splitting this data between tables.
What I do recommend is:
Use InnoDB as a table type. It has smaller locking than MyISAM (which does table locking?)
Optimize your RAM/memory usage for MySQL. Proper memory settings allow for very quick reads and writes.
Optimize your indexes. the EXPLAIN statement can show which ones are used for each query
Case 2
I don't fully understand the use case, but it might make sense to spit this data up into several tables. Depending on why you want to push the data in these temp tables, splitting might happen per user, keyword or other significant features.
Depending on the use case try limiting the search results (and thus utilizing pagination) through LIMIT BY statements. You donĀ“t need store results for pagination that way, or store the results at all. Can you explain why you want to store these results? 2500 rows/sec is a lot.
Replication is a whole other topic, much more complicated and not achieved by copying tables, but by copying servers. I can't help you with that, never done it, as I never needed it. (my largest MySQL server was aprox. 80Gb large, 350 million rows, with inserts peaking at 224 rows per second)
Can you paste the architecture of your tables you currently use, and some sample data? That might makes the cases at tad more clear.
I have 3 very large tables with clustered indexes on composite keys. No updates only inserts. New inserts will not be within the existing index range but the new inserts will not align with the clustered index and these tables get a lot of inserts (hundreds - thousands per second). What would like to do is DBREINDEX with Fill Factor = 100 but then set a Fill Factor of 5 and have that Fill Factor ONLY applied to inserts. Right now a Fill Factor applies to the whole table only. Is there a way to have a Fill Factor that applies to inserts (or inserts and updates) only? I don't care about select speed at this time. I am loading data. When the data load is complete then I will DBREINDEX at 100. A Fill Factor of 10 versus 30 doubles the rates at which new data is inserted. This load will takes a couple days and it cannot go live until the data is loaded. The clustered indexes are aligned with dominate query used by the end user application.
My practice is to DBREINDEX daily but the problem is now that the tables are getting large a 10 DBREINDEX takes a long time. I have considered indexing into "daily" tables and then inserting that data daily sorted by the clustered index into the production tables.
If you read this far even more. The indexes are all composite and I am running 6 instances of the parser on an 8 core server (lot of testing and that seems to have the best throughput). The data out of a SINGLE parser is in PK order and I am doing the inserts 990 values at a time (SQL value limits). The 3 active tables only share data via a foreign key relationship with a single relative inactive 4th table. My thought at this time is to have holding tables for each parser and then have another process that polls those table for the next complete insert and move the data into the production table in PK order. That is going to be a lot of work. I hope someone has a better idea.
The parses start in PK order but rarely finish in PK order. Some individual parses are so large that I could not hold all the data in memory until the end. Right now the SQL insert is slightly faster than the parse that creates the data. In an individual parse I run the insert asynch and go on parsing but don't insert until the prior insert is complete.
I agree you should have holding tables for the parser data and only insert to the main tables when you're ready. I implemented something similar in a former life (it was quasi-hashed into 10 tables based on mod 10 of the unique ID, then rolled into the primary table later - primarily to assist in load speed). If you're going to use holding tables then I see no need to have them at anything but FF = 100. The less pages you have to use the better.
Apparently, too, you should test the difference permanent tables, #temp tables and table-valued parameters. :-)
In an effort to add statistics and tracking for users on my site, I've been thinking about the best way to keep counters of pageviews and other very frequently occurring events. Now, my site obviously isn't the size of Facebook to warrant some of the strategies they've implemented (sharding isn't even necessary, for example), but I'd like to avoid any blatantly stupid blunders.
It seems the easiest way to keep track is to just have an integer column in the table. For example, each page has a pageview column that just gets incremented by 1 for each pageview. This seems like it might be an issue if people hit the page faster than the database can write.
If two people hit the page at the same time, for example, then the previous_pageview count would be the same prior to both updates, and each update would update it to be previous_pageview+1 rather than +2. However, assuming a database write speed of 10ms (which is really high, I believe) you'd need on the order of a hundred pageviews per second, or millions of pageviews per day.
Is it okay, then, for me to be just incrementing a column? The exact number isn't too important, so some error here and there is tolerable. Does an update statement on one column slow down if there are many columns for the same row? (My guess is no.)
I had this plan to use a separate No-SQL database to store pk_[stat]->value pairs for each stat, incrementing those rapidly, and then running a cron job to periodically update the MySQL values. This feels like overkill; somebody please reassure me that it is.
UPDATE foo SET counter = counter + 1 is atomic. It will work as expected even if two people hit at the same time.
It's also common to throw view counts into a secondary table, and then update the actual counts nightly (or at some interval).
INSERT INTO page_view (page_id) VALUES (1);
...
UPDATE page SET views = views + new_views WHERE id = 1;
This should be a little faster than X = X + 1, but requires a bit more work.
This problem is pretty hard to describe and therefore difficult to search the answer. I hope some expert could share you opinions on that.
I have a table with around 1 million of records. The table structure is similar to something like this:
items{
uid (primary key, bigint, 15)
updated (indexed, int, 11)
enabled (indexed, tinyint, 1)
}
The scenario is like this. I have to select all of the records everyday and do some processing. It takes around 3 second to handle each item.
I have written a PHP script to fetch 200 items each time using the following.
select * from items where updated > unix_timestamp(now()) - 86400 and enabled = 1 limit 200;
I will then update the "updated" field of the selected items to make sure that it wont' be selected again within one day. The selected query is something like that.
update items set updated = unix_timestamp(now()) where uid in (1,2,3,4,...);
Then, the PHP will continue to run and process the data which doesn't require any MYSQL connection anymore.
Since I have million records and each record take 3 seconds to process, it's definitely impossible to do it sequentially. Therefore, I will execute the PHP in every 10 seconds.
However, as time goes by and the table growth, the select getting much slower. Sometimes, it take more than 100 seconds to run!
Do you guys have any suggestion how may I solve this problem?
There are two points that I can think of that should help:
a. unix_timestamp(now()) - 86400)
... this will evaluate now() for every single row, make it a constant by setting a variable to that value before each run.
b. Indexes help reads but can slow down writes
Consider dropping indexes before updating (DISABLE KEYS) - and then re-add them before reading (ENABLE KEYS).
I don't think the index on enabled is doing you any good, the cardinality is too low. Remove that and your UPDATEs should go faster.
I am not sure what you mean when you say each record takes 3 seconds since, you are handling them in batches of 200. How are you determining this and what other processing is involved?
You could do this:
dispatcher.php: Manages the whole process.
fetches items in convenient packages from the database
calls worker.php on the same server with an HTTP post containing all UIDs fetched (I understand that worker.php would not need more than the UID to do its job)
maintains a counter of how many worker.php scrips are running. When one is started, the counter increments until a certain limit, when one worker returns then the counter is decremented. See "Asynchronous PHP calls?".
repeats until all records are fetched once. Maintain a MySQL LIMIT counter and do not work with updated.
worker.php: does the actual work
does its thing with each item posted.
writes to a helper table the ID of each item it has processed (no index on that table)
dispatcher.php: housekeping.
once all workers have returned, updates the main table with the helper table in a single statement
error recovery
since worker.php would update the helper table after each item done, you can use the state of the helper table to recover from a crash. Saving the "work package" of each individual worker before it starts running would help to recover worker states as well.
You would have a multi-threaded processing chain this way and could even distribute the whole thing across multiple machines.
You could try running this before the update:
ALTER TABLE items DISABLE KEYS;
and then when you're done updating,
ALTER TABLE items ENABLE KEYS;
That should recreate the index much faster than updating each record at a time will.
For a table with fewer than a couple of billion records, the primary key should be an unsigned int rather than a bigint.
One idea:
Use a HANDLER, that will improve your performance considerably:
http://dev.mysql.com/doc/refman/5.1/en/handler.html