How to count page views in MySQL without performance hit - mysql

I want to count the amount of visitors of a page, similar to what stackoverflow is doing with the "views" of each question.
The current solution just increments a field of a InnoDB table:
UPDATE data SET readers = readers + 1, date_edited = date_edited WHERE ID = '881529' LIMIT 1
This is the most expensive query on the page since it is performing a write operation.
Is there a better solution to the problem? How do high traffic sites like stackoverflow handle this?
I am thinking to instead write to a table using the memory engine and writing that content to a innodb table every minute or so.
e.g.:
INSERT INTO mem_table (id,views_new)
VALUES (881525,1)
ON DUPLICATE KEY UPDATE views_new = views_new+1
Then I would run a cron job every minute to update the InnoDB table:
UPDATE data d, mem_table m
SET d.readers = d.readers + m.readers_new
WHERE d.ID = m.ID;
DELETE FROM mem_table;
Unfortunatelly this is not so good with replication and the application is using a MySQL Galera Cluster.
Thank you in advance for any suggestions.

There are ways to reduce the immediate performance hit by starting a separate thread to update your counters. When you have a high number of parallel users (so many parallel updates of your hit counters), it is advisable to use a queuing mechanism to prevent locking (so like your in memory table). Your queue will have both writes and reads, so you have to take the table and data design into account.
Alternative is keeping a counter related to the article in a separate file. This prevents congestion on the single table with hit counters or if you keep it in the table serving the articles: A high lock wait time out on that article table (resulting in all kind of front end errors). Keeping the data in separate files does not give you insight in the overall hits on your site, but for that you could just use a log graphing tool like awstats.

If you can batch 100 INSERTs/UPDATEs together in a single statement, you can run it 10 times as fast. (There is a risk of lock_wait_timeout and/or deadlock.)
What if you build a MEMORY table and lose the queued data in a power failure? I assume that is OK for this application? (If not, you have a much bigger problem.)
What are your client(s)? Can they queue up things before even touching the database?
I like ping-ponging a pair of tables for staging data into the database. Clients write to one table; a continuously running job (not a cron job) is working with the other table. When the latter finishes with inserts/updates, it swaps the tables with a single, atomic, RENAME TABLE so that the clients are oblivious. My Staging Table blog discusses this in further detail. It explains how to avoid the replication problems you encountered.
Another tip. Do not put the count and date in the main table. Put them in a 'parallel table' ('vertical partitioning'). This cuts down on the bulkiness in replication and decreases the interference with other processing.
For Galera, use a pair non-replicated tables (suggest MyISAM with no indexes). Have the continually running job run in one place, cycling through the 3 nodes. If you had 3 jobs, there would be several ways in which they are more likely to stumble over each other.
If this won't keep up, you need to Shard your data. (That's what the big folks do, sooner or later.)

Related

Resources consumed by a simple SELECT query in MySql

There a few large tables in one of the databases of a customer (each table is ~50M rows in size and is not too wide). The intent is to infrequently read these tables (completely). As there are no reasonable CDC indices present, the plan is to read the tables by querying them
SELECT * from large_table;
The reads will be performed using a jdbc driver. With the following fetch configuration present, the intent is to read the data approximately one record at a time (it may require a significant amount of time) so that the client code is never overwhelmed.
PreparedStatement stmt = connection.prepareStatement(queryString, ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);
I was going through the execution path of a query in High Performance MySQL, however some questions seemed unanswered:
Without the temp tables being explicitly created and the query cache being made use of, "how" are the stream reads tracked on the server?
Is any temporary data created (in main memory or files on disk) whatsoever? If so, where is it created and how much?
If temporary data is not created, how are the rows to be returned tracked? Does the query engine keep track of all the page files to be read for this query on this connection? In case there are several such queries running on the server, are the earliest "Tracked" files purged in favor of queries submitted recently?
PS: I want to understand the effect of this approach on the MySql server (not saying that there aren't better ways of reading the tables)
That simple query will not use a temp table. It will simply fetch the rows and transfer them to the client until it finishes. Nor would any possible index be useful. (If the real query is more complex, let's see it.)
The client may wait for all the rows (faster, but memory intensive) before it hands any to the user code, or it may hand them off one at a time (much slower).
I don't know the details in JDBC on specifying it.
You may want to page through the table. If so, don't use OFFSET, but use the PRIMARY KEY and "remember where you left off". More discussion: http://mysql.rjweb.org/doc.php/pagination
Your Question #3 leads to a complex answer...
Every query brings all the relevant data (and index entries) into RAM. The data/index is read in chunks ("blocks") of 16KB from the BTree structure that is persisted on disk. For a simple select like that, it will read the blocks 'sequentially' until finished.
But, be aware of "caching":
If a block is already in RAM, no I/O is needed.
If a block is not in the cache ("buffer_pool"), it will, if necessary, bump some block out and read the desired block in. This is very normal, and very common. Do not fear it.
Because of the simplicity of the query, only a few blocks ever need to be in RAM at any moment. Hence, if your buffer pool were only a few megabytes, it could still handle, say, a 1TB table. There would be a lot of I/O, and that would impact other operations.
As for "tracking", let me use the analogy of reading a long book in a single sitting. There is nothing to track, you are simply turning pages ('blocks'). You don't even need a 'bookmark' for tracking, it is next-next-next...
Another note: InnoDB uses "B+Tree", which includes a link from one block to the "next", thereby making the page turning efficient.
Another interpretation of tracking... "Transactions" and "ACID". When any query (read or write) touches a table, there is some form of lock applied to each row touched. For SELECT the lock is rather light-weight. For writes it can cause delays or even a "deadlock". The locks are unavoidable, but sometimes actions can be taken to minimize their impact.
Logically (but not actually), a "snapshot" of all rows in all tables is taken at the instant you start a transaction. This allows you to see a consistent view of everything, even if other connections are changing rows. The underlying mechanism is very lightweight on reading, but heavier for writes. Writes will make a copy of the row so that each connection sees the snapshot that it 'should' see. Also, the copy allows for ROLLBACK and recovery from a crash (eg power failure).
(Transaction "isolation" mode allows some control over the snapshot.) To get the optimal performance for your case, do nothing special.
Here's a way to conceptualize the handling of transactions: Each row has a timestamp associated with it. Each query saves the start time of the query. The query can "see" only rows that are older than that start time. A subsequent write in another connection will be creating copies of rows with a later timestamp, hence not visible to the SELECT. Hence, the onus is on writes to do extra work; reads are cheap.

Can I INSERT into table while UPDATING multiple different rows with MariaDB or MySQL?

I am creating a custom analytics system and currently in the database designing process. I'm planning to use MariaDB with the InnoDB engine to be able to handle big loads.
The data I'm expecting could be around 500k clicks/day. I will need to insert these rows into the database, which means that I'll have around 5.8 inserts/sec on average. However, at the same time, I want to record if someone visited a page associated with that click. (basically to record funnels)
So what I'm planning to do is to create additional columns and search for the ID of the specific row then update that column with the exact time of the visit.
My first question: is this generally a recommended approach to design the database like that? If not, how else is it worth to design the database?
My only concern is that while updating rows the Table will be locked, and can't do inserts, therefore slowing down the user experience.
My second question: is this something I should worry about, that the table gets locked while updating, and thus slowing down inserts? Does it hurt performance?
InnoDB doesn't lock the table for insert if you're performing the update. Your users won't experience any weird hanging.
It's an MVCC compliant engine, designed to handle concurrent access to underlying tables.
You can control the engine's behavior by choosing an appropriate isolation level, however the default (REPEATABLE READ) is excellent and does the job more than well.
If a table is being modified by multiple users (not users that connect to your site but connections established towards MySQL via a scripting language or some other service) and there's many inserts/updates/deletes - MySQL can throw an error saying a deadlock occurred.
A deadlock is a warning, not an error, that more than 1 thread tried to access an occupied resource (such as two threads tried to update the same row at the same time, but only 1 will be allowed to do so). It's an indication you should repeat the query.
I'm suggesting that you take care of all possible scenarios in the language of your choice when it comes to handling MySQL that's under heavier I/O.
~6 inserts a second isn't a lot, make sure you're allowing MySQL to access sufficient system resources. For InnoDB, check the value of innodb_buffer_pool_size or google a bit to see what it is and how to use it to make your database run fast.
Good luck!
At a mere 5.6/second, there won't be much problem.
I do, however, suggest vertical partitioning for "Likes", "Upvotes", "Clicks", and similar things. These tend to have a lot of UPDATEs of random single rows, and may interfere with other activity.
That is, have a separate table with (perhaps) just 2 columns:
The id of the item being Liked/Clicked/etc.
A counter.
It is simple enough (and fast enough) to JOIN via that id when you want to display info including the counter.
As already pointed out, the row is locked, not the table.

How to improve InnoDB's SELECT performance while INSERTing

We recently switched our tables to use InnoDB (from MyISAM) specifically so we could take advantage of the ability to make updates to our database while still allowing SELECT queries to occur (i.e. by not locking the entire table for each INSERT)
We have a cycle that runs weekly and INSERTS approximately 100 million rows using "INSERT INTO ... ON DUPLICATE KEY UPDATE ..."
We are fairly pleased with the current update performance of around 2000 insert/updates per second.
However, while this process is running, we have observed that regular queries take very long.
For example, this took about 5 minutes to execute:
SELECT itemid FROM items WHERE itemid = 950768
(When the INSERTs are not happening, the above query takes several milliseconds.)
Is there any way to force SELECT queries to take a higher priority? Otherwise, are there any parameters that I could change in the MySQL configuration that would improve the performance?
We would ideally perform these updates when traffic is low, but anything more than a couple seconds per SELECT query would seem to defeat the purpose of being able to simultaneously update and read from the database. I am looking for any suggestions.
We are using Amazon's RDS as our MySQL server.
Thanks!
I imagine you have already solved this nearly a year later :) but I thought I would chime in. According to MySQL's documentation on internal locking (as opposed to explicit, user-initiated locking):
Table updates are given higher priority than table retrievals. Therefore, when a lock is released, the lock is made available to the requests in the write lock queue and then to the requests in the read lock queue. This ensures that updates to a table are not “starved” even if there is heavy SELECT activity for the table. However, if you have many updates for a table, SELECT statements wait until there are no more updates.
So it sounds like your SELECT is getting queued up until your inserts/updates finish (or at least there's a pause.) Information on altering that priority can be found on MySQL's Table Locking Issues page.

Updating large quantities of data in a production database

I have a large quantity of data in a production database that I want to update with batches of data while the data in the table is still available for end user use. The updates could be insertion of new rows or updates of existing rows. The specific table is approximately 50M rows, and the updates will be between 100k - 1M rows per "batch". What I would like to do is insert replace with a low priority.. In other words, I want the database to kind of slowly do the batch import without impacting performance of other queries that are occurring concurrently to the same disk spindles. To complicate this, the update data is heavily indexed. 8 b-tree indexes across multiple columns to facilitate various lookup that adds quite a bit of overhead to the import.
I've thought about batching the inserts down into 1-2k record blocks, then having the external script that loads the data just pause for a couple seconds between each insert, but that's really kind of hokey IMHO. Plus, during a 1M record batch, I really don't want to add 500-1000 2second pauses to add 20-40 minutes of extra load time if its not needed. Anyone have ideas on a better way to do this?
I've dealt with a similar scenario using InnoDB and hundreds of millions of rows. Batching with a throttling mechanism is the way to go if you want to minimize risk to end users. I'd experiment with different pause times and see what works for you. With small batches you have the benefit that you can adjust accordingly. You might find that you don't need any pause if you run this all sequentially. If your end users are using more connections then they'll naturally get more resources.
If you're using MyISAM there's a LOW_PRIORITY option for UPDATE. If you're using InnoDB with replication be sure to check that it's not getting too far behind because of the extra load. Apparently it runs in a single thread and that turned out to be the bottleneck for us. Consequently we programmed our throttling mechanism to just check how far behind replication was and pause as needed.
An INSERT DELAYED might be what you need. From the linked documentation:
Each time that delayed_insert_limit rows are written, the handler checks whether any SELECT statements are still pending. If so, it permits these to execute before continuing.
Check this link: http://dev.mysql.com/doc/refman/5.0/en/server-status-variables.html What I would do is write a script that will execute your batch updates when MySQL is showing Threads_running or Connections under a certain number. Hopefully you have some sort of test server where you can determine what a good number threshold might be for either of those server variables. There are plenty of other of server status variables to look at in there also. Maybe control the executions by the Innodb_data_pending_writes number? Let us know what works for you, its an interesting question!

How do you optimize a MySQL database for writes?

I have a write intensive application running on EC2. Any thoughts on how to optimize it to be able to make several thousands concurrent writes on the MySQL DB?
Write scaling is a hard problem. Perhaps, secret to write scaling is in read scaling. That is, cache reads as much as possible, so that the writes get all the throughput.
Having said that, there are a bunch of things one can do:
1) Start with the data model. Design a data model so that you do not ever delete or update a table. Only operation is an insert. Use Effective Date, Effective Sequence and Effective Status to implement Insert, Update and Delete operations using just the Insert Command. This concept is called Append Only model. Checkout RethinkDB..
2) Set the Concurrent Insert flag to 1. This makes sure that the tables keep inserting while reads are in progress.
3) When you have only Inserts at the tail, you may not need row-level locks. So, use MyISAM (this is not to take anything away from InnoDB, which I will come to later).
4) If all this does not do much, create a replica table in Memory Engine. If you have a table called MY_DATA, create a table called MY_DATA_MEM in memory table.
5) Redirect all Inserts to the MEM table. Create a View that UNIONS both tables and use that view as your Read Source.
6) Write a daemon that periodically moves MEM contents to the Main table and deletes from the Mem table. It may be ideal to implement the MOVE operation as a Delete trigger on the Mem table (I am hoping triggers are possible on Memory Engine, not entirely sure).
7) Do not do any deletes or Updates on the MEM table (they are slow) also pay attention to the cardinality of the keys in your table (HASH vs B-Tree : Low Card -> Hash, High Card-> B-Tree)
8) Even if all the above does not work, ditch jdbc/odbc. Move to InnoDB and use Handler Socket interface to do the direct inserts (Google for Yoshinori-San MySQL)
I have not used the HS myself, but the benchmarks are impressive. There is a even Java HS Project on Google Code.
Hope that helps..