Count rows or store value MySQL - mysql

I want to see how many different users have connected to a website but not sure if I should count the rows in the users database or store a separate value that is incremented each time a user registers.
What are the benefits to using a value as opposed to counting rows in speed, reliability etc.
Thanks.

If the table is not huge (you dont have many users) you should use count.
If you have a huge table, you better use another table to store the count of users, or if an approximate row count is sufficient, you can also use SHOW TABLE STATUS (ref)
You have some util information here http://dev.mysql.com/doc/refman/5.5/en/innodb-restrictions.html
InnoDB does not keep an internal count of rows in a table because concurrent transactions might “see” different numbers of rows at the same time. To process a SELECT COUNT(*) FROM t statement, InnoDB scans an index of the table, which takes some time if the index is not entirely in the buffer pool. If your table does not change often, using the MySQL query cache is a good solution. To get a fast count, you have to use a counter table you create yourself and let your application update it according to the inserts and deletes it does.[...] See Section 14.3.14.1, “InnoDB Performance Tuning Tips”.
I hope #GordonLinoff or some other sql guru can give you more information about when a table is considered big enough.

Neither of them.
I would just update the timestamp for the user each time he clicks something.
Then you could just fetch all this timestamps and could check the time of the last activity.
If you would just increment a value then you have the problem that you do not know if the user has been disconnected already.
With this approach you have a perfect overview over a user´s last activity.
That's the way how it is handled in most of the software for example forums.
It works bor both guests (normal visitors) and registered users if you log the client´s ip.

Related

Optimizing COUNT() on MariaDB for a Statistics Table

I've read a number of posts here and elsewhere about people wrestling to improve the performance of the MySQL/MariaDB COUNTfunction, but I haven't found a solution that quite fits what I am trying to do. I'm trying to produce a live updating list of read counts for a list of articles. Each time a visitor visits a page, a log table in the SQL database records the usual access log-type data (IP, browser, etc.). Of particular interest, I record the user's ID (uid) and I process the user agent tag to classify known spiders (uaType). The article itself is identified by the "paid" column. The goal is to produce a statistic that doesn't count the poster's own views of the page and doesn't include known spiders, either.
Here's the query I have:
"COUNT(*) FROM uninet_log WHERE paid='1942' AND uid != '1' AND uaType != 'Spider'"
This works nicely enough, but very slowly (approximately 1 sec.) when querying against a database with 4.2 million log entries. If I run the query several times during a particular run, it increases the runtime by about another second for each query. I know I could group by paid and then run a single query, but even then (which would require some reworking of my code, but could be done) I feel like 1 second for the query is still really slow and I'm worried about the implications when the server is under a load.
I've tried switching out COUNT(*) for COUNT(1) or COUNT(id) but that doesn't seem to make a difference.
Does anyone have a suggestion on how I might create a better, faster query that would accomplish this same goal? I've thought about having a background process regularly calculate the statistics and cache them, but I'd love to stick to live updating information if possible.
Thanks,
Tim
Add a boolean "summarized" column to your statistics table and making it part of a multicolumn index with paid.
Then have a background process that produces/updates rows containing the read count in a summary table (by article) and marks the statistics table rows as summarized. (Though the summary table could just be your article table.)
Then your live query reports the sum of the already summarized results and the as-yet-unsummarized statistics rows.
This also allows you to expire old statistics table rows without losing your read counts.
(All this assumes you already have an index on paid; if you don't, definitely add one and that will likely solve your problem for now, though in the long run likely you still want to be able to delete old statistics records.)

How to speed up InnoDB count(*) query?

There are a number of similar questions on here, but a lot of the answers say to force the use of an index and that doesn't seem to speed anything up for me.
I am wanting to show a "live" counter on my website showing the number of rows in a table. Kind of like how some websites show the number of registered users, or some other statistic, in "real time" (i.e. updated frequently using ajax or websockets).
My table has about 5M rows. It's growing fairly quickly and there is a high volume of inserts and deletes on it. Running
select count(*) from my_table
Takes 1.367 seconds, which is unacceptable because I need my application to get the new row count about once per second.
I tried what many of the answers on here suggest and changed the query to:
select count(*) from my_table use index(my_index)
Where my_index is Normal, BTREE on a bigint field. But the time actually increased to 1.414 seconds.
Why doesn't using an index speed up the query as many answers on here said it would?
Another option some answers suggest is to put a trigger on the table that increments a column in another table. So I could create a stats table and whenever a row is inserted or deleted in my_table have a trigger increment or decrement a column in the stats table. Is this the only other option, since using an index doesn't seem to work?
EDIT: Here's a perfect example of the type of thing I'm trying to accomplish: https://www.freelancer.com. Scroll to the bottom of the page and you'll see:
Those numbers update every second or so.
It takes time to read 5 million records and count them -- whether in an index or in the raw data form.
If a "fast-and-dirty" solution is acceptable, you can use metadata:
SELECT table_rows
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = <whatever> and TABLE_NAME = <whatever2>;
Note that this can get out-of-sync.
Another possibility is to partition the table into smaller chunks. One advantage is that if the inserts and deletes tend to be to one partition, you can just count that and use metadata for the other partitions.
A trigger may or may not help in this situation, depending on the insert/delete load. If you are doing multiple inserts per minute, then a trigger is a no-brainer -- a fine solution. If you are doing dozens or hundreds of changes per second, then the overhead of the trigger might slow down the server.
If your system is so busy that the counting is having too much impact, then probably the INSERTing/DELETEing is also having impact. One way to improve INSERT/DELETE is to do them in 'batches' instead of one at a time.
Gather the INSERTs, preferably in the app, but optionally in a 'staging' table. Then, once a second (or whatever) copy them into the real table using an INSERT..SELECT, or (if needed) INSERT..ON DUPLICATE KEY UPDATE. DELETEs can go into the same table (with a flag) or a separate table.
The COUNT(*) can be done at the end of the batch. Or it could be dead reckoned (at much lower cost) by knowing what the count was, then adjusting by what the staging table(s) will change it by.
This is a major upheaval to you app code, so don't embark on it unless you have spikes of, say, >100 INSERTs/DELETEs per second. (A steady 100 INSERTs/sec = 3 billion rows per year.)
For more details on "staging table", see http://mysql.rjweb.org/doc.php/staging_table Note that that blog advocates flip-flopping between a pair of staging tables, so as to minimize locks on them, and to allow multiple clients to coexist.
Have a job running in the background that does the following; then use its table for getting the count:
Loop:
INSERT INTO Counter (ct_my_table)
SELECT COUNT(*) FROM my_table;
sleep 1 second
end loop
At worst, it will be a couple of seconds out of date. Also note that INSERTs and DELETEs interfere (read: slow down) the SELECT COUNT(*), hence the "sleep".
Have you noticed that some UIs say "About 120,000 thingies"? They are using even cruder estimations. But it is usually good enough for the users.
Take inaccurate value from information_schema as Gordon Linoff suggested
Another inaccurate source of rows count is SELECT MAX(id) - MIN(id)
Create table my_table_count where you store rows count of table my_table and update it with triggers
In many cases you don't need an accurate value. Who cares if you show 36,400 users instead of the accurate 36,454?

Finding records not updated in last k days efficiently

I have a table which contains records of last n days. The records in this table are around 100 million. I need to find the records which are not updated in last k
My solution to this problem is
Partition the table on k1. Index on timestamp column. Now instead of updating the timestamp(so that index is not rebuilt), perform remove + insert. By doing this the I think the query to find the records not updated in last k days will be fast.
Is there any other better way to optimize these operations?
For example,
Suppose we have many users and each user can use different products. Also a user can start using(becomes owner) new products any time. If user does not use a product for n days his ownership expires. Now we need to find all the products for a user which are not used by him in last k days. The number of users are of order 10000 and number of products from which he can choose is of order 100,000.
I modeled this problem using a table with schema (user_id, product_id, last_used). product_id is the id of the product the user is using. Whenever a user uses the product last_used is updated. Also a user's ownership of product expires if not used for n days by the user. I partitioned on the table on user_id and indexed last_used(timestamp). Also instead of updating I performed delete + create. I did partitioning and indexing for optimizing the query to fetch records not updated in last k days for a user.
Is there a better way to solve this problem?
You have said you need to "find" and, I think "expire" the records belonging to a particular user after a certain number of days.
Look, this can be done even in a large table with good indexing without too much trouble. I promise you, partitioning the table will be a lot of trouble. You have asserted that it's too expensive in your application to carry an index on your last_used column because of updates. But, considering the initial and ongoing expense of maintaining a partitioned table, I strongly suggest you prove that assertion first. You may be wrong about the cost of maintaining indexes.
(Updating one row with a column that's indexed doesn't rebuild the index, it modifies it. The MySQL storage engine developers have optimized that use case, I promise you.)
As I am sure you know, this query will retrieve old records for a particular user.
SELECT product_id
FROM tbl
WHERE user_id = <<<chosen user>>>
AND last_used <= CURRENT_DATE() - <<<k>>> DAY
will yield your list of products. This will work very efficiently indeed if you have a compound covering index on (user_id, last_used, product_id). If you don't know what a compound covering index is, you really should find out using your favorite search engine. This one will random-access the particular user and then do a range scan on the last_used date. It will then return the product ids from the index.
If you want to get rid of all old records, I suggest you write a host program that repeats this query in a loop until you find that it has processed zero rows. Run this at an off-peak time in your application. The LIMIT clause will prevent each individual query from taking too long and interfering with other uses of the table. For the sake of speed on this query, you'll need an index on last_used.
DELETE FROM tbl
WHERE last_used <= CURRENT_DATE() - <<<k>>> DAY
LIMIT 500
I hope this helps. It comes from someone who's made the costly mistake of trying to partition something that didn't need partitioning.
MySQL doesn't "rebuild" indexes (not completely) when you modify an indexed value. In fact, it doesn't even reorder the records. It just moves the record to the proper 16KB page.
Within a page, the records are in the order they were added. If you inserted in order, then they're in order, otherwise, they're not.
So, when they say that MySQL's clustered indexes are in physical order, it's only true down to the page level, but not within the page.
Clustered indexes still get the benefit that the page data is on the same page as the index, so no further lookup is needed if the row data is small enough to fit in the pages. Reading is faster, but restructuring is slower because you have to move the data with the index. Secondary indexes are much faster to update, but to actually retrieve the data (with the exception of covering indexes), a further lookup must be made to retrieve the actual data via the primary key that the secondary index yields.
Example
Page 1 might hold user records for people whose last name start with A through B. Page 2 might hold names C through D, etc. If Bob renames himself Chuck, his record just gets copied over from page 1 to page 2. His record will always be put at the end of page 2. The keys are kept sorted, but not the data they point to.
If the page becomes full, MySQL will split the page. In this case, assuming even distribution between C and D, page 1 will be A through B, page 2 will be C, and page 3 will be D.
When a record is deleted, the space is compacted, and if the record becomes less than half full, MySQL will merge neighboring pages and possibly free up a page inbetween.
All of these changes are buffered, and MySQL does the actual writes when it's not busy.
The example works the same for both clustered (primary) and secondary indexes, but remember that with a clustered index, the keys point to the actual table data, whereas with a secondary index, the keys point to a value equal to the primary key.
Summary
After awhile, page splitting caused from random inserts will cause the pages to become noncontiguous on disk. The table will become "fragmented". Optimizing the table (rebuilding the table/index) fixes this.
There would be no benefit in deleting then reinserting the record. In fact, you'll just be adding transactional overhead. Let MySQL handle updating the index for you.
Now that you understand indexes a bit more, perhaps you can make a better decision of how to optimize your database.

How to deal with a rapidly growing mysql table

I have two tables in mysql database
tbl_comments
tbl_votes
When a User Clicks on a Like or Dislike button under the comment, a new row is inserted in the tbl_votes, with comment_id, user_id and vote_type. It means if 100 users click the Like or Dislike button on 100 comments per day, it will insert 10,000 rows in tbl_votes table. So, with increasing number of users and increasing number of votes, tbl_votes will be increased rapidly. And suppose when there are 100,000,000 rows in tbl_votes then it will also effect the performance and slow down the sql queries.
How can I deal with this solution or any other solution.
This is a perfectly fine solution.
As long as you have the indexes set correct it's okay.(index on primary key, and post id)
Take forexample stackoverflow, every post, reply comment has it's own voting system, up or down, remembers who voted, and they have about 200million+ messages+replies with each their own votes, and still it responds quickly.
As long as the indexes are set correctly, it should perform just fine. I might suggest using a bigint for the primary key though...
I would not worry about application performance with 1 billion rows on a machine that can keep the indexes in memory.
Performance depends on:
How many joins those queries do
How well your indexes are set up
How much RAM is in the machine
Speed and number of processors
Type and spindle speed of hard drives
size of the row/amount of data returned in the query
Some conclusions:
If you go for rdbms:
Doesn't really matter how many rows you insert into table if it's correctly indexed to select sum of total likes for a comment, of course you'l need to keep result cached.
Another way for fast data selection - is to keep some vote data aggregated, so if user votes up for a comment there would be 1 insert/delete in your table and update on another table like
comment_id
rate
So you'l select rate for any comment you need, and total rows of aggregated table would be much less.
Another good way is to use key-value storage. Suppose that your key would be comment_id, and stored value for raw data
user_id
vote_type
Depending or noSql storage you select, data may be totally stored in memory and all select/update operations will work really fast
It is not totally true that the size of table does not affect the SELECT query.
For big tables, I would suggest TokuDB.
In both cases the problem will arise when you want to DELETE some data.
At that point, you have 2 choices: clustered keys or begin to think to different architectures (horizontal sharding could be a good way):

MySQL performance on storing and returning ids

I have an API where I need to log which ids from a table that were returned in a query, and in another query, return results sorted based on the log of ids.
For example:
Tables products had a PK called id and users had a PK called id . I can create a log table with one insert/update per returned id. I'm wondering about performance and the design of this.
Essentially, for each returned ID in the API, I would:
INSERT INTO log (product_id, user_id, counter)
VALUES (#the_product_id, #the_user_id, 1)
ON DUPLICATE KEY UPDATE counter=counter+1;
.. I'd either have an id column as PK or a combination of product_id and user_id (alt. having those two as a UNIQUE index).
So the first issue is the performance of this (20 insert/updates and the effect on my select calls in the API) - is there a better/smarter way to log these IDs? Extracting from the webserver log?
Second is the performance of the select statements to include the logged data, to allow a user to see new products every request (a simplified example, I'd specify the table fields instead of * in real life):
SELECT p.*, IFNULL(
SELECT log.counter
FROM log
WHERE log.product_id = p.id
AND log.user_id = #the_user_id
, 0 ) AS seen_by_user
FROM products AS p
ORDER BY seen_by_user ASC
In our database, the products table has millions of rows, and the users table is growing rapidly. Am I right in my thinking to do it this way, or are there better ways? How do I optimize the process, and are there tools I can use?
Callie, I just wanted to flag a different perspective to keymone, and it doesn't fit into a comment hence this answer.
Performance is sensitive to the infrastructure environment: are you running in a shared hosting service (SHS), a dedicated private virtual service (PVS) or dedicate server, or even a multiserver config with separate web and database servers.
What are your transaction rates and volumetics? How many insert/updates are you doing per min in your 2 peaks trading hours in the day? What are your integrity requirements v.v the staleness of log counters?
Yes, keymone's points are appropriate if you are doing, say, 3-10 updates per second, and as you move into this domain, some form of collection process to batch up inserts to allow bulk insert becomes essential. But just as important here are Qs are choice of storage engine, transactional vs batch split and the choice of infrastructure architecture itself (in-server DB instance vs separate DB server, master/slave configurations ...).
However, if you are averaging <1/sec then INSERT ON DUPLICATE KEY UPDATE has comparable performance to the equivalent UPDATE statements and it is the better approach if doing single row insert/updates as it ensures ACID integrity of the counts.
Any form of PHP process start-up will typically take ~100mSec on your web server, so even thinking of this to do an asynchronous update is just plain crazy as the performance hit is significantly larger than the update itself.
Your SQL statement just doesn't jive with your comment that you have "millions of rows" in the products table as it will do a full fetch of the product table executing a correlated subquery on every row. I would have used a LEFT OUTER JOIN myself, with some sort of strong constraint to filter which product items are appropriate to this result set. However it runs, it will take materially longer to execute that any count update.
you will have really bad performance with such approach.
mysql is not exactly well suited for logging so here are few steps you might do to achieve good performance:
instead of maintaining stats table on fly (the update on duplicate key bit which will absolutely destroy your performance) you want to have a single raw logs table where you will just be doing inserts and every now and then(say daily) you would be running a script that aggregates data from that table into real statistics table.
instead of having single statistics table - have a daily stats, monthly stats, etc. aggregate jobs would then be building up data from already aggregated stuff - awesome for performance. it also allows you to drop(or archive) stats granularity over time - who the hell cares about daily stats in 2 years time? or at least about "real-time" access to those stats.
instead of inserting into log table use something like syslog-ng to gather such information into log files(much less load on mysql server[s]) and then aggregate data into mysql from raw text files(many choices here, you can even import raw files back into mysql if your aggregation routine really needs some SQL-flexibility)
that's about it