How to deal with a rapidly growing mysql table - mysql

I have two tables in mysql database
tbl_comments
tbl_votes
When a User Clicks on a Like or Dislike button under the comment, a new row is inserted in the tbl_votes, with comment_id, user_id and vote_type. It means if 100 users click the Like or Dislike button on 100 comments per day, it will insert 10,000 rows in tbl_votes table. So, with increasing number of users and increasing number of votes, tbl_votes will be increased rapidly. And suppose when there are 100,000,000 rows in tbl_votes then it will also effect the performance and slow down the sql queries.
How can I deal with this solution or any other solution.

This is a perfectly fine solution.
As long as you have the indexes set correct it's okay.(index on primary key, and post id)
Take forexample stackoverflow, every post, reply comment has it's own voting system, up or down, remembers who voted, and they have about 200million+ messages+replies with each their own votes, and still it responds quickly.
As long as the indexes are set correctly, it should perform just fine. I might suggest using a bigint for the primary key though...

I would not worry about application performance with 1 billion rows on a machine that can keep the indexes in memory.
Performance depends on:
How many joins those queries do
How well your indexes are set up
How much RAM is in the machine
Speed and number of processors
Type and spindle speed of hard drives
size of the row/amount of data returned in the query

Some conclusions:
If you go for rdbms:
Doesn't really matter how many rows you insert into table if it's correctly indexed to select sum of total likes for a comment, of course you'l need to keep result cached.
Another way for fast data selection - is to keep some vote data aggregated, so if user votes up for a comment there would be 1 insert/delete in your table and update on another table like
comment_id
rate
So you'l select rate for any comment you need, and total rows of aggregated table would be much less.
Another good way is to use key-value storage. Suppose that your key would be comment_id, and stored value for raw data
user_id
vote_type
Depending or noSql storage you select, data may be totally stored in memory and all select/update operations will work really fast

It is not totally true that the size of table does not affect the SELECT query.
For big tables, I would suggest TokuDB.
In both cases the problem will arise when you want to DELETE some data.
At that point, you have 2 choices: clustered keys or begin to think to different architectures (horizontal sharding could be a good way):

Related

Optimised way to store large key value kind of data

I am working on a database that has a table user having columns user_id and user_service_id. My application needs to fetch all the users whose user_service_id is a particular value. Normally I would add an index to the user_service_id column and run a query like this :
select user_id from user where user_service_id = 2;
Since the cardinality of the column user_service_id is very less than around 3-4 and the table has around 10M entries, the query will end up scanning almost the entire table.
I was wondering what is the recommendation for such usecases. Also, would it make more sense to move the data to another nosql datastore as this doesn't seem to be an efficient usecase for MySQL or any SQL datastore? Tried to search this but couldn't find any recommendations here. Can someone please help or provide the necessary references?
Thanks in advance.
That query needs this index, which is both "composite" and "covering":
INDEX(user_service_id, user_id) -- in this order
But what will you do with the millions of rows that you get? Sounds like it will choke the client, whether it comes fast or slow.
See my Index Cookbook
"very dynamic" -- Not a problem.
"cache" -- the dynamic nature defeats caching.
"cardinality" -- not important, except to point out that there will be millions of rows.
"millions of rows" -- that takes time to deliver to the client. The number of rows delivered is the biggest factor in cost.
"select entire table, then filter in client" -- That will be even slower! (See "millions of rows".)

Mysql Performance: Which of the query will take more time?

I have two tables:
1. user table with around 10 million data
columns: token_type, cust_id(Primary)
2. pm_tmp table with 200k data
columns: id(Primary | AutoIncrement), user_id
user_id is foreign key for cust_id
1st Approach/Query:
update user set token_type='PRIME'
where cust_id in (select user_id from pm_tmp where id between 1 AND 60000);
2nd Approach/Query: Here we will run below query for different cust_id individually for 60000 records:
update user set token_type='PRIME' where cust_id='1111110';
Theoretically time will be less for the first query as it involves less number of commits and in turn less number of index rebuilds. But, I would recommend to go with the second option since its more controlled and will appear to be less in time and you can event think about executing 2 seperate sets parellelly.
Note: The first query will need sufficient memory provisioned for mysql buffers to get it executed quickly. Second query being set of independent single transaction queries, they will need comparatively less memory and hence will appear faster if executed on limited memory environments.
Well, you may rewrite the first query this way too.
update user u, pm_tmp p set u.token_type='PRIME' where u.cust_id=p.id and p.in <60000;
Some versions of MySQL have trouble optimizing in. I would recommend:
update user u join
pm_tmp pt
on u.cust_id = pt.user_id and pt.id between 1 AND 60000
set u.token_type = 'PRIME' ;
(Note: This assumes that cust_id is not repeated in pm_temp. If that is possible, you will want a select distinct subquery.)
Your second version would normally be considerably slower, because it requires executing thousands of queries instead of one. One consideration might be the update. Perhaps the logging and locking get more complicated as the number of updates increases. I don't actually know enough about MySQL internals to know if this would have a significant impact on performance.
IN ( SELECT ... ) is poorly optimized. (I can't provide specifics because both UPDATE and IN have been better optimized in some recent version(s) of MySQL.) Suffice it to say "avoid IN ( SELECT ... )".
Your first sentence should say "rows" instead of "columns".
Back to the rest of the question. 60K is too big of a chunk. I recommend only 1000. Aside from that, Gordon's Answer is probably the best.
But... You did not use OFFSET; Do not be tempted to use it; it will kill performance as you go farther and farther into the table.
Another thing. COMMIT after each chunk. Else you build up a huge undo log; this adds to the cost. (And is a reason why 1K is possibly faster than 60K.)
But wait! Why are you updating a huge table? That is usually a sign of bad schema design. Please explain the data flow.
Perhaps you have computed which items to flag as 'prime'? Well, you could keep that list around and do JOINs in the SELECTs to discover prime-ness when reading. This completely eliminates the UPDATE in question. Sure, the JOIN costs something, but not much.

How to speed up InnoDB count(*) query?

There are a number of similar questions on here, but a lot of the answers say to force the use of an index and that doesn't seem to speed anything up for me.
I am wanting to show a "live" counter on my website showing the number of rows in a table. Kind of like how some websites show the number of registered users, or some other statistic, in "real time" (i.e. updated frequently using ajax or websockets).
My table has about 5M rows. It's growing fairly quickly and there is a high volume of inserts and deletes on it. Running
select count(*) from my_table
Takes 1.367 seconds, which is unacceptable because I need my application to get the new row count about once per second.
I tried what many of the answers on here suggest and changed the query to:
select count(*) from my_table use index(my_index)
Where my_index is Normal, BTREE on a bigint field. But the time actually increased to 1.414 seconds.
Why doesn't using an index speed up the query as many answers on here said it would?
Another option some answers suggest is to put a trigger on the table that increments a column in another table. So I could create a stats table and whenever a row is inserted or deleted in my_table have a trigger increment or decrement a column in the stats table. Is this the only other option, since using an index doesn't seem to work?
EDIT: Here's a perfect example of the type of thing I'm trying to accomplish: https://www.freelancer.com. Scroll to the bottom of the page and you'll see:
Those numbers update every second or so.
It takes time to read 5 million records and count them -- whether in an index or in the raw data form.
If a "fast-and-dirty" solution is acceptable, you can use metadata:
SELECT table_rows
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = <whatever> and TABLE_NAME = <whatever2>;
Note that this can get out-of-sync.
Another possibility is to partition the table into smaller chunks. One advantage is that if the inserts and deletes tend to be to one partition, you can just count that and use metadata for the other partitions.
A trigger may or may not help in this situation, depending on the insert/delete load. If you are doing multiple inserts per minute, then a trigger is a no-brainer -- a fine solution. If you are doing dozens or hundreds of changes per second, then the overhead of the trigger might slow down the server.
If your system is so busy that the counting is having too much impact, then probably the INSERTing/DELETEing is also having impact. One way to improve INSERT/DELETE is to do them in 'batches' instead of one at a time.
Gather the INSERTs, preferably in the app, but optionally in a 'staging' table. Then, once a second (or whatever) copy them into the real table using an INSERT..SELECT, or (if needed) INSERT..ON DUPLICATE KEY UPDATE. DELETEs can go into the same table (with a flag) or a separate table.
The COUNT(*) can be done at the end of the batch. Or it could be dead reckoned (at much lower cost) by knowing what the count was, then adjusting by what the staging table(s) will change it by.
This is a major upheaval to you app code, so don't embark on it unless you have spikes of, say, >100 INSERTs/DELETEs per second. (A steady 100 INSERTs/sec = 3 billion rows per year.)
For more details on "staging table", see http://mysql.rjweb.org/doc.php/staging_table Note that that blog advocates flip-flopping between a pair of staging tables, so as to minimize locks on them, and to allow multiple clients to coexist.
Have a job running in the background that does the following; then use its table for getting the count:
Loop:
INSERT INTO Counter (ct_my_table)
SELECT COUNT(*) FROM my_table;
sleep 1 second
end loop
At worst, it will be a couple of seconds out of date. Also note that INSERTs and DELETEs interfere (read: slow down) the SELECT COUNT(*), hence the "sleep".
Have you noticed that some UIs say "About 120,000 thingies"? They are using even cruder estimations. But it is usually good enough for the users.
Take inaccurate value from information_schema as Gordon Linoff suggested
Another inaccurate source of rows count is SELECT MAX(id) - MIN(id)
Create table my_table_count where you store rows count of table my_table and update it with triggers
In many cases you don't need an accurate value. Who cares if you show 36,400 users instead of the accurate 36,454?

Count rows or store value MySQL

I want to see how many different users have connected to a website but not sure if I should count the rows in the users database or store a separate value that is incremented each time a user registers.
What are the benefits to using a value as opposed to counting rows in speed, reliability etc.
Thanks.
If the table is not huge (you dont have many users) you should use count.
If you have a huge table, you better use another table to store the count of users, or if an approximate row count is sufficient, you can also use SHOW TABLE STATUS (ref)
You have some util information here http://dev.mysql.com/doc/refman/5.5/en/innodb-restrictions.html
InnoDB does not keep an internal count of rows in a table because concurrent transactions might “see” different numbers of rows at the same time. To process a SELECT COUNT(*) FROM t statement, InnoDB scans an index of the table, which takes some time if the index is not entirely in the buffer pool. If your table does not change often, using the MySQL query cache is a good solution. To get a fast count, you have to use a counter table you create yourself and let your application update it according to the inserts and deletes it does.[...] See Section 14.3.14.1, “InnoDB Performance Tuning Tips”.
I hope #GordonLinoff or some other sql guru can give you more information about when a table is considered big enough.
Neither of them.
I would just update the timestamp for the user each time he clicks something.
Then you could just fetch all this timestamps and could check the time of the last activity.
If you would just increment a value then you have the problem that you do not know if the user has been disconnected already.
With this approach you have a perfect overview over a user´s last activity.
That's the way how it is handled in most of the software for example forums.
It works bor both guests (normal visitors) and registered users if you log the client´s ip.

How to store the specific (polls eg.) data in a MySQL database?

Let's say I would like to store votes to polls in mysql database.
As far as I know I have two options:
1. Create one table (let's say votes) with fields like poll_id, user_id, selected_option_id, vote_date and so on..
2. Create a new database for votes (let's say votes_base) and for each poll add a table to this base (a table, which consist the id of the poll in the name), let's say poll[id of the poll].
The problem with the first option is that the table will become big very soon. Let's say I have 1000 polls and each poll has 1000 votes - that's already a million records in the table. I don't know how much of the speed performance that will costs.
The problem with the second option is I'm not sure if this is the correct solution from the programming rules point of view. But I'm sure with this option it will be (much?) faster to find all votes to some poll.
Or maybe there is a better option?
Your first option is the better option. It is structurally more sound. Millions of rows in a table is no problem from MySQL. A new table per poll is an antipattern.
EDIT for first comment:
Even for a billion or more votes, MySQL should handle. Indexes are the key here. What is the difference between one database with 100 times the same table, or one table with 100 times the rows?
Technically, the second option works as well. Sometimes it might be even better. But we frequently see this:
Instead of one table, users, with 10 columns
Make 100 tables, users_uk, users_us, ... depending on where the users are from.
Great, no? Works, yes? Well it does, until you want to select all the male users, or join the users table onto another table. You'll have a huge UNION coming, and you won't even know the tables beforehand.
One big users table, with the appropriate indexes, is better. If it gets too big for your liking (or your disk), you can start with PARTITIONING: you still have the benefit of one table, but the partitions are stored on different locations.
Now, with your polls, these kind of queries might not happen. In that case, one big InnoDB table or 1000s of small tables might both work.. but the first option is a lot easier to program, and has no drawbacks over the second option. Why choose the second option?
The first option is the better, no doubt. Just be sure to define INDEXes for fields you will use to search data (such as poll_id, for sure) and you will not experience performance issues. MySQL is a DBMS perfectly capable to handle such amount of rows. Do not worry.
First option is better. And you can archive tables after a while, if you not going to use it often