Optimize SQL to get rows count

Optimize SQL to get rows count - mysql

I have a page on my site that keeps track of the number of people accessing it, on another part I displays the data containing information about the users that access this page, it displays only about 10 at a time.
The problem is I need to create pagination so I need to know how much data is on my table at every time and this causes the display page to take some time to load 2-3 seconds, sometimes 7-10, because I have millions of record. I am wondering, how do I get this page to load faster.
Select COUNT(*) as Count from visits

My first response is . . . if you are paging records 10 at a time, why do you need the total count of more than a million?
Second, counting a million rows should not take very long, unless your rows are wide (lots of columns or wide columns). If that is the case, then:
select count(id) from t;
can help, because it will explicitly use an index. Note that the first run may be slower than subsequent runs because of caching.
If you decide that you do need an exact row count, then your only real option for speeding it up using MySQL is to create triggers to maintain the count in another table. However, that will slow down inserts and deletions, which might not be a good idea.

The best answer is to say "About 1,234,000 visits", not the exact number. Then calculate it daily (or whatever).
But if you must have the exact count, ...
If this table is "write only", then there is a solution. It involves treating it as a "Fact" table in a Data Warehouse. Then create and maintain a "Summary table" with a row for, say, each hour. Then the COUNT becomes:
SELECT SUM(hourly_count) FROM SummaryTable;
This will be much faster because there is much less to scan. However, there is a problem in that it does not include the count for the last (partial) hour. But that can be solved if you use INSERT ... ON DUPLICATE KEY UPDATE ... to increment the counter for the current hour or insert a new row with a "1".
Some more info is here .
But, before we take this too far, please inform us of how often a new 'visit' occurs.

You cannot make that query get faster without changing the server's hardware or adding more servers to run it in parallel. In the second case it would be better to move to a nosql database.
My approach would be to reduce the number of records. That you could do by having some temporary table where you record the access logs for the past hour/day and after that time run a cronjob that deletes the data, or moves it to another table for log term storage.

You usually do not need to know exact number of rows for pagination
SELECT COUNT(*) FROM
(SELECT TOP 10000 * FROM visits) as v
would tell You, that there are at least 1000 pages. In most cases You do not need to know more.
You can store total count somewhere and update it from time to time if You want some reasonable estimate. If You need exact number, You can use trigger to keep it actual. The more up to date info, the more expensive, of course.

Decide on limit (let's say, 1000 last ones) from practical (business requirements) point of view. Have auto_increment index (id) or timestamp (createdon). Grab max 1000 records
select count(*) from (select id from visits order by id desc limit 1000)
or grab all 1000 and count paginate on the client side (php) (as if you paginate mysql will still go through those records):
select * from visits order by id desc limit 1000

Related

How to speed up InnoDB count(*) query?

There are a number of similar questions on here, but a lot of the answers say to force the use of an index and that doesn't seem to speed anything up for me.
I am wanting to show a "live" counter on my website showing the number of rows in a table. Kind of like how some websites show the number of registered users, or some other statistic, in "real time" (i.e. updated frequently using ajax or websockets).
My table has about 5M rows. It's growing fairly quickly and there is a high volume of inserts and deletes on it. Running
select count(*) from my_table
Takes 1.367 seconds, which is unacceptable because I need my application to get the new row count about once per second.
I tried what many of the answers on here suggest and changed the query to:
select count(*) from my_table use index(my_index)
Where my_index is Normal, BTREE on a bigint field. But the time actually increased to 1.414 seconds.
Why doesn't using an index speed up the query as many answers on here said it would?
Another option some answers suggest is to put a trigger on the table that increments a column in another table. So I could create a stats table and whenever a row is inserted or deleted in my_table have a trigger increment or decrement a column in the stats table. Is this the only other option, since using an index doesn't seem to work?
EDIT: Here's a perfect example of the type of thing I'm trying to accomplish: https://www.freelancer.com. Scroll to the bottom of the page and you'll see:
Those numbers update every second or so.

It takes time to read 5 million records and count them -- whether in an index or in the raw data form.
If a "fast-and-dirty" solution is acceptable, you can use metadata:
SELECT table_rows
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = <whatever> and TABLE_NAME = <whatever2>;
Note that this can get out-of-sync.
Another possibility is to partition the table into smaller chunks. One advantage is that if the inserts and deletes tend to be to one partition, you can just count that and use metadata for the other partitions.
A trigger may or may not help in this situation, depending on the insert/delete load. If you are doing multiple inserts per minute, then a trigger is a no-brainer -- a fine solution. If you are doing dozens or hundreds of changes per second, then the overhead of the trigger might slow down the server.

If your system is so busy that the counting is having too much impact, then probably the INSERTing/DELETEing is also having impact. One way to improve INSERT/DELETE is to do them in 'batches' instead of one at a time.
Gather the INSERTs, preferably in the app, but optionally in a 'staging' table. Then, once a second (or whatever) copy them into the real table using an INSERT..SELECT, or (if needed) INSERT..ON DUPLICATE KEY UPDATE. DELETEs can go into the same table (with a flag) or a separate table.
The COUNT(*) can be done at the end of the batch. Or it could be dead reckoned (at much lower cost) by knowing what the count was, then adjusting by what the staging table(s) will change it by.
This is a major upheaval to you app code, so don't embark on it unless you have spikes of, say, >100 INSERTs/DELETEs per second. (A steady 100 INSERTs/sec = 3 billion rows per year.)
For more details on "staging table", see http://mysql.rjweb.org/doc.php/staging_table Note that that blog advocates flip-flopping between a pair of staging tables, so as to minimize locks on them, and to allow multiple clients to coexist.

Have a job running in the background that does the following; then use its table for getting the count:
Loop:
INSERT INTO Counter (ct_my_table)
SELECT COUNT(*) FROM my_table;
sleep 1 second
end loop
At worst, it will be a couple of seconds out of date. Also note that INSERTs and DELETEs interfere (read: slow down) the SELECT COUNT(*), hence the "sleep".
Have you noticed that some UIs say "About 120,000 thingies"? They are using even cruder estimations. But it is usually good enough for the users.

Take inaccurate value from information_schema as Gordon Linoff suggested
Another inaccurate source of rows count is SELECT MAX(id) - MIN(id)
Create table my_table_count where you store rows count of table my_table and update it with triggers
In many cases you don't need an accurate value. Who cares if you show 36,400 users instead of the accurate 36,454?

Mysql query optimization for Distinct count

I had a query like this
select count(distinct sessionKey) as tot from visits
But it takes too much time for execution 48512 ms Now.
Within some months data in table will become double of the current amount of data.How I can optimize this query
This is my table structure

add an INDEX in your column SessionKey and it will improve its performance.
ALTER TABLE visits ADD INDEX (SessionKey)

Like others suggested, adding an Index would be the first and easiest thing to do.
If you have tons and tons of lines in there, going through all of them might take some time anyways.
I once had a problem with something like this once, where someone coded a system, where users could vote on news-entries. Every vote was saved as a single line in the database. On every webpage there was a list of the "top voted" news. This basically meant there was a query to select the complete votes table, sum them up, and sort after that sum. With entries in the multiple 100k range, this was taking some serious time. Someone before me "solved" it, by trying to "cache" the results. This worked nice most of the time, but if you had cleaned all caches, then the whole page messes up for some hours until the caches was built again. I then fixed it by not saving every vote on an own row, but just sums for every entry.
What I want to tell you with this: You could try either caching (but the result wouldn be "live" of course), or change something in the database like adding a field or table where you store the count that you want to read which you update on every insert to the visits table. This would create a little more load on insert, but getting that number would be super cheap.

ORDER BY RAND() alternative [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
MySQL: Alternatives to ORDER BY RAND()
I currently have a query that ends ORDER BY RAND(HOUR(NOW())) LIMIT 40 to get 40 random results. The list of results changes each hour.
This kills the query cache, which is damaging performance.
Can you suggest an alternative way of getting a random(ish) set of results that changes from time to time? It does not have to be every hour and it does not have to be totally random.
I would prefer a random result, rather than sorting on an arbitrary field in the table, but I will do that as a last resort...
(this is a list of new products that I want to shuffle around a bit every now and then).

If you have an ID column it's better to do a:
-- create a variable to hold the random number
SET #rownum := SELECT count(*) FROM table;
SET #row := (SELECT CEIL((rand() * #rownum));
-- use the random number to select on the id column
SELECT * from tablle WHERE id = #row;
The logic of selecting the random id number can be move to the application level.
SELECT * FROM table ORDER BY RAND LIMIT 40
is very inefficient because MySQL will process ALL the records in the table performing a full table scan on all the rows, order them randomly.

Its going to kill the cache because you are expecting a different result set each time. There is no way that you can cache a random set of values. If you want to cache a group of results, cache a large random set of values, and then within sub sections of the time that you are going to use those values do a random grab within the smaller set [outside of sql].

I think the better way is to download product identifiers to your middle layer, choose random 40 values when you need (once per hour or for every request) and use them in the query: product_id in (#id_1, #id_2, ..., #id_40).

you may have a column with random values that you update every hour.

This is going to be a significantly nasty query if it needs to sort a large data set into a random order (which really does require a sort), then discard all but the first 40 records.
A better solution would be to just pick 40 random records. There are lots of ways of doing this and it usually depends on having keys which are evenly distributed.
Another option is to pick the 40 random records in a batch job which is only run once per hour (or whatever) and then remember which ones they are.

One way to achieve it is to shuffle the objects you map the data to. If you don't map the data to objects, you could shuffle the result array from the database. I don't know if this will perform better or not, but you will at least get the benefits from the query cache as you mention.
You could also generate a random sequence from 1 to n, and index the result array (or object array) with those.

calculate the current hour in your PHP code and pass that to your query. this will result in a static value that can be cached.
note that you might also have a hidden bug. since you're only taking the hour, you only have 24 different values, which will repeat every day. which means that what's showing at 1 pm today will also be the same as what shows tomorrow at 6. you might want to change that.

Don't fight with the cache-- expoit it!
Write your query as you are (or even simpler). Then, in your code, cache the results, setting a cache expiry for 1 hour. If you are using a caching layer, like memcached, you are set. If not, you can build a fairly simple one:
[pseudocode]
global cache[24]
h = Time.hour
if (cache[h] == null) {
cache[h] = .. run your query
}
return cache[h];

If you only need a new set of random data once an hour, don't hit the database - save the results to your application's caching layer (or, if it doesn't have one, just put it out into a temporary file of some sort). Query cache is handy, but if you never need to even execute a query, even better...

mysql, force limited entries in a table

I keep some temporary data in a memory table. I only need the 20 most recent entries and would prefer the data is always be on the heap. How should i accomplish this? i am sure theres nothing i can do about the memory table but how should i handle entries tables? should i add a autoincrease key and delete the oldest whenever i want to push a new value in?

Could you please describe in more detail, what you are trying to do? I don't see why you want to keep the most recent data in an additional table when you can just use a SELECT with descending order and a LIMIT 20. If the SELECT query is too expensive then just cache the result using memcached or similar and clear the cache every time a new data is inserted.
If the additional table is really necessary there are several ways to prune old data from the table. Either you fetch the id of the 20th recent data (again descending order and LIMIT 19,1 and delete everything that has a smaller id (in case you have an auto increment index, timestamp, etc.) or you SELECT COUNT(*) and then do a DELETE with ascending order and a LIMIT (all items - 20). This could be packed into a cronjob that runs every several minutes.
But I would really recommend using a cache and looking at the table definition. With a decent index there shouldn't be any problems.

Appending to the 20-entry table and removing the eldest element (i.e. the one with the minimum ID?) is possible. However, note that this will fragment the table.
That's OK so long as you run OPTIMIZE every once in a while.
A different way would be to pre-allocate 20 entries and keep a separate counter of which entry is the latest. Then instead of insert/delete, you would update the item ID based on the counter, which you would then increment (mod 20 + 1) and store again.
However note that both of these models work only under a "single-threaded" model. If multiple threads are running on the table it's possible that they'll conflict.
If the counter is in program memory, shared by threads but guarded properly, that will be both thread-safe and efficient.

Should totals be denormalized?

I am working on a website with a simple normalized database.
There is a table called Pages and a table called Views. Each time a Page is viewed, a unique record of that View is recorded in the Views table.
When displaying a Page on the site, I use a simple MySQL COUNT() to total up the number of Views for display.
Database design seems fine, except for this problem: I am at a loss for how to retrieve the top 10 most viewed pages among thousands.
Should I denormalize the Pages table by adding a Pages.views column to hold the total number of views for each page? Or is there an efficient way to query for the top 10 most viewed pages?

SELECT p.pageid, count(*) as viewcount FROM
pages p
inner join views v on p.pageid = v.pageid
group by p.pageid
order by count(*) desc
LIMIT 10 OFFSET 0;
I can't test this, but something along those lines. I would not store the value unless I have to due to performance constraints (I just learned the term "premature optimization", and it seems to apply if you do).

It depends on the level of information you are trying to maintain. If you want to record who viewed when? Then the separate table is fine. Otherwise, a column for Views is the way to go. Also If you keep a separate column, you'll find that the table will be locked more often since each page view will try to update the column for its corresponding row.
Select pageid, Count(*) as countCol from Views
group by pageid order by countCol DESC
LIMIT 10 OFFSET 0;

Database normalization is all about the most efficient / least redundant way to store data. This is good for transaction processing, but often directly conflicts with the need to efficiently get the data out again. The problem is usually addressed by having derived tables (indexes, materialized views, rollup tables...) with more accessible, pre-processed data. The (slightly dated) buzzword here is Data Warehousing.
I think you want to keep your Pages table normalized, but have an extra table with the totals. Depending on how recent those counts need to be, you can update the table when you update the original table, or you can have a background job to periodically recalculate the totals.
You also want to do this only if you really run into a performance problem, which you will not unless you have a very large number of records, or a very large number of concurrent accesses. Keep your code flexible to be able to switch between having the table and not having it.

I would probably include the views column in the Pages table.
It seems like a perfectly reasonable breaking of normalization to me. Especially since I can't imagine you deleting views so you wouldn't expect the count to get out of whack. Referential integrity doesn't seem super-critical in this case.

Denormalizing would definitely work in this case. Your loss is the extra storage room used up by the extra column.
Alternatively you could set up a scheduled job to populate this information on a nightly basis, whenever your traffic is low, x period of time.
In this case you would be losing the ability to instantly know your page counts unless you run this query manually.
Denormalization can definitely be employed to increase performance.
--Kris

While this is an old question, I'd like to add my answer because I find the accepted one to be misguided.
It is one thing to have COUNT for a single selected row; it is quite another to sort the COUNT of ALL columns.
Even if you have just 1000 rows, each counted with some join, you can easily involve reading tens of thousands if not millions of rows.
It can be ok if you only call this occasionally, but it is very costly otherwise.
What you can do is to add a TRIGGER:
CREATE TRIGGER ins AFTER INSERT ON table1 FOR EACH ROW
UPDATE table2
SET count = count + 1
WHERE CONDITION

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008