Mysql query optimization for Distinct count

Mysql query optimization for Distinct count - mysql

I had a query like this
select count(distinct sessionKey) as tot from visits
But it takes too much time for execution 48512 ms Now.
Within some months data in table will become double of the current amount of data.How I can optimize this query
This is my table structure

add an INDEX in your column SessionKey and it will improve its performance.
ALTER TABLE visits ADD INDEX (SessionKey)

Like others suggested, adding an Index would be the first and easiest thing to do.
If you have tons and tons of lines in there, going through all of them might take some time anyways.
I once had a problem with something like this once, where someone coded a system, where users could vote on news-entries. Every vote was saved as a single line in the database. On every webpage there was a list of the "top voted" news. This basically meant there was a query to select the complete votes table, sum them up, and sort after that sum. With entries in the multiple 100k range, this was taking some serious time. Someone before me "solved" it, by trying to "cache" the results. This worked nice most of the time, but if you had cleaned all caches, then the whole page messes up for some hours until the caches was built again. I then fixed it by not saving every vote on an own row, but just sums for every entry.
What I want to tell you with this: You could try either caching (but the result wouldn be "live" of course), or change something in the database like adding a field or table where you store the count that you want to read which you update on every insert to the visits table. This would create a little more load on insert, but getting that number would be super cheap.

Related

Best way to aggregate logging data

So I have a database that contains logs of people clicking a link. What I store is id, country, referrer-domain, clickedat. That last column is a timestamp.
At the moment there are not a lot of rows, but if this takes off it might be tens to hundreds of thousands of rows. What is the best way to query the database for things like this:
Times viewed per day for the past month
Top 20 countries that used your link
Top 20 websites referring your link
Using a COUNT(*) would eventually be way too slow. I've seen techniques where you add another query to every update, insert, delete that happens to save in a special aggregation table. But I'm not sure that would work, as I'd like to have users be able to select two specific dates for example. Or I'd have to aggregate by day.

If you add an indexed Date, column, so you're not doing date/time calculations on the fly, you should just be able to query it using normal aggregations. It would take a long time before that would be 'way too slow' with properly formatted queries.
If it takes off, look into de-normalising the data as you describe, but don't optimise it prematurely!

Optimize SQL to get rows count

I have a page on my site that keeps track of the number of people accessing it, on another part I displays the data containing information about the users that access this page, it displays only about 10 at a time.
The problem is I need to create pagination so I need to know how much data is on my table at every time and this causes the display page to take some time to load 2-3 seconds, sometimes 7-10, because I have millions of record. I am wondering, how do I get this page to load faster.
Select COUNT(*) as Count from visits

My first response is . . . if you are paging records 10 at a time, why do you need the total count of more than a million?
Second, counting a million rows should not take very long, unless your rows are wide (lots of columns or wide columns). If that is the case, then:
select count(id) from t;
can help, because it will explicitly use an index. Note that the first run may be slower than subsequent runs because of caching.
If you decide that you do need an exact row count, then your only real option for speeding it up using MySQL is to create triggers to maintain the count in another table. However, that will slow down inserts and deletions, which might not be a good idea.

The best answer is to say "About 1,234,000 visits", not the exact number. Then calculate it daily (or whatever).
But if you must have the exact count, ...
If this table is "write only", then there is a solution. It involves treating it as a "Fact" table in a Data Warehouse. Then create and maintain a "Summary table" with a row for, say, each hour. Then the COUNT becomes:
SELECT SUM(hourly_count) FROM SummaryTable;
This will be much faster because there is much less to scan. However, there is a problem in that it does not include the count for the last (partial) hour. But that can be solved if you use INSERT ... ON DUPLICATE KEY UPDATE ... to increment the counter for the current hour or insert a new row with a "1".
Some more info is here .
But, before we take this too far, please inform us of how often a new 'visit' occurs.

You cannot make that query get faster without changing the server's hardware or adding more servers to run it in parallel. In the second case it would be better to move to a nosql database.
My approach would be to reduce the number of records. That you could do by having some temporary table where you record the access logs for the past hour/day and after that time run a cronjob that deletes the data, or moves it to another table for log term storage.

You usually do not need to know exact number of rows for pagination
SELECT COUNT(*) FROM
(SELECT TOP 10000 * FROM visits) as v
would tell You, that there are at least 1000 pages. In most cases You do not need to know more.
You can store total count somewhere and update it from time to time if You want some reasonable estimate. If You need exact number, You can use trigger to keep it actual. The more up to date info, the more expensive, of course.

Decide on limit (let's say, 1000 last ones) from practical (business requirements) point of view. Have auto_increment index (id) or timestamp (createdon). Grab max 1000 records
select count(*) from (select id from visits order by id desc limit 1000)
or grab all 1000 and count paginate on the client side (php) (as if you paginate mysql will still go through those records):
select * from visits order by id desc limit 1000

Optimizing COUNT() on MariaDB for a Statistics Table

I've read a number of posts here and elsewhere about people wrestling to improve the performance of the MySQL/MariaDB COUNTfunction, but I haven't found a solution that quite fits what I am trying to do. I'm trying to produce a live updating list of read counts for a list of articles. Each time a visitor visits a page, a log table in the SQL database records the usual access log-type data (IP, browser, etc.). Of particular interest, I record the user's ID (uid) and I process the user agent tag to classify known spiders (uaType). The article itself is identified by the "paid" column. The goal is to produce a statistic that doesn't count the poster's own views of the page and doesn't include known spiders, either.
Here's the query I have:
"COUNT(*) FROM uninet_log WHERE paid='1942' AND uid != '1' AND uaType != 'Spider'"
This works nicely enough, but very slowly (approximately 1 sec.) when querying against a database with 4.2 million log entries. If I run the query several times during a particular run, it increases the runtime by about another second for each query. I know I could group by paid and then run a single query, but even then (which would require some reworking of my code, but could be done) I feel like 1 second for the query is still really slow and I'm worried about the implications when the server is under a load.
I've tried switching out COUNT(*) for COUNT(1) or COUNT(id) but that doesn't seem to make a difference.
Does anyone have a suggestion on how I might create a better, faster query that would accomplish this same goal? I've thought about having a background process regularly calculate the statistics and cache them, but I'd love to stick to live updating information if possible.
Thanks,
Tim

Add a boolean "summarized" column to your statistics table and making it part of a multicolumn index with paid.
Then have a background process that produces/updates rows containing the read count in a summary table (by article) and marks the statistics table rows as summarized. (Though the summary table could just be your article table.)
Then your live query reports the sum of the already summarized results and the as-yet-unsummarized statistics rows.
This also allows you to expire old statistics table rows without losing your read counts.
(All this assumes you already have an index on paid; if you don't, definitely add one and that will likely solve your problem for now, though in the long run likely you still want to be able to delete old statistics records.)

Statistical data like display in website from large record set

I have 4 databases with tables having lots of data. My requirement is to show count of all the records in these tables on mouse hovering the corresponding div in UI(It is an asp.net website). Please note the count may change in every minute or in hour. (Means new records can be added or deleted from the table [using another application]). Now the issue is like, it is taking lot of time to get the count (since it has lots of data). So each mouse over, it is having a call to corresponding database and taking the count. Is there any better approach to implement this?
I am thinking of implementing something like as below.
http://www.worldometers.info/world-population/
But to change the figures like that in each second I need to have a call to the database. Right? (To get the latest count) Is there any better approach to show data like this statistics?
By the Way, I am using MySQL.
Thanks

You need to give more details - what table engines you are using, how does your count query look like, etc.
But assuming that you are using InnoDB, and you are trying to run count(*) or count(primary_id_column), you have to remember that InnoDB has clustered primary keys, that are stored with the data pages of the row itself, not in separate index pages, so the count will do full scan on the rows.
One thing you can try is to create additional, separate, not primary index on any unique column (like row's id etc,) and make sure (use explain query statement) that your count uses this index.
If this does not work for you, I would suggest to create separate table (for example with columns: table_name, row_count) to store counters in it and create triggers on insert and on delete on other tables (you need to count records in) to increment or decrement these values. From my experience (we monitor number of records in daily and hourly manners, on tables with hundreds of milions of records and heavy write load: ~150 inserts/sec) this is the best solution I have came up so far.

Should totals be denormalized?

I am working on a website with a simple normalized database.
There is a table called Pages and a table called Views. Each time a Page is viewed, a unique record of that View is recorded in the Views table.
When displaying a Page on the site, I use a simple MySQL COUNT() to total up the number of Views for display.
Database design seems fine, except for this problem: I am at a loss for how to retrieve the top 10 most viewed pages among thousands.
Should I denormalize the Pages table by adding a Pages.views column to hold the total number of views for each page? Or is there an efficient way to query for the top 10 most viewed pages?

SELECT p.pageid, count(*) as viewcount FROM
pages p
inner join views v on p.pageid = v.pageid
group by p.pageid
order by count(*) desc
LIMIT 10 OFFSET 0;
I can't test this, but something along those lines. I would not store the value unless I have to due to performance constraints (I just learned the term "premature optimization", and it seems to apply if you do).

It depends on the level of information you are trying to maintain. If you want to record who viewed when? Then the separate table is fine. Otherwise, a column for Views is the way to go. Also If you keep a separate column, you'll find that the table will be locked more often since each page view will try to update the column for its corresponding row.
Select pageid, Count(*) as countCol from Views
group by pageid order by countCol DESC
LIMIT 10 OFFSET 0;

Database normalization is all about the most efficient / least redundant way to store data. This is good for transaction processing, but often directly conflicts with the need to efficiently get the data out again. The problem is usually addressed by having derived tables (indexes, materialized views, rollup tables...) with more accessible, pre-processed data. The (slightly dated) buzzword here is Data Warehousing.
I think you want to keep your Pages table normalized, but have an extra table with the totals. Depending on how recent those counts need to be, you can update the table when you update the original table, or you can have a background job to periodically recalculate the totals.
You also want to do this only if you really run into a performance problem, which you will not unless you have a very large number of records, or a very large number of concurrent accesses. Keep your code flexible to be able to switch between having the table and not having it.

I would probably include the views column in the Pages table.
It seems like a perfectly reasonable breaking of normalization to me. Especially since I can't imagine you deleting views so you wouldn't expect the count to get out of whack. Referential integrity doesn't seem super-critical in this case.

Denormalizing would definitely work in this case. Your loss is the extra storage room used up by the extra column.
Alternatively you could set up a scheduled job to populate this information on a nightly basis, whenever your traffic is low, x period of time.
In this case you would be losing the ability to instantly know your page counts unless you run this query manually.
Denormalization can definitely be employed to increase performance.
--Kris

While this is an old question, I'd like to add my answer because I find the accepted one to be misguided.
It is one thing to have COUNT for a single selected row; it is quite another to sort the COUNT of ALL columns.
Even if you have just 1000 rows, each counted with some join, you can easily involve reading tens of thousands if not millions of rows.
It can be ok if you only call this occasionally, but it is very costly otherwise.
What you can do is to add a TRIGGER:
CREATE TRIGGER ins AFTER INSERT ON table1 FOR EACH ROW
UPDATE table2
SET count = count + 1
WHERE CONDITION

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008