Best way to aggregate logging data - mysql

So I have a database that contains logs of people clicking a link. What I store is id, country, referrer-domain, clickedat. That last column is a timestamp.
At the moment there are not a lot of rows, but if this takes off it might be tens to hundreds of thousands of rows. What is the best way to query the database for things like this:
Times viewed per day for the past month
Top 20 countries that used your link
Top 20 websites referring your link
Using a COUNT(*) would eventually be way too slow. I've seen techniques where you add another query to every update, insert, delete that happens to save in a special aggregation table. But I'm not sure that would work, as I'd like to have users be able to select two specific dates for example. Or I'd have to aggregate by day.

If you add an indexed Date, column, so you're not doing date/time calculations on the fly, you should just be able to query it using normal aggregations. It would take a long time before that would be 'way too slow' with properly formatted queries.
If it takes off, look into de-normalising the data as you describe, but don't optimise it prematurely!

Related

Making a groupby query faster

This is my data from my table:
I mean i have exactly one million rows so it is just a snippet.
I would like to make this query faster:
Which basically groups the values by time (ev represents year honap represents month and so on.). It has one problem that it takes a lot of time. I tried to apply indexes as you can see here:
but it does absolutely nothing.
Here is my index:
I have tried also to put the perc (which represents minute) due to cardinality but mysql doesnt want to use that. Could you give me any suggestions?
Is the data realistic? If so, why run the query -- it essentially delivers exactly what was in the table.
If, on the other hand, you had several rows per minute, then the GROUP BY makes sense.
The index you have is not worth using. However, the Optimizer seemed to like it. That's a bug.
In that case, I would simply this:
SELECT AVG(konyha1) AS 'avg',
LEFT(time, 16) AS 'time'
FROM onemilliondata
GROUP BY LEFT(time, 16)
A DATE or TIME or DATETIME can be treated as such a datatype or as a VARCHAR. I'm asking for it to be a string.
Even in this case, no index is useful. However, this would make it a little faster:
PRIMARY KEY(time)
and the table would have only 2 columns: time, konyha1.
It is rarely beneficial to break a date and/or time into components and put them into columns.
A million points will probably choke a graphing program. And the screen -- which has a resolution of only a few thousand.
Perhaps you should group by hour? And use LEFT(time, 13)? Performance would probably be slightly faster -- but only because less data is being sent to the client.
If you are collecting this data "forever", consider building and maintaining a "summary table" of the averages for each unit of time. Then the incremental effort is, say, aggregating yesterday's data each morning.
You might find MIN(konyha1) and MAX(konyha1) interesting to keep on an hourly or daily basis. Note that daily or weekly aggregates can be derived from hourly values.

Optimize SQL to get rows count

I have a page on my site that keeps track of the number of people accessing it, on another part I displays the data containing information about the users that access this page, it displays only about 10 at a time.
The problem is I need to create pagination so I need to know how much data is on my table at every time and this causes the display page to take some time to load 2-3 seconds, sometimes 7-10, because I have millions of record. I am wondering, how do I get this page to load faster.
Select COUNT(*) as Count from visits
My first response is . . . if you are paging records 10 at a time, why do you need the total count of more than a million?
Second, counting a million rows should not take very long, unless your rows are wide (lots of columns or wide columns). If that is the case, then:
select count(id) from t;
can help, because it will explicitly use an index. Note that the first run may be slower than subsequent runs because of caching.
If you decide that you do need an exact row count, then your only real option for speeding it up using MySQL is to create triggers to maintain the count in another table. However, that will slow down inserts and deletions, which might not be a good idea.
The best answer is to say "About 1,234,000 visits", not the exact number. Then calculate it daily (or whatever).
But if you must have the exact count, ...
If this table is "write only", then there is a solution. It involves treating it as a "Fact" table in a Data Warehouse. Then create and maintain a "Summary table" with a row for, say, each hour. Then the COUNT becomes:
SELECT SUM(hourly_count) FROM SummaryTable;
This will be much faster because there is much less to scan. However, there is a problem in that it does not include the count for the last (partial) hour. But that can be solved if you use INSERT ... ON DUPLICATE KEY UPDATE ... to increment the counter for the current hour or insert a new row with a "1".
Some more info is here .
But, before we take this too far, please inform us of how often a new 'visit' occurs.
You cannot make that query get faster without changing the server's hardware or adding more servers to run it in parallel. In the second case it would be better to move to a nosql database.
My approach would be to reduce the number of records. That you could do by having some temporary table where you record the access logs for the past hour/day and after that time run a cronjob that deletes the data, or moves it to another table for log term storage.
You usually do not need to know exact number of rows for pagination
SELECT COUNT(*) FROM
(SELECT TOP 10000 * FROM visits) as v
would tell You, that there are at least 1000 pages. In most cases You do not need to know more.
You can store total count somewhere and update it from time to time if You want some reasonable estimate. If You need exact number, You can use trigger to keep it actual. The more up to date info, the more expensive, of course.
Decide on limit (let's say, 1000 last ones) from practical (business requirements) point of view. Have auto_increment index (id) or timestamp (createdon). Grab max 1000 records
select count(*) from (select id from visits order by id desc limit 1000)
or grab all 1000 and count paginate on the client side (php) (as if you paginate mysql will still go through those records):
select * from visits order by id desc limit 1000

Optimizing COUNT() on MariaDB for a Statistics Table

I've read a number of posts here and elsewhere about people wrestling to improve the performance of the MySQL/MariaDB COUNTfunction, but I haven't found a solution that quite fits what I am trying to do. I'm trying to produce a live updating list of read counts for a list of articles. Each time a visitor visits a page, a log table in the SQL database records the usual access log-type data (IP, browser, etc.). Of particular interest, I record the user's ID (uid) and I process the user agent tag to classify known spiders (uaType). The article itself is identified by the "paid" column. The goal is to produce a statistic that doesn't count the poster's own views of the page and doesn't include known spiders, either.
Here's the query I have:
"COUNT(*) FROM uninet_log WHERE paid='1942' AND uid != '1' AND uaType != 'Spider'"
This works nicely enough, but very slowly (approximately 1 sec.) when querying against a database with 4.2 million log entries. If I run the query several times during a particular run, it increases the runtime by about another second for each query. I know I could group by paid and then run a single query, but even then (which would require some reworking of my code, but could be done) I feel like 1 second for the query is still really slow and I'm worried about the implications when the server is under a load.
I've tried switching out COUNT(*) for COUNT(1) or COUNT(id) but that doesn't seem to make a difference.
Does anyone have a suggestion on how I might create a better, faster query that would accomplish this same goal? I've thought about having a background process regularly calculate the statistics and cache them, but I'd love to stick to live updating information if possible.
Thanks,
Tim
Add a boolean "summarized" column to your statistics table and making it part of a multicolumn index with paid.
Then have a background process that produces/updates rows containing the read count in a summary table (by article) and marks the statistics table rows as summarized. (Though the summary table could just be your article table.)
Then your live query reports the sum of the already summarized results and the as-yet-unsummarized statistics rows.
This also allows you to expire old statistics table rows without losing your read counts.
(All this assumes you already have an index on paid; if you don't, definitely add one and that will likely solve your problem for now, though in the long run likely you still want to be able to delete old statistics records.)

Statistical data like display in website from large record set

I have 4 databases with tables having lots of data. My requirement is to show count of all the records in these tables on mouse hovering the corresponding div in UI(It is an asp.net website). Please note the count may change in every minute or in hour. (Means new records can be added or deleted from the table [using another application]). Now the issue is like, it is taking lot of time to get the count (since it has lots of data). So each mouse over, it is having a call to corresponding database and taking the count. Is there any better approach to implement this?
I am thinking of implementing something like as below.
http://www.worldometers.info/world-population/
But to change the figures like that in each second I need to have a call to the database. Right? (To get the latest count) Is there any better approach to show data like this statistics?
By the Way, I am using MySQL.
Thanks
You need to give more details - what table engines you are using, how does your count query look like, etc.
But assuming that you are using InnoDB, and you are trying to run count(*) or count(primary_id_column), you have to remember that InnoDB has clustered primary keys, that are stored with the data pages of the row itself, not in separate index pages, so the count will do full scan on the rows.
One thing you can try is to create additional, separate, not primary index on any unique column (like row's id etc,) and make sure (use explain query statement) that your count uses this index.
If this does not work for you, I would suggest to create separate table (for example with columns: table_name, row_count) to store counters in it and create triggers on insert and on delete on other tables (you need to count records in) to increment or decrement these values. From my experience (we monitor number of records in daily and hourly manners, on tables with hundreds of milions of records and heavy write load: ~150 inserts/sec) this is the best solution I have came up so far.

Mysql query optimization for Distinct count

I had a query like this
select count(distinct sessionKey) as tot from visits
But it takes too much time for execution 48512 ms Now.
Within some months data in table will become double of the current amount of data.How I can optimize this query
This is my table structure
add an INDEX in your column SessionKey and it will improve its performance.
ALTER TABLE visits ADD INDEX (SessionKey)
Like others suggested, adding an Index would be the first and easiest thing to do.
If you have tons and tons of lines in there, going through all of them might take some time anyways.
I once had a problem with something like this once, where someone coded a system, where users could vote on news-entries. Every vote was saved as a single line in the database. On every webpage there was a list of the "top voted" news. This basically meant there was a query to select the complete votes table, sum them up, and sort after that sum. With entries in the multiple 100k range, this was taking some serious time. Someone before me "solved" it, by trying to "cache" the results. This worked nice most of the time, but if you had cleaned all caches, then the whole page messes up for some hours until the caches was built again. I then fixed it by not saving every vote on an own row, but just sums for every entry.
What I want to tell you with this: You could try either caching (but the result wouldn be "live" of course), or change something in the database like adding a field or table where you store the count that you want to read which you update on every insert to the visits table. This would create a little more load on insert, but getting that number would be super cheap.