Should totals be denormalized? - mysql

I am working on a website with a simple normalized database.
There is a table called Pages and a table called Views. Each time a Page is viewed, a unique record of that View is recorded in the Views table.
When displaying a Page on the site, I use a simple MySQL COUNT() to total up the number of Views for display.
Database design seems fine, except for this problem: I am at a loss for how to retrieve the top 10 most viewed pages among thousands.
Should I denormalize the Pages table by adding a Pages.views column to hold the total number of views for each page? Or is there an efficient way to query for the top 10 most viewed pages?

SELECT p.pageid, count(*) as viewcount FROM
pages p
inner join views v on p.pageid = v.pageid
group by p.pageid
order by count(*) desc
LIMIT 10 OFFSET 0;
I can't test this, but something along those lines. I would not store the value unless I have to due to performance constraints (I just learned the term "premature optimization", and it seems to apply if you do).

It depends on the level of information you are trying to maintain. If you want to record who viewed when? Then the separate table is fine. Otherwise, a column for Views is the way to go. Also If you keep a separate column, you'll find that the table will be locked more often since each page view will try to update the column for its corresponding row.
Select pageid, Count(*) as countCol from Views
group by pageid order by countCol DESC
LIMIT 10 OFFSET 0;

Database normalization is all about the most efficient / least redundant way to store data. This is good for transaction processing, but often directly conflicts with the need to efficiently get the data out again. The problem is usually addressed by having derived tables (indexes, materialized views, rollup tables...) with more accessible, pre-processed data. The (slightly dated) buzzword here is Data Warehousing.
I think you want to keep your Pages table normalized, but have an extra table with the totals. Depending on how recent those counts need to be, you can update the table when you update the original table, or you can have a background job to periodically recalculate the totals.
You also want to do this only if you really run into a performance problem, which you will not unless you have a very large number of records, or a very large number of concurrent accesses. Keep your code flexible to be able to switch between having the table and not having it.

I would probably include the views column in the Pages table.
It seems like a perfectly reasonable breaking of normalization to me. Especially since I can't imagine you deleting views so you wouldn't expect the count to get out of whack. Referential integrity doesn't seem super-critical in this case.

Denormalizing would definitely work in this case. Your loss is the extra storage room used up by the extra column.
Alternatively you could set up a scheduled job to populate this information on a nightly basis, whenever your traffic is low, x period of time.
In this case you would be losing the ability to instantly know your page counts unless you run this query manually.
Denormalization can definitely be employed to increase performance.
--Kris

While this is an old question, I'd like to add my answer because I find the accepted one to be misguided.
It is one thing to have COUNT for a single selected row; it is quite another to sort the COUNT of ALL columns.
Even if you have just 1000 rows, each counted with some join, you can easily involve reading tens of thousands if not millions of rows.
It can be ok if you only call this occasionally, but it is very costly otherwise.
What you can do is to add a TRIGGER:
CREATE TRIGGER ins AFTER INSERT ON table1 FOR EACH ROW
UPDATE table2
SET count = count + 1
WHERE CONDITION

Related

Will records order change between two identical query in mysql without order by

The problem is I need to do pagination.I want to use order by and limit.But my colleague told me mysql will return records in the same order,and since this job doesn't care in which order the records are shown,so we don't need order by.
So I want to ask if what he said is correct? Of course assuming that no records are updated or inserted between the two queries.
You don't show your query here, so I'm going to assume that it's something like the following (where ID is the primary key of the table):
select *
from TABLE
where ID >= :x:
limit 100
If this is the case, then with MySQL you will probably get rows in the same order every time. This is because the only predicate in the query involves the primary key, which is a clustered index for MySQL, so is usually the most efficient way to retrieve.
However, probably may not be good enough for you, and if your actual query is any more complex than this one, probably no longer applies. Even though you may think that nothing changes between queries (ie, no rows inserted or deleted), so you'll get the same optimization plan, that is not true.
For one thing, the block cache will have changed between queries, which may cause the optimizer to choose a different query plan. Or maybe not. But I wouldn't take the word of anyone other than one of the MySQL maintainers that it won't.
Bottom line: use an order by on whatever column(s) you're using to paginate. And if you're paginating by the primary key, that might actually improve your performance.
The key point here is that database engines need to handle potentially large datasets and need to care (a lot!) about performance. MySQL is never going to waste any resource (CPU cycles, memory, whatever) doing an operation that doesn't serve any purpose. Sorting result sets that aren't required to be sorted is a pretty good example of this.
When issuing a given query MySQL will try hard to return the requested data as quick as possible. When you insert a bunch of rows and then run a simple SELECT * FROM my_table query you'll often see that rows come back in the same order than they were inserted. That makes sense because the obvious way to store the rows is to append them as inserted and the obvious way to read them back is from start to end. However, this simplistic scenario won't apply everywhere, every time:
Physical storage changes. You won't just be appending new rows at the end forever. You'll eventually update values, delete rows. At some point, freed disk space will be reused.
Most real-life queries aren't as simple as SELECT * FROM my_table. Query optimizer will try to leverage indices, which can have a different order. Or it may decide that the fastest way to gather the required information is to perform internal sorts (that's typical for GROUP BY queries).
You mention paging. Indeed, I can think of some ways to create a paginator that doesn't require sorted results. For instance, you can assign page numbers in advance and keep them in a hash map or dictionary: items within a page may appear in random locations but paging will be consistent. This is of course pretty suboptimal, it's hard to code and requieres constant updating as data mutates. ORDER BY is basically the easiest way. What you can't do is just base your paginator in the assumption that SQL data sets are ordered sets because they aren't; neither in theory nor in practice.
As an anecdote, I once used a major framework that implemented pagination using the ORDER BY and LIMIT clauses. (I won't say the same because it isn't relevant to the question... well, dammit, it was CakePHP/2). It worked fine when sorting by ID. But it also allowed users to sort by arbitrary columns, which were often not unique, and I once found an item that was being shown in two different pages because the framework was naively sorting by a single non-unique column and that row made its way into both ORDER BY type LIMIT 10 and ORDER BY type LIMIT 10, 10 because both sortings complied with the requested condition.

Efficient way to get last n records from multiple tables sorted by date (user log of events)

I need to get last n records from m tables (lets say m=n=~10 for now) ordered by date (also supporting offset would be nice). This should show user his last activity. These tables will contain mostly hundreds or thousands of records for that user.
How can I do that most efficient way? I'm using Doctrine 2 and these tables has no relation to each other.
I though about some solutions not sure whats the best approach:
Create separate table and put records there. If any change happen (if user do any change inside the system that should be shown inside log table) it will be inserted in this table. This should be pretty fast, but it will be hard to manage and I don't want to use this approach yet.
Get last n records from every table and then sort them (out of DB) and limit to n. This seems to be pretty straightforward but with more tables there will be quite high overhead. For 10 tables 90% of records will be thrown away. If offset is used, it would be even worse. Also this mean m queries.
Create single native query and get id and type of last n items doing union of all tables. Like SELECT id, date, type FROM (SELECT a.id, a.date, 'a' AS type FROM a ORDER BY a.date DESC LIMIT 10 UNION b.id, b.date, 'b' AS type ORDER BY b.date DESC LIMIT 10) ORDER BY date DESC LIMIT 10. Then create at most m queries getting these entities. This should be a bit better than 2., but requires native query.
Is there any other way how to get these records?
Thank you
is not hard to manage, it just is an additional insert for each insert you are doing for the "action"-tables.
You could also solve this by using a trigger I'd guess, so you wouldn't even have to implement it in the application code. https://stackoverflow.com/a/4754333/3595565
Wouldn't it be "get last n records by a specific user from each of those tables? So don't see a lot of problems with this approach, though I also think it is the least ideal way to handle things.
Would be like the 2nd option, but the database handles the sorting which makes this approach a lot more viable.
Conclusion: (opinion based)
You should choose between options 1 and 3. The main questions should be
is it ok to store redundant data
is it ok to have logic outside of your application and inside of your database (trigger)
Using the logging table would make things pretty straight forward. But you will duplicate data.
If you are ok with using a trigger to fill the logging table, things will be more simple, but it has it's downside as it requires additional documentation etc. so nobody wonders "where is that data coming from?"

Optimizing COUNT() on MariaDB for a Statistics Table

I've read a number of posts here and elsewhere about people wrestling to improve the performance of the MySQL/MariaDB COUNTfunction, but I haven't found a solution that quite fits what I am trying to do. I'm trying to produce a live updating list of read counts for a list of articles. Each time a visitor visits a page, a log table in the SQL database records the usual access log-type data (IP, browser, etc.). Of particular interest, I record the user's ID (uid) and I process the user agent tag to classify known spiders (uaType). The article itself is identified by the "paid" column. The goal is to produce a statistic that doesn't count the poster's own views of the page and doesn't include known spiders, either.
Here's the query I have:
"COUNT(*) FROM uninet_log WHERE paid='1942' AND uid != '1' AND uaType != 'Spider'"
This works nicely enough, but very slowly (approximately 1 sec.) when querying against a database with 4.2 million log entries. If I run the query several times during a particular run, it increases the runtime by about another second for each query. I know I could group by paid and then run a single query, but even then (which would require some reworking of my code, but could be done) I feel like 1 second for the query is still really slow and I'm worried about the implications when the server is under a load.
I've tried switching out COUNT(*) for COUNT(1) or COUNT(id) but that doesn't seem to make a difference.
Does anyone have a suggestion on how I might create a better, faster query that would accomplish this same goal? I've thought about having a background process regularly calculate the statistics and cache them, but I'd love to stick to live updating information if possible.
Thanks,
Tim
Add a boolean "summarized" column to your statistics table and making it part of a multicolumn index with paid.
Then have a background process that produces/updates rows containing the read count in a summary table (by article) and marks the statistics table rows as summarized. (Though the summary table could just be your article table.)
Then your live query reports the sum of the already summarized results and the as-yet-unsummarized statistics rows.
This also allows you to expire old statistics table rows without losing your read counts.
(All this assumes you already have an index on paid; if you don't, definitely add one and that will likely solve your problem for now, though in the long run likely you still want to be able to delete old statistics records.)

How to speed up InnoDB count(*) query?

There are a number of similar questions on here, but a lot of the answers say to force the use of an index and that doesn't seem to speed anything up for me.
I am wanting to show a "live" counter on my website showing the number of rows in a table. Kind of like how some websites show the number of registered users, or some other statistic, in "real time" (i.e. updated frequently using ajax or websockets).
My table has about 5M rows. It's growing fairly quickly and there is a high volume of inserts and deletes on it. Running
select count(*) from my_table
Takes 1.367 seconds, which is unacceptable because I need my application to get the new row count about once per second.
I tried what many of the answers on here suggest and changed the query to:
select count(*) from my_table use index(my_index)
Where my_index is Normal, BTREE on a bigint field. But the time actually increased to 1.414 seconds.
Why doesn't using an index speed up the query as many answers on here said it would?
Another option some answers suggest is to put a trigger on the table that increments a column in another table. So I could create a stats table and whenever a row is inserted or deleted in my_table have a trigger increment or decrement a column in the stats table. Is this the only other option, since using an index doesn't seem to work?
EDIT: Here's a perfect example of the type of thing I'm trying to accomplish: https://www.freelancer.com. Scroll to the bottom of the page and you'll see:
Those numbers update every second or so.
It takes time to read 5 million records and count them -- whether in an index or in the raw data form.
If a "fast-and-dirty" solution is acceptable, you can use metadata:
SELECT table_rows
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = <whatever> and TABLE_NAME = <whatever2>;
Note that this can get out-of-sync.
Another possibility is to partition the table into smaller chunks. One advantage is that if the inserts and deletes tend to be to one partition, you can just count that and use metadata for the other partitions.
A trigger may or may not help in this situation, depending on the insert/delete load. If you are doing multiple inserts per minute, then a trigger is a no-brainer -- a fine solution. If you are doing dozens or hundreds of changes per second, then the overhead of the trigger might slow down the server.
If your system is so busy that the counting is having too much impact, then probably the INSERTing/DELETEing is also having impact. One way to improve INSERT/DELETE is to do them in 'batches' instead of one at a time.
Gather the INSERTs, preferably in the app, but optionally in a 'staging' table. Then, once a second (or whatever) copy them into the real table using an INSERT..SELECT, or (if needed) INSERT..ON DUPLICATE KEY UPDATE. DELETEs can go into the same table (with a flag) or a separate table.
The COUNT(*) can be done at the end of the batch. Or it could be dead reckoned (at much lower cost) by knowing what the count was, then adjusting by what the staging table(s) will change it by.
This is a major upheaval to you app code, so don't embark on it unless you have spikes of, say, >100 INSERTs/DELETEs per second. (A steady 100 INSERTs/sec = 3 billion rows per year.)
For more details on "staging table", see http://mysql.rjweb.org/doc.php/staging_table Note that that blog advocates flip-flopping between a pair of staging tables, so as to minimize locks on them, and to allow multiple clients to coexist.
Have a job running in the background that does the following; then use its table for getting the count:
Loop:
INSERT INTO Counter (ct_my_table)
SELECT COUNT(*) FROM my_table;
sleep 1 second
end loop
At worst, it will be a couple of seconds out of date. Also note that INSERTs and DELETEs interfere (read: slow down) the SELECT COUNT(*), hence the "sleep".
Have you noticed that some UIs say "About 120,000 thingies"? They are using even cruder estimations. But it is usually good enough for the users.
Take inaccurate value from information_schema as Gordon Linoff suggested
Another inaccurate source of rows count is SELECT MAX(id) - MIN(id)
Create table my_table_count where you store rows count of table my_table and update it with triggers
In many cases you don't need an accurate value. Who cares if you show 36,400 users instead of the accurate 36,454?

Best practice for keeping counter stats?

In an effort to add statistics and tracking for users on my site, I've been thinking about the best way to keep counters of pageviews and other very frequently occurring events. Now, my site obviously isn't the size of Facebook to warrant some of the strategies they've implemented (sharding isn't even necessary, for example), but I'd like to avoid any blatantly stupid blunders.
It seems the easiest way to keep track is to just have an integer column in the table. For example, each page has a pageview column that just gets incremented by 1 for each pageview. This seems like it might be an issue if people hit the page faster than the database can write.
If two people hit the page at the same time, for example, then the previous_pageview count would be the same prior to both updates, and each update would update it to be previous_pageview+1 rather than +2. However, assuming a database write speed of 10ms (which is really high, I believe) you'd need on the order of a hundred pageviews per second, or millions of pageviews per day.
Is it okay, then, for me to be just incrementing a column? The exact number isn't too important, so some error here and there is tolerable. Does an update statement on one column slow down if there are many columns for the same row? (My guess is no.)
I had this plan to use a separate No-SQL database to store pk_[stat]->value pairs for each stat, incrementing those rapidly, and then running a cron job to periodically update the MySQL values. This feels like overkill; somebody please reassure me that it is.
UPDATE foo SET counter = counter + 1 is atomic. It will work as expected even if two people hit at the same time.
It's also common to throw view counts into a secondary table, and then update the actual counts nightly (or at some interval).
INSERT INTO page_view (page_id) VALUES (1);
...
UPDATE page SET views = views + new_views WHERE id = 1;
This should be a little faster than X = X + 1, but requires a bit more work.