Mysql Activity Union Query - mysql

right to business.
I have an activity feed which gets all different kinds of activity from different parts of my site sorts them all out by means of using UNION and an ORDER BY and then a LIMIT to get the top 25 and then displays the activity.
My fellow programmer says that we will run into problems when we have more rows (currently we have 800) and it's fine.
So the question.
Will the UNION cause slow down later down the line?
If so should we
a) Try and put the activity into a new table and then query that.
b) Try some sort of view? - (if so could anyone explain how I'm not too sure how!)
c) Other...
Thanks for your help.
Richard

Why not just limit each of the individual queries to 25 as well? That way you could restrict the amount of rows that come back before being unioned and limited for the final list

This is a tricky one as it depends on a lot of variables, such as how busy your tables are etc.
I would imagine the union would slow you down later, as the method you are using is basically doing a union on entire tables so these need to be read into memory before the ordering and the limiting of the number of rows is applied. Your query will get slower as the amount of data increases.
If all the data in the tables is important then the best you can do it try and ensure you index the tables as best you can so at least your ordering runs fast etc. IF some of the data in the tables gets old or stale and you aren't too interested in it, then you might have scope to read just the rows you need into a temp table. This can then be ordered etc.

I think the best way might be to do a combination of the 2 so
a)yes Index
b) then do the LIMIT 25 on each of the sub queries.
c) Do a where added_date >= date on each of the queries so that we have the correct date order
mmm but then that presents problems as to what dates to go for. So if we go to page x which dates do I get.
This is turning into a problem and quite a big one. The size of data we have is going to be quite big.
Thanks for your help.
Richard

Decided just to make a log and do it that way.
Thanks as always for your help.
Richard

Related

Efficient way to get last n records from multiple tables sorted by date (user log of events)

I need to get last n records from m tables (lets say m=n=~10 for now) ordered by date (also supporting offset would be nice). This should show user his last activity. These tables will contain mostly hundreds or thousands of records for that user.
How can I do that most efficient way? I'm using Doctrine 2 and these tables has no relation to each other.
I though about some solutions not sure whats the best approach:
Create separate table and put records there. If any change happen (if user do any change inside the system that should be shown inside log table) it will be inserted in this table. This should be pretty fast, but it will be hard to manage and I don't want to use this approach yet.
Get last n records from every table and then sort them (out of DB) and limit to n. This seems to be pretty straightforward but with more tables there will be quite high overhead. For 10 tables 90% of records will be thrown away. If offset is used, it would be even worse. Also this mean m queries.
Create single native query and get id and type of last n items doing union of all tables. Like SELECT id, date, type FROM (SELECT a.id, a.date, 'a' AS type FROM a ORDER BY a.date DESC LIMIT 10 UNION b.id, b.date, 'b' AS type ORDER BY b.date DESC LIMIT 10) ORDER BY date DESC LIMIT 10. Then create at most m queries getting these entities. This should be a bit better than 2., but requires native query.
Is there any other way how to get these records?
Thank you
is not hard to manage, it just is an additional insert for each insert you are doing for the "action"-tables.
You could also solve this by using a trigger I'd guess, so you wouldn't even have to implement it in the application code. https://stackoverflow.com/a/4754333/3595565
Wouldn't it be "get last n records by a specific user from each of those tables? So don't see a lot of problems with this approach, though I also think it is the least ideal way to handle things.
Would be like the 2nd option, but the database handles the sorting which makes this approach a lot more viable.
Conclusion: (opinion based)
You should choose between options 1 and 3. The main questions should be
is it ok to store redundant data
is it ok to have logic outside of your application and inside of your database (trigger)
Using the logging table would make things pretty straight forward. But you will duplicate data.
If you are ok with using a trigger to fill the logging table, things will be more simple, but it has it's downside as it requires additional documentation etc. so nobody wonders "where is that data coming from?"

SQL Optimization: how to JOIN a table with itself

I'm trying to optimize a SQL query and I am not sure if further optimization is possible.
Here's my query:
SELECT someColumns
FROM (((smaller_table))) AS a
INNER JOIN (((smaller_table))) AS b
ON a.someColumnProperty = b.someColumnProperty
...the problem with this approach is that my table has half a trillion records in it. In my query, you'll notice (((smaller_table))). I wrote that as an abbreviation for a SELECT statement being run on MY_VERY_LARGE_TABLE to reduce it's size.
(((smaller_table))) appears twice, and the code within is exactly the same both times. There's no reason for me to run the same sub-query twice. This table is several TB and I shouldn't scan through it twice just to get the same results.
Do you have any suggestions on how I can NOT run the exact same reduction twice? I tried replacing the INNER JOIN line with INNER JOIN a AS b but got an "unrecognized table a" warning. Is there any way to store the value of a so I can reuse it?
Thoughts:
Make sure there is an index on userid and dayid.
I would ask you to define better what it is you are trying to find out.
Examples:
What is the busiest time of the week?
Who are the top 25 people who come to the gym the most often?
Who are the top 25 people who utilize the gem the most? (This is different than the one above because maybe I have a user that comes 5 times a month, but stays 5 hours per session vs a user that comes 30 times a month and stays .5 hour per session.)
Maybe doing all days in a horizontal method (day1, day2, day3) would be better visually to try to find out what you are looking for. You could easily put this into excel or libreoffice and color the days that are populated to get a visual "picture" of people who come consecutively.
It might be interesting to run this for multiple months to see if what the seasonality looks like.
Alas CTE is not available in MySQL. The ~equivalent is
CREATE TABLE tmp (
INDEX(someColumnProperty)
)
SELECT ...;
But...
You can't use CREATE TEMPORARY TABLE because such can't be used twice in the same query. (No, I don't know why.)
Adding the INDEX (or PK or ...) during the CREATE (or afterwards) provides the very necessary key for doing the self join.
You still need to worry about DROPping the table (or otherwise dealing with it).
The choice of ENGINE for tmp depends on a number of factors. If you are sure it will be "small" and has no TEXT/BLOB, then MEMORY may be optimal.
In a Replication topology, there are additional considerations.

I need to speed up specific mysql query on large table

Hi I know there is a lot of topics dedicated to query optimizing strategies, but this one is so specific I couldnt find the answer anywhere on the interenet.
I have large table of product in eshop (appx. 180k rows) and the table has 65 columns. Yeah yeah I know its quite a lot, but I store there information about books, dvds, bluerays and games.
Still I am not considering a lot of cols into query, but the select is still quite tricky. There are many conditions that need to be considered and compared. Query below
SELECT *
FROM products
WHERE production = 1
AND publish_on < '2012-10-23 11:10:06'
AND publish_off > '2012-10-23 11:10:06'
AND price_vat > '0.5'
AND ean <> ''
AND publisher LIKE '%Johnny Cash%'
ORDER BY bought DESC, datec DESC, quantity_storage1 DESC, quantity_storege2 DESC, quantity_storage3 DESC
LIMIT 0, 20
I have already tried to put there indexes one by one on cols in where clause and even in order by clause, then I tried to create compound index on (production, publish_on, publish_off, price_vat, ean).
Query is still slow (couple of seconds) and it need to be fast since its eshop solution and people are leaving as they are not getting their results fast. And I am still not counting the time I need to perform the search for all found rows so I can make paging.
I mean, the best way to make it quick is to simplify the query, but all the conditions and sorting is a must in this case.
Can anyone help with this kind of issue? Is it even possible to speed this kind of query up, or is there any other way how I can for example simplify the query and leave the rest on php engine to sort the results..
Oh, Iam really clueless in this.. Share your wisdom peple, please...
Many thanks in advance
First of all be sure what you want to select and erase the '*'
Select * from
with something more specific
Select id, name, ....
There is no Join or anything other in your table so the speed up options are quite small I think.
Check that your mysql Server can use enough memory.
Have a look at this confis in your my.cnf
key_buffer_size = 384M;
myisam_sort_buffer_size = 64M;
thread_cache_size = 8;
query_cache_size = 64M
Have a look a max allowed concurrency. mysql recommends CPU's*2
thread_concurrency = 4
You should really thinks about splitting the table depending on informations you use and on standard normalization. If possible.
If it's a productive system with no way to split the tables then think about a caching server. But this will only help if you have a lot of recurring querys that are the same.
This is what I would do when knowing nothing about the underlying implementation or the system at all.
Edit:
Making as many columns indexable as you can won't necessarily speed up your system. The more indexes ≠ the more speed.
thx to all of you for good remarks..
I found the solution probably 'cause I was able to reduce query time from 2,8s down to 0,3 sec.
SOLUTION:
using SELECT * is really naive on large tables (65cols) so I realized I only need 25 of them on page - other can be easily used on product page itself.
I also reindexed my table little bit. I created compound index on
production, publish_on, publish_off, price_vat, ean
then I created another one specificaly for search including cols
title, publisher, author
last thing what I did was to use query like
SELECT SQL_CALC_FOUND_ROWS ID, title, alias, url, type, preorder, subdescription,....
which allowed me to calculate influenced rows quicker using
mysql_result(mysql_query("SELECT FOUND_ROWS()"), 0)
after mysql_query()... However I cannot understand how it could be quicker, because EXPLAIN EXTENDED says the query is not using any index, its still 0,5s quicker then calculate the number of rows in individual query.
It seems to be working rather fine. If the order by clause wasnt there it would be evil quick, but thats something I have no influence on.
Still need to check my server settings...
Thank y'all for all your help..

MySQL : Does a query searches inside the whole table?

1. So if I search for the word ball inside the toys table where I have 5.000.000 entries does it search for all the 5 millions?
I think the answer is yes because how should it know else, but let me know please.
2. If yes: If I need more informations from that table isn't more logic to query just once and work with the results?
An example
I have this table structure for example:
id | toy_name | state
Now I should query like this
mysql_query("SELECT * FROM toys WHERE STATE = 1");
But isn't more logical to query for all the table
mysql_query("SELECT * FROM toys"); and then do this if($query['state'] == 1)?
3. And something else, if I put an ORDER BY id LIMIT 5 in the mysql_query will it search for the 5 million entries or just the last 5?
Thanks for the answers.
Yes, unless you have a LIMIT clause it will look through all the rows. It will do a table scan unless it can use an index.
You should use a query with a WHERE clause here, not filter the results in PHP. Your RDBMS is designed to be able to do this kind of thing efficiently. Only when you need to do complex processing of the data is it more appropriate to load a resultset into PHP and do it there.
With the LIMIT 5, the RDBMS will look through the table until it has found you your five rows, and then it will stop looking. So, all I can say for sure is, it will look at between 5 and 5 million rows!
Read this about indexes :-)
http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
It makes it uber-fast :-)
Full table scan is here only if there are no matching indexes and indeed very slow operation.
Sorting is also accelerated by indexes.
And for the #2 - this is slow because transfer rate from MySQL -> PHP is slow, and MySQL is MUCH faster at doing filtering.
For your #1 question: Depends on how you're searching for 'ball'. If there's no index on the column(s) where you're searching, then the entire table has to be read. If there is an index, then...
WHERE field LIKE 'ball%' will use an index
WHERE field LIKE '%ball%' will NOT use an index
For your #2, think of it this way: Doing SELECT * FROM table and then perusing the results in your application is exactly the same as going to the local super walmart, loading the store's complete inventory into your car, driving it home, picking through every box/package, and throwing out everything except the pack of gum from the impulse buy rack by the front till that you'd wanted in the first place. The whole point of a database is to make it easy to search for data and filter by any kind of clause you could think of. By slurping everything across to your application and doing the filtering there, you've reduced that shiny database to a very expensive disk interface, and would probably be better off storing things in flat files. That's why there's WHERE clauses. "SELECT something FROM store WHERE type=pack_of_gum" gets you just the gum, and doesn't force you to truck home a few thousand bottles of shampoo and bags of kitty litter.
For your #3, yes. If you have an ORDER BY clause in a LIMIT query, the result set has to be sorted before the database can figure out what those 5 records should be. While it's not quite as bad as actually transferring the entire record set to your app and only picking out the first five records, it still involves a bit more work than just retrieving the first 5 records that match your WHERE clause.

Should totals be denormalized?

I am working on a website with a simple normalized database.
There is a table called Pages and a table called Views. Each time a Page is viewed, a unique record of that View is recorded in the Views table.
When displaying a Page on the site, I use a simple MySQL COUNT() to total up the number of Views for display.
Database design seems fine, except for this problem: I am at a loss for how to retrieve the top 10 most viewed pages among thousands.
Should I denormalize the Pages table by adding a Pages.views column to hold the total number of views for each page? Or is there an efficient way to query for the top 10 most viewed pages?
SELECT p.pageid, count(*) as viewcount FROM
pages p
inner join views v on p.pageid = v.pageid
group by p.pageid
order by count(*) desc
LIMIT 10 OFFSET 0;
I can't test this, but something along those lines. I would not store the value unless I have to due to performance constraints (I just learned the term "premature optimization", and it seems to apply if you do).
It depends on the level of information you are trying to maintain. If you want to record who viewed when? Then the separate table is fine. Otherwise, a column for Views is the way to go. Also If you keep a separate column, you'll find that the table will be locked more often since each page view will try to update the column for its corresponding row.
Select pageid, Count(*) as countCol from Views
group by pageid order by countCol DESC
LIMIT 10 OFFSET 0;
Database normalization is all about the most efficient / least redundant way to store data. This is good for transaction processing, but often directly conflicts with the need to efficiently get the data out again. The problem is usually addressed by having derived tables (indexes, materialized views, rollup tables...) with more accessible, pre-processed data. The (slightly dated) buzzword here is Data Warehousing.
I think you want to keep your Pages table normalized, but have an extra table with the totals. Depending on how recent those counts need to be, you can update the table when you update the original table, or you can have a background job to periodically recalculate the totals.
You also want to do this only if you really run into a performance problem, which you will not unless you have a very large number of records, or a very large number of concurrent accesses. Keep your code flexible to be able to switch between having the table and not having it.
I would probably include the views column in the Pages table.
It seems like a perfectly reasonable breaking of normalization to me. Especially since I can't imagine you deleting views so you wouldn't expect the count to get out of whack. Referential integrity doesn't seem super-critical in this case.
Denormalizing would definitely work in this case. Your loss is the extra storage room used up by the extra column.
Alternatively you could set up a scheduled job to populate this information on a nightly basis, whenever your traffic is low, x period of time.
In this case you would be losing the ability to instantly know your page counts unless you run this query manually.
Denormalization can definitely be employed to increase performance.
--Kris
While this is an old question, I'd like to add my answer because I find the accepted one to be misguided.
It is one thing to have COUNT for a single selected row; it is quite another to sort the COUNT of ALL columns.
Even if you have just 1000 rows, each counted with some join, you can easily involve reading tens of thousands if not millions of rows.
It can be ok if you only call this occasionally, but it is very costly otherwise.
What you can do is to add a TRIGGER:
CREATE TRIGGER ins AFTER INSERT ON table1 FOR EACH ROW
UPDATE table2
SET count = count + 1
WHERE CONDITION