I'm working on a query for a news site, which will find FeaturedContent for display on the main homepage. Content marked this way is tagged as 'FeaturedContent', and ordered in a featured table by 'homepage'. I currently have the desired output, but the query runs in over 3 seconds, which I need to cut down on. How does one optimize a query like the one which follows?
EDIT: Materialized the view every minute as suggested, down to .4 seconds:
SELECT f.position, s.item_id, s.item_type, s.title, s.caption, s.date
FROM live.search_all s
INNER JOIN live.tags t
ON s.item_id = t.item_id AND s.item_type = t.item_type AND t.tag = 'FeaturedContent'
LEFT OUTER JOIN live.featured f
ON s.item_id = f.item_id AND s.item_type = f.item_type AND f.feature_type = 'homepage'
ORDER BY position IS NULL, position ASC, date
This returns all the homepage features in order, followed by other featured content ordered by date.
The explain looks like this:
|-id---|-select_type-|-table-|-type---|-possible_keys---------|-key--------|-key_len-|-ref---------------------------------------|-rows--|-Extra-------------------------------------------------------------|
|-1----|-SIMPLE------|-t2----|-ref----|-PRIMARY,tag_index-----|-tag_index--|-303-----|-const-------------------------------------|-2-----|-Using where; Using index; Using temporary; Using filesort;--------|
|-1----|-SIMPLE------|-t-----|-ref----|-PRIMARY---------------|-PRIMARY----|-4-------|-newswires.t2.id---------------------------|-1974--|-Using index-------------------------------------------------------|
|-1----|-SIMPLE------|-s-----|-eq_ref-|-PRIMARY, search_index-|-PRIMARY----|-124-----|-newswires.t.item_id,newswires.t.item_type-|-1-----|-------------------------------------------------------------------|
|-1----|-SIMPLE------|-f-----|-index--|-NULL------------------|-PRIMARY----|-190-----|-NULL--------------------------------------|-13----|-Using index-------------------------------------------------------|
And the Profile is as follows:
|-Status---------------|-Time-----|
|-starting-------------|-0.000091-|
|-Opening tables-------|-0.000756-|
|-System lock----------|-0.000005-|
|-Table lock-----------|-0.000008-|
|-init-----------------|-0.000004-|
|-checking permissions-|-0.000001-|
|-checking permissions-|-0.000001-|
|-checking permissions-|-0.000043-|
|-optimizing-----------|-0.000019-|
|-statistics-----------|-0.000127-|
|-preparing------------|-0.000023-|
|-Creating tmp table---|-0.001802-|
|-executing------------|-0.000001-|
|-Copying to tmp table-|-0.311445-|
|-Sorting result-------|-0.014819-|
|-Sending data---------|-0.000227-|
|-end------------------|-0.000002-|
|-removing tmp table---|-0.002010-|
|-end------------------|-0.000005-|
|-query end------------|-0.000001-|
|-freeing items--------|-0.000296-|
|-logging slow query---|-0.000001-|
|-cleaning up----------|-0.000007-|
I'm new to reading the EXPLAIN output, so I'm unsure if I have a better ordering available, or anything rather simple that could be done to speed these along.
The search_all table is the materialized view table which is periodically updated, while the tags and featured tables are views. These views are not optional, and cannot be worked around.
The tags view combines tags and a relational table to get back a listing of tags according to item_type and item_id, but the other views are all simple views of one table.
EDIT: With the materialized view, the biggest bottleneck seems to be the 'Copying to temp table' step. Without ordering the output, it takes .0025 seconds (much better!) but the final output does need ordered. Is there any way to enhance the performance of that step, or work around it?
Sorry if the formatting is difficult to read, I'm new and unsure how it is regularly done.
Thanks for your help! If anything else is needed, please let me know!
EDIT: Table sizes, for reference:
Tag Relations: 197,411
Tags: 16,897
Stories: 51,801
Images: 28,383
Videos: 2,408
Featured: 13
I think optimizing your query alone won't be very useful. First thoughts are that joining a subquery, itself made of UNIONs, is alone a double bottleneck for performance.
If you have the option to change your database structure, then I would suggest to merge the 3 tables stories, images and videos into one, if they are, as it looks like, very similar (adding them a type ENUM('story', 'image', 'video')) to differentiate the records; this would remove both the subquery and the union.
Also, it looks like your views on stories and videos, are not using an indexed field to filter content. Are you querying an indexed column?
It's a pretty tricky problem without knowing your full table structure and the repartition of your data!
Another option, which would not involve bringing modifications to your existing database (especially if it is already in production), would be to "cache" this information into another table, which would be periodically refreshed by a cron job.
The caching can be done at different levels, either on the full query, or on subparts of it (independent views, or the 3 unions merged into a single cache table, etc.)
The viability of this option depends on whether it is acceptable to display slightly outdated data, or not. It might be acceptable for just some parts of your data, which may imply that you will cache just a subset of the tables/views involved in the query.
Related
So I'm facing a difficult scenario, I have a legacy app, bad written and designed, with a table, t_booking. This app, has a calendar view, where, for every hall, and for every day in the month, shows its reservation status, with this query:
SELECT mr1b.id, mr1b.idreserva, mr1b.idhotel, mr1b.idhall, mr1b.idtiporeserva, mr1b.date, mr1b.ampm, mr1b.observaciones, mr1b.observaciones_bookingarea, mr1b.tipo_de_navegacion, mr1b.portal, r.estado
FROM t_booking mr1b
LEFT JOIN a_reservations r ON mr1b.idreserva = r.id
WHERE mr1b.idhotel = '$sIdHotel' AND mr1b.idhall = '$hall' AND mr1b.date = '$iAnyo-$iMes-$iDia'
AND IF (r.comidacena IS NULL OR r.comidacena = '', mr1b.ampm = 'AM', r.comidacena = 'AM' AND mr1b.ampm = 'AM')
AND (r.estado <> 'Cancelled' OR r.estado IS NULL OR r.estado = '')
LIMIT 1;
(at first there was also a ORDER BY r.estado DESC which I took out)
This query, after proper (I think) indexing, takes 0.004 seconds each, and the overall calendar view is presented in a reasonable time. There are indexes over idhotel, idhall, and date.
Now, I have a new module, well written ;-), which does reservations in another table, but I must present both types of reservations in same calendar view. My first approach was create a view, joining content of both tables, and selecting data for calendar view from this view instead of t_booking.
The view is defined like this:
CREATE OR REPLACE VIEW
t_booking_hall_reservation
AS
SELECT id,
idreserva,
idhotel,
idhall,
idtiporeserva,
date,
ampm,
observaciones,
observaciones_bookingarea,
tipo_de_navegacion, portal
FROM t_booking
UNION ALL
SELECT HR.id,
HR.convention_id as idreserva,
H.id_hotel as idhotel,
HR.hall_id as idhall,
99 as idtiporeserva,
date,
session as ampm,
observations as observaciones,
'new module' as observaciones_bookingarea,
'events' as tipo_de_navegacion,
'new module' as portal
FROM new_hall_reservation HR
JOIN a_halls H on H.id = HR.hall_id
;
(table new_hall_reservation has same indexes)
I tryed UNION ALL instead of UNION as I read this is much more efficient.
Well, the former query, changing t_booking for t_booking_hall_reservation, takes 1.5 seconds, to multiply for each hall and each day, which makes calendar view impossible to finish.
The app is spaguetti code, so, looping twice, once over t_booking and then over new_hall_reservation and combining results is somehow difficult.
Is it possible to tune the view to make this query fast enough? Another approach?
Thanks
PS: the less I modify original query, the less I'll need to modify the legacy app, which is, at less, risky to modify
This is too long for a comment.
A view is (almost) never going to help performance. Yes, they make queries simpler. Yes, they incorporate important logic. But no, they don't help performance.
One key problem is the execution of the view -- it doesn't generally take the filters in the overall tables into account (although the most recent versions of MySQL are better at this).
One suggestion -- which might be a fair bit of work -- is to materialize the view as a table. When the underlying tables change, you need to change t_booking_hall_reservation using triggers. Then you can create indexes on the table to achieve your performance goals.'
t_booking, unless it is a VIEW, needs
INDEX(idhotel, idhall, date)
VIEWs are syntactic sugar; they do not enhance performance; sometimes they are slower than the equivalent SELECT.
I'm trying to do what I think is a set of simple set operations on a database table: several intersections and one union. But I don't seem to be able to express that in a simple way.
I have a MySQL table called Moment, which has many millions of rows. (It happens to be a time-series table but that doesn't impact on my problem here; however, these data have a column 'source' and a column 'time', both indexed.) Queries to pull data out of this table are created dynamically (coming in from an API), and ultimately boil down to a small pile of temporary tables indicating which 'source's we care about, and maybe the 'time' ranges we care about.
Let's say we're looking for
(source in Temp1) AND (
((source in Temp2) AND (time > '2017-01-01')) OR
((source in Temp3) AND (time > '2016-11-15'))
)
Just for excitement, let's say Temp2 is empty --- that part of the API request was valid but happened to include 'no actual sources'.
If I then do
SELECT m.* from Moment as m,Temp1,Temp2,Temp3
WHERE (m.source = Temp1.source) AND (
((m.source = Temp2.source) AND (m.time > '2017-01-01')) OR
((m.source = Temp3.source) AND (m.time > '2016-11'15'))
)
... I get a heaping mound of nothing, because the empty Temp2 gives an empty Cartesian product before we get to the WHERE clause.
Okay, I can do
SELECT m.* from Moment as m
LEFT JOIN Temp1 on m.source=Temp1.source
LEFT JOIN Temp2 on m.source=Temp2.source
LEFT JOIN Temp3 on m.source=Temp3.source
WHERE (m.source = Temp1.source) AND (
((m.source = Temp2.source) AND (m.time > '2017-01-01')) OR
((m.source = Temp3.source) AND (m.time > '2016-11-15'))
)
... but this takes >70ms even on my relatively small development database.
If I manually eliminate the empty table,
SELECT m.* from Moment as m,Temp1,Temp3
WHERE (m.source = Temp1.source) AND (
((m.source = Temp3.source) AND (m.time > '2016-11-15'))
)
... it finishes in 10ms. That's the kind of time I'd expect.
I've also tried putting a single unmatchable row in the empty table and doing SELECT DISTINCT, and it splits the difference at ~40ms. Seems an odd solution though.
This really feels like I'm just conceptualizing the query wrong, that I'm asking the database to do more work than it needs to. What is the Right Way to ask the database this question?
Thanks!
--UPDATE--
I did some actual benchmarks on my actual database, and came up with some really unexpected results.
For the scenario above, all tables indexed on the columns being compared, with an empty table,
doing it with left joins took 3.5 minutes (!!!)
doing it without joins (just 'FROM...WHERE') and adding a null row to the empty table, took 3.5 seconds
even more striking, when there wasn't an empty table, but rather ~1000 rows in each of the temporary tables,
doing the whole thing in one query took 28 minutes (!!!!!), but,
doing each of the three AND clauses separately and then doing the final combination in the code took less than a second.
I still feel I'm expressing the query in some foolish way, since again, all I'm trying to do is one set union (OR) and a few set intersections. It really seems like the DB is making this gigantic Cartesian product when it seriously doesn't need to. All in all, as pointed out in the answer below, keeping some of the intelligence up in the code seems to be the better approach here.
There are various ways to tackle the problem. Needless to say it depends on
how many queries are sent to the database,
the amount of data you are processing in a time interval,
how the database backend is configured to manage it.
For your use case, a little more information would be helpful. The optimization of your query by using CASE/COUNT(*) or CASE/LIMIT combinations in queries to sort out empty tables would be one option. However, if-like queries cost more time.
You could split the SQL code to downgrade the scaling of the problem from 1*N^x to y*N^z, where z should be smaller than x.
You said that an API is involved, maybe you are able handle the temporary "no data" tables differently or even don't store them?
Another option would be to enable query caching:
https://dev.mysql.com/doc/refman/5.5/en/query-cache-configuration.html
I have a pretty huge MySQL database and having performance issues while selecting data. Let me first explain what I am doing in my project: I have a list of files. Every file should be analyzed with a number of tools. The result of the analysis is stored in a results table.
I have one table with files (samples). The table contains about 10 million rows. The schema looks like this:
idsample|sha256|path|...
The other (really small table) is a table which identifies the tool. Schema:
idtool|name
The third table is going to be the biggest one. The table contains all results of the tools I am using to analyze the files (The number of rows will be the number of files TIMES the number of tools). Schema:
id|idsample|idtool|result information| ...
What I am looking for is a query, which returns UNPROCESSED files for a given tool id (where no result exists yet).
The (most efficient) way I found so far to query those entries is following:
SELECT
s.idsample
FROM
samples AS s
WHERE
s.idsample NOT IN (
SELECT
idsample
FROM
results
WHERE
idtool = 1
)
LIMIT 100
The problem is that the query is getting slower and slower as the results table is growing.
Do you have any suggestions for improvements? One further problem is, that i cannot change the structure of the tables, as this a shared database which is also used by other projects. (I think) the only way for improvement is to find a more efficient select query.
Thank you very much,
Philipp
A left join may perform better, especially if idsample is indexed in both tables; in my experience, those kinds of "inquiries" are better served by JOINs rather than that kind of subquery.
SELECT s.idsample
FROM samples AS s
LEFT JOIN results AS r ON s.idsample = r.idsample AND r.idtool = 1
WHERE r.idsample IS NULL
LIMIT 100
;
Another more involved possible solution would be to create a fourth table with the full "unprocessed list", and then use triggers on the other three tables to maintain it; i.e.
when a new tool is added, add all the current files to that fourth table (with the new tool).
when a new file is added, add all the current tools to that fourth table (with the new file).
when a new result in entered, remove the corresponding record from the fourth table.
My site shows collections of links on different subjects. These links are divided into two types: web and images. My database will have millions (probably more than ten million) of these records. When the page loads, I need to show the user the web and image links for the particular subject of that page. So the first question is:
Do I create two separate, smaller tables, one each for the web and image links, and then make a query to each, or do I create one huge table (with correct indexes) for both and make one query. Where will I get better performance? Should the one table and one query be more efficient, then my next question is:
What would be the most efficient way to subdivide the two types for presentation? Should I use group by, or should I use php to divide my result array into the two types?
TIA!
You can get similar performances using a table for all objects, or one for links or websites. If you have two separate tables, doing a UNION of the results would return all of the results you needed.
The main reason to divide the results is whether they are really different (from your application point of view). That is, if you are going to end up using a lot of queries like
select * from objects where type='image';
then it might make sense to have two tables.
Then using group by is not a way of grouping the different results, it is a way of aggregating them.
So, for instance, you can use
select type, count(*) from objects group by type
to get
| image | 100000 |
| web | 2000000 |
but it will not return the objects separated. To get them "grouped", you can either use a query for each one, or use an ordering and then have the logic in the application to divide the results.
It's possible you'll get slightly better performance from just one table, but this decision should be primarily guided by whether the nature of data or constraints is different or not.
There is another (more important from the performance perspective) decision you'll have to make: how do you want to cluster the data (all InnoDB tables are clustered)?
If you want to have an excellent performance getting all the links of a given page, use an identifying relationship, producing a natural key in the link table(s):
The LINK table is effectively just a single B-tree, with the page PK1 at its leading edge, which physically groups together the rows that belong to the same page. The following query can be satisfied by a simple index range scan and minimal I/O:
SELECT URL
FROM LINK
WHERE PAGE_ID = <whatever>
If you used separate tables, you can just have two different queries. Many client APIs support executing two queries in a single database round-trip. If PHP doesn't, you can UNION the two queries to save one database round-trip:
SELECT *
FROM (
SELECT 1 LINK_TYPE, URL
FROM IMAGE_LINK
WHERE PAGE_ID = <whatever>
UNION ALL
SELECT 2, URL
FROM WEB_LINK
WHERE PAGE_ID = <whatever>
)
ORDER BY LINK_TYPE
The above query will give you...
LINK_TYPE URL
1 http://somesite.com/foo.jpeg
1 http://somesite.com/bar.jpeg
1 http://somesite.com/baz.jpeg
...
2 http://somesite.com/foo.html
2 http://somesite.com/bar.html
2 http://somesite.com/baz.html
...
...which will be very easy to separate at the client level.
If you didn't use separate tables, you can them separate the URLs by their extension at the client level, or introduce an additional field in the LINK PK: {PAGE_ID, LINK_TYPE, URL}, which should make the following query very efficient:
SELECT LINK_TYPE, URL
FROM LINK
WHERE PAGE_ID = <whatever>
ORDER BY LINK_TYPE
Note that the order of fields in the PK matters, so placing the LINK_TYPE at the end would prevent the DBMS from just doing the index range scan.
1 Whatever it may be; I just used the PAGE_ID as an example.
It depends on how web data is close to img data. If data is basically made of the link, one table fits better, having a column to differentiate between web and data (and possibly others later, like css, js ...)
Links: (id, link, type)
adding an index on type or type link will help the grouping (by type), and the matching search by (type, link).
If however, web and img data are different in such a way that you don't want to mix apples and oranges, like
Web: (wid, wlink, rating, ...)
Img: (iid, ilink, width, height, mbsize, camera, datetaken, hasexif...)
in this case, besides the link both tables don't have much in common. Image links and web links being different, there is not even a "gain" when having a same link for both kinds of data. Another advantage (which is also possible with one table, but makes more sense here) is to link both kinds of data in another table
Relations: (wid,iid)
that allows to maintain the relation between web sites and images, since an image may be used by several web sites, and web sites use several images. Indexing on wid and on iid.
My preference goes to the two tables (with optional Relations link).
Regarding queries from PHP, using UNION you can obtain the data from two tables in one query.
Do I create two separate, smaller tables or one huge table?
Go for one table.
What would be the most efficient way to subdivide the two types for presentation?
Depends on the certain search criteria.
So I'm having serious speed problems using a left join to count ticket comments from another table. I've tried using a sub-select in the count field and had precisely the same performance.
With the count, the query takes about 1 second on maybe 30 tickets, and 5 seconds for 19000 tickets (I have both a production and a development server so times are skewed a tad). I'm trying to optimize this as four variations of the query need to be run each time per page refresh.
Without the count, I see execution time fall from 1 second to 0.03 seconds, so certainly this count is being run across all tickets and not just the ones which are selected.
Here's a trimmed down version of the query in question:
SELECT tickets.ticket_id,
ticket_severity,
ticket_short_description,
ticket_full_description,
count(*) as commentCount,
FROM tickets (LEFT JOIN tickets_comment on ticket_id = tickets_comment.ticket_id)
WHERE ticket_status='Open'
and ticket_owner_id='133783475'
GROUP BY
everything,
under,
the,
sun
Now, not all tickets have comments, so I can't just do a right or standard join. When doing that the speed is fairly good (1/10th the current), but any tickets without comments aren't included.
I see three fixes for this, and would like any and all advice you have.
Create a new column comment_count and use a trigger/update query on new comment
Work with the UI and grab comments on the fly (not really wanted)
Hope stackoverflow folks have a more elegant solution :รพ
Ideas?
A co-worker has come to the rescue. The query was just using join improperly.
What must be done here is create a second table with a query like:
select count(*) from tickets_comment group by ticket_id where (clause matches other)
which will create a table with counts for each ticket id. Then join that table with the ticket table where the ticket ids match. It's not as wicked fast as creating a new column, but it's at least 1/10th the speed it was, so I'm pleased as punch.
Last step is converting nulls (on tickets where there were no comments) into zeros
Is by far the fastest solution and you'll see it done in Rails all the time because it really is that fast.
count(*) is really only used when you aren't selecting any other attributes. Try count(ticket_id) and see if that helps. I can't run explain so I can't test it myself but if your analysis is correct it should help.
Try running explain on the query to make sure the correct indexes are being used. If there are no indexes being used, create another one