SQL Optimization: how to JOIN a table with itself - mysql

I'm trying to optimize a SQL query and I am not sure if further optimization is possible.
Here's my query:
SELECT someColumns
FROM (((smaller_table))) AS a
INNER JOIN (((smaller_table))) AS b
ON a.someColumnProperty = b.someColumnProperty
...the problem with this approach is that my table has half a trillion records in it. In my query, you'll notice (((smaller_table))). I wrote that as an abbreviation for a SELECT statement being run on MY_VERY_LARGE_TABLE to reduce it's size.
(((smaller_table))) appears twice, and the code within is exactly the same both times. There's no reason for me to run the same sub-query twice. This table is several TB and I shouldn't scan through it twice just to get the same results.
Do you have any suggestions on how I can NOT run the exact same reduction twice? I tried replacing the INNER JOIN line with INNER JOIN a AS b but got an "unrecognized table a" warning. Is there any way to store the value of a so I can reuse it?

Thoughts:
Make sure there is an index on userid and dayid.
I would ask you to define better what it is you are trying to find out.
Examples:
What is the busiest time of the week?
Who are the top 25 people who come to the gym the most often?
Who are the top 25 people who utilize the gem the most? (This is different than the one above because maybe I have a user that comes 5 times a month, but stays 5 hours per session vs a user that comes 30 times a month and stays .5 hour per session.)
Maybe doing all days in a horizontal method (day1, day2, day3) would be better visually to try to find out what you are looking for. You could easily put this into excel or libreoffice and color the days that are populated to get a visual "picture" of people who come consecutively.
It might be interesting to run this for multiple months to see if what the seasonality looks like.

Alas CTE is not available in MySQL. The ~equivalent is
CREATE TABLE tmp (
INDEX(someColumnProperty)
)
SELECT ...;
But...
You can't use CREATE TEMPORARY TABLE because such can't be used twice in the same query. (No, I don't know why.)
Adding the INDEX (or PK or ...) during the CREATE (or afterwards) provides the very necessary key for doing the self join.
You still need to worry about DROPping the table (or otherwise dealing with it).
The choice of ENGINE for tmp depends on a number of factors. If you are sure it will be "small" and has no TEXT/BLOB, then MEMORY may be optimal.
In a Replication topology, there are additional considerations.

Related

Group by, Order by and Count MySQL performance

I have the next query to get the 15 most sold plates in a place:
This query is taking 12 seconds to execute over 100,000 rows. I think this execution takes too long, so I am searching a way to optmize the query.
I ran the explain SQL command on PHPMyAdmin and i got this:
[![enter image description here][1]][1]
According to this, the main problem is on the p table which is scanning the entire table, but how can I fix this? The id of p table is a primary key, do I need to set it also as an index? Also, is there anything else I can do to make the query runs faster?
You can make a relationship between the two tables.
https://database.guide/how-to-create-a-relationship-in-mysql-workbench/
Beside this you can also use a left join so you won't load the whole right table in.
Order by is a slow function in MySQL, if you are using code afterwards you can just do it in the code that is much faster than order by.
I hope I helped and Community feel free to edit :)
You did include the explain plan but you did not give any information about your table structure, data distribution, cardinality nor volumes. Assuming your indices are accurate and you have an even data distribution, the query is having to process over 12 million rows - not 100,000. But even then, that is relatively poor performance. But you never told us what hardware this sits on nor the background load.
A query with so many joins is always going to be slow - are they all needed?
the main problem is on the p table which is scanning the entire table
Full table scans are not automatically bad. The cost of dereferencing an index lookup as opposed to a streaming read is about 20 times more. Since the only constraint you apply to this table is its joins to other tables, there's nothing in the question you asked to suggest there is much scope for improving this.

Mysql Performance: Which of the query will take more time?

I have two tables:
1. user table with around 10 million data
columns: token_type, cust_id(Primary)
2. pm_tmp table with 200k data
columns: id(Primary | AutoIncrement), user_id
user_id is foreign key for cust_id
1st Approach/Query:
update user set token_type='PRIME'
where cust_id in (select user_id from pm_tmp where id between 1 AND 60000);
2nd Approach/Query: Here we will run below query for different cust_id individually for 60000 records:
update user set token_type='PRIME' where cust_id='1111110';
Theoretically time will be less for the first query as it involves less number of commits and in turn less number of index rebuilds. But, I would recommend to go with the second option since its more controlled and will appear to be less in time and you can event think about executing 2 seperate sets parellelly.
Note: The first query will need sufficient memory provisioned for mysql buffers to get it executed quickly. Second query being set of independent single transaction queries, they will need comparatively less memory and hence will appear faster if executed on limited memory environments.
Well, you may rewrite the first query this way too.
update user u, pm_tmp p set u.token_type='PRIME' where u.cust_id=p.id and p.in <60000;
Some versions of MySQL have trouble optimizing in. I would recommend:
update user u join
pm_tmp pt
on u.cust_id = pt.user_id and pt.id between 1 AND 60000
set u.token_type = 'PRIME' ;
(Note: This assumes that cust_id is not repeated in pm_temp. If that is possible, you will want a select distinct subquery.)
Your second version would normally be considerably slower, because it requires executing thousands of queries instead of one. One consideration might be the update. Perhaps the logging and locking get more complicated as the number of updates increases. I don't actually know enough about MySQL internals to know if this would have a significant impact on performance.
IN ( SELECT ... ) is poorly optimized. (I can't provide specifics because both UPDATE and IN have been better optimized in some recent version(s) of MySQL.) Suffice it to say "avoid IN ( SELECT ... )".
Your first sentence should say "rows" instead of "columns".
Back to the rest of the question. 60K is too big of a chunk. I recommend only 1000. Aside from that, Gordon's Answer is probably the best.
But... You did not use OFFSET; Do not be tempted to use it; it will kill performance as you go farther and farther into the table.
Another thing. COMMIT after each chunk. Else you build up a huge undo log; this adds to the cost. (And is a reason why 1K is possibly faster than 60K.)
But wait! Why are you updating a huge table? That is usually a sign of bad schema design. Please explain the data flow.
Perhaps you have computed which items to flag as 'prime'? Well, you could keep that list around and do JOINs in the SELECTs to discover prime-ness when reading. This completely eliminates the UPDATE in question. Sure, the JOIN costs something, but not much.

Efficiency of query using NOT IN()?

I have a query that runs on my server:
DELETE FROM pairing WHERE id NOT IN (SELECT f.id FROM info f)
It takes two different tables, pairing and info and says to DELETE all entries from pairing whenever the id of that pairing is not in info.
I've run into an issue on the server where this is beginning to take too long to execute, and I believe it has to do with the efficiency (or lack of constraints in the SELECT statement).
However, I took a look at the MySQL slow_log and the number of compared entries is actually LOWER than it should be. From my understanding, this should be O(mn) time where m is the number of entries in pairing and n is the number of entries in info. The number of entries in pairing is 26,868 and in info is 34,976.
This should add up to 939,735,168 comparisons. But the slow_log is saying there are only 543,916,401: almost half the amount.
I was wondering if someone could please explain to me how the efficiency of this specific query works. I realize the fact that it's performing quicker than I think it should is a blessing in this case, but I still need to understand where the optimization comes from so that I can further improve upon it.
I haven't used the slow query log much (at all) but isn't it possible that the difference can just be chalked up to simple... can't think of the word. Basically, 939,735,168 is the theoretical worst case scenario where the query literally checks every single row except the one it needs to first. Realistically, with a roughly even distribution (and no use of indexing), a check of row in pairing will on average compare to half the rows in info.
It looks like your real world performance is only 15% off (worse) than what would be expected from the "average comparisons".
Edit: Actually, "worse than expected" should be expected when you have rows in pairing that are not in info, as they will skew the number of comparisons.
...which is still not great. If you have id indexed in both tables, something like this should work a lot faster.
DELETE pairing
FROM pairing LEFT JOIN info ON pairing.id = info.id
WHERE info.id IS NULL
;
This should take advantage of an index on id to make the comparisons needed something like O(NlogM).

Efficient way to get last n records from multiple tables sorted by date (user log of events)

I need to get last n records from m tables (lets say m=n=~10 for now) ordered by date (also supporting offset would be nice). This should show user his last activity. These tables will contain mostly hundreds or thousands of records for that user.
How can I do that most efficient way? I'm using Doctrine 2 and these tables has no relation to each other.
I though about some solutions not sure whats the best approach:
Create separate table and put records there. If any change happen (if user do any change inside the system that should be shown inside log table) it will be inserted in this table. This should be pretty fast, but it will be hard to manage and I don't want to use this approach yet.
Get last n records from every table and then sort them (out of DB) and limit to n. This seems to be pretty straightforward but with more tables there will be quite high overhead. For 10 tables 90% of records will be thrown away. If offset is used, it would be even worse. Also this mean m queries.
Create single native query and get id and type of last n items doing union of all tables. Like SELECT id, date, type FROM (SELECT a.id, a.date, 'a' AS type FROM a ORDER BY a.date DESC LIMIT 10 UNION b.id, b.date, 'b' AS type ORDER BY b.date DESC LIMIT 10) ORDER BY date DESC LIMIT 10. Then create at most m queries getting these entities. This should be a bit better than 2., but requires native query.
Is there any other way how to get these records?
Thank you
is not hard to manage, it just is an additional insert for each insert you are doing for the "action"-tables.
You could also solve this by using a trigger I'd guess, so you wouldn't even have to implement it in the application code. https://stackoverflow.com/a/4754333/3595565
Wouldn't it be "get last n records by a specific user from each of those tables? So don't see a lot of problems with this approach, though I also think it is the least ideal way to handle things.
Would be like the 2nd option, but the database handles the sorting which makes this approach a lot more viable.
Conclusion: (opinion based)
You should choose between options 1 and 3. The main questions should be
is it ok to store redundant data
is it ok to have logic outside of your application and inside of your database (trigger)
Using the logging table would make things pretty straight forward. But you will duplicate data.
If you are ok with using a trigger to fill the logging table, things will be more simple, but it has it's downside as it requires additional documentation etc. so nobody wonders "where is that data coming from?"

MySQL FULLTEXT Search Across >1 Table

As a more general case of this question because I think it may be of interest to more people...What's the best way to perform a fulltext search on two tables? Assume there are three tables, one for programs (with submitter_id) and one each for tags and descriptions with object_id: foreign keys referring to records in programs. We want the submitter_id of programs with certain text in their tags OR descriptions. We have to use MATCH AGAINST for reasons that I won't go into here. Don't get hung up on that aspect.
programs
id
submitter_id
tags_programs
object_id
text
descriptions_programs
object_id
text
The following works and executes in a 20ms or so:
SELECT p.submitter_id
FROM programs p
WHERE p.id IN
(SELECT t.object_id
FROM titles_programs t
WHERE MATCH (t.text) AGAINST ('china')
UNION ALL
SELECT d.object_id
FROM descriptions_programs d
WHERE MATCH (d.text) AGAINST ('china'))
but I tried to rewrite this as a JOIN as follows and it runs for a very long time. I have to kill it after 60 seconds.
SELECT p.id
FROM descriptions_programs d, tags_programs t, programs p
WHERE (d.object_id=p.id AND MATCH (d.text) AGAINST ('china'))
OR (t.object_id=p.id AND MATCH (t.text) AGAINST ('china'))
Just out of curiosity I replaced the OR with AND. That also runs in s few milliseconds, but it's not what I need. What's wrong with the above second query? I can live with the UNION and subselects, but I'd like to understand.
Join after the filters (e.g. join the results), don't try to join and then filter.
The reason is that you lose use of your fulltext index.
Clarification in response to the comment: I'm using the word join generically here, not as JOIN but as a synonym for merge or combine.
I'm essentially saying you should use the first (faster) query, or something like it. The reason it's faster is that each of the subqueries is sufficiently uncluttered that the db can use that table's full text index to do the select very quickly. Joining the two (presumably much smaller) result sets (with UNION) is also fast. This means the whole thing is fast.
The slow version winds up walking through lots of data testing it to see if it's what you want, rather than quickly winnowing the data down and only searching through rows you are likely to actually want.
Just in case you don't know: MySQL has a built in statement called EXPLAIN that can be used to see what's going on under the surface. There's a lot of articles about this, so I won't be going into any detail, but for each table it provides an estimate for the number of rows it will need to process. If you look at the "rows" column in the EXPLAIN result for the second query you'll probably see that the number of rows is quite large, and certainly a lot larger than from the first one.
The net is full of warnings about using subqueries in MySQL, but it turns out that many times the developer is smarter than the MySQL optimizer. Filtering results in some manner before joining can cause major performance boosts in many cases.
If you join both tables you end up having lots of records to inspect. Just as an example, if both tables have 100,000 records, fully joining them give you with 10,000,000,000 records (10 billion!).
If you change the OR by AND, then you allow the engine to filter out all records from table descriptions_programs which doesn't match 'china', and only then joining with titles_programs.
Anyway, that's not what you need, so I'd recommend sticking to the UNION way.
The union is the proper way to go. The join will pull in both full text indexes at once and can multiple the number of checks actually preformed.