MYSQL search query optimization from two many-to-many tables - mysql

I have three tables.
tbl_post for a table of posts. (post_idx, post_created, post_title, ...)
tbl_mention for a table of mentions. (mention_idx, mention_name, mention_img, ...)
tbl_post_mention for a unique many-to-many relation between the two tables. (post_idx, mention_idx)
For example,
PostA can have MentionA and MentionB.
PostB can have MentionA and MentionC.
PostC cannot have MentionC and MentionC.
tbl_post has about million rows, tbl_mention has less than hundred rows, and tbl_post_mention has a couple of million rows. All three tables are heavily loaded with foreign keys, unique indices, etc.
I am trying to make two separate search queries.
Search for post ids with all the given mention ids[AND condition]
Search for post ids with any of the given mention ids[OR condition]
Then join with tbl_post and tbl_mention to populate with meaningful data, order the results, and return the top n. In the end, I hope to have a n list of posts with all the data required for my service to display on the front end.
Here are the respective simpler queries
SELECT post_idx
FROM
(SELECT post_idx, count(*) as c
FROM tbl_post_mention
WHERE mention_idx in (1,95)
GROUP BY post_idx) AS A
WHERE c >= 2;
The problem with this query is that it is already inefficient before the joins and ordering. This process alone takes 0.2 seconds.
SELECT DISTINCT post_idx
FROM tbl_post_mention
WHERE mention_idx in (1,95);
This is a simple index range scan, but because of the IN statement, the query becomes expensive again once you start joining it with other tables.
I tried more complex and "clever" queries and tried indexing different sets of columns with no avail. Are there special syntaxes that I could use in this case? Maybe a clever trick? Partitioning? Or am I missing some fundamental concept here... :(
Send help.

The query you want is this:
SELECT post_idx
FROM tbl_post_mention
WHERE mention_idx in (1,95)
GROUP BY post_idx
HAVING COUNT(*) >= 2
The HAVING clause does your post-GROUP BY filtering.
The index that will help you is this.
CREATE INDEX mentionsdex ON tbl_post_mention (mention_idx, post_idx);
It covers your query by allowing rapid lookup by mention_idx then grouping by post_idx.
Often so-called join tables with two columns -- like your tbl_post_mention -- work most efficiently when they have a pair of indexes with the columns in opposite orders.

Related

How to make a faster query when joining multiple huge tables?

I have 3 tables. All 3 tables have approximately 2 million rows. Everyday 10,000-100,000 new entries are entered. It takes approximately 10 seconds to finish the sql statement below. Is there a way to make this sql statement faster?
SELECT customers.name
FROM customers
INNER JOIN hotels ON hotels.cus_id = customers.cus_id
INNER JOIN bookings ON bookings.book_id = customers.book_id
WHERE customers.gender = 0 AND
customers.cus_id = 3
LIMIT 25 OFFSET 1;
Of course this statement works fine, but its slow. Is there a better way to write this code?
All database servers have a form of an optimization engine that is going to determine how best to grab the data you want. With a simple query such as the select you showed, there isn't going to be any way to greatly improve performance within the SQL. As others have said sub-queries won't helps as that will get optimized into the same plan as joins.
Reduce the number of columns, add indexes, beef up the server if that's an option.
Consider caching. I'm not a mysql expert but found this article interesting and worth a skim. https://www.percona.com/blog/2011/04/04/mysql-caching-methods-and-tips/
Look at the section on summary tables and consider if that would be appropriate. Does pulling every hotel, customer, and booking need to be up-to-the-minute or would inserting this into a summary table once an hour be fine?
A subquery don't help but a proper index can improve the performance so be sure you have proper index
create index idx1 on customers(gender , cus_id,book_id, name )
create index idex2 on hotels(cus_id)
create index idex3 on hotels(book_id)
I find it a bit hard to believe that this is related to a real problem. As written, I would expect this to return the same customer name over and over.
I would recommend the following indexes:
customers(cus_id, gender, book_id, name)
hotels(cus_id)
bookings(book_id)
It is really weird that bookings are not to a hotel.
First, these indexes cover the query, so the data pages don't need to be accessed. The logic is to start with the where clause and use those columns first. Then add additional columns from the on and select clauses.
Only one column is used for hotels and bookings, so those indexes are trivial.
The use of OFFSET without ORDER BY is quite suspicious. The result set is in indeterminate order anyway, so there is no reason to skip the nominally "first" value.

MySQL index key on table with more columns

In my script, I have a lot of SQL WHERE clauses, e.g.:
SELECT * FROM cars WHERE active=1 AND model='A3';
SELECT * FROM cars WHERE active=1 AND year=2017;
SELECT * FROM cars WHERE active=1 AND brand='BMW';
I am using different SQL clauses on same table because I need different data.
I would like to set index key on table cars, but I am not sure how to do it. Should I set separate keys for each column (active, model, year, brand) or should I set keys for groups (active,model and active,year and active,brand)?
WHERE a=1 AND y='m'
is best handled by INDEX(a,y) in either order. The optimal set of indexes is several pairs like that. However, I do not recommend having more than a few indexes. Try to limit it to queries that users actually make.
INDEX(a,b,c,d):
WHERE a=1 AND b=22 -- Index is useful
WHERE a=1 AND d=44 -- Index is less useful
Only the "left column(s)" of an index are used. Hence the second case, uses a, but stops because b is not in the WHERE.
You might be tempted to also have (active, year, model). That combination works well for active AND year, active AND year AND model, but not active AND model (but no year).
More on creating indexes: http://mysql.rjweb.org/doc.php/index_cookbook_mysql
Since model implies a make, there is little use to put both of those in the same composite index.
year is not very selective, and users might want a range of years. These make it difficult to get an effective index on year.
How many rows will you have? If it is millions, we need to work harder to avoid performance problems. I'm leaning toward this, but only because the lookup would be more compact.
We use single indexing when we want to query for just one column, same asin your case and multiple group indexing when we have multiple condition in the same where clause.
Go for single indexing.
For more detailed explanation, refer this article: https://www.sqlinthewild.co.za/index.php/2010/09/14/one-wide-index-or-multiple-narrow-indexes/

Join vs subquery to count nested objects

Let's say my model contains 2 tables: persons and addresses. One person can have O, 1 or more addresses. I'm trying to execute a query that lists all persons and includes the number of addresses they have respectively. Here is the 2 queries that I have to achieve that:
SELECT
persons.*,
count(addresses.id) AS number_of_addresses
FROM `persons`
LEFT JOIN addresses ON persons.id = addresses.person_id
GROUP BY persons.id
and
SELECT
persons.*,
(SELECT COUNT(*)
FROM addresses
WHERE addresses.person_id = persons.id) AS number_of_addresses
FROM `persons`
And I was wondering if one is better than the other in term of performance.
The way to determine performance characteristics is to actually run the queries and see which is better.
If you have no indexes, then the first is probably better. If you have an index on addresses(person_id), then the second is probably better.
The reason is a little complicated. The basic reason is that group by (in MySQL) uses a sort. And, sorts are O(n * log(n)) in complexity. So, the time to do a sort grows faster than a data (not much faster, but a bit fast). The consequence is that a bunch of aggregations for each person is faster than one aggregation by person over all the data.
That is conceptual. In fact, MySQL will use the index for the correlated subquery, so it is often faster than the overall group by, which does not make use of an index.
I think the first query is optimum and more optimization can provided by changing table structure. For example define both person_id and address_id fields (order is important) as primary key in addresses table to join faster.
The mysql table storage structure is indexed organized table(clustered index) so the primary key index is very faster than normal index specially in join operation.

How to optimise join between two MySQL tables?

I have two tables: p_group.full_data, which is a large dataset I'm working on (100k rows, 200 columns) and p_group.full_data_aggregated, which I've produced to summarise a load of other tables.
Now, what I'd like to do is perform a join between full_data and full_data_aggregated to select out certain rows, averages, and so on. The query I have is as follows:
SELECT 'name', p.group_id, a.group_condition, p.event_index, AVG(p.value) FROM p_group.full_data p
JOIN p_group.full_data_aggregated as a on p.group_id = a.group_id AND p.event_index = a.event_index
WHERE (a.group_condition='open')
GROUP BY p.group_id, p.event_index
I have an index on: full_data.group_id, full_data.event_index and full_data_aggregated.group_id, full_data_aggregated.event_index, full_data_aggregated.group_condition.
Now, the problem is that this query simply won't finish: previously, I had my full_data split up into different tables (one for each group_id), and that worked fine. But now that I have joined the groups together, the query sits there running and so I can only assume I have done something stupid.
Is there anything else I can try to actually get this query to run at a decent speed? I'm sure I've messed up something with indices and the group by function, but I can't work out what. I've tried all sorts of variations of the above query. EXPLAIN indicates that it is "using where; using temporary; using filesort" but I'm not sure how to fix this.
Thanks!
I assume that your indexes are combination indexes (with group_id and event_index together). If you have separate indexes for each field, then only one index is used at a time and the database engine is going through significantly more data.
For example, if you only have a few unique group_id, but lots of event_index, and you have two indexes, one on group_id only, and the other one on event_index, then you query is going to run through a large number of rows for each group_id. If you have one index instead, with both fields in order, then the query will run much faster.

Mysql - sequence of multiple column indexes

Say if I have a query that look like this:
SELECT * FROM table WHERE category='5' and status='1' LIMIT 5
The table has 1 million rows.
To speed things up, I create index (status, category), i.e. multiple column index.
There are 600 categories but only 2 status (1 or 0). I'm wondering if there is any difference in performance if I create index (category, status) instead of index (status, category).
Status first.
The trick is then if you only need to query by category you can.
SELECT * from table where status in (1,0) and category = 'whatever'
and still get index support. Of course if your queries all use both columns it's the same either way, but in this case if you use only status it's much better, and only category only slightly worse if at all.
If you are looking at a lot of inserts as well, you want to minimize the number of indices, so this is your best bet rather than having multiple indices.
There shouldn't be any difference. The selectivity of the index is identical whether you order it (category, status) or (status, category).
By the way, using LIMIT is often meaningless without also using ORDER BY. The order of rows returned by an SQL query is arbitrary unless you specify an order.
Re your comment: Yes, it's common to need five random rows, but arbitrary is not the same as random. It's not common to need five arbitrary rows.