mysql select - subquery, group_concat() not working / too slow - mysql

I'm having trouble working out a query. I've tried subqueries, different joins and group_concat() but they either don't work or are painfully slow. This may be a bit complicated to explain, but here's the problem:
I have a table "item" (with about 2000 products). I have a table "tag" (which contains about 2000 different product tags). And I have a table "tagassign" (which connects the tags to the items, with about 200000 records).
I'm using the tag to define characteristics of the products, for example colour, compatibility, whether the product is on special offer etc. Now if I want to be able to show the products that have a certain tag assigned to them, I use a simple query like:
select * from item, tagassign
where item.itemid = tagassign.itemid
and tagassign.tagid = "specialoffer"
The problem is, that I may want to see items that have several tags. For example I might want to see only the black cell phone cases that are compatible with the Apple iPhone and are new. So I basically want to see all records from the item table, that have tags "black" and "case" and "iphone" and "new". The only way I can get this to work is to create 4 aliases (select * from item, tagassign, tagassign as t1, tagassign as t2, tagassign as t3 etc.). In some cases I might be looking for 10 or 20 different tags, and with that many records the queries are dreadfully slow.
I know I'm missing something obvious. Any ideas?
Thanks!

SELECT *
FROM item i
WHERE (
SELECT COUNT(*)
FROM tagassign ta
WHERE ta.tagid IN ('black', 'case', 'iphone', 'new')
AND ta.itemid = i.itemid
) = 4
Substitute the actual number of the tags you are searching for instead of 4.
Create a unique index or a primary key on tagassign (itemid, tagid) (in this order) for this to work fast.
If you are searching for lots of tags (or for tags that are used rarely), this query may also be faster:
SELECT i.*
FROM (
SELECT itemid
FROM tagassign ta
WHERE ta.tagid IN ('black', 'case', 'iphone', 'new')
GROUP BY
itemid
HAVING COUNT(*) = 4
) t
JOIN item i
ON i.itemid = t.itemid
For this query, you would need a unique index on tagassign (tagid, itemid).

Related

Optimizing Inner Join Queries

I have this query and i want to know if i can optimize it in some way because currently it takes a long time to execute (like 4/5 seconds)
SELECT *
FROM `posts` ml INNER JOIN
posts_tag_one gt
ON gt.post_id = ml.id AND gt.tag_id = 15 INNER JOIN
posts_tag_two gg
ON gg.post_id = ml.id AND gg.tag_id = 5
WHERE active = '1' AND NOT ml.id = '639474'
ORDER BY ml.id DESC
LIMIT 5
I want to say the database it has like 600k+ posts, the posts_tag_one 5 milions records, the posts_tag_two 475k+ records.
That example i gave it's only with 2 joins but in some cases i have up to 4 joins so the other tables has like 300k-400k records.
I am using foregin keys and indexes for posts_tag_one, posts_tag_two tables but the query it's still slow.
Any advice would help. Thanks!
By means of Transitive property (if a=b and b=c, then a=c), your ML.ID = GT.Post_ID = GG.Post_ID. Since you are trying to pre-qualify specific tags, I would rewrite and try to see if cardinality of data may help by moving to a front position and using better indexes to optimize the query. Also, MySQL has a nice keyword "STRAIGHT_JOIN" that tells the engine query the data in the order I tell you, dont think for me. I have used many times and have seen significant improvement.
SELECT STRAIGHT_JOIN
*
FROM
posts_tag_two gg
INNER JOIN posts_tag_one gt
ON gg.post_id = gt.post_id
AND gt.tag_id = 15
INNER JOIN posts ml
ON gt.post_id = ml.id
AND ml.active = 1
WHERE
gg.tag_id = 5
AND NOT gg.post_id = 639474
ORDER BY
gg.post_id DESC
LIMIT 5
I would ensure the following table / multi-field indexes
table index
Posts_Tag_One ( tag_id, post_id )
Posts_Tag_Two ( tag_id, post_id )
posts ( id, active )
By starting with the Posts_Tag_Two table which you are pre-filtering for tag_id = 5, you are already cutting the list down to those pre-qualified FIRST. Not by starting with ALL posts and seeing which qualify with the tag.
Second level join is to the POSTS_TAG_ONE table on same ID, but that level filtered by its Tag_ID = 15.
Only then does it even care to get to the POSTS table for active.
Since the order is based on the ID descending, and the Posts_tag_two table "post_id" is the same value as Posts.id, the index from the posts_tag_two table should return the record already pre-sorted.
HTH, and would be interested to know final performance difference. Again, I have used STRAIGHT_JOIN many times with significant improvement in performance. I also typically do NOT do "Select *" for all tables / all columns. Get what you need.
FEEDBACK
#eshirvana, in MANY cases, yes, the optimizers do by default. But sometimes, the designer knows a better the makeup of the data. Lets take the scenario of POSTS in the lead-position. You have a room of boxes for posts. Each box contains say 10k records. You have to go through all 10k records, then to the next box until you get through 400k records... again, just for example. Once you find those, then it goes to the join on the filtered criteria for a specific tag. Those too are ordered by ID so you have to do a one-to-one- correlation. So which table stays in a primary position.
Now, by the index by tag, and one of the posts_tag tables (smaller by choice is #2).
Now, you have a room of boxes, but each box only has one tag within it. If you have 300 tag IDs available, you have already cut out x-amount of records giving you just the small sample you pre-qualify to.
So now, the second posts table similarly is a room of boxes. Their boxes are also broken down by tags. So now you only have to grab box for tag #15.
So now you have two very finite sets of records that the JOIN can match on the ID that exists in both cases. only once that is done do you ever need to go to the posts table, which by ID is going to be quick and direct. But having the active status in the index, the engine never needs to go to any actual data pages to retrieve the data until all conditions are met. Only then does it pull the record from the 3 respective tables being returned.
Sounds like posts_tags is a many-to-many mapping table? It need two indexes: (post_id, tag_id) and (tag_id, post_id). One of those should probably be the PRIMARY KEY (Having an auto_increment id is wasteful and slows things down). The other should be INDEX (not UNIQUE). More discussion: http://mysql.rjweb.org/doc.php/index_cookbook_mysql#many_to_many_mapping_table
But, why have both posts_tag_two and posts_tag_one?
In addition to those 'composite' keys, do not also have the single-column (post_id) or (tag_id).
If tag is simply a short string, don't bother normalizing it; simply have it in the table.
For further discussion, please provide SHOW CREATE TABLE for each table. And EXPLAIN SELECT ....

How to efficiently select distinct tags with associated items in many-to-many relationship?

I'm building a system that has items and tags, with a many-to-many relationship (via an intermediate table), in MySQL. As I've scaled it up, one query has become unacceptably slow, but I'm struggling to make it more efficient.
The query in question amounts to "select all tags that have an item of type x associated with them". Here's a very slightly simplified version:
SELECT DISTINCT(t.id)
FROM tags t
INNER JOIN items_tags it ON it.tag_id = t.id
INNER JOIN items i ON it.item_id = i.id
WHERE i.type = 10
I have unique primary indexes on t.id, item.id and "it.tag_id, it.item_id". The problem I'm having is that the items_tags table is at a size (~1,400,000 rows) where the query takes too long (one thing that puzzles me here is that phpMyAdmin seems to think the query is fast - it times it as a few ms, but in practice it seems to take 6 or 7 seconds).
It feels to me as if there might be a way of joining the items_tags table to itself to reduce the size of the result set (and perhaps remove the need for that DISTINCT clause), but I can't figure out how... Alternatively, it occurs to me that there might be a better way of indexing things. Any help or suggestions would be much appreciated!
Well, for the record, here's what worked for me (though I'd still be interested if anyone has any other suggestions).
It was pointed out (in the comments above - thanks #Turophile!) that since tag id is available in the items_tags table, I could leave the tags table out. I actually did need other fields (eg. name) from the tags table (I simplified the query a little for the question), but I found that removing the tags table from the above query and joining the tags table onto its results was significantly faster (EXPLAIN showed that it allowed fewer rows to be scanned). That made the query look more like this:
SELECT
tags.id,
tags.name
FROM tags
INNER JOIN (
SELECT DISTINCT(it.tag_id) AS tag_id
FROM items_tags it
JOIN items i ON it.item_id = i.id
WHERE i.type = 10
) it ON tags.id = it.tag_id
This was about 10x faster than the previous version of the query (reduced the average time from about 27s to ~2.5s).
On top of that, adding an index to items.type improved things further (reduced the average time from ~2.5s to ~1.2s).

Join on 3 tables insanely slow on giant tables

I have a query which goes like this:
SELECT insanlyBigTable.description_short,
insanlyBigTable.id AS insanlyBigTable,
insanlyBigTable.type AS insanlyBigTableLol,
catalogpartner.id AS catalogpartner_id
FROM insanlyBigTable
INNER JOIN smallerTable ON smallerTable.id = insanlyBigTable.catalog_id
INNER JOIN smallerTable1 ON smallerTable1.catalog_id = smallerTable.id
AND smallerTable1.buyer_id = 'xxx'
WHERE smallerTable1.cont = 'Y' AND insanlyBigTable.type IN ('111','222','33')
GROUP BY smallerTable.id;
Now, when I run the query first time it copies the giant table into a temp table... I want to know how I can prevent that? I am considering a nested query, or even to reverse the join (not sure the effect would be to run faster), but that is well, not nice. Any other suggestions?
To figure out how to optimize your query, we first have to boil down exactly what it is selecting so that we can preserve that information while we change things around.
What your query does
So, it looks like we need the following
The GROUP BY clause limits the results to at most one row per catalog_id
smallerTable1.cont = 'Y', insanelyBigTable.type IN ('111','222','33'), and buyer_id = 'xxx' appear to be the filters on the query.
And we want data from insanlyBigTable and ... catalogpartner? I would guess that catalogpartner is smallerTable1, due to the id of smallerTable being linked to the catalog_id of the other tables.
I'm not sure on what the purpose of including the buyer_id filter on the ON clause was for, but unless you tell me differently, I'll assume the fact it is on the ON clause is unimportant.
The point of the query
I am unsure about the intent of the query, based on that GROUP BY statement. You will obtain just one row per catalog_id in the insanelyBigTable, but you don't appear to care which row it is. Indeed, the fact that you can run this query at all is due to a special non-standard feature in MySQL that lets you SELECT columns that do not appear in the GROUP BY statement... however, you don't get to select WHICH columns. This means you could have information from 4 different rows for each of your selected items.
My best guess, based on column names, is that you are trying to bring back a list of items that are in the same catalog as something that was purchased by a given buyer, but without any more than one item per catalog. In addition, you want something to connect back to the purchased item in that catalog, via the catalogpartner table's id.
So, something probably akin to amazon's "You may like these items because you purchased these other items" feature.
The new query
We want 1 row per insanlyBigTable.catalog_id, based on which catalog_id exists in smallerTable1, after filtering.
SELECT
ibt.description_short,
ibt.id AS insanlyBigTable,
ibt.type AS insanlyBigTableLol,
(
SELECT smallerTable1.id FROM smallerTable1 st
WHERE st.buyer_id = 'xxx'
AND st.cont = 'Y'
AND st.catalog_id = ibt.catalog_id
LIMIT 1
) AS catalogpartner_id
FROM insanlyBigTable ibt
WHERE ibt.id IN (
SELECT (
SELECT ibt.id AS ibt_id
FROM insanlyBigTable ibt
WHERE ibt.catalog_id = sti.catalog_id
LIMIT 1
) AS ibt_id
FROM (
SELECT DISTINCT(catalog_id) FROM smallerTable1 st
WHERE st.buyer_id = 'xxx'
AND st.cont = 'Y'
AND EXISTS (
SELECT * FROM insanlyBigTable ibt
WHERE ibt.type IN ('111','222','33')
AND ibt.catalog_id = st.catalog_id
)
) AS sti
)
This query should generate the same result as your original query, but it breaks things down into smaller queries to avoid the use (and abuse) of the GROUP BY clause on the insanlyBigTable.
Give it a try and let me know if you run into problems.

self join with a self-referring condition

What I want to do is to get all records that have almost exact duplicates except that duplicates don't have an extra char at the beginning of 'name'
this is my sql query:
select * from tags as spaced inner join tags as not_spaced on not_spaced.name = substring(spaced.name, 2);
also I tried:
select * from tags as spaced where (select count(*) from tags as not_spaced where not_spaced.name = substring(spaced.name, 2)) > 0;
What I'm getting is... the SQL connection stops responding.
Thanks!
p.s. Sorry I haven't mentioned that the only field I need is name. All other fields are insignificant (if present).
Try something like this:
select all potentially duplicated fields except name , name
from tags union all
select all potentially duplicated fields except name , substring(name, 2) name
from tags
group by all potentially duplicated fields including name
having count(*) > 1
If the tables are very large, make an index on name and substring(name,2) to make it faster:
select t1.* from tags t1
inner join tags t2 on t1.name = substring(t2.name, 2)
Even with an Index, your query will require every record in spaced to be checked against every record in tags.
If each table has 1,000 records, that's 1,000,000 combinations.
You may be better off creating a temporary table with just two fields spaced.id, substring(t2.name, 2) as shortname, then index the shortname field. Joining on that temporary and indexed table will be much much faster.
Without knowing the DB, how the tables are indexed, etc, it's just trying different things until one gets better optimized...
Here is another query you can try:
SELECT name, count(*) c FROM (
SELECT name FROM tags
UNION ALL
SELECT substring(name, 2) AS name FROM tags
) AS t
GROUP BY name

MySQL JOIN tables with WHERE clause

I need to gather posts from two mysql tables that have different columns and provide a WHERE clause to each set of tables. I appreciate the help, thanks in advance.
This is what I have tried...
SELECT
blabbing.id,
blabbing.mem_id,
blabbing.the_blab,
blabbing.blab_date,
blabbing.blab_type,
blabbing.device,
blabbing.fromid,
team_blabbing.team_id
FROM
blabbing
LEFT OUTER JOIN
team_blabbing
ON team_blabbing.id = blabbing.id
WHERE
team_id IN ($team_array) ||
mem_id='$id' ||
fromid='$logOptions_id'
ORDER BY
blab_date DESC
LIMIT 20
I know that this is messy, but i'll admit, I am no mysql veteran. I'm a beginner at best... Any suggestions?
You could put the where-clauses in subqueries:
select
*
from
(select * from ... where ...) as alias1 -- this is a subquery
left outer join
(select * from ... where ...) as alias2 -- this is also a subquery
on
....
order by
....
Note that you can't use subqueries like this in a view definition.
You could also combine the where-clauses, as in your example. Use table aliases to distinguish between columns of different tables (it's a good idea to use aliases even when you don't have to, just because it makes things easier to read). Example:
select
*
from
<table> as alias1
left outer join
<othertable> as alias2
on
....
where
alias1.id = ... and alias2.id = ... -- aliases distinguish between ids!!
order by
....
Two suggestions for you since a relative newbie in SQL. Use "aliases" for your tables to help reduce SuperLongTableNameReferencesForColumns, and always qualify the column names in a query. It can help your life go easier, and anyone AFTER you to better know which columns come from what table, especially if same column name in different tables. Prevents ambiguity in the query. Your left join, I think, from the sample, may be ambigous, but confirm the join of B.ID to TB.ID? Typically a "Team_ID" would appear once in a teams table, and each blabbing entry could have the "Team_ID" that such posting was from, in addition to its OWN "ID" for the blabbing table's unique key indicator.
SELECT
B.id,
B.mem_id,
B.the_blab,
B.blab_date,
B.blab_type,
B.device,
B.fromid,
TB.team_id
FROM
blabbing B
LEFT JOIN team_blabbing TB
ON B.ID = TB.ID
WHERE
TB.Team_ID IN ( you can't do a direct $team_array here )
OR B.mem_id = SomeParameter
OR b.FromID = AnotherParameter
ORDER BY
B.blab_date DESC
LIMIT 20
Where you were trying the $team_array, you would have to build out the full list as expected, such as
TB.Team_ID IN ( 1, 4, 18, 23, 58 )
Also, not logical "||" or, but SQL "OR"
EDIT -- per your comment
This could be done in a variety of ways, such as dynamic SQL building and executing, calling multiple times, once for each ID and merging the results, or additionally, by doing a join to yet another temp table that gets cleaned out say... daily.
If you have another table such as "TeamJoins", and it has say... 3 columns: a date, a sessionid and team_id, you could daily purge anything from a day old of queries, and/or keep clearing each time a new query by the same session ID (as it appears coming from PHP). Have two indexes, one on the date (to simplify any daily purging), and second on (sessionID, team_id) for the join.
Then, loop through to do inserts into the "TempJoins" table with the simple elements identified.
THEN, instead of a hard-coded list IN, you could change that part to
...
FROM
blabbing B
LEFT JOIN team_blabbing TB
ON B.ID = TB.ID
LEFT JOIN TeamJoins TJ
on TB.Team_ID = TJ.Team_ID
WHERE
TB.Team_ID IN NOT NULL
OR B.mem_id ... rest of query
What I ended up doing is;
I added an extra column to my blabbing table called team_id and set it to null as well as another field in my team_blabbing table called mem_id
Then I changed the insert script to also insert a value to the mem_id in team_blabbing.
After doing this I did a simple UNION ALL in the query:
SELECT
*
FROM
blabbing
WHERE
mem_id='$id' OR
fromid='$logOptions_id'
UNION ALL
SELECT
*
FROM
team_blabbing
WHERE
team_id
IN
($team_array)
ORDER BY
blab_date DESC
LIMIT 20
I am open to any thought on what I did. Try not to be too harsh though:) Thanks again for all the info.