SQL query finding best categories match - mysql

I have categories and multiple categorization for my Items. How to find, for specific Item, other Items that have same categories, ordered by most categories matching (aka best match)?
My table structure is roughly:
Item Table
ID
Name
...
Category Table
ID
Name
...
Categorization Table
ID
Item_ID
Category_ID
...
To find all Items having similar categories, for example, I use
SELECT `items`.*
FROM `items`
INNER JOIN `categorizations` c1
ON c1.`item_id` = `items`.`id`
INNER JOIN `categorizations` c2
ON c2.`item_id` = <Item_ID>
WHERE `c1.`category_id` = c2.`category_id`

This should produce a table of counts of category matches between each pair of items that share at least one category.
select i1.item_id,i2.item_id,count(1)
from items i1
join categorizations c1 on c1.item_id=i1.item_id
join categorizations c2 on c2.category_id=c1.category_id
join items i2 on c2.item_id=i2.item_id
where i1.item_id <> i2.item_id
group by i1.item_id,i2.item_id
order by count(1)
I suspect that it may be a bit slow, though. I don't have an instance of MySQL at the moment to try it out.

Something like:
select item_id, count(id)
from item_category ic
where exists(
select category_id
from item_category ic2
where ic2.item_id = #item_id
and ic2.category_id = ic.category_id )
where item_id <> #item_id
group by item_id
order by count(item_id) desc

An alternative method which I have just implemented to solve this problem is using bitwise operators to speed things up. In MySQL this method only works if you have 64 or less categories as the bit functions are 64 bit.
1) Assign each category a unique integer value which is a power of 2.
2) For each item sum the category values that the item is in to create a 64 bit int representing all of the categories that the item is in.
3) To compare an item to another do something like:
SELECT id, BIT_COUNT(item1categories & item2categories) AS numMatchedCats FROM tablename HAVING numMatchedCats > 0 ORDER BY numMatchedCats DESC
The BIT_COUNT() function might be MySQL specific so an alternative may well be required for any other DB.
MySQL bit functions used are explained here:
http://dev.mysql.com/doc/refman/5.0/en/bit-functions.html

Related

Select items from categories and item should also be in another category

i have one issue which i'm unable to solve.
The problem is with sql query i need to select items from categories which should also be in another category.
basically it is something like this :
select * from items where category_id in (1,2,3,4,5) and category_id = 6
of course above statement will not work, i have tried to:
select * from items where category_id in (1,2,3,4,5,6)
GROUP BY item_id HAVING COUNT(DISTINCT category_id) > 1
but this will give me also items which are exist in multiple categories like items which exist in category 1, category 2 and in category 3, but not in category 6.
any suggestion on how to solve this issue.
One way is to filter the resulting groups, by counting the number of constituent records that match each of your criteria and combining with suitable logical operations:
SELECT item_id
FROM items
WHERE category_id IN (1,2,3,4,5,6) -- only for performance, if indexed
GROUP BY item_id
HAVING SUM(category_id IN (1,2,3,4,5))
AND SUM(category_id = 6)
Another way would be to use a self-join, with each side of the join filtering the groups according to a different criterion:
SELECT item_id
FROM items a JOIN items b USING (item_id)
WHERE a.category_id IN (1,2,3,4,5)
AND b.category_id = 6
GROUP BY item_id

Joining Results From Another Table

I'm dealing with a large query that maps data from one table into a CSV file, so it essentially looks like a basic select query--
SELECT * FROM item_table
--except that * is actually a hundred lines of CASE, IF, IFNULL, and other logic.
I've been told to add a "similar items" line to the select statement, which should be a string of comma-separated item numbers. The similar items are found in a category_table, which can join to item_table on two data points, column_a and column_b, with category_table.category_id having the data that identifies the similar items.
Additionally, I've been told NOT to use a subquery.
So I need to join category_table and group_concat item numbers from that table having the same category_id value (but not having the item number of whatever the current record would be).
If I can only do it with a subquery regardless of the instructions, I will accept that, but I want to do it with a join and group_concat as instructed if possible--I just can't figure it out. How can I do this?
You can make use of a mySQL "feature" called hidden columns.
I am going to assume you have an item id in the item table that uniquely identifies each row. And, if I have your logic correct, the following query does what you want:
select i.*, group_concat(c.category_id)
from item_table i left outer join
category_table c
on i.column_a = c.column_a and
i.column_b = c.column_b and
i.item_id <> c.category_id
group by i.item_id
I think this is what you're looking for, although I wasn't sure what uniquely identified your item_table so I used column_a and column_b (those may be incorrect):
SELECT
...,
GROUP_CONCAT(c.category_id separator ',') CategoryIDs
FROM item_table i
JOIN category_table ct ON i.column_a = ct.column_a AND
i.column_b = ct.column_b
GROUP BY i.column_a, i.column_b
I've used a regular INNER JOIN, but if the category_table might not have any related records, you may need to use a LEFT JOIN instead to get your desired results.
Maybe something like this?
SELECT i.*, GROUP_CONCAT(c.category_id) AS similar_items
FROM item_table i
INNER JOIN category_table c ON (i.column_a = c.column_a AND
i.column_b = c.column_b)
GROUP BY i.column_a, i.column_b

MySQL: Return grouped fields where the group is not empty, efficiently

In one statement I'm trying to group rows of one table by joining to another table. I want to only get grouped rows where their grouped result is not empty.
Ex. Items and Categories
SELECT Category.id
FROM Item, Category
WHERE Category.id = Item.categoryId
GROUP BY Category.id
HAVING COUNT(Item.id) > 0
The above query gives me the results that I want but this is slow, since it has to count all the rows grouped by Category.id.
What's a more effecient way?
I was trying to do a Group By LIMIT to only retrieve one row per group. But my attempts failed horribly. Any idea how I can do this?
Thanks
Try this:
SELECT item.categoryid
FROM Item
JOIN Category
ON Category.id = Item.categoryId
GROUP BY
item.categoryid
HAVING COUNT(*) > 0
This is similar to your original query, but won't do what you want.
If you want to select non-empty categories, do this:
SELECT category.id
FROM category
WHERE id IN
(
SELECT category_id
FROM item
)
For this to work fast, create an index on item (category_id).
What about eliminating the Category table if you don't need it?
SELECT Item.categoryId
FROM Item
GROUP BY Item.categoryId
I'm not sure you even need the HAVING clause since if there is no item in a category it won't create a group.
I think this is functionally equivalent (returns every category that has at least one item), and should be much faster.
SELECT
c.id
FROM
Category c
WHERE
EXISTS (
select 1 from Item i where i.categoryid = c.categoryID
)
I think, and this is just my opinion, that the correct approach IS counting all the stuff. Maybe the problem is in another place.
This is what I use for counting and it works pretty fast, even with a lot of data.
SELECT categoryid, COUNT(*) FROM Item GROUP By categoryid
It will give you a hash with all the items by category. But it will NOT include empty categories.
Then, for retrieveng category information do like this:
SELECT category.* FROM category
INNER JOIN (SELECT categoryid, COUNT(*) AS n FROM Item GROUP By categoryid) AS item
ON category.id = item.categoryid

how can I select from one table based on non matches in another in mySQL

I have two tables
ITEMS : id, name
CATEGORIES: items_id, category
I need to select all the items whose IDs are NOT in the CATEGORIES table.
I suspect it's really simple, but can't figure out the syntax.
try this:
SELECT
i.*
FROM Items i
LEFT OUTER JOIN Categories c ON i.id=c.items_id
WHERE c.items_id is NULL
NOT IN (select CATEGORIES.item_id)
not sure if that's faster than the join above... but it works.
SELECT * FROM Items
WHERE id NOT IN (SELECT items_id FROM Categories)
This selects everything from the items table and only the records from categories that match the join. Then filter out the nulls.
Select Item.Id
FROM ITEMS LEFT OUTER JOIN CATEGORIES On
Items.Id = Categories.items_id
where categories.items_id is null
How about
SELECT id
, name
FROM ITEMS
WHERE NOT EXISTS(SELECT 1 FROM CATEGORIES WHERE Categories.items.id = ITEMS.id)
This will only bring back items that do not have at least one entry in the Categories table
SELECT items.id
FROM items
WHERE NOT EXISTS( SELECT *
FROM categories
WHERE items.id = categories.items.id )
This is the same as joining to the categories table as Mike Pone and KM listed but I find this more readable.

Using GROUP_CONCAT on subquery in MySQL

I have a MySQL query in which I want to include a list of ID's from another table. On the website, people are able to add certain items, and people can then add those items to their favourites. I basically want to get the list of ID's of people who have favourited that item (this is a bit simplified, but this is what it boils down to).
Basically, I do something like this:
SELECT *,
GROUP_CONCAT((SELECT userid FROM favourites WHERE itemid = items.id) SEPARATOR ',') AS idlist
FROM items
WHERE id = $someid
This way, I would be able to show who favourited some item, by splitting the idlist later on to an array in PHP further on in my code, however I am getting the following MySQL error:
1242 - Subquery returns more than 1 row
I thought that was kind of the point of using GROUP_CONCAT instead of, for example, CONCAT? Am I going about this the wrong way?
Ok, thanks for the answers so far, that seems to work. However, there is a catch. Items are also considered to be a favourite if it was added by that user. So I would need an additional check to check if creator = userid. Can someone help me come up with a smart (and hopefully efficient) way to do this?
Thank you!
Edit: I just tried to do this:
SELECT [...] LEFT JOIN favourites ON (userid = itemid OR creator = userid)
And idlist is empty. Note that if I use INNER JOIN instead of LEFT JOIN I get an empty result. Even though I am sure there are rows that meet the ON requirement.
OP almost got it right. GROUP_CONCAT should be wrapping the columns in the subquery and not the complete subquery (I'm dismissing the separator because comma is the default):
SELECT i.*,
(SELECT GROUP_CONCAT(userid) FROM favourites f WHERE f.itemid = i.id) AS idlist
FROM items i
WHERE i.id = $someid
This will yield the desired result and also means that the accepted answer is partially wrong, because you can access outer scope variables in a subquery.
You can't access variables in the outer scope in such queries (can't use items.id there). You should rather try something like
SELECT
items.name,
items.color,
CONCAT(favourites.userid) as idlist
FROM
items
INNER JOIN favourites ON items.id = favourites.itemid
WHERE
items.id = $someid
GROUP BY
items.name,
items.color;
Expand the list of fields as needed (name, color...).
I think you may have the "userid = itemid" wrong, shouldn't it be like this:
SELECT ITEMS.id,GROUP_CONCAT(FAVOURITES.UserId) AS IdList
FROM FAVOURITES
INNER JOIN ITEMS ON (ITEMS.Id = FAVOURITES.ItemId OR FAVOURITES.UserId = ITEMS.Creator)
WHERE ITEMS.Id = $someid
GROUP BY ITEMS.ID
The purpose of GROUP_CONCAT is correct but the subquery is unnecessary and causing the problem. Try this instead:
SELECT ITEMS.id,GROUP_CONCAT(FAVOURITES.UserId)
FROM FAVOURITES INNER JOIN ITEMS ON ITEMS.Id = FAVOURITES.ItemId
WHERE ITEMS.Id = $someid
GROUP BY ITEMS.ID
Yes, soulmerge's solution is ok. But I needed a query where I had to collect data from more child tables, for example:
main table: sessions (presentation sessions) (uid, name, ..)
1st child table: events with key session_id (uid, session_uid, date, time_start, time_end)
2nd child table: accessories_needed (laptop, projector, microphones, etc.) with key session_id (uid, session_uid, accessory_name)
3rd child table: session_presenters (presenter persons) with key session_id (uid, session_uid, presenter_name, address...)
Every Session has more rows in child tables tables (more time schedules, more accessories)
And I needed to collect in one collection for every session to display in ore row (some of them):
session_id | session_name | date | time_start | time_end | accessories | presenters
My solution (after many hours of experiments):
SELECT sessions.uid, sessions.name,
,(SELECT GROUP_CONCAT( `events`.date SEPARATOR '</li><li>')
FROM `events`
WHERE `events`.session_id = sessions.uid ORDER BY `events`.date) AS date
,(SELECT GROUP_CONCAT( `events`.time_start SEPARATOR '</li><li>')
FROM `events`
WHERE `events`.session_id = sessions.uid ORDER BY `events`.date) AS time_start
,(SELECT GROUP_CONCAT( `events`.time_end SEPARATOR '</li><li>')
FROM `events`
WHERE `events`.session_id = sessions.uid ORDER BY `events`.date) AS time_end
,(SELECT GROUP_CONCAT( accessories.name SEPARATOR '</li><li>')
FROM accessories
WHERE accessories.session_id = sessions.uid ORDER BY accessories.name) AS accessories
,(SELECT GROUP_CONCAT( presenters.name SEPARATOR '</li><li>')
FROM presenters
WHERE presenters.session_id = sessions.uid ORDER BY presenters.name) AS presenters
FROM sessions
So no JOIN or GROUP BY needed.
Another useful thing to display data friendly (when "echoing" them):
you can wrap the events.date, time_start, time_end, etc in "<UL><LI> ... </LI></UL>" so the "<LI></LI>" used as separator in the query will separate the results in list items.
I hope this helps someone. Cheers!