Score algorithm in multiple join - mysql

I have a list of publications stored in publications table. Each publication has a many-to-many relation with categories and also a many-to-many relation with keywords.
Given a publication I'd like to find related ones based on a score value computed with the following algorithm:
each shared category with other publications counts as one point
each shared keyword with other publications counts as one point
the score value is the sum of the points computed with previous steps
I want to retrieve with a single query the list of related publications ordered by this score.
Now I have these two queries which compute the score for both categories and keyword
SELECT c.publication_id, (COUNT(c.category_id)) AS cscore
FROM cat_pub c
WHERE c.category_id IN <list of category ids obtained from the current publication>
GROUP BY c.publication_id
ORDER BY cscore DESC
and for the keyword score
SELECT k.publication_id, (COUNT(k.keyword_id)) AS kscore
FROM key_pub k
WHERE k.keyword IN <list of category ids obtained from the current publication>
GROUP BY k.publication_id
ORDER BY kscore DESC
Finally I need to JOIN the resulting query with a SELECT query which should retrieve publications data (title, intro, etc,) ordering them by score and with a limit clause to get the most relevant publications related to the selected one.
Currently I tried to use these two queries as subtables in a join:
SELECT mydata.*, (q1.cscore + q2.kscore) AS score
FROM publications p
INNER JOIN (<cscore query>) q1 ON p.id = q1.publication_id
INNER JOIN (<kscore query>) q2 ON p.id = q2.publication_id
ORDER BY score DESC
LIMIT 5
EXPLAIN shows me that a couple of temporary table will be used. Could it be a performance problem? Is there any better way to implement this?
Update
To answer to Johan's comment
Your solution is wrong. Use a LIMIT clause in subqueries could lead to inconsistent results with every value for the limit. What if I have the following results for the subqueries (I'll show 11 records, but your query will fetch only the first ten)
+-------+--------+ +-------+--------+
| p.id | cscore | | p.id | kscore |
+-------+--------+ +-------+--------+
| 27854 | 100 | | 27865 | 100 |
| 27853 | 100 | | 27864 | 100 |
| 27852 | 100 | | 27863 | 100 |
| 27851 | 100 | | 27862 | 100 |
| 27850 | 100 | | 27861 | 100 |
| 27849 | 100 | | 27860 | 100 |
| 27848 | 100 | | 27859 | 100 |
| 27847 | 100 | | 27858 | 100 |
| 27846 | 100 | | 27857 | 100 |
| 27845 | 100 | | 27856 | 100 |
| 27844 | 100 | | 27855 | 100 |
| 1000 | 99 | | 1000 | 99 |
+-------+--------+ +-------+--------+
If I have ten record with 100 as cscore and ten different records with 100 as kscore the join will produce an empty set. So I'm not getting any result, while the publication with id 1000 should be the solution and it's left out from the result set.
Furthermore I could consider your solution with a LEFT JOIN, in this case only records from the left table will be fetched, and each record will get a total score of 100 (because of the NULL given by the empty kscore field in the second table). Again, the result is wrong because the highest scored record should be p1000 with a total score of 198 (= 99 + 99)
Your solution cannot produce reliable results.

You only want 5 results each from the subqueries.
I think it is best to only select 5 from then and use that in the query.
Rewrite q1 as:
SELECT c.publication_id, COUNT(*) AS cscore
FROM cat_pub c
WHERE c.publication_id = p.id
AND c.category_id IN <list of category ids obtained from the current publication>
GROUP BY c.publication_id
ORDER BY cscore DESC
LIMIT 10
Rewrite q2 as:
SELECT k.publication_id, COUNT(*) AS kscore
FROM key_pub k
WHERE p.id = k.publication_id
AND k.keyword IN <list of category ids obtained from the current publication>
GROUP BY k.publication_id
ORDER BY kscore DESC
LIMIT 10
Leave the join as is:
SELECT p.*, (q1.cscore + q2.kscore) AS score
FROM publications p
INNER JOIN (<cscore query>) q1 ON p.id = q1.publication_id
INNER JOIN (<kscore query>) q2 ON p.id = q2.publication_id
ORDER BY score DESC
LIMIT 5
Note that count(*) is usually a faster choice, because it will not test of null If you can have null values and don't want to include those in the count, then name the count(field) explicitly.

Related

MySQL. Group matches in columns by a common id

I am codding a search page with multiple filters and I am wondering if this is the best approach to get the results.
Each result of the search has several attributes, here I am using two attributes to simplify the example.
The main 'items' table:
id_items
1
2
The 'languages' table:
id_languages | language_code
1 es
2 en
The 'attributes_one' table:
id_attributes_one
1
2
The 'attributes_one_translations' table:
id_attributes_one_translations | id_attributes_one | id_language_code | translation
1 | 1 | 1 | Oro
2 | 1 | 2 | Gold
3 | 2 | 1 | Plata
4 | 2 | 2 | Silver
The 'attributes_one_match' table:
id_attributes_one_match | id_attributes_one | id_items
1 | 1 | 1
2 | 2 | 1
3 | 1 | 2
The 'attributes_two' table:
id_attributes_two
1
The 'attributes_two_translations' table:
id_attributes_two_translations | id_attributes_two | id_language_code | translation
1 | 1 | 2 | 99% gold
The 'attributes_two_match' table:
id_attributes_two_match | id_attributes_two | id_items
1 | 1 | 1
The concept is one item can have 0 or more match of each attribute table, and that match can have 0 or more translations.
Here is the query I am using when the user selects the filters to get all the items that have the attribute_one 'Gold' or 'Silver' order by this attribute ascendant:
SELECT
i.id_items AS id,
GROUP_CONCAT(DISTINCT aot.translation ORDER BY aot.translation DESC SEPARATOR '!¡') AS attribute_one,
GROUP_CONCAT(DISTINCT att.translation ORDER BY att.translation DESC SEPARATOR '!¡') AS attribute_two
FROM
items i
LEFT JOIN
languages AS l ON l.language_code = 'en'
LEFT JOIN
attributes_one_match AS aom ON aom.id_items = i.id_items
LEFT JOIN
attributes_one_translations AS aot ON aot.id_attributes_one = aom.id_attributes_one
AND l.id_languages = aot.id_language_code
AND (MATCH (aot.translation) AGAINST ('"Gold"' IN BOOLEAN MODE)
OR MATCH (aot.translation) AGAINST ('"Silver"' IN BOOLEAN MODE))
LEFT JOIN
attributes_one AS ao ON ao.id_attributes_one = aom.id_attributes_one
LEFT JOIN
attributes_two_match AS atm ON atm.id_items = i.id_items
LEFT JOIN
attributes_two_translations AS att ON att.id_attributes_two = atm.id_attributes_two
AND l.id_languages = att.id_language_code
LEFT JOIN
attributes_two AS at ON at.id_attributes_two = atm.id_attributes_two
GROUP BY id
ORDER BY 2 ASC
The result I get is:
id | attribute_one | attribute_two
2 | Gold | null
1 | Silver!¡Gold | 99% gold
That result is what I was expecting. Now:
* The table items will have around 300k entries once the data base is filled.
* There are 28 attributes table to match with the item.
Each attribute table will have around 20k entries, and each translation table will have 2
times the entries of the table that represents.
* Each item will have from 0 to 20 match to each item table, so I think
I wont have problems using the function GROUP_CONCAT
I am concern about the performance because the search filter page I am doing updates itself by ajax each time the user change one of the filters (it updates the filters and the results). The max results per page will be 1000 items, I didn't put the LIMIT in the query of the example.
I am not an sql expert so I don't really know if what I am doing is the best approach. I would appreciate some feedback.
Thanks a lot!

inner join on multiple tables, count & distinct

I have 3 tables and I am trying to join those tables with inner join. however when I use count(distinct column_id) it mysql through error which is
SQL syntax : check
for the right syntax to use near '(DISTINCT as_ticket.vehicle_id) FROM as_vehicle INNER JOIN as_ticket
My Query
SELECT
`as_vehicle`.`make`, `as_vehicle`.`model`, `as_odometer`.`value`
COUNT (DISTINCT `as_ticket`.`vehicle_id`)
FROM `as_vehicle`
INNER JOIN `as_ticket`
ON `as_vehicle`.`vehicle_id` = `as_ticket`.`vehicle_id`
INNER JOIN `as_odometer`
ON `as_odometer`.`vehicle_id` = `as_vehicle`.`vehicle_id`
WHERE `as_ticket`.`vehicle_id` = 7
ORDER BY `as_odometer`.`value`
DESC
Tbl as_vehicle
+------------+-------------+---------+
| vehicle_id |make | model |
+------------+-------------+---------|
| 1 | HYUNDAI | SOLARIS |
| 2 | A638EA15 | ACCENT |
+-------------+------------+---------+
Tbl as_odometer;
+------------+-------+
| vehicle_id | value |
+------------+-------+
| 1 | 10500 |
| 5 | 20000 |
| 1 | 20000 |
+------------+-------+
Tbl service
+-----------+------------+
| ticket_id | vehicle_id |
+-----------+------------+
| 1 | 1 |
| 2 | 1 |
+-----------+------------+
You forgot a comma before count.
SELECT `as_vehicle`.`make`, `as_vehicle`.`model`, `as_odometer`.`value`,
count(DISTINCT `as_ticket`.`vehicle_id`) // here ---^
First, you should not have a space after the count() and you have a missing comma (as already noted). More importantly, you don't have a group by, so your query will return one row.
And, because of the where clause, the value will always be "1". You have restricted the query to just one vehicle id.
I suspect the query you want is more like:
SELECT `as_vehicle`.`make`, `as_vehicle`.`model`, `as_odometer`.`value`
COUNT(*)
FROM `as_vehicle` INNER JOIN
`as_ticket`
ON `as_vehicle`.`vehicle_id` = `as_ticket`.`vehicle_id` INNER JOIN
`as_odometer`
ON `as_odometer`.`vehicle_id` = `as_vehicle`.`vehicle_id`
WHERE `as_ticket`.`vehicle_id` = 7
GROUP BY `as_vehicle`.`make`, `as_vehicle`.`model`, `as_odometer`.`value`
ORDER BY `as_odometer`.`value` DESC;
Also, you should learn to use table aliases and all those backquotes don't help the query.

Order table by presences in a third table

I have a movie database with a table for actors and another one for movies, I created a third table to add an actor partecipation in a movie. I added a field "star" to distinque leading actors from not leading actors.
I wish create a list order by the actors importance and so by the the total number of "stars".
SELECT a.id, a.name, COUNT( p.star ) AS star
FROM actors a
JOIN playing p, movies m
WHERE p.id_actor = a.id
AND p.id_movie = m.id
AND p.star =1
GROUP BY p.star
ORDER BY p.star DESC;
ACTORS
+----+---------+
| id | name |
+----+---------+
| 1 | actor01 |
| 2 | actor02 |
| 3 | actor03 |
+----+---------+
MOVIES
+----+----------+
| id | title |
+----+----------+
| 1 | movie01 |
| 2 | movie02 |
| 3 | movie03 |
+----+----------+
PLAYING
+----------+----------+-------+------+
| id_movie | id_actor | char | star |
+----------+----------+-------+------+
| 1 | 1 | char1 | 0 |
| 1 | 2 | char2 | 1 |
| 2 | 3 | char3 | 1 |
+----------+----------+-------+------+
I Need output Like:
+----------+--------------+
| actor | protagonist |
+----------+--------------+
| actor01 | 2 times |
| actor02 | 3 times |
+----------+--------------+
You need to fix the group by clause to group by the actor not the star column. You need to fix the order by to group by the aggregated column, not the original column:
SELECT a.id, a.name, sum( p.star = 1) AS stars
FROM actors a join playing p
on p.id_actor = a.id join
movies m
on p.id_movie = m.id
GROUP BY a.id, a.name
ORDER BY stars DESC;
Along the way, I fixed the from so it uses proper join syntax (with an on clause). And changed the query so it returns all actors, even those who have never been the star.
1.If you want to count all stars for an actor, you should group by actor but not stars.(Unless you want to count how many times an actor gets 1 star in a movie, you may not want to group by star)
2.You may want to use ON with JOIN
3.You may want to ORDER BY star but not ORDER BY p.star since you want to order by the result.
4.You may want to use SUM instead of COUNT to get the star counts.(SUM calculates the value but COUNT calculates the number. With SUM, you can set star value to whatever you want without change your sql. You can have star=2 which shows the actor is important to the movie or have star=-1, which means the actor stinks.)
You may have a look at the sql below:
SELECT a.id, a.name, SUM( p.star ) AS sum
FROM actors a
LEFT JOIN playing p ON p.id_actor = a.id
LEFT JOIN movies m ON p.id_movie = m.id
GROUP BY a.id
ORDER BY sum DESC;

Show all grouped results and sort

I have a table, like that one:
| B | 1 |
| C | 2 |
| B | 2 |
| A | 2 |
| C | 3 |
| A | 2 |
I would like to fetch it, but sorted and grouped. That is, I would like it grouped by the letter, but sorted by the highest sum of the group. Also, I want to show all entries within the group:
| C | 3 |
| C | 2 |
| A | 2 |
| A | 2 |
| B | 2 |
| B | 1 |
The order is that way because C has 3 and 2. 3+2=5, which is higher than 2+2=4 for A which in turn is higher than 2+1=3 for B.
I need to show all "grouped" letters because there are other columns that are distinct all of which I need shown.
EDIT:
Thanks for the quick reply. I have the audacity, however, to inquire further.
I have this query:
SELECT * FROM `ip_log` WHERE `IP` IN
(SELECT `IP` FROM `ip_log` GROUP BY `IP` HAVING COUNT(DISTINCT `uid`) > 1)
GROUP BY `uid` ORDER BY `IP`
The letters in the upper description are ip (I need it grouped by the IP addresses) and the numbers are timestamp (I need it sorted by the sum (or just used as the sorting parameter)). Should I create a temporary table and then use the solution below?
select t.Letter, t.Value
from MyTable t
inner join (
select Letter, sum(Value) as ValueSum
from MyTable
group by Letter
) ts on t.Letter = ts.Letter
order by ts.ValueSum desc, t.Letter, t.Value desc
SQL Fiddle Example
If your table's columns are letter and number, the way I would go around to doing this would be the following:
SELECT
letter,
GROUP_CONCAT(number ORDER BY number DESC),
SUM(number) AS total
FROM table
GROUP BY letter
ORDER BY total desc
What you will get, based on your example is the following:
| C | 3,2 | 5
| A | 2,2 | 4
| B | 2,1 | 3
You can then process that data to get the actual information you want/need.
If you still want the data in the format you requested originally, it is not possible with a single query. The reason for that is that you can't sort based on an aggregated data that you are not calculating in the same query (the SUM of the number column). So you will need to make a sub-query to calculate that and feed it back into the original query (disclaimer: untested query):
SELECT
letter,
number
FROM table
JOIN (SELECT ltr, SUM(number) AS total FROM table GROUP BY letter) AS totals
ON table.letter = totals.ltr
ORDER BY totals.total desc, letter desc, number desc

MySQL nested query

I have 2 tables, one containing products, one purchases. I'm trying to find the average of the last 3 purchase prices for each product. So in the example below, for the product 'beans' I would want to return the average of the last 3 purchase prices before the product time 1230854663, i.e. average of customer C,D,E (239)
Products
+-------+------------+
| Name | time |
+-------+------------+
| beans | 1230854764 |
+-------+------------+
Purchases
+----------+------------+-------+
| Customer | time | price |
+----------+------------+-------+
| B | 1230854661 | 207 |
| C | 1230854662 | 444 |
| D | 1230854663 | 66 |
| E | 1230854764 | 88 |
| A | 1230854660 | 155 |
+----------+------------+-------+
I've come up with a nested select query which nearly gets me there i.e. it works if I hard code the time:
SELECT products.name,(SELECT avg(temp.price) FROM (select purchases.price from purchases WHERE purchases.time < 1230854764 order by purchases.time desc limit 3) temp) as av_price
from products products
But if the query references product.time rather than a hard coded time such as below I get an error that column products.time does not exist.
SELECT products.name,(SELECT avg(temp.price) FROM (select purchases.price from purchases WHERE purchases.time < products.time order by purchases.time desc limit 3) temp) as av_price from products products
I'm not sure if I'm making a simple mistake with nested queries or I'm going about this in totally the wrong way and should be using joins or some other construct, either way I'm stuck. Any help would be greatfully received.
The only problem in your query is you haven't mentioned products table in your inner query.
SELECT products.name,(SELECT avg(temp.price)
FROM (select purchases.price from purchases,products
WHERE purchases.time < products.time order by purchases.time desc limit 3) temp) as
av_price from products products