mySQL Multi Join from 2 Statements - mysql

I've found many similar questions but have not been able to understand / apply the answers; and I don't really know what to search for...
I have 2 tables (docs and words) which have a many to many relationship. I am trying to generate a list of the top 5 most frequently used words that DO NOT appear in a specified docs.
To this end I have 2 mySQL queries, each of which takes me part way to achieving my goal:
Query #1 - returns words sorted by frequency of use, falls short because it also returns ALL words (SQLFiddle.com)
SELECT `words_idwords` as wdID, COUNT(*) as freq
FROM docs_has_words
GROUP BY `words_idwords`
ORDER BY freq DESC, wdID ASC
Query #2 - returns words that are missing from specified document, falls short because it does not sort by frequency of use (SQLFiddle.com)
SELECT wordscol as wrd, idwords as wID
FROM `words` where NOT `idwords`
IN (SELECT `words_idwords` FROM `docs_has_words` WHERE `docs_iddocs` = 1)
But what I want the output to look like is:
idwords | wordscol | freq
-------------------------
| 8 | Dog | 3 |
| 3 | Ape | 2 |
| 4 | Bear | 1 |
| 6 | Cat | 1 |
| 7 | Cheetah | 1 |
| 5 | Beaver | 0 |
Note: `Dolphin`, one of the most frequently used words, is NOT in the
list because it is already in the document iddocs = 1
Note: `Beaver`, is a "never used word" BUT is in the list because it is
in the main word list
And the question is: how can I combine these to queries, or otherwise, get my desired output?
Basic requirements:
- 3 column output
- results sorted by frequency of use, even if use is zero
Updates:
In light of some comments, the approach that I was thinking of when I came up with the 2 queries was:
Step 1 - find all the words that are in the main word list but not used in document 1
Step 2 - rank words from Step 1 according to how many documents use them
Once I had the 2 queries I thought it would be easy to combine them with a where clause, but I just can't get it working.
A hack solution could be based on adding a dummy document that contains all the words and then subtract 1 from freq (but I'm not that much of a hack!).

I see now what the problem is. I was mislead by your statement regarding the results of the 1st query (emphasis is mine):
returns words sorted by frequency of use, falls short because it also returns ALL words
This query does not return all words, it only returns all used words.
So, you need to left join the words table on docs_has_words table to get all words and eliminate the words that are associated with doc 1:
SELECT w.idwords as wdID, w.wordscol, COUNT(d.words_idwords) as freq
FROM words w
LEFT JOIN `docs_has_words` d on w.idwords=d.words_idwords
WHERE w.idwords not in (SELECT `words_idwords` FROM `docs_has_words` WHERE `docs_iddocs` = 1)
GROUP BY w.idwords
ORDER BY freq DESC, wdID ASC;
See sqlfiddle

I think #Shadow has it right in his comment, you just need to add the where clause like this: sqlFiddle
SELECT
`words_idwords` as wdID,
COUNT(*) as freq
FROM docs_has_words
WHERE NOT `words_idwords` IN (SELECT `words_idwords` FROM `docs_has_words` WHERE `docs_iddocs` = 1)
GROUP BY `words_idwords`
ORDER BY freq DESC, wdID ASC
Does this produce the output you need?

Related

Mysql-> Group after rand()

I have the following table in Mysql
Name Age Group
abel 7 A
joe 6 A
Rick 7 A
Diana 5 B
Billy 6 B
Pat 5 B
I want to randomize the rows, but they should still remain grouped by the Group column.
For exmaple i want my result to look something like this.
Name Age Group
joe 6 A
abel 7 A
Rick 7 A
Billy 6 B
Pat 5 B
Diana 5 B
What query should i use to get this result? The entire table should be randomised and then grouped by "Group" column.
What you describe in your question as GROUPing is more correctly described as sorting. This is a particular issue when talking about SQL databases where "GROUP" means something quite different and determines the scope of aggregation operations.
Indeed "group" is a reserved word in SQL, so although mysql and some other SQL databases can work around this, it is a poor choice as an attribute name.
SELECT *
FROM yourtable
ORDER BY `group`
Using random values also has a lot of semantic confusion. A truly random number would have a different value every time it is retrieved - which would make any sorting impossible (and databases do a lot of sorting which is normally invisible to the user). As long as the implementation uses a finite time algorithm such as quicksort that shouldn't be a problem - but a bubble sort would never finish, and a merge sort could get very confused.
There are also degrees of randomness. There are different algorithms for generating random numbers. For encryption it's critical than the random numbers be evenly distributed and completely unpredictable - often these will use hardware events (sometimes even dedicated hardware) but I don't expect you would need that. But do you want the ordering to be repeatable across invocations?
SELECT *
FROM yourtable
ORDER BY `group`, RAND()
...will give different results each time.
OTOH
SELECT
FROM yourtable
ORDER BY `group`, MD5(CONCAT(age, name, `group`))
...would give the results always sorted in the same order. While
SELECT
FROM yourtable
ORDER BY `group`, MD5(CONCAT(DATE(), age, name, `group`))
...will give different results on different days.
DROP TABLE my_table;
CREATE TABLE my_table
(name VARCHAR(12) NOT NULL
,age INT NOT NULL
,my_group CHAR(1) NOT NULL
);
INSERT INTO my_table VALUES
('Abel',7,'A'),
('Joe',6,'A'),
('Rick',7,'A'),
('Diana',5,'B'),
('Billy',6,'B'),
('Pat',5,'B');
SELECT * FROM my_table ORDER BY my_group,RAND();
+-------+-----+----------+
| name | age | my_group |
+-------+-----+----------+
| Joe | 6 | A |
| Abel | 7 | A |
| Rick | 7 | A |
| Pat | 5 | B |
| Diana | 5 | B |
| Billy | 6 | B |
+-------+-----+----------+
Do the random first then sort by column group.
select Name, Age, Group
from (
select *
FROM yourtable
order by RAND()
) t
order by Group
Try this:
SELECT * FROM table order by Group,rand()

Count the number of occurences of a comma separated string

I am needing a way to count comma separated values like this - any
suggestions please?
Table:
id (int) | site (varchar)
1 | 1,2,3
2 | 2,3
3 | 1,3
Desired output:
site | # of occurrences
1 | 2
2 | 2
3 | 3
Without getting into exactly what you're doing, I'll assume you have a sites table. If so, it's technically achievable with something like
SELECT sites.site_id AS site, COUNT(1) AS `# of occurrences`
FROM sites
INNER JOIN table ON FIND_IN_SET(sites.site_id, table.site)
GROUP BY sites.site_id
Performance of that will be appalling, as there is no way to use an index, and the data will be able to get inconsistent very easily.
What the comments in your question are alluding to, is to use a relational table of some description, where instead of storing a comma-separated list, you store a row for each 'occurrence'

MySQL search query ordered by match relevance

I know basic MySQL querying, but I have no idea how to achieve an accurate and relevant search query.
My table look like this:
id | kanji
-------------
1 | 一子
2 | 一人子
3 | 一私人
4 | 一時
5 | 一時逃れ
I already have this query:
SELECT * FROM `definition` WHERE `kanji` LIKE '%一%'
The problem is that I want to order the results from the learnt characters, 一 being a required character for the results of this query.
Say, a user knows those characters: 人,子,時
Then, I want the results to be ordered that way:
id | kanji
-------------
2 | 一人子
1 | 一子
4 | 一時
3 | 一私人
5 | 一時逃れ
The result which matches the most learnt characters should be first. If possible, I'd like to show results that contain only learnt characters first, then a mix of learnt and unknown characters.
How do I do that?
Per your preference, ordering by number of unmatched characters (increasing), and then number of matched character (decreasing).
SELECT *,
(kanji LIKE '%人%')
+ (kanji LIKE '%子%')
+ (kanji LIKE '%時%') score
FROM kanji
ORDER BY CHAR_LENGTH(kanji) - score, score DESC
Or, the relational way to do it is to normalize. Create the table like this:
kanji_characters
kanji_id | index | character
----------------------------
1 | 0 | 一
1 | 1 | 子
2 | 0 | 一
2 | 1 | 人
2 | 2 | 子
...
Then
SELECT kanji_id,
COUNT(*) length,
SUM(CASE WHEN character IN ('人','子','時') THEN 1 END) score
FROM kanji_characters
WHERE index <> 0
AND kanji_id IN (SELECT kanji_id FROM kanji_characters WHERE index = 0 AND character = '一')
GROUP BY kanji_id
ORDER BY length - score, score DESC
Though you didn't specify what should be done in the case of duplicate characters. The two solutions above handle that differently.
Just a thought, but a text index may help, you can get a score back like this:
SELECT match(kanji) against ('your search' in natural language mode) as rank
FROM `definition` WHERE match(`kanji`) against ('your search' in natural language mode)
order by rank, length(kanji)
The trick is to index these terms (or words?) the right way. I think the general trick is to encapsulate each word with double quotes and make a space between each. This way the tokenizer will populate the index the way you want. Of course you would need to add/remove the quotes on the way in/out respectively.
Hope this doesn't bog you down.

What can an aggregate function do in the ORDER BY clause?

Lets say I have a plant table:
id fruit
1 banana
2 apple
3 orange
I can do these
SELECT * FROM plant ORDER BY id;
SELECT * FROM plant ORDER BY fruit DESC;
which does the obvious thing.
But I was bitten by this, what does this do?
SELECT * FROM plant ORDER BY SUM(id);
SELECT * FROM plant ORDER BY COUNT(fruit);
SELECT * FROM plant ORDER BY COUNT(*);
SELECT * FROM plant ORDER BY SUM(1) DESC;
All these return just the first row (which is with id = 1).
What's happening underhood?
What are the scenarios where aggregate function will come in handy in ORDER BY?
Your results are more clear if you actually select the aggregate values instead of columns from the table:
SELECT SUM(id) FROM plant ORDER BY SUM(id)
This will return the sum of all id's. This is of course a useless example because the aggregation will always create only one row, hence no need for ordering. The reason you get a row qith columns in your query is because MySQL picks one row, not at random but not deterministic either. It just so happens that it is the first column in the table in your case, but others may get another row depending on storage engine, primary keys and so on. Aggregation only in the ORDER BY clause is thus not very useful.
What you usually want to do is grouping by a certain field and then order the result set in some way:
SELECT fruit, COUNT(*)
FROM plant
GROUP BY fruit
ORDER BY COUNT(*)
Now that's a more interesting query! This will give you one row for each fruit together with the total count for that fruit. Try adding some more apples and the ordering will actually start making sense:
Complete table:
+----+--------+
| id | fruit |
+----+--------+
| 1 | banana |
| 2 | apple |
| 3 | orange |
| 4 | apple |
| 5 | apple |
| 6 | banana |
+----+--------+
The query above:
+--------+----------+
| fruit | COUNT(*) |
+--------+----------+
| orange | 1 |
| banana | 2 |
| apple | 3 |
+--------+----------+
All these queries will all give you a syntax error on any SQL platform that complies with SQL standards.
SELECT * FROM plant ORDER BY SUM(id);
SELECT * FROM plant ORDER BY COUNT(fruit);
SELECT * FROM plant ORDER BY COUNT(*);
SELECT * FROM plant ORDER BY SUM(1) DESC;
On PostgreSQL, for example, all those queries will raise the same error.
ERROR: column "plant.id" must appear in the GROUP BY clause or be
used in an aggregate function
That means you're using a domain aggregate function without using GROUP BY. SQL Server and Oracle return similar error messages.
MySQL's GROUP BY is known to be broken in several respects, at least as far as standard behavior is concerned. But the queries you posted were a new broken behavior to me, so +1 for that.
Instead of trying to understand what it's doing under the hood, you're probably better off learning to write standard GROUP BY queries. MySQL will process standard GROUP BY statements correctly, as far as I know.
Earlier versions of MySQL docs warned you about GROUP BY and hidden columns. (I don't have a reference, but this text is cited all over the place.)
Do not use this feature if the columns you omit from the GROUP BY part
are not constant in the group. The server is free to return any value
from the group, so the results are indeterminate unless all values are
the same.
More recent versions are a little different.
You can use this feature to get better performance by avoiding
unnecessary column sorting and grouping. However, this is useful
primarily when all values in each nonaggregated column not named in
the GROUP BY are the same for each group. The server is free to choose
any value from each group, so unless they are the same, the values
chosen are indeterminate.
Personally, I don't consider indeterminate a feature in SQL.
When you use an aggregate like that, the query gets an implicit group by where the entire result is a single group.
Using an aggregate in order by is only useful if you also have a group by, so that you can have more than one row in the result.

Max occurences of a given value in a table

I have a table (pretty big one) with lots of columns, two of them being "post" and "user".
For a given "post", I want to know which "user" posted the most.
I was first thinking about getting all the entries WHERE (post='wanted_post') and then throw a PHP hack to find which "user" value I get the most, but given the large size of my table, and my poor knowledge of MySQL subtle calls, I am looking for a pure-MySQL way to get this value (the "user" id that posted the most on a given "post", basically).
Is it possible ? Or should I fall back on the hybrid SQL-PHP solution ?
Thanks,
Cystack
It sounds like this is what you want... am I missing something?
SELECT user
FROM myTable
WHERE post='wanted_post'
GROUP BY user
ORDER BY COUNT(*) DESC
LIMIT 1;
EDIT: Explanation of what this query does:
Hopefully the first three lines make sense to anyone familiar with SQL. It's the last three lines that do the fun stuff.
GROUP BY user -- This collapses rows with identical values in the user column. If this was the last line in the query, we might expect output something like this:
+-------+
| user |
+-------+
| bob |
| alice |
| joe |
ORDER BY COUNT(*) DESC -- COUNT(*) is an aggregate function, that works along with the previous GROUP BY clause. It tallies all of the rows that are "collapsed" by the GROUP BY for each user. It might be easier to understand what it's doing with a slightly modified statement, and it's potential output:
SELECT user,COUNT(*)
FROM myTable
WHERE post='wanted_post'
GROUP BY user;
+-------+-------+
| user | count |
+-------+-------+
| bob | 3 |
| alice | 1 |
| joe | 8 |
This is showing the number of posts per user.
However, it's not strictly necessary to actually output the value of an aggregate function in this case--we can just use it for the ordering, and never actually output the data. (Of course if you want to know how many posts your top-poster posted, maybe you do want to include it in your output, as well.)
The DESC keyword tells the database to sort in descending order, rather than the default of ascending order.
Naturally, the sorted output would look something like this (assuming we leave the COUNT(*) in the SELECT list):
+-------+-------+
| user | count |
+-------+-------+
| joe | 8 |
| bob | 3 |
| alice | 1 |
LIMIT 1 -- This is probably the easiest to understand, as it just limits how many rows are returned. Since we're sorting the list from most-posts to fewest-posts, and we only want the top poster, we just need the first result. If you wanted the top 3 posters, you might instead use LIMIT 3.