mysql SORT BY amount of unique word matches - mysql

I've found many questions that ask for amount of appearences, but none that ask the very same as I wish to do.
A dynamically generated (prepared-statement) query will result in something like this:
SELECT * FROM products WHERE
( title LIKE ? AND title LIKE ? ) AND
( content LIKE ? OR content LIKE ? ) AND
( subtitle LIKE ? AND author LIKE ? )
ORDER BY relevance LIMIT ?,?
The amount of words entered, (and so the amount of LIKE) are for title,content and author a variable amount (depending on the search query).
Now I've added a ORDER BY relevance. But I wish this order to be the amount of unique words from the content-field that match. (Note: Not on the amount of appearences, but on the amount of entered strings in the content column that have at least one match).
Example table products:
id | title | subtitle | content
------------------------------------
1 | animals | cat | swim swim swim swim swim swim swim
2 | canimal | fish | some content
3 | food | roasted | some content
4 | animal | cat | swim better better swims better something else
5 | animal | cat | dogs swim better
Example query (with prepared statements ? filled in):
SELECT * FROM products WHERE
( title LIKE %animal% ) AND
( content LIKE %dog% OR content LIKE %swim% OR content LIKE %better% ) AND
( subtitle LIKE %cat% )
ORDER BY relevance LIMIT 0,10
Expected results (in correct order!):
id | amount of matches
-----------------
5 | 3 (dog, swim, better)
4 | 2 (swim, better)
1 | 1 (swim)
I have an Innodb table and mysql version lower than 5.6, therefore I can't use MATCH...AGAINST.
I was thinking this could be solved with WHEN CASE ... THEN. But I have no idea how I could create this sorting.

You can do it in many ways for example
ORDER BY SIGN(LOCATE('dog',content))+
SIGN(LOCATE('swim',content))+
SIGN(LOCATE('better',content)) DESC
SQLFiddle demo
or with CASE
ORDER BY
CASE WHEN content LIKE '%dog%'
THEN 1
ELSE 0
END
+
CASE WHEN content LIKE '%swim%'
THEN 1
ELSE 0
END
+
CASE WHEN content LIKE '%better%'
THEN 1
ELSE 0
END
DESC

Check like this.
SELECT id,CONCAT_WS('-',COUNT(LENGTH(content) - LENGTH(REPLACE(content, ' ', '')) + 1),REPLACE(content,' ',',')) AS amount of matches FROM products
WHERE
( title LIKE %animal% ) AND
( content LIKE %dog% OR content LIKE %swim% OR content LIKE %better% ) AND
( subtitle LIKE %cat% )
GROUP BY id
ORDER BY id

Related

NOT IN subquery gives 0 results

i'm not an mysqlologist but i have to deal with the following problem:
given a following table:
+-------+-----------+-------------+------+
| id | articleID | img | main |
+-------+-----------+-------------+------+
| 48350 | 4325 | scr426872xa | 1 |
| 48351 | 4325 | scr426872ih | 2 |
| 48352 | 4325 | scr426872jk | 2 |
| 48353 | 4326 | scr426882vs | 1 |
| 48354 | 4326 | scr426882ss | 2 |
| 48355 | 4326 | scr426882nf | 2 |
+-------+-----------+-------------+------+
each set of images of one distinct articleID should have one image set as main=1 and an unspecified number of images with main value of 2
Due to processing issues it can happen that there is no main=1 set for an image and i need to find the articleID where images with main=2 exist, but not with main=1.
By explaining it backwards it is easier to fomulate what my thinking process for the query is. My idea was to create a result set (subquery) by querying the table for articleID where main is "1". Then use that result to check which distinct articleID of a query where main=2 is not in the results of aforementioned (sub-)query. Basically "substracting" all matching articleID lines.
This should give basically the leftover of all main=2 lines which have no line with the same articleID where main=1
SELECT DISTINCT articleID
FROM img_table WHERE main = 2
AND articleID
NOT IN (SELECT articleID FROM img_table WHERE main = 1 );
I get no result when I know for a fact that there are some. There is surely something I'm doing wrong. I hope my problem is explained in a way that not only me know what I want :)
Given your problem description, it looks like you're actually looking for NOT EXISTS to check for rows that don't have a matching row in the subselect. Note that you do have to add the article id to the where clause in the subselect:
SELECT DISTINCT articleID
FROM img_table t1
WHERE main = 2
AND NOT EXISTS
(SELECT articleID
FROM img_table t2
WHERE main = 1
AND t2.articleID = t1.articleID);
I think your current solution should work too, but maybe you didn't show all the data. For the data you specified, the query would indeed return 0 rows, because all articleIDs have at least one main=1 and a main=2 image.
One important thing to remember: the subquery must not return any NULL value, otherwise NOT IN won't work properly. So if articleID is nullable, make sure your subselect looks like this:
(SELECT articleID FROM img_table WHERE main = 1 and articleID IS NOT NULL)
I didn't find any issue in your query, Please add some data where article id having only main 2. Your query checking both article ID contains main 1,2. ie why you not getting any result.

mysql count votes optimization

so im making a file hub nothing huge or fancy just to store some files that may be shared by others for download. and it just occured to me in the way that i originally intended to count the amount of upvotes or downvotes the query could be server heavy.the query to get the files is something along the lines of
select*from files;
and in such i would recieve an array of my files that i could loop over and get specifics on each file now with the inclusion of voting a file that same foreach loop would include a further query that would get the count the amount votes a file would get (the file id in the where clause) like so
select*from votes where upvoted=true and file.id=?
and i was thinking of using pdo::rowCount to get my answer. now evey bone in my body just says this is bad very bad as imagine im getting 10,000 files i just ran 10,000 extra queries one on each file and i havent looked at the downvotes yet which i was think could go in a similar fasion. any optimization adviece here is a small rep of the structure of a few tables. the upvoted and downvoted columbs are of type bool or tinyint if you will
table: file table: user table: votes
+----+-------------+ +----+-------------+ +--------+--------+--------+--------+
| id |storedname | | id | username | |file_id | user_id| upvoted | downvoted
+----+-------------+ +----+-------------+ +--------+--------+--------+--------+
| 1 | 45tfvb.txt | | 1 | matthew | | 1 | 2 | 1 | 0
| 2 |jj7fnfddf.pdf| | 2 | mark | | 2 | 1 | 1 | 1
| .. | .. | | .. | .. | | .. | .. | .. | ..
there are two ways to do this. the better way to do this (aka faster) is to write separate queries and build into one variable in your programming language (like php, python.. etc.)
SELECT
d.id as doc_id,
COUNT(v.document_id) as num_upvotes
FROM votes v
JOIN document d on d.id = v.document_id
WHERE v.upvoted IS TRUE
GROUP BY doc_id
);
that will return your list of upvoted documents. you can do the same for your downvotes.
then after your select from document do a for loop to compare the votes with the document by ID and build into a dictionary or list.
The second way to do this which can take a lot longer at runtime if you have a bunch of records in the table (its less efficient, but easier to write) is to add subquery selects in your select statement like this...
SELECT
logical_name ,
document.id ,
file_type ,
physical_name ,
uploader_notes ,
views ,
downloads ,
user.name ,
category.name AS category_name,
(Select count(1) from votes where upvoted=true and document_id=document.id )as upvoted,
(select count(1) from votes where upvoted=false and document_id=document.id) as downvoted
FROM document
INNER JOIN category ON document.category_id = category.id
INNER JOIN user ON document.uploader_id = user.id
ORDER BY category.id
Two advices:
Avoid SELECT * especially if you're going to count. Replace it, with something like that:
SELECT COUNT(1) AS total WHERE upvoted=true AND file.id=?
Maybe you want to create a TRIGGER to keep update a counter in the file table.
I hope it will be helpfull to you.

MySQL search query ordered by match relevance

I know basic MySQL querying, but I have no idea how to achieve an accurate and relevant search query.
My table look like this:
id | kanji
-------------
1 | 一子
2 | 一人子
3 | 一私人
4 | 一時
5 | 一時逃れ
I already have this query:
SELECT * FROM `definition` WHERE `kanji` LIKE '%一%'
The problem is that I want to order the results from the learnt characters, 一 being a required character for the results of this query.
Say, a user knows those characters: 人,子,時
Then, I want the results to be ordered that way:
id | kanji
-------------
2 | 一人子
1 | 一子
4 | 一時
3 | 一私人
5 | 一時逃れ
The result which matches the most learnt characters should be first. If possible, I'd like to show results that contain only learnt characters first, then a mix of learnt and unknown characters.
How do I do that?
Per your preference, ordering by number of unmatched characters (increasing), and then number of matched character (decreasing).
SELECT *,
(kanji LIKE '%人%')
+ (kanji LIKE '%子%')
+ (kanji LIKE '%時%') score
FROM kanji
ORDER BY CHAR_LENGTH(kanji) - score, score DESC
Or, the relational way to do it is to normalize. Create the table like this:
kanji_characters
kanji_id | index | character
----------------------------
1 | 0 | 一
1 | 1 | 子
2 | 0 | 一
2 | 1 | 人
2 | 2 | 子
...
Then
SELECT kanji_id,
COUNT(*) length,
SUM(CASE WHEN character IN ('人','子','時') THEN 1 END) score
FROM kanji_characters
WHERE index <> 0
AND kanji_id IN (SELECT kanji_id FROM kanji_characters WHERE index = 0 AND character = '一')
GROUP BY kanji_id
ORDER BY length - score, score DESC
Though you didn't specify what should be done in the case of duplicate characters. The two solutions above handle that differently.
Just a thought, but a text index may help, you can get a score back like this:
SELECT match(kanji) against ('your search' in natural language mode) as rank
FROM `definition` WHERE match(`kanji`) against ('your search' in natural language mode)
order by rank, length(kanji)
The trick is to index these terms (or words?) the right way. I think the general trick is to encapsulate each word with double quotes and make a space between each. This way the tokenizer will populate the index the way you want. Of course you would need to add/remove the quotes on the way in/out respectively.
Hope this doesn't bog you down.

Count occurrences of a sub string in a MySQL column

I have a table which stores information of a lot of twitter tweets including the tweet text and the screen name of the user who tweeted the tweet. The tweets contain hashtags (starting with #), I want to count the number of hashtags that a specific user has tweeted:
tweet_id | tweet_text | screen_name |
--------------------------------------------------------------------------------------------
1 | #hashtag1 #otherhashtag2 #hashtag3 some more text | tweeter_user_1 |
2 | some text #hashtag1 #hashtag4 more text | tweeter_user_2 |
3 | #hashtag5 #hashtag1 #not a hashtag some#nothashtag | tweeter_user_1 |
4 | #hashtag1 with more text | tweeter_user_3 |
5 | #otherhashtag2 #hashtag3,#hashtag4 more text | tweeter_user_1 |
If I were to count the hashtags of tweeter_user_1, the result i expect is 8, if i wanted the hashtags of tweeter_user_3 it should return 1. How can I do it assuming that my table name is tweets.
I tried this: SELECT COUNT( * ) FROM tweets WHERE( LENGTH( REPLACE( tweet_text, '#%', '#') = 0 ) ) AND screen_name = 'tweeter_user_1' but it didn't work
I would be happy if the result of tweeter_user_1 was 9 too :D
This should give you a list of screen_names and the total count of all hashtags they use.
SELECT foo.screen_name, SUM(foo.counts) FROM
(
SELECT screen_name,
LENGTH( tweet_text) - LENGTH(REPLACE(tweet_text, '#', '')) AS counts
FROM tweet_table
) as foo
GROUP BY foo.screen_name
But.... it's a nasty query if the table is huge. I might specify a specific users in the inner select if you just need counts for a single user. Like this:
SELECT foo.screen_name, SUM(foo.counts) FROM
(
SELECT screen_name,
LENGTH( tweet_text) - LENGTH(REPLACE(tweet_text, '#', '')) AS counts
FROM tweet_table WHERE screen_name = 'tweeter_user_1'
) as foo
GROUP BY foo.screen_name
Depending on how often you need to run the query, you could be causing MySQL to spend a lot of CPU time parsing and reparsing the tweet_text column. I would strongly recommend adding a hashtag_qty column (or similar) and store the count of hashtag elements there when you populate the row to begin with.

How to optimize this query with multiple substring and subquery

Okay, I´m working on a website right now that shows information about parts of electronic devices. These parts sometimes get a revision. The part number stays the same, but they append an A, B, C etc to the part number, so the ´higher´ the letter, the newer it is. Also a date is added. So the table looks something like this:
------------------------------------------------------------
| Partcode | Description | Partdate |
------------------------------------------------------------
| 12345A | Some description 1 | 2009-11-10 |
| 12345B | Some description 2 | 2010-12-30 |
| 17896A | Some description 3 | 2009-01-12 |
| 12345C | Some description 4 | 2011-08-06 |
| 17896B | Some description 5 | 2009-07-10 |
| 12345D | Some description 6 | 2012-05-04 |
------------------------------------------------------------
What I need right now is the data from the newest revision of a part. So for this example I need:
12345D and 17896B
The query that some has build before me is something in the line of this:
SELECT substring(Partcode, 1, 5) AS Part,
(
SELECT pt.Partcode
FROM Parttable pt
WHERE substring(pt.PartCode, 1, 5) = Part
ORDER BY pt.Partdate DESC
LIMIT 0,1
),
(
SELECT pt.Description
FROM Parttable pt
WHERE substring(pt.PartCode, 1, 5) = Part
ORDER BY pt.Partdate DESC
LIMIT 0,1
),
(
SELECT pt.Partdate
FROM Parttable pt
WHERE substring(pt.PartCode, 1, 5) = Part
ORDER BY pt.Partdate DESC
LIMIT 0,1
)
FROM Parttable
GROUP BY Part
As you will understand, this query is insanely slow and feels really inefficient. But I just can't get my head around how to optimize this query.
So I really hope someone can help.
Thanks in advance!
PS. I'm working on a MySQL database and before anyone asks, I can't change the database.
First : why not storing your version variable in a separate column? This way you wouldn't need to call substring to first extract it. If you really need the code and version to be concatenated, I thing it's a good practice to do it at the end.
Then in your place, I would first split the code and version, and simply use a max in an aggregate query, like:
SELECT code,max(version) FROM
(SELECT substring(Partcode, 5, 1) as code,
substring(Partcode, 1, 5) as version
FROM Parttable
)
AS part
GROUP BY code;
Note: I haven't tested this query so you may need to fix few parameters, like the substring indexes.