Full-text MySQL search - return snippets - mysql

I have a MySQL table that contains chapters of books.
Table: book_chapter
--------------------------
| id | book_id | content |
--------------------------
I am currently able to search the content using full-text search like this:
SELECT * FROM book_chapter WHERE book_chapter.book_id="2" AND
MATCH (book_chapter.content) AGAINST ("unicorn hair" IN BOOLEAN MODE)
However, I would like to know if it's possible to search the content and have the results returned in 30 character snippets, just so the user can feel the gist. So for example, if I search for "unicorn hair", I would have a result like this:
-------------------------------------------------
| id | book_id | content |
-------------------------------------------------
| 15 | 2 | it isn't unicorn hair. You kno |
-------------------------------------------------
| 15 | 2 | chup and unicorn hair in soup |
-------------------------------------------------
| 27 | 2 | , should unicorn hair be used |
-------------------------------------------------
| 31 | 2 | eware of unicorn hair for bal |
Notice that there are two results from the same record. Is that possible as well?

An improvement to the query by Mike Bryant
If the match is at the beginning of the field, then the SUBSTRING will start from the end.
I just added an IF statement to fix it
SELECT
id,
book_id,
SUBSTRING(
content,
IF(LOCATE("unicorn hair", content) > 10, LOCATE("unicorn hair", content) - 10, 1),
10 + LENGTH("unicorn hair") + 10
)
FROM
book_chapter
WHERE book_chapter.book_id="2"
AND MATCH (book_chapter.content) AGAINST ("unicorn hair" IN BOOLEAN MODE)

Try something like this for creating a snippet of the first match of the search phrase plus 10 characters before it and 10 characters after it (this is not 30 characters in length, but may be a better solution depending on the length of the search phrase, i.e. what if your search phrase > 30 characters). This doesn't address your wish to possibly show multiple results for the same record in the result set. For something like that I would almost think you would be best server creating a stored procedure to do the work you want for you.
SELECT id, book_id, SUBSTRING(content, LOCATE("unicorn hair", content) - 10, 10 + LENGTH("unicorn hair") + 10) FROM book_chapter WHERE book_chapter.book_id="2" AND
MATCH (book_chapter.content) AGAINST ("unicorn hair" IN BOOLEAN MODE)
Obviously you would replace "unicorn hair" with whatever your search phrase is in all it's locations.

Related

MySQL pattern matching - finding the match

I'm working with a MySQL database that contains a substantial amount of data (about 10.000 records). The data in the database is logging of a machine maintenance, one of the fields contains a basic timeline (just steps that are timestamped) explaining all the work done. In this field I'm looking for certain strings that can indicate certain procedures (i.e. ABC123.ABC, abc111.abc, abc001.abc).
I'm looking for matches in this field with pattern matching like such
SELECT * FROM [tablename]
WHERE `work_performed` LIKE '% ______.___ %'
ORDER BY id DESC;`
The regex is very general but I can specify that further myself.
However, since the field which contains the string I'm looking for can be very large (up to 2364763 characters) i want to return the records matching the pattern specified but I also want to return a field that contains just the matched expression so I can confirm it is actually what I'm looking for and can use that string further.
I have found people with the same issue but I cannot reproduce their results.
Something like this might work?:
SELECT *, SUBSTRING(`work_performed`,
patindex('%[0-9][0-9][0-9]%', `work_performed`)-1, 5) as match
FROM [tablename]
WHERE `work_performed`LIKE '% ______.___ %'
I would like to get output that looks somewhat like this:
+----+-------------------------------------------+------------+
| id | work_performed | match |
+----+-------------------------------------------+------------+
| 1 | 2017-02-26|10:59| Arrival: admin1 | ABCD12.adb |
| | 2017-02-26|10:59| diagnosed error ab-0001 | |
| | 2017-02-26|11:02| ran ABCD12.adb | |
| | 2017-02-26|11:03| system back online | |
+----+-------------------------------------------+------------+
| 2 | 2017-02-26|10:59| Arrival: admin34 | abc123.ags |
| | 2017-02-26|10:59| diagnosed error WP1234 | |
| | 2017-02-26|11:02| ran abc123.ags | |
| | 2017-02-26|11:03| system back online | |
+----+-------------------------------------------+------------+
I apologise if I didn't give enough details but I'm an intern at a major company and we have very strict rules about confidentiality.
If there is a need for any additional information I will try to.
EDIT
I have been trying to search for the string I'm looking for with regexp, but I cant get it to work as I want to, here is what I tried:
SELECT * FROM tablename
WHERE `work_performed` regexp '% ([a-z]^3)([0-9]^3).([a-z]^3) %'
ORDER BY id DESC;
The solution using CONCAT, SUBSTR, SUBSTRING_INDEX and LOCATE functions:
SELECT
CONCAT(SUBSTRING_INDEX(SUBSTRING_INDEX(work_performed, '.', 1), ' ', - 1),
'.',
SUBSTR(SUBSTRING_INDEX(work_performed, '.', - 1), 1,
LOCATE(' ', SUBSTRING_INDEX(work_performed, '.', - 1))
)
) m
FROM
tablename
https://dev.mysql.com/doc/refman/5.7/en/string-functions.html
DEMO link

Select rows from a table that contain any word from a long list of words in another table

I have one table with every Fortune 1000 company name:
FortuneList:
------------------------------------------------
|fid | coname |
------------------------------------------------
| 1 | 3m |
| 2 | Amazon |
| 3 | Bank of America |
| 999 | Xerox |
------------------------------------------------
I have a 2nd table with every user on my newsletter:
MyUsers:
------------------------------------------------
|uid | name | companyname |
------------------------------------------------
| 1350 | John Smith | my own Co |
| 2731 | Greg Jones | Amazon.com, Inc |
| 3899 | Mike Mars | Bank of America, Inc |
| 6493 | Alex Smith | Handyman America |
------------------------------------------------
How do I pull out every one of my newsletter subscribers that works for a Fortune 1000 company? (By scanning my entire MyUsers table for every record that has any of the coname's from the FortuneList table)
I would want output to pull:
------------------------------------------------
|uid | name | companyname |
------------------------------------------------
| 2731 | Greg Jones | Amazon.com, Inc |
| 3899 | Mike Mars | Bank of America, Inc |
------------------------------------------------
(See how it finds "Amazon" in the middle of "Amazon.com, Inc")
Try using this, which uses an INNER JOIN, the LIKE operator, and CONCAT:
SELECT *
FROM MyUsers
INNER JOIN FortuneList
ON FortuneList.coname LIKE CONCAT('%', MyUsers.companyname, '%)
(This wouldn't use your Full Text index, I'm trying to figure out how you could use a MATCH...AGAINST in a JOIN.)
If you were doing this in Oracle, this would yield your desired result (with the example data):
with fortunelist as(
select 1 as fid, '3m' as coname from dual union all
select 2, 'Amazon' from dual union all
select 3, 'Bank of America' from dual union all
select 999, 'Xerox' from dual
)
, myusers as(
select 1350 as usrid, 'John Smith' as name, 'my own Co' as companyname from dual union all
select 2731, 'Greg Jones', 'Amazon.com, Inc.' from dual union all
select 3899, 'Mike Mars', 'Bank of America, Inc' from dual union all
select 6493, 'Alex Smith', 'Handyman America' from dual
)
select utl_match.jaro_winkler_similarity(myusers.companyname, fortunelist.coname) as sim
, myusers.companyname
, fortunelist.coname
from fortunelist
, myusers
where utl_match.jaro_winkler_similarity(myusers.companyname, fortunelist.coname) >= 80
The reason being, the Jaro Winkler result for the 2 you're after are 87 and 95 (Amazon and BOA, respectively). You can bump the 80 in the query up or down to make the matching threshold higher or lower. The higher you go, the fewer matches you'll have, but the more likely they will be. The lower you go, the more matches you'll have, but you risk getting matches back that aren't really matches. For instance, "Handyman America" vs. "Bank of America" = 73/100. So if you lowered it to 70, you would get a false positive, using your example data. Jaro Winkler is generally meant for people's names, not company names, however because company names are typically also very short strings, it may still be useful for you.
I know you tagged this as MySQL and while this function does not exist, from what I've read people have already done the work creating a custom function for it:
http://androidaddicted.wordpress.com/2010/06/01/jaro-winkler-sql-code/
http://dannykopping.com/blog/fuzzy-text-search-mysql-jaro-winkler
You could also try string replacements, ex. eliminating common reasons for a match not being found (such as there being an "Inc." on one table but not the other).
Edit 2/10/14:
You can do this in MySQL (via phpmyadmin) following these steps:
Go into phpmyadmin then your database and paste the code from this URL link (below) into a SQL window and hit Go. This will create the custom function that you'll need to use in Step 2. I'm not going to paste the code for the function here because it's long, also it's not my work. It basically allows you to use the jaro winkler algorithm in MySQL, the same way you would with utl_match if you were using Oracle.
http://androidaddicted.wordpress.com/2010/06/01/jaro-winkler-sql-code/
After that function is created, run the following SQL:
-
select jaro_winkler_similarity(myusers.companyname, fortunelist.coname) as similarity
, myusers.uid
, myusers.name
, myusers.companyname as user_co
, fortunelist.coname as matching_co
from fortunelist
, myusers
where jaro_winkler_similarity(myusers.companyname, fortunelist.coname) >= 80
This should yield the exact result you're looking for, but like I said you'll want to play around with the 80 in that SQL and go up or down so that you have a good balance between avoiding false positives but also finding the matches that you want to find.
I don't have a MySQL database with which to test so if you run into an issue please let me know, but this should work.
Using LOCATE (no index thus):
select uid, name, companyname
from MyUsers JOIN FortuneList
WHERE LOCATE(coname, companyname) > 0

MySQL search query ordered by match relevance

I know basic MySQL querying, but I have no idea how to achieve an accurate and relevant search query.
My table look like this:
id | kanji
-------------
1 | 一子
2 | 一人子
3 | 一私人
4 | 一時
5 | 一時逃れ
I already have this query:
SELECT * FROM `definition` WHERE `kanji` LIKE '%一%'
The problem is that I want to order the results from the learnt characters, 一 being a required character for the results of this query.
Say, a user knows those characters: 人,子,時
Then, I want the results to be ordered that way:
id | kanji
-------------
2 | 一人子
1 | 一子
4 | 一時
3 | 一私人
5 | 一時逃れ
The result which matches the most learnt characters should be first. If possible, I'd like to show results that contain only learnt characters first, then a mix of learnt and unknown characters.
How do I do that?
Per your preference, ordering by number of unmatched characters (increasing), and then number of matched character (decreasing).
SELECT *,
(kanji LIKE '%人%')
+ (kanji LIKE '%子%')
+ (kanji LIKE '%時%') score
FROM kanji
ORDER BY CHAR_LENGTH(kanji) - score, score DESC
Or, the relational way to do it is to normalize. Create the table like this:
kanji_characters
kanji_id | index | character
----------------------------
1 | 0 | 一
1 | 1 | 子
2 | 0 | 一
2 | 1 | 人
2 | 2 | 子
...
Then
SELECT kanji_id,
COUNT(*) length,
SUM(CASE WHEN character IN ('人','子','時') THEN 1 END) score
FROM kanji_characters
WHERE index <> 0
AND kanji_id IN (SELECT kanji_id FROM kanji_characters WHERE index = 0 AND character = '一')
GROUP BY kanji_id
ORDER BY length - score, score DESC
Though you didn't specify what should be done in the case of duplicate characters. The two solutions above handle that differently.
Just a thought, but a text index may help, you can get a score back like this:
SELECT match(kanji) against ('your search' in natural language mode) as rank
FROM `definition` WHERE match(`kanji`) against ('your search' in natural language mode)
order by rank, length(kanji)
The trick is to index these terms (or words?) the right way. I think the general trick is to encapsulate each word with double quotes and make a space between each. This way the tokenizer will populate the index the way you want. Of course you would need to add/remove the quotes on the way in/out respectively.
Hope this doesn't bog you down.

Selecting rows based upon a search string or any of its synonyms

I need some help please...
I have 2 tables, one contains a description field which is entered freehand by the user, the second table is made up of 2 columns, the first is a group name and the second is a list of synonyms. So, for example, I might have three rows in the synonyms table in a group called A that contains the synonyms 'Leaflet', 'Brochure', 'Hand Bill'.
What I need to do is return all rows from the first table where the ItemDescription column contains any of the synonyms of the query variable which might be 'Leaflet'.
So this should give me all of the rows that contain anywhere in the long description field the words 'Leaflet', 'Brochure' or 'Hand Bill'.
I have been able to do this only where the ItemDescription field contains only actual words being looked for, in reality this os a long wordy column that may contain 50 or 60 words any one of which may be one of the search word or any of its synonyms.
All help gratefully received as always.
Thanks.
You should probably try to use LIKE or RLIKE to match the description column. In this case, you want to match a number of alternatives, so I'll just show an example.
Let us assume that we have this table containing synonyms. Note that we have added the word itself as a synonym:
+---------+-----------+
| word | synonym |
+---------+-----------+
| leaflet | leaflet |
| leaflet | brochure |
| leaflet | hand bill |
| skin | skin |
| skin | leather |
| skin | hide |
+---------+-----------+
You don't give an example table, so I invented one called items:
+---------+-------------------+-----------------------------------+
| item_id | brief | description |
+---------+-------------------+-----------------------------------+
| 1 | Diamond | This brochure is glossy and shiny |
| 2 | Halloween Special | A leaflet for the Halloween |
| 3 | Pumpkin | This is just a Halloween pumpkin |
+---------+-------------------+-----------------------------------+
Now, we assume that you want to look for all rows containing one of the synonyms of 'leaflet' in the description. The following query does the job:
SELECT * FROM items
WHERE description RLIKE (
SELECT
CONCAT('.*(', GROUP_CONCAT(synonym SEPARATOR '|'), ').*')
FROM synonyms
WHERE word = 'leaflet'
GROUP BY word
);
The inner select create a regular expression matching one of the synonyms, and the outer select applies this regular expression to the description column of our items table.
Thanks for the feedback. I have found an answer to my SQL needs:
SELECT *
FROM MainTable a
WHERE EXISTS
(
SELECT 1
FROM (
Select concat('%',Synonym,'%') As cond
From synonyms
Where Synonym Like '%SearchString%'
OR ListRef = ( Select ListRef
From synonyms
Where Synonym Like '%SearchString%')
) с
WHERE a.Description LIKE cond
)
OR ItemDescription Like '%SearchString%'
Without the final OR I was only returning rows where something existed in the synonyms table for my search string, with the OR it also returns all straight matches not found through synonyms.

MySQL Fulltext search present me inaccurate result

Let's say that I have a database that looks like this (MyISAM):
+------------+-------------------+------------------+
| student_id | student_firstname | student_lastname |
+------------+-------------------+------------------+
| 30 | Patrik | Andersson |
| 79 | Patrik | Svensson |
+------------+-------------------+------------------+
And I perform this query:
SELECT s.student_firstname, s.student_lastname FROM students s
WHERE MATCH (student_firstname, student_lastname)
AGAINST
('+Patrik Svensson*' IN BOOLEAN mode)
This generates both of the above rows. Why do I not get 1 row in my result? Is it because the last three letters in the student_lastname are the same? Is there any way to make FULLTEXT more precise?
Have you tried reading the MySQL documentation?
http://dev.mysql.com/doc/refman/5.5/en/fulltext-boolean.html
And I quote:
By default (when neither + nor - is specified) the word is optional,
but the rows that contain it are rated higher.
And:
'+apple macintosh'
Find rows that contain the word “apple”, but rank rows higher if they
also contain “macintosh”.
I have tested it, this query is giving right result
SELECT s.student_firstname, s.student_lastname FROM students s
WHERE MATCH (student_firstname, student_lastname)
AGAINST
('+Patrik +Svensson*' IN BOOLEAN mode)