mysql fulltext MATCH,AGAINST - mysql

I am trying to do a full text search on a field to match specific parts of a string. Consider a long string holding array values like 201:::1###193:::5###193:::6###202:::6. ### seperates an array element and ::: seperates key=>val. Now my understanding of match against is that it can match portions of a string in boolean mode. but when i do something in the lines of
`SELECT
a.settings
, MATCH(a.settings) AGAINST('201:::1') as relevance
, b.maxrelevance
, (MATCH(a.settings) AGAINST('201:::1'))/b.maxrelevance*100 as relevanceperc
FROM
users_profile a
, (SELECT MAX(MATCH(settings) AGAINST('201:::1')) as maxrelevance FROM users_profile LIMIT 1) b
WHERE
MATCH(a.settings) AGAINST('201:::1')
ORDER BY
relevance DESC;`
Table example
CREATE TABLE users_profile (
id int(11) default NULL,
profile text,
views int(11) default NULL,
friends_list text,
settings text,
points int(11) default NULL,
KEY id (id),
FULLTEXT KEY settings (settings)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
i'm getting zero results. Any ideas are welcome.

MySQL fulltext indexes are designed to store natural language words. Your sample
201:::1###193:::5###193:::6###202:::6. ###
Is made up of only numbers as the significant parts, such as 201,1,192... Because very short words are rarely useful, ft_min_word_len is usually set at 4, which means none of the numbers are even in the index.
Fulltext isn't the solution to this problem.
If all you wanted is to count how many times an expression exists in the column, just use
(length(a.setting) - length(replace(a.setting,'201:::1',''))) / length('201:::1')

Related

Selecting results by not exact match

I need to figure out the best way to select records from db by a string that's not matching exactly the string in db.
The one stored in db is:
So-Fi (S. 1st St. District), 78704 (South Austin), Bouldin Creek, South Congress
And the one I have to match with is:
$myArea = 'So-Fi-S-1st-St-District-78704-South-Austin-Bouldin-Creek-South-Congress';
The $myArea is actually a value taken from db and formatted for SEO-friendly URL on a different page.
I've tried
SELECT* FROM t1 WHERE area = REPLACE('".$myArea."', '-', '')
But clearly there's no match. Basically, since I cannot tame $myArea and format it back to what it was in db.
Is there a way to remove all punctuation and such leaving only alphanumerics in db before selecting?
Doing lookups like this will guarantee you some headache, there are to many special cases which you'll be unable to cover.
Why don't you add a "slug" field to your database, where you put the SEO friendly string. This way you do a direct look up on the slug without having to do a lot of string manipulation.
Example of database table:
CREATE TABLE `locations` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`slug` varchar(255) NOT NULL,
`location` varchar(255) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=UTF8;
Then you do lookups like this:
SELECT location from locations where slug = :slug;

Optimal search query and structure for querying large set of data

I've created file indexer which simply inserts filenames into specified table. Now I'm considering the best way to search for the filenames. There could be 100000+ files in table so performance is important.
File name can be various - 10, 20, 50 or more characters in length. At least for now, my test dataset has no files with spaces in their names. User can do partial search, for example looking for '1001' should return file with name 10_1001_20_30_40_50.
My current table structure:
CREATE TABLE `file` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`id_category` int(10) unsigned NOT NULL,
`filename` varchar(255) NOT NULL,
`file_ext` varchar(3) NOT NULL,
`date_added` timestamp NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`id`,`id_category`),
KEY `idx_file_filename` (`filename`) USING BTREE,
KEY `fk_file_1_idx` (`id_category`),
FULLTEXT KEY `filename` (`filename`)
) ENGINE=MyISAM AUTO_INCREMENT=24974 DEFAULT CHARSET=utf8;
INSERT INTO `file` (`id`,`id_category`,`filename`,`file_ext`,`date_added`) VALUES (22474,14199,'095_98_1002_1003_148_98_1001_003','pdf','2016-03-19 19:02:12');
INSERT INTO `file` (`id`,`id_category`,`filename`,`file_ext`,`date_added`) VALUES (22475,14199,'095_98_1002_1003_148_98_1001_001','pdf','2016-03-19 19:02:11');
I've tried to use MATCH () AGAINST (), but it turned out it's not a good idea if you don't have spaces in string and want to do "if string contains search" like:
SELECT id, filename FROM `file` WHERE MATCH(filename) AGAINST ('1002*' IN BOOLEAN MODE);
This is not going to return what I need. What I'm considering is to use FULLTEXT by split all filenames while importing into 3 length (min. string length user can provide) parts separated by spaces and them use queries like this:
SELECT * FROM `file` WHERE MATCH(filename) AGAINST ('100*' IN BOOLEAN MODE);
Of course I can leave filenames as they are and use LIKE operator:
SELECT * FROM `file` WHERE filename LIKE '%100%'
but there is a lot negative opinions about using LIKE for large data sets. I'm curious if my solution with adding spaces to file names will be a good idea.
Attempting to use FULLTEXT: requires space, limits you (mostly) to full "words", gets inefficient with "short" words, misses "stop words", etc.
LIKE '%100%', though inefficient because it must test every row, is what you need.
You imply that all the relevant parts of the filenames are numbers? And that you only want to test for whole parts? That is 22_100_33 will be searched for 22, 100, and 33, but not for 2, 10, 00, etc?? If all that is the case, then LIKE will not work correctly. Example: 101_1000 will be caught by LIKE '%100%'.
So, maybe you want to build an "inverted index": For 10_1001_20_30_40_50, you would have a 6 rows in a table: 10, 1001, etc, and either the rest of the columns, or some id(s) for joining to the file table.
there is a lot negative opinions about using LIKE for large data sets
Chances are it would be good enough for Your case, I would test it first.
If You really want to speed it up, I can think of one option, but sacrifices would be huge - memory, insertion times, maintanability, flexibility, complexity... You can build "inverted index" for suffixes. The table would look like (pseudocode):
CREATE TABLE Pref(
prefix varchar(255) NOT NULL,
fileid bigint(20) unsigned NOT NULL,
CONSTRAINT [PK_Pref] PRIMARY KEY CLUSTERED
(
prefix ASC,
fileid ASC
))
and have data like this
'095_98_1002_1003_148_98_1001_003', 22474
'95_98_1002_1003_148_98_1001_003', 22474
'5_98_1002_1003_148_98_1001_003', 22474
'_98_1002_1003_148_98_1001_003', 22474
'98_1002_1003_148_98_1001_003', 22474
...
'03', 22474
'3', 22474
it would have clustered primary key on both columns. That way it would be ordered by the prefix and you can change infix search '%abcd%' into prefix search 'abcd%'. The query would then have the form
SELECT id, filename FROM `file`
WHERE id IN (SELECT fileid FROM Pref WHERE prefix like 'abcd%')
You just have to make triggers to keep it in sync with the main table. Beware, that when You delete the row in this table, You should avoid search of fileid without prefix specified, or the performance would be a disaster.

MySql FullText Search seems slow, How can I make this faster? Any Optimization required?

I have a table with 1 million records and want to apply faster way to fetch record against any search query. As I am bad with mysql fulltext search.
I made following tests:
Initially I applied MATCH AGAINS on single column it return result fast.
Secondly I applied MATCH AGAINS on two columns and return very slow.
Then I made column optimization and combined two columns into one and applied MATCH AGAINSon computed column. It returns very slow on first time but reasonably fast on second attempt with same search term.
Is there any issue with my query how I should amend this with more optimization?
select name, meaning, m.gender, m.similar
FROM
NAMES n
INNER JOIN meta m ON m.nameid = n.id
WHERE MATCH (nameandmeaning) AGAINST ('searchterm*' IN BOOLEAN MODE)
AND meaning IS NOT NULL
ORDER BY LENGTH(m.similar) DESC
LIMIT 0 , 10;
Note: nameandmeaning is combinition of name, meaning.
My Table Structure is as follow:
CREATE TABLE NAMES (
id BIGINT(20) NOT NULL AUTO_INCREMENT,
name VARCHAR(100) NOT NULL,
meaning VARCHAR(2000) DEFAULT NULL,
nameandmeaning VARCHAR(2000) DEFAULT NULL,
PRIMARY KEY (id),
FULLTEXT KEY constains_name(name,meaning),
FULLTEXT KEY contains_namemeaing (nameandmeaning)
) ENGINE=MYISAM AUTO_INCREMENT=67846 DEFAULT CHARSET=latin1;
CREATE TABLE NAMES (
id BIGINT(20) NOT NULL AUTO_INCREMENT,
name VARCHAR(100) NOT NULL,
meaning VARCHAR(2000) DEFAULT NULL,
-- skip this: nameandmeaning VARCHAR(2000) DEFAULT NULL,
PRIMARY KEY (id),
FULLTEXT KEY constains_name(name, meaning),
-- skip this: FULLTEXT KEY contains_namemeaing (nameandmeaning)
) ENGINE=MYISAM AUTO_INCREMENT=67846 DEFAULT CHARSET=latin1;
Then these should work fast:
MATCH(name) AGAINST (...)
MATCH(meaning) AGAINST (...)
MATCH(name, meaning) AGAINST (...)
When you upgrade to InnoDB, you should have all 3 of these (if you are doing all 3 of those MATCHes):
FULLTEXT (name),
FULLTEXT (meaning),
FULLTEXT (name, meaning)

Select rows where column LIKE dictionary word

I have 2 tables:
Dictionary - Contains roughly 36,000 words
CREATE TABLE IF NOT EXISTS `dictionary` (
`word` varchar(255) NOT NULL,
PRIMARY KEY (`word`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
Datas - Contains roughly 100,000 rows
CREATE TABLE IF NOT EXISTS `datas` (
`ID` int(11) NOT NULL AUTO_INCREMENT,
`hash` varchar(32) NOT NULL,
`data` varchar(255) NOT NULL,
`length` int(11) NOT NULL,
`time` int(11) NOT NULL,
PRIMARY KEY (`ID`),
UNIQUE KEY `hash` (`hash`),
KEY `data` (`data`),
KEY `length` (`length`),
KEY `time` (`time`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=105316 ;
I would like to somehow select all the rows from datas where the column data contains 1 or more words.
I understand this is a big ask, it would need to match all of these rows together in every combination possible, so it needs the best optimization.
I have tried the below query, but it just hangs for ages:
SELECT `datas`.*, `dictionary`.`word`
FROM `datas`, `dictionary`
WHERE `datas`.`data` LIKE CONCAT('%', `dictionary`.`word`, '%')
AND LENGTH(`dictionary`.`word`) > 3
ORDER BY `length` ASC
LIMIT 15
I have also tried something similar to the above with a left join, and on clause that specified the like statement.
This is actually not an easy problem, what you are trying to perform is called Full Text Search, and relational databases are not the best tools for such a task. If this is some kind of a core functionality consider using solutions dedicated for this kind of operations, like Sphinx Search Server.
If this is not a "Mission Critical" system, you can try with something else. I can see that datas.data column isn't really long, so you can create a structure dedicated for your task and keep maintaining it during operational use. Fore example, create table:
dictionary_datas (
datas_id FK (datas.id),
word FK (dictionary.word)
)
Now anytime you insert, delete or simply modify datas or dictionary tables you update dictionary_datas placing there info which datas_id contains which words (basically many to many relations). Of course it will degradate your performance, so if you have high high transactional load on your system, you have to do this periodicaly. For example place a Cron Job which runs every night at 03:00 am and actualize the table. To simplify the task you can add a flag TO_CHECK into DATAS table, and actualize data only for those records having there 1 (after you actualise dictionary_datas you switch the value to 0). Remember by the way to refresh whole DATAS table after an update to DICTIONARY table. 36 000 and 100 000 aren't big numbers in terms of data processing.
Once you have this table you can just query it like:
SELECT datas_id, count(*) AS words_num FROM dictionary_datas GROUP BY datas_id HAVING count(*) > 3;
To speed up the query (and yet slow down it's update) you can create a composite index on its columns datas_id, word (in EXACTLY that order). If you decide to refresh the data periodicaly you should remove the index before refresh, than refresh the data, and finaly create the index after refreshing - this way will be faster.
I'm not sure if I understood your problem, but I think this could be a solution. Also, I think people don't like Regular Expression but this works for me to select columns where their value has more than 1 word.
SELECT * FROM datas WHERE data REGEXP "([a-z] )+"
Have you tried this?
select *
from dictionary, datas
where position(word,data) > 0
;
This is very inefficient, but might be good enough for you. Here is a fiddle.
For better performance, you could try placing a text search index on your text column DATA and then using the CONTAINS function instead of POSITION.

Is it possible to find the repeating patterns in the TEXT stored in MySQL?

Is it possible to find repeating patterns in the text?
My table looks like this:
CREATE TABLE `textanalysis` (
`ID` int(11) NOT NULL AUTO_INCREMENT,
`abstract` text,
UNIQUE KEY `ID` (`ID`),
FULLTEXT KEY `abstract` (`abstract`)
) ENGINE=MyISAM AUTO_INCREMENT=2 DEFAULT CHARSET=latin1;
I would like to find the words or group of words in the text then make a statistics.
Here is some tricks (not very optimized)
use "apple" for example,
length for apple is 5
SELECT
(LENGTH(abstract)-LENGTH(REPLACE(LOWER(abstract), 'apple', '')))/5
AS occurrences
FROM
textanalysis
WHERE
MATCH (abstract) AGAINST ('+apple' IN BOOLEAN MODE);
What is does is to replace apple (make the length of abstract shorter),
and you compare the original length to deduce number of occurrences.
I'm not so clear about your requirement, but if you want to count the occurrence of each distinct words, you can try
select count(id) as total_word, abstract from textanalysis group by abstract;