Selecting results by not exact match - mysql

I need to figure out the best way to select records from db by a string that's not matching exactly the string in db.
The one stored in db is:
So-Fi (S. 1st St. District), 78704 (South Austin), Bouldin Creek, South Congress
And the one I have to match with is:
$myArea = 'So-Fi-S-1st-St-District-78704-South-Austin-Bouldin-Creek-South-Congress';
The $myArea is actually a value taken from db and formatted for SEO-friendly URL on a different page.
I've tried
SELECT* FROM t1 WHERE area = REPLACE('".$myArea."', '-', '')
But clearly there's no match. Basically, since I cannot tame $myArea and format it back to what it was in db.
Is there a way to remove all punctuation and such leaving only alphanumerics in db before selecting?

Doing lookups like this will guarantee you some headache, there are to many special cases which you'll be unable to cover.
Why don't you add a "slug" field to your database, where you put the SEO friendly string. This way you do a direct look up on the slug without having to do a lot of string manipulation.
Example of database table:
CREATE TABLE `locations` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`slug` varchar(255) NOT NULL,
`location` varchar(255) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=UTF8;
Then you do lookups like this:
SELECT location from locations where slug = :slug;

Related

mySQL query to find multiple strings in any order within a single field?

In one column of a database, we store the parameters that we used to hit an API, for example if the API call was sample.api/call?foo=1&bar=2&foobar=3 then the field will store foo=1&bar=2&foobar=3
It'd be easy enough to make a query to check 2 or 3 of those values if it was guaranteed that they'd be in that order, but that's not guaranteed. There's a possibility that call could have been made with the parameters as bar=2&foo=1&foobar=3 or any other combination.
Is there a way to make that query without saying:
SELECT * FROM table
WHERE value LIKE "%foo=1%"
AND value LIKE "%bar=2%"
AND value LIKE "%foobar=3%"
I've also tried
SELECT * FROM table
WHERE "foo=1" IN (value)
but that didn't yield any results at all.
Edit: I should have previously mentioned that I won't necessarily be always looking for the same parameters.
But why?
The problem with doing simple LIKE statements is this:
SELECT * FROM table
WHERE value LIKE "%foo=1%"
This will match the value asdffoo=1 and also foo=13. One hacky solution is to do this:
SELECT * FROM `api`
WHERE `params` REGEXP '(^|&)foo=1(&|$)'
AND `params` ...
Be aware, this does not use indexes. If you have a large dataset, this will need to do a row scan and be extremely slow!
Alternatively, if you can store your info in the database differently, you can utilize the FIND_IN_SET() function.
-- Store in DB as foo=1,bar=2,foobar=3
SELECT * FROM `api`
WHERE FIND_IN_SET(`params`, 'foo=1')
AND FIND_IN_SET(`params`, 'bar=2')
...
The only other solution would be to involve either another table, something like the following, and following the solution on this page:
CREATE TABLE `endpoints` (
`id` int(6) unsigned NOT NULL AUTO_INCREMENT,
`url` varchar(200) NOT NULL,
PRIMARY KEY (`id`)
) DEFAULT CHARSET=utf8;
CREATE TABLE IF NOT EXISTS `params` (
`id` int(6) unsigned NOT NULL AUTO_INCREMENT,
`endpoint` int(6) NOT NULL,
`param` varchar(200) NOT NULL,
PRIMARY KEY (`id`),
INDEX `idx_param` (`param`)
) DEFAULT CHARSET=utf8;
The last and final recommendation is to upgrade to 5.7, and utilize JSON functionality. Insert the data as a JSON object, and search it as demonstrated in this question.
This is completely impossible to do properly.
Problem 1. bar and foobar overlap
so if you search for bar=2, you will match on foobar=2. This is not what you want.
This can be fixed by prepending a leading & when storing the get query string.
Problem 2. you don't know how many characters are in the value. SO you must also have an end of string character. Which is the same & character. so you need it at the beginning and end.
You now see the issue.
even if you sort the parameters before storing it all to the database, you still cant do LIKE "%&bar=2&%&foo=1&%&foobar=3&%", because the first match can overlap the second.
even after the corrections, you still have to use three LIKES to match the overlapping strings.

Optimal search query and structure for querying large set of data

I've created file indexer which simply inserts filenames into specified table. Now I'm considering the best way to search for the filenames. There could be 100000+ files in table so performance is important.
File name can be various - 10, 20, 50 or more characters in length. At least for now, my test dataset has no files with spaces in their names. User can do partial search, for example looking for '1001' should return file with name 10_1001_20_30_40_50.
My current table structure:
CREATE TABLE `file` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`id_category` int(10) unsigned NOT NULL,
`filename` varchar(255) NOT NULL,
`file_ext` varchar(3) NOT NULL,
`date_added` timestamp NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`id`,`id_category`),
KEY `idx_file_filename` (`filename`) USING BTREE,
KEY `fk_file_1_idx` (`id_category`),
FULLTEXT KEY `filename` (`filename`)
) ENGINE=MyISAM AUTO_INCREMENT=24974 DEFAULT CHARSET=utf8;
INSERT INTO `file` (`id`,`id_category`,`filename`,`file_ext`,`date_added`) VALUES (22474,14199,'095_98_1002_1003_148_98_1001_003','pdf','2016-03-19 19:02:12');
INSERT INTO `file` (`id`,`id_category`,`filename`,`file_ext`,`date_added`) VALUES (22475,14199,'095_98_1002_1003_148_98_1001_001','pdf','2016-03-19 19:02:11');
I've tried to use MATCH () AGAINST (), but it turned out it's not a good idea if you don't have spaces in string and want to do "if string contains search" like:
SELECT id, filename FROM `file` WHERE MATCH(filename) AGAINST ('1002*' IN BOOLEAN MODE);
This is not going to return what I need. What I'm considering is to use FULLTEXT by split all filenames while importing into 3 length (min. string length user can provide) parts separated by spaces and them use queries like this:
SELECT * FROM `file` WHERE MATCH(filename) AGAINST ('100*' IN BOOLEAN MODE);
Of course I can leave filenames as they are and use LIKE operator:
SELECT * FROM `file` WHERE filename LIKE '%100%'
but there is a lot negative opinions about using LIKE for large data sets. I'm curious if my solution with adding spaces to file names will be a good idea.
Attempting to use FULLTEXT: requires space, limits you (mostly) to full "words", gets inefficient with "short" words, misses "stop words", etc.
LIKE '%100%', though inefficient because it must test every row, is what you need.
You imply that all the relevant parts of the filenames are numbers? And that you only want to test for whole parts? That is 22_100_33 will be searched for 22, 100, and 33, but not for 2, 10, 00, etc?? If all that is the case, then LIKE will not work correctly. Example: 101_1000 will be caught by LIKE '%100%'.
So, maybe you want to build an "inverted index": For 10_1001_20_30_40_50, you would have a 6 rows in a table: 10, 1001, etc, and either the rest of the columns, or some id(s) for joining to the file table.
there is a lot negative opinions about using LIKE for large data sets
Chances are it would be good enough for Your case, I would test it first.
If You really want to speed it up, I can think of one option, but sacrifices would be huge - memory, insertion times, maintanability, flexibility, complexity... You can build "inverted index" for suffixes. The table would look like (pseudocode):
CREATE TABLE Pref(
prefix varchar(255) NOT NULL,
fileid bigint(20) unsigned NOT NULL,
CONSTRAINT [PK_Pref] PRIMARY KEY CLUSTERED
(
prefix ASC,
fileid ASC
))
and have data like this
'095_98_1002_1003_148_98_1001_003', 22474
'95_98_1002_1003_148_98_1001_003', 22474
'5_98_1002_1003_148_98_1001_003', 22474
'_98_1002_1003_148_98_1001_003', 22474
'98_1002_1003_148_98_1001_003', 22474
...
'03', 22474
'3', 22474
it would have clustered primary key on both columns. That way it would be ordered by the prefix and you can change infix search '%abcd%' into prefix search 'abcd%'. The query would then have the form
SELECT id, filename FROM `file`
WHERE id IN (SELECT fileid FROM Pref WHERE prefix like 'abcd%')
You just have to make triggers to keep it in sync with the main table. Beware, that when You delete the row in this table, You should avoid search of fileid without prefix specified, or the performance would be a disaster.

How to create index in SQL to increase performance

I have around 200,000 rows in database table. When I execute my search query, it's taking around 4-5 seconds to give me results in next page. I want that execution should be fast and results should be loaded under 2 seconds. I have around 16 columns in my table.
Following is my query for creation of table
Create table xml(
PID int not null,
Percentdisc int not null,
name varchar(100) not null,
brand varchar(30) not null,
store varchar(30) not null,
price int not null,
category varchar(20) not null,
url1 varchar(300) not null,
emavail varchar(100) not null,
dtime varchar(100) not null,
stock varchar(30) not null,
description varchar(200) not null,
avail varchar(20) not null,
tags varchar(30) not null,
dprice int not null,
url2 varchar(300),
url3 varchar(300),
sid int primary key auto_increment);
Select query which I'm using
select * from feed where (name like '%Baby%' And NAME like '%Bassinet%')
I dont have much knowledge of indexing the database, to increase performance. Please guide me what index to use.
Indexes aren't going to help. LIKE is a non sargable operator. http://en.wikipedia.org/wiki/Sargable
The wildcard opeartor % used in starting of matching string renders any index created useless .
More are the characters before 1st wildcard operator , faster is the index lookup scan .
Anyways you can add an index to existing table
ALTER TABLE feed ADD INDEX (NAME);
This will have no index usage even after creating index on NAME column becuse it has a leading % character
select * from feed where (name like '%Baby%' And NAME like '%Bassinet%')
This will use indexing as starting % removed
select * from feed where (name like 'Baby%' And NAME like 'Bassinet%')
There's a good read here.
LIKE does not use the full text indexing. If you want to use full text searching you can use MySQL full text search functions, You can read MySQL doc regarding this.
Here's the syntax for adding INDEX in MySQL:
ALTER TABLE `feed`
ADD INDEX (`Name`);
MySQL Match example:
Substring matches: (Matches: Babylonian, Bassineete etc.)
SELECT * FROM `feed` WHERE MATCH (NAME) AGAINST ("+Baby* +Bassinett*" IN BOOLEAN MODE);
Exact matches:
SELECT * FROM `feed` WHERE MATCH (NAME) AGAINST ("+Baby +Bassinett" IN BOOLEAN MODE);
In your case index is not usefull. When we find with like operator it not use index. When we direct search i.e columnname = 'Ajay', at this time it search in index(if apply). The reason is index is searching with the physical data ,not with logical data(for like operator).
You can use Full-text search for this where you can define only those column in which you need to find data. FTS is usefull and get faster data when more data as you have.
How to enable FTS, please check the link.
http://blog.sqlauthority.com/2008/09/05/sql-server-creating-full-text-catalog-and-index/

Select rows where column LIKE dictionary word

I have 2 tables:
Dictionary - Contains roughly 36,000 words
CREATE TABLE IF NOT EXISTS `dictionary` (
`word` varchar(255) NOT NULL,
PRIMARY KEY (`word`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
Datas - Contains roughly 100,000 rows
CREATE TABLE IF NOT EXISTS `datas` (
`ID` int(11) NOT NULL AUTO_INCREMENT,
`hash` varchar(32) NOT NULL,
`data` varchar(255) NOT NULL,
`length` int(11) NOT NULL,
`time` int(11) NOT NULL,
PRIMARY KEY (`ID`),
UNIQUE KEY `hash` (`hash`),
KEY `data` (`data`),
KEY `length` (`length`),
KEY `time` (`time`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=105316 ;
I would like to somehow select all the rows from datas where the column data contains 1 or more words.
I understand this is a big ask, it would need to match all of these rows together in every combination possible, so it needs the best optimization.
I have tried the below query, but it just hangs for ages:
SELECT `datas`.*, `dictionary`.`word`
FROM `datas`, `dictionary`
WHERE `datas`.`data` LIKE CONCAT('%', `dictionary`.`word`, '%')
AND LENGTH(`dictionary`.`word`) > 3
ORDER BY `length` ASC
LIMIT 15
I have also tried something similar to the above with a left join, and on clause that specified the like statement.
This is actually not an easy problem, what you are trying to perform is called Full Text Search, and relational databases are not the best tools for such a task. If this is some kind of a core functionality consider using solutions dedicated for this kind of operations, like Sphinx Search Server.
If this is not a "Mission Critical" system, you can try with something else. I can see that datas.data column isn't really long, so you can create a structure dedicated for your task and keep maintaining it during operational use. Fore example, create table:
dictionary_datas (
datas_id FK (datas.id),
word FK (dictionary.word)
)
Now anytime you insert, delete or simply modify datas or dictionary tables you update dictionary_datas placing there info which datas_id contains which words (basically many to many relations). Of course it will degradate your performance, so if you have high high transactional load on your system, you have to do this periodicaly. For example place a Cron Job which runs every night at 03:00 am and actualize the table. To simplify the task you can add a flag TO_CHECK into DATAS table, and actualize data only for those records having there 1 (after you actualise dictionary_datas you switch the value to 0). Remember by the way to refresh whole DATAS table after an update to DICTIONARY table. 36 000 and 100 000 aren't big numbers in terms of data processing.
Once you have this table you can just query it like:
SELECT datas_id, count(*) AS words_num FROM dictionary_datas GROUP BY datas_id HAVING count(*) > 3;
To speed up the query (and yet slow down it's update) you can create a composite index on its columns datas_id, word (in EXACTLY that order). If you decide to refresh the data periodicaly you should remove the index before refresh, than refresh the data, and finaly create the index after refreshing - this way will be faster.
I'm not sure if I understood your problem, but I think this could be a solution. Also, I think people don't like Regular Expression but this works for me to select columns where their value has more than 1 word.
SELECT * FROM datas WHERE data REGEXP "([a-z] )+"
Have you tried this?
select *
from dictionary, datas
where position(word,data) > 0
;
This is very inefficient, but might be good enough for you. Here is a fiddle.
For better performance, you could try placing a text search index on your text column DATA and then using the CONTAINS function instead of POSITION.

mysql fulltext MATCH,AGAINST

I am trying to do a full text search on a field to match specific parts of a string. Consider a long string holding array values like 201:::1###193:::5###193:::6###202:::6. ### seperates an array element and ::: seperates key=>val. Now my understanding of match against is that it can match portions of a string in boolean mode. but when i do something in the lines of
`SELECT
a.settings
, MATCH(a.settings) AGAINST('201:::1') as relevance
, b.maxrelevance
, (MATCH(a.settings) AGAINST('201:::1'))/b.maxrelevance*100 as relevanceperc
FROM
users_profile a
, (SELECT MAX(MATCH(settings) AGAINST('201:::1')) as maxrelevance FROM users_profile LIMIT 1) b
WHERE
MATCH(a.settings) AGAINST('201:::1')
ORDER BY
relevance DESC;`
Table example
CREATE TABLE users_profile (
id int(11) default NULL,
profile text,
views int(11) default NULL,
friends_list text,
settings text,
points int(11) default NULL,
KEY id (id),
FULLTEXT KEY settings (settings)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
i'm getting zero results. Any ideas are welcome.
MySQL fulltext indexes are designed to store natural language words. Your sample
201:::1###193:::5###193:::6###202:::6. ###
Is made up of only numbers as the significant parts, such as 201,1,192... Because very short words are rarely useful, ft_min_word_len is usually set at 4, which means none of the numbers are even in the index.
Fulltext isn't the solution to this problem.
If all you wanted is to count how many times an expression exists in the column, just use
(length(a.setting) - length(replace(a.setting,'201:::1',''))) / length('201:::1')