Lookup against MYSQL TEXT type column - mysql

My table/model has TEXT type column, and when filtering for the records on the model itself, the AR where produces the correct SQL and returns correct results, here is what I mean :
MyNamespace::MyValue.where(value: 'Good Quality')
Produces this SQL :
SELECT `my_namespace_my_values`.*
FROM `my_namespace_my_values`
WHERE `my_namespace_my_values`.`value` = '\\\"Good Quality\\\"'
Take another example where I m joining MyNamespace::MyValue and filtering on the same value column but from the other model (has relation on the model to my_values). See this (query #2) :
OtherModel.joins(:my_values).where(my_values: { value: 'Good Quality' })
This does not produce correct query, this filters on the value column as if it was a String column and not Text, therefore producing incorrect results like so (only pasting relevant where) :
WHERE my_namespace_my_values`.`value` = 'Good Quality'
Now I can get past this by doing LIKE inside my AR where, which will produce the correct result but slightly different query. This is what I mean :
OtherModel.joins(:my_values).where('my_values.value LIKE ?, '%Good Quality%')
Finally arriving to my questions. What is this and how it's being generated for where on the model (for text column type)?
WHERE `my_namespace_my_values`.`value` = '\\\"Good Quality\\\"'
Maybe most important question what is the difference in terms of performance using :
WHERE `my_namespace_my_values`.`value` = '\\\"Good Quality\\\"'
and this :
(my_namespace_my_values.value LIKE '%Good Quality%')
and more importantly how do I get my query with joins (query #2) produce where like this :
WHERE `my_namespace_my_values`.`value` = '\\\"Good Quality\\\"'

(Partial answer -- approaching from the MySQL side.)
What will/won't match
Case 1: (I don't know where the extra backslashes and quotes come from.)
WHERE `my_namespace_my_values`.`value` = '\\\"Good Quality\\\"'
\"Good Quality\" -- matches
Good Quality -- does not match
The product has Good Quality. -- does not match
Case 2: (Find Good Quality anywhere in value.)
WHERE my_namespace_my_values.value LIKE '%Good Quality%'
\"Good Quality\" -- matches
Good Quality -- matches
The product has Good Quality. -- matches
Case 3:
WHERE `my_namespace_my_values`.`value` = 'Good Quality'
\"Good Quality\" -- does not match
Good Quality -- matches
The product has Good Quality. -- does not match
Performance:
If value is declared TEXT, all cases are slow.
If value is not indexed, all are slow.
If value is VARCHAR(255) (or smaller) and indexed, Cases 1 and 3 are faster. It can quickly find the one row, versus checking all rows.
Phrased differently:
LIKE with a leading wildcard (%) is slow.
Indexing the column is important for performance, but TEXT cannot be indexed.

What is this and how it's being generated for where on the model (for
text column type)?
Thats generated behind Active Records (Arel) lexical engine.
See my answer below on your second question as to why.
What is the difference in terms of performance using...
The "=" matches by whole string/chunk comparison
While LIKE matches by character(s) ( by character(s)).
In my projects i got tables with millions of rows, from my experience its really faster to the use that comparator "=" or regexp than using a LIKE in a query.
How do I get my query with joins (query #2) produce where like this...
Can you try this,
OtherModel.joins(:my_values).where(OtherModel[:value].eq('\\\"Good Quality\\\"'))

I think it might be helpful.
to search for \n, specify it as \n. To search for \, specify it as
\\ this is because the backslashes are stripped once by the parser
and again when the pattern match is made, leaving a single backslash
to be matched against.
link
LIKE and = are different operators.
= is a comparison operator that operates on numbers and strings. When comparing strings, the comparison operator compares whole strings.
LIKE is a string operator that compares character by character.
mysql> SELECT 'ä' LIKE 'ae' COLLATE latin1_german2_ci;
+-----------------------------------------+
| 'ä' LIKE 'ae' COLLATE latin1_german2_ci |
+-----------------------------------------+
| 0 |
+-----------------------------------------+
mysql> SELECT 'ä' = 'ae' COLLATE latin1_german2_ci;
+--------------------------------------+
| 'ä' = 'ae' COLLATE latin1_german2_ci |
+--------------------------------------+
| 1 |
+--------------------------------------+

The '=' op is looking for an exact match while the LIKE op is working more like pattern matching with '%' being similar like '*' in regular expressions.
So if you have entries with
Good Quality
More Good Quality
only LIKE will get both results.
Regarding the escape string I am not sure where this is generated, but looks like some standardized escaping to get this valid for SQL.

Related

MYSQL REGEX search many words with no order condition

I try to use a regex with mysql that search boundary words in a json array string but I don't want the regex match words order because I don't know them.
So I started firstly to write my regex on regex101 (https://regex101.com/r/wNVyaZ/1) and then try to convert this one for mysql.
WHERE `Wish`.`services` REGEXP '^([^>].*[[:<:]]Hygiène[[:>:]])([^>].*[[:<:]]Radiothérapie[[:>:]]).+';
WHERE `Wish`.`services` REGEXP '^([^>].*[[:<:]]Hygiène[[:>:]])([^>].*[[:<:]]Andrologie[[:>:]]).+';
In the first query I get result, cause "Hygiène" is before "Radiothérapie" but in the second query "Andrologie" is before "Hygiène" and not after like it written in the query. The problem is that the query is generated automatically with a list of services that are choosen with no order importance and I want to match only boundary words if they exists no matter the order they have.
You can search for words in JSON like the following (I tested on MySQL 5.7):
select * from wish
where json_search(services, 'one', 'Hygiène') is not null
and json_search(services, 'one', 'Andrologie') is not null;
+------------------------------------------------------------+
| services |
+------------------------------------------------------------+
| ["Andrologie", "Angiologie", "Hygiène", "Radiothérapie"] |
+------------------------------------------------------------+
See https://dev.mysql.com/doc/refman/5.7/en/json-search-functions.html#function_json-search
If you can, use the JSON search queries (you need a MySQL with JSON support).
If it's advisable, consider changing the database structure and enter the various "words" as a related table. This would allow you much more powerful (and faster) queries.
JOIN has_service AS hh ON (hh.row_id = id)
JOIN services AS ss ON (hh.service_id = ss.id
AND ss.name IN ('Hygiène', 'Angiologie', ...)
Otherwise, in this context, consider that you're not really doing a regexp search, and you're doing a full table scan anyway (unless MySQL 8.0+ or PerconaDB 5.7+ (not sure) and an index on the full extent of the 'services' column), and several LIKE queries will actually cost you less:
WHERE (services LIKE '%"Hygiène"%'
OR services LIKE '%"Angiologie"%'
...)
or
IF(services LIKE '%"Hygiène"%', 1, 0)
+IF(services LIKE '%"Angiologie"%', 1, 0)
+ ... AS score
HAVING score > 0 -- or score=5 if you want only matches on all full five
ORDER BY score DESC;

Find accented and non-accented variations of same word

I have a database table which represent people and the records have people's names in them. Some of the names have accented characters in them. Some do not. Some are non-accented duplicates of the accented version.
I need to generate a report of all of the potential duplicates by finding names that are the same (first, middle, last) except for the accents so that someone else can go through this list and verify which are true duplicates, and which are actually different people (I'm assuming they have some other way of knowing).
For example: Jose DISTINCT-LAST-NAME and José DISTINCT-LAST-NAME should be picked up as potential duplicates because they have the same characters, but one has an accented character.
How can this type of query by written in MySQL?
This question: How to remove accents in MySQL? is not the same. It is asking about de-accenting strings in-place and the poster already has a second column of data that has been de-accented. Also, the accepted answer to that question is to set the character set and collation. I have already set the character set and collation.
I am trying to generate a report that finds strings in different records that are the same except for their accents.
I found your question very interesting.
According to this article Accents in text searches, using "like" condition with some character collation adjustments will solve your problem. I have not tested this solution, so if it helps you, please come back and tell us.
Here is a similar question: Accent insensitive search query in MySQL,
according to that, you can use something like:
where 'José' like 'Jose' collate utf8_general_ci
Well, I found something that seems to work (the real query involves a few more other fields, but the same basic idea):
select distinct p1.person_id, p1.first_name, p1.last_name, p2.last_name
from people as p1, people as p2
where binary p1.last_name <> binary p2.last_name
and p1.last_name = p2.last_name
and p1.first_name = p2.first_name
order by p1.last_name, p1.first_name, p2.last_name, p2.first_name;
The results look like this:
12345 Bob Jose José
56789 Bob José Jose
...
This makes sense as there are 2 records for Bob José and I know that in this case, it is the same person but one record is missing the accent.
The trick is to do a binary and non-binary compare on the "last_name" field as well as matching on all other fields. This way we can find everything that is "equal" and also not binary-equal. This works because with the current character-set/collation (utf8/utf8_general_ci), Jose and José are equal but are not binary-equal. you can try it out like this:
select 'Jose' = 'José', 'Jose' like 'José', binary 'Jose' = binary 'José';
The Bane of Character Encodings
There are a wide variety of character-sets and encodings that may be used in MySQL, and when dealing with encoding it is important to learn what you can about them. In particular, take a close look at the differences between:
utf8_unicode_ci
utf8_general_ci
utf8_unicode_520_ci
utf8mb4_general_ci
Some character sets are built to include as many printable characters as possible, to support a wider range of uses, while others are built with the intent of portability and compatibility between systems. In particular, utf8_unicode_ci maps most accented characters to non-accented equivalents. Alternatively, you could use uft8_ascii_ci which is even more restrictive.
Take a look at the utf8_unicode_ci collation chart, and What's the difference between utf8_general_ci and utf8_unicode_ci .
The best answer is from a similar question, "How to remove accents in MySQL?"
If you set an appropriate collation for the column then the value
within the field will compare equal to its unaccented equivalent
naturally.
mysql> SET NAMES 'utf8' COLLATE 'utf8_unicode_ci';
Query OK, 0 rows affected (0.00 sec)
mysql> SELECT 'é' = 'e';
+------------+
| 'é' = 'e' |
+------------+
| 1 |
+------------+
1 row in set (0.05 sec)
How to apply this to your situation?
SELECT id, last-name
FROM people
WHERE last-name COLLATE utf8_unicode_ci IN
(
SELECT last-name
FROM people
GROUP BY last-name COLLATE utf8_unicode_ci
HAVING COUNT(last-name)>1
)

selecting values that do not have '_' with mysql

I was looking for a way to exclude values with a '_' in the results set from a mysql database.
Why would the following sql statement return no results?
select questionKey
from labels
where set_id = 674
and questionKey like 'Class%'
and questionKey not like '%_%' ;
which was the first sql I tried where as
select questionKey
from labels
where set_id = 674
and questionKey like 'Class%'
and locate('_',questionKey) = 0 ;
returns
questionKey
ClassA
ClassB
ClassC
ClassD
ClassE
ClassF
ClassG
ClassNPS
ClassDis
which is the result I wanted. Both SQL statements appear to me to be logically equivalent though they are not.
As tadman and PM77 already pointed out, it's a special character. If you want to use the first query, try to escape it like this (note the backslash):
select questionKey
from labels
where set_id = 674
and questionKey like 'Class%'
and questionKey not like '%\_%' ;
In the LIKE context _ takes on special meaning and represents any single character. It's the only one other than % that means something here.
Your LOCATE() version is probably the best here, though it's worth noting that doing table scans like this can get cripplingly slow on large amounts of data. If underscore represents something important you might want to have a flag field you can set and index.
You could also use a regular expression to try and match records with a single condition:
REGEXP '^Class[^_]+'

MySQL - Characters matching

How would I get MySQL to be more strict with character matching?
A quick example of what I mean, say I have a table with a single column `name`. In this column, I have two names: 'Jorge' and 'Jorgé" The only difference between these names is the ´ over the e. If I run the query SELECT * FROM table WHERE name = 'Jorge', it will return
+--------+
| name |
+--------+
| Jorge |
| Jorgé |
+--------+
and if I run the query SELECT * FROM table WHERE name = 'Jorgé', it returns the same result table. How would I set MySQL to be more strict in that so that it would not return both names?
Thanks ahead.
Quick Edit: I'm using the UTF-8 character encoding
If you want to make sure that no similar characters (like e and é) are considered the same, you should use the utf8_bin collation on that column. I assume that you're using utf8_general_ci now, which will consider some similar characters to be the same. utf8_bin only matches on the exact same characters.
#G-Nugget is correct, but since you are looking at Spanish stuff you might also be interested in the utf8_spanish_ci or utf8_spanish2_ci. They correspond to modern and traditional Spanish. "ñ" is considered a separate letter, and in traditional the "ch" and "ll" are also treated as separate letters.
More here: http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-sets.html

Difference between LIKE and = in MYSQL?

What's the difference between
SELECT foo FROM bar WHERE foobar='$foo'
AND
SELECT foo FROM bar WHERE foobar LIKE'$foo'
= in SQL does exact matching.
LIKE does wildcard matching, using '%' as the multi-character match symbol and '_' as the single-character match symbol. '\' is the default escape character.
foobar = '$foo' and foobar LIKE '$foo' will behave the same, because neither string contains a wildcard.
foobar LIKE '%foo' will match anything ending in 'foo'.
LIKE also has an ESCAPE clause so you can set an escape character. This will let you match literal '%' or '_' within the string. You can also do NOT LIKE.
The MySQL site has documentation on the LIKE operator. The syntax is
expression [NOT] LIKE pattern [ESCAPE 'escape']
LIKE can do wildcard matching:
SELECT foo FROM bar WHERE foobar LIKE "Foo%"
If you don't need pattern matching, then use = instead of LIKE. It's faster and more secure. (You are using parameterized queries, right?)
Please bear in mind as well that MySQL will do castings dependent upon the situation: LIKE will perform string cast, whereas = will perform int cast. Considering the situation of:
(int) (vchar2)
id field1 field2
1 1 1
2 1 1,2
SELECT *
FROM test AS a
LEFT JOIN test AS b ON a.field1 LIKE b.field2
will produce
id field1 field2 id field1 field2
1 1 1 1 1 1
2 1 1,2 1 1 1
whereas
SELECT *
FROM test AS a
LEFT JOIN test AS b ON a.field1 = b.field2
will produce
id field1 field2 id field1 field2
1 1 1 1 1 1
1 1 1 2 1 1,2
2 1 1,2 1 1 1
2 1 1,2 2 1 1,2
According to the MYSQL Reference page, trailing spaces are significant in LIKE but not =, and you can use wildcards, % for any characters, and _ for exactly one character.
I think in term of speed = is faster than LIKE. As stated, = does an exact match and LIKE can use a wildcard if needed.
I always use = sign whenever I know the values of something. For example
select * from state where state='PA'
Then for likes I use things like:
select * from person where first_name like 'blah%' and last_name like 'blah%'
If you use Oracle Developers Tool, you can test it with Explain to determine the impact on the database.
The end result will be the same, but the query engine uses different logic to get to the answer. Generally, LIKE queries burn more cycles than "=" queries. But when no wildcard character is supplied, I'm not certain how the optimizer may treat that.
With the example in your question there is no difference.
But, like Jesse said you can do wildcard matching
SELECT foo FROM bar WHERE foobar LIKE "Foo%"
SELECT foo FROM bar WHERE foobar NOT LIKE "%Foo%"
More info:
http://dev.mysql.com/doc/refman/5.0/en/string-comparison-functions.html
A little bit og google doesn't hurt...
A WHERE clause with equal sign (=) works fine if we want to do an exact match. But there may be a requirement where we want to filter out all the results where 'foobar' should contain "foo". This can be handled using SQL LIKE clause alongwith WHERE clause.
If SQL LIKE clause is used along with % characters then it will work like a wildcard.
SELECT foo FROM bar WHERE foobar LIKE'$foo%'
Without a % character LIKE clause is very similar to equal sign alongwith WHERE clause.
In your example, they are semantically equal and should return the same output.
However, LIKE will give you the ability of pattern matching with wildcards.
You should also note that = might give you a performance boost on some systems, so if you are for instance, searching for an exakt number, = would be the prefered method.
Looks very much like taken out from a PHP script. The intention was to pattern-match the contents of variable $foo against the foo database field, but I bet it was supposed to be written in double quotes, so the contents of $foo would be fed into the query.
As you put it, there is NO difference.
It could potentially be slower but I bet MySQL realises there are no wildcard characters in the search string, so it will not do LIKE patter-matching after all, so really, no difference.
In my case I find Like being faster than =
Like fetched a number of rows in 0.203 secs the first time then 0.140 secs
= returns fetched the same rows in 0.156 secs constantly
Take your choice
I found an important difference between LIKE and equal sign = !
Example: I have a table with a field "ID" (type: int(20) ) and a record that contains the value "123456789"
If I do:
SELECT ID FROM example WHERE ID = '123456789-100'
Record with ID = '123456789' is found (is an incorrect result)
If I do:
SELECT ID FROM example WHERE ID LIKE '123456789-100'
No record is found (this is correct)
So, at least for INTEGER-fields it seems an important difference...