Why is '>' a valid comparison with VARCHAR data in SQL? - mysql

So I'm pretty brand new to SQL (and scripting/coding in general) and this is from one of the examples in the book, but they, unfortunately, decided that there wouldn't be any questions about this query and neglected to expand on the '>' near the end of the query
Here is the query in question:
SELECT *
FROM easy_drinks
WHERE main > 'soda';
Here is a pastebin of a few queries, hopefully giving the perspective needed: http://pastebin.com/xfJQsBvU
Paste of DESC easy_drinks: http://pastebin.com/LZZPhk6Z
I'm just confused as to how the '>' near the end of the query is working, since main is stored as a VARCHAR and 'soda' is definitely not an integer than could be compared with another integer. Yet, as you can see in the first pastebin, the query completes successfully. Why doesn't MySQL return an error and what is the pattern behind the different queries using '>' and '<'?

It's probably doing a lexicographical ordering.
It's similar to the way you or I would order words in a dictionary. However, note that it may not handle numbers the way you'd expect.

You will often find > -relations on strings, mostly used for alphabetical order, but of course it is implementation dependent, and can be different especially for different languages/encodings or for the question of upper- and lowercase.

Related

How to return substring positions in LIKE query

I retrieve data from a MySQL database using a simple SELECT FROM WHERE LIKE case-insensitive query where I escape any % or _ in the like clause, so really the user can only perform basic text research and cannot mess up with regex because I then surround it myself with % in the LIKE clause.
For every row returned by this query, I have to search again using a JS script in order to find all the indexes of the substring in the original string. I dislike this method because I it's a different pattern matching than the one used by the LIKE query, I can't guarantee that the algorithm is the same.
I found MySQL functions POSITION or LOCATE that can achieve it, but they return only the first index if it was found or 0 if it was not found. Yes you can set the first index to search from, and by searching by passing the previously returned index as the first index until the new returned index is 0, you can find all indexes of the substring, but it means a lot of additional queries and it might end up slowing down my application a lot.
So I'm now wondering: Is there a way to have the LIKE query to return substring positions directly, but I didn't find any because I lack MySQL vocabulary yet (I'm a noob).
Simple answer: No.
Longer answer: MySQL has no syntax or mechanism ot return an array of anything -- from either a SELECT or even a Stored Procedure.
Maybe answer: You could write a Stored procedure that loops through one result, finding the results and packing them into a commalist. But I cringe at how messy that code would be. I would quickly decide to write JS code, as you have already done.
Moral of the story: SQL is is not a full language. It's great at storing and efficiently retrieving large sets of rows, but lousy at string manipulation or arrays (other than "rows").
Commalist
If you are actually searching a simple list of things separated by commas, then FIND_IN_SET() and SUBSTRING_INDEX() in MySQL closely match what JS can be done with its split (on comma) method on strings.

Optimization: WHERE x IN (1, 2 .., 100.000) vs INNER JOIN tmp_table USING(x)?

I've visited one interesting job interview recently. There I was asked a question about optimizing a query with a WHERE..IN clause containing long list of scalars (thousands of values, that is). This question is NOT about subqueries in the IN clause, but about simple list of scalars.
I answered right away, that this can be optimized using an INNER JOIN with another table (possibly temporary one), which will contain only those scalars. My answer was accepted and there was a note from the reviewer, that "no database engine currently can optimize long WHERE..IN conditions to be performant enough". I nodded.
But when I walked out, I started to have some doubts. The condition seemed rather trivial and widely used for modern RDBMS not to be able to optimize it. So, I started some digging.
PostgreSQL:
It seems, that PostgreSQL parse scalar IN() constructions into ScalarArrayOpExpr structure, which is sorted. This structure is later used during index scan to locate matching rows. EXPLAIN ANALYZE for such queries shows only one loop. No joins are done. So, I expect such query to be even faster, than INNER JOIN. I tried some queries on my existing database and my tests proved that position. But I didn't care about test purity and that Postgres was under Vagrant so I might be wrong.
MSSQL Server:
MSSQL Server builds a hash structure from the list of constant expressions and then does a hash join with the source table. Even though no sorting seems to be done, that is a performance match, I think. I didn't do any tests since I don't have any experience with this RDBMS.
MySQL Server:
The 13th of these slides says, that before 5.0 this problem indeed took place in MySQL with some cases. But other than that, I didn't find any other problem related to bad IN () treatment. I didn't find any proofs of the inverse, unfortunately. If you did, please kick me.
SQLite:
Documentation page hints some problems, but I tend to believe things described there are really at conceptual level. No other information was found.
So, I'm starting to think I misunderstood my interviewer or misused Google ;) Or, may be, it's because we didn't set any conditions and our talk became a little vague (we didn't specify any concrete RDBMS or other conditions. That was just abstract talk).
It looks like the days, where databases rewrote IN() as a set of OR statements (which can cause problems sometimes with NULL values in lists, btw) are long ago. Or not?
Of course, in cases where a list of scalars is longer than allowed database protocol packet, INNER JOIN might be the only solution available.
I think in some cases query parsing time (if it was not prepared) alone can kill performance.
Also databases could be unable to prepare IN(?) query which will lead to reparsing it again and again (which may kill performance). Actually, I never tried, but I think that even in such cases query parsing and planning is not huge comparing to query execution.
But other than that I do not see other problems. Well, other than the problem of just HAVING this problem. If you have queries, that contain thousands of IDs inside, something is wrong with your architecture.
Do you?
Your answer is only correct if you build an index (preferably a primary key index) on the list, unless the list is really small.
Any description of optimization is definitely database specific. However, MySQL is quite specific about how it optimizes in:
Returns 1 if expr is equal to any of the values in the IN list, else
returns 0. If all values are constants, they are evaluated according
to the type of expr and sorted. The search for the item then is done
using a binary search. This means IN is very quick if the IN value
list consists entirely of constants.
This would definitely be a case where using IN would be faster than using another table -- and probably faster than another table using a primary key index.
I think that SQL Server replaces the IN with a list of ORs. These would then be implemented as sequential comparisons. Note that sequential comparisons can be faster than a binary search, if some elements are much more common than others and those appear first in the list.
I think it is bad application design. Those values using IN operator are most probably not hardcoded but dynamic. In such case we should always use prepared statements the only reliable mechanism to prevent SQL injection.
In each case it will result in dynamically formatting the prepared statement (as number of placeholders is dynamic too) and it will also result in having excessive hard parsing (as many unique queries as we have number of IN values - IN (?), IN (?,?), ...).
I would either load these values into table as use join as you mentioned (unless loading is too overhead) or use Oracle pipelined function IN foo(params) where params argument can be complex structure (array) coming from memory (PLSQL/Java etc).
If the number of values is larger I would consider using EXISTS (select from mytable m where m.key=x.key) or EXISTS (select x from foo(params) instead of IN. In such case EXISTS provides better performance than IN.

MySQL Fulltext search but using LIKE

I'm recently doing some string searches from a table with about 50k strings in it, fairly large I'd say but not that big. I was doing some nested queries for a 'search within results' kinda thing. I was using LIKE statement to get a match of a searched keyword.
I came across MySQL's Full-Text search which I tried so I added a fulltext index to my str column. I'm aware that Full-text searches doesn't work on virtually created tables or even with Views so queries with sub-selects will not fit. I mentioned I was doing a nested queries, example is:
SELECT s2.id, s2.str
FROM
(
SELECT s1.id, s1.str
FROM
(
SELECT id, str
FROM strings
WHERE str LIKE '%term%'
) AS s1
WHERE s1.str LIKE '%another_term%'
) AS s2
WHERE s2.str LIKE '%a_much_deeper_term%';
This is actually not applied to any code yet, I was just doing some tests. Also, searching strings like this can be easily achieved by using Sphinx (performance wise) but let's consider Sphinx not being available and I want to know how this will work well in pure SQL query. Running this query on a table without Full-text added takes about 2.97 secs. (depends on the search term). However, running this query on a table with Full-text added to the str column finished in like 104ms which is fast (i think?).
My question is simple, is it valid to use LIKE or is it a good practice to use it at all in a table with Full-text added when normally we would use MATCH and AGAINST statements?
Thanks!
In this case you not neccessarily need subselects. You can siply use:
SELECT id, str
FROM item_strings
WHERE str LIKE '%term%'
AND str LIKE '%another_term%'
AND str LIKE '%a_much_deeper_term%'
... but also raises a good question: the order in which you are excluding the rows. I guess MySQL is smart enough to assume that the longest term will be the most restrictive, so starting with a_much_deeper_term it will eliminate most of the records then perform addtitional comparsion only on a few rows. - Contrary to this, if you start with term you will probably end up with many possible records then you have to compare them against the st of the terms.
The interesting part is that you can force the order in which the comparsion is made by using your original subselect example. This gives the opportunity to make a decision which term is the most restrictive based upon more han just the length, but for example:
the ratio of consonants a vowels
the longest chain of consonants of the word
the most used vowel in the word
...etc. You can also apply some heuristics based on the type of textual infomation you are handling.
Edit:
This is just a hunch but it could be possible to apply the LIKE to the words in the fulltext indexitself. Then match the rows against the index as if you have serched for full words.
I'm not sure if this is actually done, but it would be a smart thing to pull off by the MySQL people. Also note that this theory can only be used if all possible ocurrences arein fact in the fulltext search. For this you need that:
Your search pattern must be at least the size of the miimal word-length. (If you re searching for example %id% then it can be a part of a 3 letter word too, which is excluded by default form FULLTEXT index).
Your search pattern must not be a substring of any listed excluded word for example: and, of etc.
Your pattern must not contain any special characters.

Disadvantages of quoting integers in a Mysql query?

I am curious about the disadvantage of quoting integers in MYSQL queries
For example
SELECT col1,col2,col3 FROM table WHERE col1='3';
VS
SELECT col1,col2,col3 FROM table WHERE col1= 3;
If there is a performance cost, what is the size of it and why does it occur? Are there any other disavantages other that performance?
Thanks
Andrew
Edit: The reason for this question
1. Because I want to learn the difference because I am curious
2. I am experimenting with a way of passing composite keys from my database around in my php code as psudo-Id-keys(PIK). These PIK's are the used to target the record.
For example, given a primary key (AreaCode,Category,RecordDtm)
My PIK in the url would look like this:
index.php?action=hello&Id=20001,trvl,2010:10:10 17:10:45
And I would select this record like this:
$Id = $_POST['Id'];//equals 20001,trvl,2010:10:10 17:10:45
$sql = "SELECT AreaCode,Category,RecordDtm,OtherColumns.... FROM table WHERE (AreaCode,Category,RecordDtm) = ({$Id});
$mysqli->query($sql):
......and so on.
At this point the query won't work because of the datetime(which must be quoted) and it is open to sql injection because I haven't escaped those values. Given the fact that I won't always know how my PIK's are constructed I would write a function splits the Id PIK at the commas, cleans each part with real_escape_string and puts It back together with the values quoted. For Example:
$Id = "'20001','trvl','2010:10:10 17:10:45'"
Of course, in this function that is breaking apart and cleaning the Id I could check if the value is a number or not. If it is a number, don't quote it. If it is anything but a string then quote it.
The performance cost is that whenever mysql needs to do a type conversion from whatever you give it to datatype of the column. So with your query
SELECT col1,col2,col3 FROM table WHERE col1='3';
If col1 is not a string type, MySQL needs to convert '3' to that type. This type of query isn't really a big deal, as the performance overhead of that conversion is negligible.
However, when you try to do the same thing when, say, joining 2 table that have several million rows each. If the columns in the ON clause are not the same datatype, then MySQL will have to convert several million rows every single time you run your query, and that is where the performance overhead comes in.
Strings also have a different sort order from numbers.
Compare:
SELECT 312 < 41
(yields 0, because 312 numerically comes after 41)
to:
SELECT '312' < '41'
(yields 1, because '312' lexicographically comes before '41')
Depending on the way your query is built using quotes might give wrong results or none at all.
Numbers should be used as such, so never use quotes unless you have a special reason to do so.
According to me, I think there is no performance/size cost in the case you have mentioned. Even if there is, then it is very much negligible and wont affect your application as such.
It gives the wrong impression about the data type for the column. As an outsider, I assume the column in question is CHAR/VARCHAR & choose operations accordingly.
Otherwise MySQL, like most other databases, will implicitly convert the value to whatever the column data type is. There's no performance issue with this that I'm aware of but there's a risk that supplying a value that requires explicit conversion (using CAST or CONVERT) will trigger an error.

MySQL Match Fulltext

Im' trying to do a fulltext search with mysql, to match a string. The problem is that it's returning odd results in the first place.
For example, the string 'passat 2.0 tdi' :
AND MATCH (
records_veiculos.titulo, records_veiculos.descricao
)
AGAINST (
'passat 2.0 tdi' WITH QUERY EXPANSION
)
is returning this as the first result (the others are fine) :
Volkswagen Passat Variant 1.9 TDI- ANO 2003
wich is incorrect, since there's no "2.0" in this example.
What could it be?
edit: Also, since this will probably be a large database (expecting up to 500.000 records), will this search method be the best for itself, or would it be better to install any other search engine like Sphinx? Or in case it doesn't, how to show relevant results?
edit2: For the record, despite the question being marked as answered, the problem with the MySQL delimiters persists, so if anyone has a suggestion on how to escape delimiters, it would be appreciated and worth the 500 points at stake. The sollution I found to increase the resultset was to replace WITH QUERY EXPANSION with IN BOOLEAN MODE, using operators to force the engine to get the words I needed, like :
AND MATCH (
records_veiculos.titulo, records_veiculos.descricao
)
AGAINST (
'+passat +2.0 +tdi' IN BOOLEAN MODE
)
It didn't solve at all, but at least the relevance of the results as changed significantly.
From the MySQL documentation on Fulltext search:
"The FULLTEXT parser determines where words start and end by looking for certain delimiter characters; for example, “ ” (space), “,” (comma), and “.” (period)."
This means that the period is delimiting the 2 and 0. So it's not looking for '2.0'; it's looking for '2' and '0', and not finding it. WITH QUERY EXPANSION is probably causing relevant related words to show up, thus obviating the need for '2' and '0' to be individual words in the result rankings. A character minimum may also be being enforced.
By default I believe mysql only indexes and matches words with 4 or more characters. You could also try escaping the period? It might be ignored this or otherwise using it as a stop character.
What is the match rank that it returns for that? Does the match have to contain all "words" my understanding was it worked like Google and only needs to match some of the words.
Having said that, have a mind to the effect of adding WITH QUERY EXPANSION, that automatically runs a second search for "related" words, which may not be what you have typed, but which the fulltext engines deems probably related.
Relevant Documentation: http://dev.mysql.com/doc/refman/5.1/en/fulltext-query-expansion.html
The "." is what's matching on 2003 in your query results.
If you're going to do searches on 3 character text strings, you should set ft_min_word_len=3
in your mysql config, restart mysql. Otherwise, a search for "tdi" will return results with "TDI-" but not with just "TDI", because rows with "TDI-" will be indexed but "TDI" alone will not.
After making that config change, you'll have to rebuild your index on that table. (Warning: your index might be significantly larger now.)