mysql regexp returns boolean rather than value - mysql

according to the docs selecting a REGXP will always produce a boolean (match or non match) but I'm trying to get the result, if it's a match, meaning if I'm doing...
select file_id REGEXP '^\d{10}' from my_table;
What I want back is not false or true but false or the actual 10 digits that start file_id.
Am I missing something? or is this really how mySQL has implemented regexp?!
I realize that in my case, I can use SUBSTR, but now I'm just curious why they would have deviated from the prescribed norm of how regexp matching works everywhere else.

In answer to your question, "is this really how MySQL has implemented regexp?" the answer is yes. It simply returns a boolean on success or failure to match.
In answer to your question, "why they would shave deviated form the prescribed norm", the answer is that it is more useful in queries to have boolean returns, since you are more often testing for the presence of something, not extracting something, based on a pattern. Extracting things is more often done using procedural languages, not relational databases.
To do what you want it to do, you might want to write a stored procedure that does the necessary string manipulation.

Related

How to return substring positions in LIKE query

I retrieve data from a MySQL database using a simple SELECT FROM WHERE LIKE case-insensitive query where I escape any % or _ in the like clause, so really the user can only perform basic text research and cannot mess up with regex because I then surround it myself with % in the LIKE clause.
For every row returned by this query, I have to search again using a JS script in order to find all the indexes of the substring in the original string. I dislike this method because I it's a different pattern matching than the one used by the LIKE query, I can't guarantee that the algorithm is the same.
I found MySQL functions POSITION or LOCATE that can achieve it, but they return only the first index if it was found or 0 if it was not found. Yes you can set the first index to search from, and by searching by passing the previously returned index as the first index until the new returned index is 0, you can find all indexes of the substring, but it means a lot of additional queries and it might end up slowing down my application a lot.
So I'm now wondering: Is there a way to have the LIKE query to return substring positions directly, but I didn't find any because I lack MySQL vocabulary yet (I'm a noob).
Simple answer: No.
Longer answer: MySQL has no syntax or mechanism ot return an array of anything -- from either a SELECT or even a Stored Procedure.
Maybe answer: You could write a Stored procedure that loops through one result, finding the results and packing them into a commalist. But I cringe at how messy that code would be. I would quickly decide to write JS code, as you have already done.
Moral of the story: SQL is is not a full language. It's great at storing and efficiently retrieving large sets of rows, but lousy at string manipulation or arrays (other than "rows").
Commalist
If you are actually searching a simple list of things separated by commas, then FIND_IN_SET() and SUBSTRING_INDEX() in MySQL closely match what JS can be done with its split (on comma) method on strings.

Optimization: WHERE x IN (1, 2 .., 100.000) vs INNER JOIN tmp_table USING(x)?

I've visited one interesting job interview recently. There I was asked a question about optimizing a query with a WHERE..IN clause containing long list of scalars (thousands of values, that is). This question is NOT about subqueries in the IN clause, but about simple list of scalars.
I answered right away, that this can be optimized using an INNER JOIN with another table (possibly temporary one), which will contain only those scalars. My answer was accepted and there was a note from the reviewer, that "no database engine currently can optimize long WHERE..IN conditions to be performant enough". I nodded.
But when I walked out, I started to have some doubts. The condition seemed rather trivial and widely used for modern RDBMS not to be able to optimize it. So, I started some digging.
PostgreSQL:
It seems, that PostgreSQL parse scalar IN() constructions into ScalarArrayOpExpr structure, which is sorted. This structure is later used during index scan to locate matching rows. EXPLAIN ANALYZE for such queries shows only one loop. No joins are done. So, I expect such query to be even faster, than INNER JOIN. I tried some queries on my existing database and my tests proved that position. But I didn't care about test purity and that Postgres was under Vagrant so I might be wrong.
MSSQL Server:
MSSQL Server builds a hash structure from the list of constant expressions and then does a hash join with the source table. Even though no sorting seems to be done, that is a performance match, I think. I didn't do any tests since I don't have any experience with this RDBMS.
MySQL Server:
The 13th of these slides says, that before 5.0 this problem indeed took place in MySQL with some cases. But other than that, I didn't find any other problem related to bad IN () treatment. I didn't find any proofs of the inverse, unfortunately. If you did, please kick me.
SQLite:
Documentation page hints some problems, but I tend to believe things described there are really at conceptual level. No other information was found.
So, I'm starting to think I misunderstood my interviewer or misused Google ;) Or, may be, it's because we didn't set any conditions and our talk became a little vague (we didn't specify any concrete RDBMS or other conditions. That was just abstract talk).
It looks like the days, where databases rewrote IN() as a set of OR statements (which can cause problems sometimes with NULL values in lists, btw) are long ago. Or not?
Of course, in cases where a list of scalars is longer than allowed database protocol packet, INNER JOIN might be the only solution available.
I think in some cases query parsing time (if it was not prepared) alone can kill performance.
Also databases could be unable to prepare IN(?) query which will lead to reparsing it again and again (which may kill performance). Actually, I never tried, but I think that even in such cases query parsing and planning is not huge comparing to query execution.
But other than that I do not see other problems. Well, other than the problem of just HAVING this problem. If you have queries, that contain thousands of IDs inside, something is wrong with your architecture.
Do you?
Your answer is only correct if you build an index (preferably a primary key index) on the list, unless the list is really small.
Any description of optimization is definitely database specific. However, MySQL is quite specific about how it optimizes in:
Returns 1 if expr is equal to any of the values in the IN list, else
returns 0. If all values are constants, they are evaluated according
to the type of expr and sorted. The search for the item then is done
using a binary search. This means IN is very quick if the IN value
list consists entirely of constants.
This would definitely be a case where using IN would be faster than using another table -- and probably faster than another table using a primary key index.
I think that SQL Server replaces the IN with a list of ORs. These would then be implemented as sequential comparisons. Note that sequential comparisons can be faster than a binary search, if some elements are much more common than others and those appear first in the list.
I think it is bad application design. Those values using IN operator are most probably not hardcoded but dynamic. In such case we should always use prepared statements the only reliable mechanism to prevent SQL injection.
In each case it will result in dynamically formatting the prepared statement (as number of placeholders is dynamic too) and it will also result in having excessive hard parsing (as many unique queries as we have number of IN values - IN (?), IN (?,?), ...).
I would either load these values into table as use join as you mentioned (unless loading is too overhead) or use Oracle pipelined function IN foo(params) where params argument can be complex structure (array) coming from memory (PLSQL/Java etc).
If the number of values is larger I would consider using EXISTS (select from mytable m where m.key=x.key) or EXISTS (select x from foo(params) instead of IN. In such case EXISTS provides better performance than IN.

Is it bad to use enum('y','n') instead of a boolean field in a MySQL table?

So a few years ago I saw the DB schema of a system developed by a 3rd party and noticed they used enum('y','n') instead of a boolean (tinyint) field. I don't know why but I loved it so much, I found it made things easier to read (totally subjective I know) but I adopted it and started using it ever since then. I suppose I could swap it for "true" and "false" but, what can I say, I just liked it.
Now that being said, are there any setbacks to doing things this way -- aside from maybe slightly annoying a programmer who'd come in late in the game?
Yes, it is bad. You lose intuitive boolean logic with it (SELECT * FROM user WHERE NOT banned becomes SELECT * FROM user WHERE banned = 'n'), and you receive strings instead of booleans on your application side, so your boolean conditions there become cumbersome as well. Other people who work with your schema will get bitten by seeing flag-like column names and attempting to use boolean logic on them.
As explained in the manual:
If you insert an invalid value into an ENUM (that is, a string not present in the list of permitted values), the empty string is inserted instead as a special error value. This string can be distinguished from a “normal” empty string by the fact that this string has the numeric value 0. See Section 11.4.4, “ Index Values for Enumeration Literals ” for details about the numeric indexes for the enumeration values.
If strict SQL mode is enabled, attempts to insert invalid ENUM values result in an error.
In this respect, an ENUM results in different behaviour to a BOOLEAN type; otherwise I'm inclined to agree with #lanzz's answer that it makes integration with one's application that little bit less direct.
One factor to consider is whether the people who wrote the original schema limit it to MySQL or not. If it is only intended to run on MySQL, then adapting to MySQL makes sense. If the same schema is intended to be usable with other DBMS, then a more generic schema design that works in all the relevant DBMS may be better for the people making the design.
With that said, the enum is moderately MySQL specific, but something equivalent to enum can easily be created in other DBMS:
CREATE TABLE ...
(
...
FlagColumn CHAR(1) NOT NULL CHECK(FlagColumn IN ('y', 'n')),
...
);
The way that different DBMS handle BOOLEAN is not as uniform as you'd like it to be, SQL Standard notwithstanding (and the reason is, as ever, history; the less conformant systems had a variation on the theme of BOOLEAN before the standard did, and changing their implementation breaks the code of their existing customers).
So, I would not automatically condemn the use of enum over boolean, but it is better to use boolean for boolean flags.

Why is '>' a valid comparison with VARCHAR data in SQL?

So I'm pretty brand new to SQL (and scripting/coding in general) and this is from one of the examples in the book, but they, unfortunately, decided that there wouldn't be any questions about this query and neglected to expand on the '>' near the end of the query
Here is the query in question:
SELECT *
FROM easy_drinks
WHERE main > 'soda';
Here is a pastebin of a few queries, hopefully giving the perspective needed: http://pastebin.com/xfJQsBvU
Paste of DESC easy_drinks: http://pastebin.com/LZZPhk6Z
I'm just confused as to how the '>' near the end of the query is working, since main is stored as a VARCHAR and 'soda' is definitely not an integer than could be compared with another integer. Yet, as you can see in the first pastebin, the query completes successfully. Why doesn't MySQL return an error and what is the pattern behind the different queries using '>' and '<'?
It's probably doing a lexicographical ordering.
It's similar to the way you or I would order words in a dictionary. However, note that it may not handle numbers the way you'd expect.
You will often find > -relations on strings, mostly used for alphabetical order, but of course it is implementation dependent, and can be different especially for different languages/encodings or for the question of upper- and lowercase.

MySQL - AND condition

Let's say I have a query like this:
SELECT bla WHERE foo LIKE '%bar%' AND boo = 'bar' AND whatvr IN ('foo', 'bar')...
I was wondering if MySQL continues to check all conditions when retrieving results.
For eg. if foo is not LIKE %bar%, will it continue to check if boo = 'bar', and so on ?
Would it be any faster if I put conditions that are less likely to be true at the end?
I'm sorry if this seems to be stupid question, I'm a complete noob when it comes to SQL :)
I don't think there are any guarantees about whether or not multiple conditions will be short-circuited, but...
In general, you should treat the query optimiser as a black box and assume -- unless you have evidence to the contrary -- that it will do its job properly. The optimiser's job is to ensure that the requested data is retrieved as efficiently as possible. If the most efficient plan involves short-circuiting then it'll do it; if it doesn't then it won't.
(Of course, query optimisers aren't perfect. If you have evidence that a query isn't being executed optimally then it's often worth re-ordering and/or re-stating the query to see if anything changes.)
What you're looking for is documentation on MySQL's short-circuit evaluation. I have, however, not been able to find anything better than people who were not able to find the documentation, but they claim to have tested it and found it to be true, i.e., MySQL short-circuits.
Would it be any faster if I put conditions that are less likely to be true at the end?
No, the optimizer will try and optimize (!) the order of processing. So, as for the order of tests, you should not assume anything.
I would not count on that : Where Optimisations. That link explains that other criterias prevail on the order.
You can't rely on MySQL evaluating conditions from left to right (as opposed to any programming language). This is because the "WHERE clause optimizer" looks for columns that are indexed and will look for this subset first.
For query optimization see the chapter Optimizing SELECT Statements in the MySQL reference manual.
If it does short-circuit when first condition fails( which is most likely ), it would be best to put those conditions, that are most likely to fail, first!
let's say we have 3 conditions, and all must be true( separated by "AND" ).
slow case:
1. never fail. All rows are looked through and success.
2. sometimes fail. All rows are looked through and still success.
3. often fail. All rows are looked through and this time we fail.
Result: It took a while, but can't find a match.
fast case:
1. often fail. All rows are looked through and matching fail.
2. sometimes fail. NOT looked through, because searching ended due to short-circuit.
3. never fail. NOT looked through, because searching ended due to short-circuit.
Result: Quickly figured, no match.
Correct me if I'm wrong.
I can imagine, that all conditions are checked, for each row looked ad. Which makes this matter ALOT less.
If your fields are ordered in the same order, as your conditions, you could maybe measure the difference.