I am currently learning MySQL and am noticing a lot of different do's and don'ts.
Is there anywhere I can find the absolute list of best practices that you go by or learned from?
Thanks for your time.
Do use InnoDB; don't use MyISAM.
(OK, OK, unless you absolutely have to, often due to fulltext matching not being available in InnoDB. Even then you're often better off putting the canonical data in InnoDB and the fulltext index on a separate MyISAM searchbait table, which you can then process for stemming.)
Do use BINARY columns when you want rigorous string matching, otherwise you get a case-insensitive comparison by default. Do set the collation correctly for your character set (best: UTF-8) or case-insensitive comparisons will behave strangely.
Do use ANSI SQL mode if you want your code to be portable. ANSI_QUOTES allows you to use standard double-quoted "identifier" (table, column, etc.) names to avoid reserved words; MySQL's default way of saying this is backquotes but they're non-standard and won't work elsewhere. If you can't control settings like this, omit any identifier quoting and try to avoid reserved words (which is annoying, as across the different databases there are many).
Do use your data access layer's MySQL string literal escaping or query parameterisation functions; don't try to create escaped literals yourself because the rules for them are a lot more complicated than you think and if you get it wrong you've got an SQL injection hole.
Don't rely on MySQL's behaviour of returning a particular row when you select columns that don't have a functional dependency on the GROUP BY column(s). This is an error in other databases and can easily hide bugs that will only pop up when the internal storage in the database changes, causing a different row to be returned.
SELECT productid, MIN(cost)
FROM products
GROUP BY productcategory -- this doesn't do what you think
Well, there won't be an absolute list of dos and donts as the goal posts keep moving. MySql moved on in leaps and bounds between versions 4 and 5, and some fairly essential bug fixes for MySql seem to be around the corner (I'm thinking of the issue surrounding the use of count(distinct col1) from ...).
Here are a couple of issues off the top of my head:
don't rely on views to be able to use indexes on the underlying tables
http://forums.mysql.com/read.php?100,22967,66618#msg-66618
The order of columns in indexes intended to be used by GROUP BY is important:
http://dev.mysql.com/doc/refman/5.1/en/group-by-optimization.html
COUNT(DISTINCT) is slow:
http://www.delphifaq.com/faq/databases/mysql/f3095.shtml
although there might be a bug fix a-coming....
http://bugs.mysql.com/bug.php?id=17865
Here are some other questions from this site you might find useful:
Database opimization
Database design with MySql
Finetuning tips
DON'T WRITE YOUR SQL IN ALL CAPS, EVEN THOUGH THE OFFICIAL REFERENCE DOES IT. I MEAN, OK, IT MAKES IT PRETTY OBVIOUS TO DIFFERENTIATE BETWEEN IDENTIFIERS AND KEYWORDS. NO, WAIT, THAT'S WHY WE HAVE SYNTAX HIGHLIGHTING.
Do use SQL_MODE "Traditional".
SET SQL_MODE='TRADITIONAL'
Or put it in your my.cnf (even better, because you can't forget it; but ensure it gets deployed on to ALL instances including dev, test etc).
If you don't do this, inserting invalid values into columns will succeed anyway. This is not usually a Good Thing, as it may mean that you lose data.
It's important that it's turned on in dev as well as you'll spot those problems early.
Oh I need this list too .. joking. No. The problem is that whatever works with 1 MB database will never be good for 1 GB database, same applies to 1GB database vs 1TB database. etc.
Related
I've visited one interesting job interview recently. There I was asked a question about optimizing a query with a WHERE..IN clause containing long list of scalars (thousands of values, that is). This question is NOT about subqueries in the IN clause, but about simple list of scalars.
I answered right away, that this can be optimized using an INNER JOIN with another table (possibly temporary one), which will contain only those scalars. My answer was accepted and there was a note from the reviewer, that "no database engine currently can optimize long WHERE..IN conditions to be performant enough". I nodded.
But when I walked out, I started to have some doubts. The condition seemed rather trivial and widely used for modern RDBMS not to be able to optimize it. So, I started some digging.
PostgreSQL:
It seems, that PostgreSQL parse scalar IN() constructions into ScalarArrayOpExpr structure, which is sorted. This structure is later used during index scan to locate matching rows. EXPLAIN ANALYZE for such queries shows only one loop. No joins are done. So, I expect such query to be even faster, than INNER JOIN. I tried some queries on my existing database and my tests proved that position. But I didn't care about test purity and that Postgres was under Vagrant so I might be wrong.
MSSQL Server:
MSSQL Server builds a hash structure from the list of constant expressions and then does a hash join with the source table. Even though no sorting seems to be done, that is a performance match, I think. I didn't do any tests since I don't have any experience with this RDBMS.
MySQL Server:
The 13th of these slides says, that before 5.0 this problem indeed took place in MySQL with some cases. But other than that, I didn't find any other problem related to bad IN () treatment. I didn't find any proofs of the inverse, unfortunately. If you did, please kick me.
SQLite:
Documentation page hints some problems, but I tend to believe things described there are really at conceptual level. No other information was found.
So, I'm starting to think I misunderstood my interviewer or misused Google ;) Or, may be, it's because we didn't set any conditions and our talk became a little vague (we didn't specify any concrete RDBMS or other conditions. That was just abstract talk).
It looks like the days, where databases rewrote IN() as a set of OR statements (which can cause problems sometimes with NULL values in lists, btw) are long ago. Or not?
Of course, in cases where a list of scalars is longer than allowed database protocol packet, INNER JOIN might be the only solution available.
I think in some cases query parsing time (if it was not prepared) alone can kill performance.
Also databases could be unable to prepare IN(?) query which will lead to reparsing it again and again (which may kill performance). Actually, I never tried, but I think that even in such cases query parsing and planning is not huge comparing to query execution.
But other than that I do not see other problems. Well, other than the problem of just HAVING this problem. If you have queries, that contain thousands of IDs inside, something is wrong with your architecture.
Do you?
Your answer is only correct if you build an index (preferably a primary key index) on the list, unless the list is really small.
Any description of optimization is definitely database specific. However, MySQL is quite specific about how it optimizes in:
Returns 1 if expr is equal to any of the values in the IN list, else
returns 0. If all values are constants, they are evaluated according
to the type of expr and sorted. The search for the item then is done
using a binary search. This means IN is very quick if the IN value
list consists entirely of constants.
This would definitely be a case where using IN would be faster than using another table -- and probably faster than another table using a primary key index.
I think that SQL Server replaces the IN with a list of ORs. These would then be implemented as sequential comparisons. Note that sequential comparisons can be faster than a binary search, if some elements are much more common than others and those appear first in the list.
I think it is bad application design. Those values using IN operator are most probably not hardcoded but dynamic. In such case we should always use prepared statements the only reliable mechanism to prevent SQL injection.
In each case it will result in dynamically formatting the prepared statement (as number of placeholders is dynamic too) and it will also result in having excessive hard parsing (as many unique queries as we have number of IN values - IN (?), IN (?,?), ...).
I would either load these values into table as use join as you mentioned (unless loading is too overhead) or use Oracle pipelined function IN foo(params) where params argument can be complex structure (array) coming from memory (PLSQL/Java etc).
If the number of values is larger I would consider using EXISTS (select from mytable m where m.key=x.key) or EXISTS (select x from foo(params) instead of IN. In such case EXISTS provides better performance than IN.
I was recently told that it is not recomended to use the "LIKE" keyword in SQL. is this true? if so why? if it is true are there any alternatives to it?
The reason is primarily performance. However, on the other side of the argument, LIKE is standard SQL and should work in all databases. Because LIKE has to parse the pattern string, it is a bit less efficient than looking for a substring in a longer string (using charindex or instr or your database's favorite function). However, processors are so fast that this rarely makes a difference now, except perhaps for the largest queries.
The one caution with LIKE is in a join statement (and this is true of the alternatives as well). In general, database engines will not use an index for a LIKE in a join. So, if you can express the join clause in a more index-friendly way, then you might see a substantial increase in performance.
By the way, I'm something of an old-timer with the SQL language, and tend to be a person who avoids using it personally. However, this is not a habit that should be passed on, because there is little basis anymore for avoiding it.
Specifically in MySQL (and since this has a MySQL tag I guess that's what you are using), when using LIKE on a column which has an Index you should be carefull of not putting a % in front of the string you are matching if you don't have to, because it will kill the possibility of using the Index for looking efficiently, otherwise there is no problem in using LIKE. e.g.
BAD:
col_with_index LIKE '%someText'
GOOD:
col_with_index LIKE 'someText%'
There are no valid reasons to not use like!!!
The only exception comes when you can use the EQUAL(=) operator to achieve the same results (my_column LIKE 'XYZ').
If you need to use LIKE any other alternative to achieve the same result should cause the same (or even more) performance problems!
So, in those cases, just think if the use of like is necessary and then use it with no hesitations.
Should all table names in MySQL be enclosed in backticks (`) to prevent collisions with reserved keywords? The reason I ask is because their use makes the SQL less portable as not all databases allow backticks.
So would avoiding table and column names containing keywords be a better course of action? If so what can be done to mitigate the risk of MySQL adding a new keyword in the next version that might collide with your schema.
Is there a best practice regarding this?
The most portable way (between the systems) is to use double quotes, however, it would require enabling ANSI_QUOTES which is off by default on most installations.
So while keeping arguably useful compatibility between different engines (and incompatibility does not limit itself to backticks only but to zillion other things different between MySQL and other systems) you are killing the compatibility between different setups of MySQL which is by far more important.
Avoiding the reserved keywords is always the best solution.
This is a matter of opinion. But portable code outweighs their use. As you noted, backticks can allow you to use reserved words, which is never a good thing. That, for me, already proves they do more harm than good.
So would avoiding table and column names containing keywords be a better course of action?
In short, yes. And there isn't much you can do with respect to future keywords except avoiding obvious candidates, e.g. with, window, over, etc.
One common practice is to prefix all your table names with a few letters and an underscore. It prevents collisions if you need to house two different applications in the same database, and you'll likely never run into reserved words.
Not escaping and manually avoiding reserved name collisions with keywords can be quite a tedious endeavor as reserved names vary greatly across databases. E.g you can use User in MySQL but not on MSSQL.
It also boils down to what the SQL queries are aimed at: are they table creation queries? initialization queries? "regular" queries? This is important as there are other factors that will make the SQL database dependent (e.g the use of AUTO_INCREMENT when creating a table).
It also depends if it is handwritten SQL files that you load and run directly into the database or programmatically constructed/filled ones. In the latter case I would use available facilities or write a micro driver that encapsulates database dialect specificities. We're not talking ORM here, just something that will help encapsulate this problem away.
To me the answer "try to avoid them" is a path of least resistance solution that might or might not bite you back at some point depending on the role of your queries. Maybe your question is a bit too broad?
I don't worry much about portability issues that can be handled with automation.
Standard SQL doesn't have any use for backticks. So can't backticks in DDL simply be replaced globally with SQL-standard double quotes? If so, that's just a one-liner in a make file.
If you use backticks, you avoid that your code stops working if MySQL introduces a new reserved keyword. Imagine all the websites you created, all stopping to work because a MySQL update introduced a new keyword that you previously used as a table name!
Your SQL may be slightly less portable, but really.. replacing a backtick with a double quote is a matter of a single search/replace in a file (unless you are using also the PHP backtick execute operator in the same file). You can't do this in reverse: replace double quotes to backticks, as other strings may be changed too (all to the PHP "execute" operator, ugh!)!
Or if you want the code to be compatible with both you can do the replace inside a few functions that process/prepare the SQL:
function myExecute($sql,$params) {
if(NOT_MYSQL) $sql=str_replace('`','"',$sql);
return execute($sql,$params);
}
What you should NEVER do is using double quotes to enclose string values in SQL. It is allowed by MySQL, but very bad for portability. You may have to replace all your strings manually.
<?php
// Don't. Use ' for strings instead
$sql='SELECT "col1" FROM "tab" WHERE "col2"="very bad"';
// Better
$sql="SELECT `col1` FROM `tab` WHERE `col2`='not that bad'";
// May crash later if tab becomes a keyword (or col1, or col2,..)
$sql="SELECT col1 FROM tab WHERE col2='not that bad'";
// Standard. But harder to replace ""s to ``s later, and annoying \'s in code
$sql='SELECT "col1" FROM "tab" WHERE "col2"=\'not that bad\'';
// Safe. Annoying.
$sql="SELECT my_unique_col1 FROM my_unique_tab WHERE my_unique_col2='not that bad'";
?>
As you see in the last example, you can name your tables and fields in a way that is probably unique (add some prefix to all, in this case "my_unique_") it is boring but mostly safe and portable.
I have an authors table in my database that lists an author's whole name, e.g. "Charles Dickinson". I would like to sort of "decatenate" at the space, so that I can get 'Charles" and "Dickinson" separately. I know there is the explode function in PHP, but is there anything similar for a straight mysql query? Thanks.
No, don't do that. Seriously. That is a performance killer. If you ever find yourself having to process a sub-column (part of a column) in some way, your DB design is flawed. It may well work okay on a home address book application or any of myriad other small databases but it will not be scalable.
Store the components of the name in separate columns. It's almost invariably a lot faster to join columns together with a simple concatenation (when you need the full name) than it is to split them apart with a character search.
If, for some reason you cannot split the field, at least put in the extra columns and use an insert/update trigger to populate them. While not 3NF, this will guarantee that the data is still consistent and will massively speed up your queries. You could also ensure that the extra columns are lower-cased (and indexed if you're searching on them) at the same time so as to not have to fiddle around with case issues.
This is related: MySQL Split String
I would like to know if the Sphinx engine works with any delimiters (like commas and periods in normal MySQL). My question comes from the urge, not to use them at all, but to escape them or at least thay they don't enter in conflict when performing MATCH operations with FULLTEXT searches, since I have problems dealing with them in MySQL by default and I would prefer not to be forced to replace those delimiters by any other characters to provide a good set of results.
Sorry if I'm saying something stupid, but I don't have experience with Sphinx or other complementary (?) search engines.
To give you an example, if I perform a search with
"Passat 2.0 TDI"
MySQL by default would identify the period in this case as a delimiter and since the "2" and "0" are too short to be considered words by default, the results would be a bit messed up.
Is it easy to handle with Sphinx (or other search engine)? I'm open to suggestions.
This is for a large project, with probably more than 500.000 possible records (not trivial at all).
Cheers!
You can effectively control which characters are delimiters by specifying the charset table of a specific sphinx index.
If you exclude a character from your charset table, it effectively acts as a delimiter. If you specify it in your charset table (even spaces as U+0020), it will no longer acts as a delimiter and will be part of your token strings.
Each index (which uses one or more sphinx data sources) can have a different charset table for flexibility.
NB: If you want single character words, you can specify the min_word_len of each the sphinx index.
This is probably the best section of the documentation to read. As sphinx is a fulltext engine primarily it's highly tunable as to how it handles phrases and also how you pass them in.