Erratic behaviour of a mysql query - mysql

I have the following query :
SELECT days.from, days.to, days.nombre, days.totalDays, days.bloque,
days.comentario, days.local, admin.eMail, admin.passcode, days.id,
admin.username
FROM days,admin
WHERE days.id='9' AND days.nombre=admin.username
The problem is that the query somethimes work but sometimes doesnt. Sometimes works with only certain IDs. Is there any other way to formulate the query?

You are currently using implicit joins. Explicit joins are easier to read and understand for you and tend to make for much more consistent queries.
You could rewrite your query using JOINs. So, instead of:
SELECT days.from, days.to, days.nombre, days.totalDays, days.bloque,
days.comentario , days.local, admin.eMail, admin.passcode,
days.id, admin.username
FROM days,admin
WHERE days.id='9'
AND days.nombre=admin.username
You can use:
SELECT days.from,days.to,days.nombre,days.totalDays,days.bloque,
days.comentario,days.local,admin.eMail,admin.passcode,
days.id,admin.username
FROM days
INNER JOIN admin ON days.nombre=admin.username
WHERE days.id='9'
You may be able to note already how much easier it is to understand what is happening here. While this shouldn't in and of itself fix your query, it is far easier to read and thus to debug.
If you find that certain cases are not working, the best way to figure out why is to remove some restrictions and see if it then works. In this instance, make sure that the usernames that are not showing up have the column days.id equal to 9. Other potential issues when using a natural key are things like extra white space. Check for this in cases that do not work as the JOIN property days.nombre=admin.username may be failing.
Your other option, if, in fact, whitespaces are causing you issues, is to do away with your natural keys and implement surrogate keys. Surrogate keys mean that you will be using a standard and unique key code like an int that increments over time. Rather than have days.nombre as your foreign key, you would have days.admin_id as your foreign key.
As a rule, while there are many pros to natural keys and it is a debate which rages on, it is generally accepted that natural keys only work if the keys are consistent and unique.

Just guessing, but here's something that caused a problem for me recently: check your table and column definitions that the character sets are consistent. It looks like you have a mixture of English and Spanish, so perhaps some non-ASCII characters like ñ are not matching as expected.

Related

MySQL self join performance: fact or just bad indexing?

As an example: I'm having a database to detect visitor (bots, etc) and since not every visitor have the same amount of 'credential' I made a 'dynamic' table like so: see fiddle: http://sqlfiddle.com/#!9/ca4c8/1 (simplified version).
This returns me the profile ID that I use to gather info about each profile (in another DB). Depending on the profile type I query the table with different nameclause (name='something') (ei: hostname, ipAddr, userAgent, HumanId, etc).
I'm not an expert in SQL but I'm familiar with indexes, constraints, primary, unique, foreign key etc. And from what I saw from these search results:
Mysql Self-Join Performance
How to tune self-join table in mysql like this?
Optimize MySQL self join query
JOIN Performance Issue MySQL
MySQL JOIN performance issue
Most of them have comments about bad performance on self-join but answers tend to go for the missing index cause.
So the final question is: is self joining a table makes it more prone to bad performance assuming that everything is indexed properly?
On a side note, more information about the table: might be irrelevant to the question but is well in context for my particular situation:
column flag is used to mark records for deletion as the user I use from php don't have DELETE permission over this database. Sorry, Security is more important than performance
I added the 'type' that will go with info I get from the user agent. (ie: if anything is (at least seems to be) a bot, we will only search for type 5000.
Column 'name' is unfortunately a varchar indexed in the primary key (with profile and type).
I tried to use as much INT and filtering (WHERE) in the SELECT query to reduce eventual lost of performance (if that even matters)
I'm willing to study and tweak the thing if needed unless someone with a high background in mysql tells me it's really not a good thing to do.
This is a big project I have in development so I cannot test it with millions of records for now but I wonder if performance will be an issues as this grows. Any input, links, references, documentation or test procedure (maybe in comments) will be appreciated.
A self-join is no different than joining two different tables. The optimizer will pick one 'table', usually based on the WHERE, then do a Nested Loop Join into the other. In your case, you have implied, via LEFT, that it should work only one way. (The Optimizer will ignore that if it sees no need for it.
Your keys are find for that Fiddle.
The real problem is "Entity-Attribute-Value", which is a messy way to lay out data in tables. Your query seems to be saying "find a (limit 1) profile (entity) that has a certain pair of attributes (name = Googlebot AND addr = ...).
It would be so much easier, and faster, to have two columns (name and addr) and a "composite" INDEX(name, addr).
I recommend doing that for the common "attributes", then put the rest into a single column with a JSON string. See here.

Optimization: WHERE x IN (1, 2 .., 100.000) vs INNER JOIN tmp_table USING(x)?

I've visited one interesting job interview recently. There I was asked a question about optimizing a query with a WHERE..IN clause containing long list of scalars (thousands of values, that is). This question is NOT about subqueries in the IN clause, but about simple list of scalars.
I answered right away, that this can be optimized using an INNER JOIN with another table (possibly temporary one), which will contain only those scalars. My answer was accepted and there was a note from the reviewer, that "no database engine currently can optimize long WHERE..IN conditions to be performant enough". I nodded.
But when I walked out, I started to have some doubts. The condition seemed rather trivial and widely used for modern RDBMS not to be able to optimize it. So, I started some digging.
PostgreSQL:
It seems, that PostgreSQL parse scalar IN() constructions into ScalarArrayOpExpr structure, which is sorted. This structure is later used during index scan to locate matching rows. EXPLAIN ANALYZE for such queries shows only one loop. No joins are done. So, I expect such query to be even faster, than INNER JOIN. I tried some queries on my existing database and my tests proved that position. But I didn't care about test purity and that Postgres was under Vagrant so I might be wrong.
MSSQL Server:
MSSQL Server builds a hash structure from the list of constant expressions and then does a hash join with the source table. Even though no sorting seems to be done, that is a performance match, I think. I didn't do any tests since I don't have any experience with this RDBMS.
MySQL Server:
The 13th of these slides says, that before 5.0 this problem indeed took place in MySQL with some cases. But other than that, I didn't find any other problem related to bad IN () treatment. I didn't find any proofs of the inverse, unfortunately. If you did, please kick me.
SQLite:
Documentation page hints some problems, but I tend to believe things described there are really at conceptual level. No other information was found.
So, I'm starting to think I misunderstood my interviewer or misused Google ;) Or, may be, it's because we didn't set any conditions and our talk became a little vague (we didn't specify any concrete RDBMS or other conditions. That was just abstract talk).
It looks like the days, where databases rewrote IN() as a set of OR statements (which can cause problems sometimes with NULL values in lists, btw) are long ago. Or not?
Of course, in cases where a list of scalars is longer than allowed database protocol packet, INNER JOIN might be the only solution available.
I think in some cases query parsing time (if it was not prepared) alone can kill performance.
Also databases could be unable to prepare IN(?) query which will lead to reparsing it again and again (which may kill performance). Actually, I never tried, but I think that even in such cases query parsing and planning is not huge comparing to query execution.
But other than that I do not see other problems. Well, other than the problem of just HAVING this problem. If you have queries, that contain thousands of IDs inside, something is wrong with your architecture.
Do you?
Your answer is only correct if you build an index (preferably a primary key index) on the list, unless the list is really small.
Any description of optimization is definitely database specific. However, MySQL is quite specific about how it optimizes in:
Returns 1 if expr is equal to any of the values in the IN list, else
returns 0. If all values are constants, they are evaluated according
to the type of expr and sorted. The search for the item then is done
using a binary search. This means IN is very quick if the IN value
list consists entirely of constants.
This would definitely be a case where using IN would be faster than using another table -- and probably faster than another table using a primary key index.
I think that SQL Server replaces the IN with a list of ORs. These would then be implemented as sequential comparisons. Note that sequential comparisons can be faster than a binary search, if some elements are much more common than others and those appear first in the list.
I think it is bad application design. Those values using IN operator are most probably not hardcoded but dynamic. In such case we should always use prepared statements the only reliable mechanism to prevent SQL injection.
In each case it will result in dynamically formatting the prepared statement (as number of placeholders is dynamic too) and it will also result in having excessive hard parsing (as many unique queries as we have number of IN values - IN (?), IN (?,?), ...).
I would either load these values into table as use join as you mentioned (unless loading is too overhead) or use Oracle pipelined function IN foo(params) where params argument can be complex structure (array) coming from memory (PLSQL/Java etc).
If the number of values is larger I would consider using EXISTS (select from mytable m where m.key=x.key) or EXISTS (select x from foo(params) instead of IN. In such case EXISTS provides better performance than IN.

mysql "KEEP ONLY" command?

I know that there is a DELETE FROM <table> WHERE <exprs> command in mysql that deletes tuples from the specified table if the expressions are valid.
However, it becomes a burden always using demorgan's law in taking the complement of keep only expressions.
My question is, is there a KEEP ONLY type of command for mysql? I tried looking everywhere, but only came across examples of taking the complements of expressions and then using the DELETE command.
There is nothing like KEEP ONLY. However, it's not usually as hard as DeMorgan's Law to invert a result set.
Imagine you have some query to produce the results you want to keep. Remember that tables generally have some kind of primary key. What you want to do is select those columns. Often, this is a surrogate key (ID column), and it's as easy as selecting that a column:
SELECT ID FROM table WHERE X
What you can do is nest the query:
DELETE FROM table
WHERE ID NOT IN (SELECT ID FROM table WHERE X)
It's also common to write that as a JOIN, instead of a NOT IN, but as the whole point of this answer was to simplify your logical process for producing this code, I'll leave that as an exercise for the reader. I will add, however, that if your primary key has more than one column, you may have to write that JOIN code.
I just figured out another way to structure a "KEEP ONLY" type of command!
Say you want have something like this:
KEEP the tuples that satisfy <massive_expression>
All you have to do is negate <massive_expression> in the DELETE command, like so:
DELETE FROM table
WHERE ! (massive_expression);
It makes total sense, I should have seen this before!

Best primary key for storing URLs

which is the best primary key to store website address and page URLs?
To avoid the use of autoincremental id (which is not really tied to the data), I designed the schema with the use of a SHA1 signature of the URL as primary key.
This approach is useful in many ways: for example I don't need to read the last_id from the database so I can prepare all table updates calculating the key and do the real update in a single transaction. No constraint violation.
Anyway I read two books which tell me I am wrong. In "High performance MySQL" it is said that the random key is not good for the DB optimizer. Moreover, in each Joe Celko's books he says the primary key should be some part of the data.
The question is: the natural keys for URLs are... URLs themselves. The fact is that if for a site it is short (www.something.com), there's not an imposed limit for am URL (see http://www.boutell.com/newfaq/misc/urllength.html).
Consider I have to store (and work with) some millions of them.
Which is the best key, then? Autoincremental ids, URLs, hashes of URLs?
You'll want an autoincrement numeric primary key. For the times when you need to pass ids around or join against other tables (for example, optional attributes for a URL), you'll want something small and numeric.
As for what other columns and indexes you want, it depends, as always, on how you're going to use them.
A column storing a hash of each URL is an excellent idea for almost any application that uses a significant number of URLs. It makes SELECTing a URL by its full text about as fast as it's going to get. A second advantage is that if you make that column UNIQUE, you don't need to worry about making the column storing the actual URL unique, and you can use REPLACE INTO and INSERT IGNORE as simple, fast atomic write operations.
I would add that using MySQL's built-in MD5() function is just fine for this purpose. Its only disadvantage is that a dedicated attacker can force collisions, which I'm quite sure you don't care about. Using the built-in function makes, for example, some types of joins much easier. It can be a tiny bit slower to pass a full URL across the wire ("SELECT url FROM urls WHERE hash=MD5('verylongurl')" instead of "WHERE hash='32charhexstring'"), but you'll have the option to do that if you want. Unless you can come up with a concrete scenario where MD5() will let you down, feel free to use it.
The hard question is whether and how you're going to need to look up URLs in ways other than their full text: for example, will you want to find all URLs starting with "/foo" on any "bar.com" host? While "LIKE '%bar.com%/foo%'" will work in testing, it will fail miserably at scale. If your needs include things like that, you can come up with creative ways to generate non-UNIQUE indexes targeted at the type of data you need... maybe a domain_name column, for starters. You'll have to populate those columns from your application, almost certainly (triggers and stored procedures are a lot more trouble than they're worth here, especially if you're concerned about performance -- don't bother).
The good news is that relational databases are very flexible for that sort of thing. You can always add new columns and populate them later. I would suggest for starters: int unsigned auto_increment primary key, unique hash char(32), and (assuming 64K chars suffices) text url.
Presumably you're talking about an entire URL, not just a hostname, including CGI parameters and other stuff.
SHA-1 hashing the URLs makes all the keys long, and makes sorting out trouble fairly obscure. I had to use indexes on hashes once to obscure some confidential data while maintaining the ability to join two tables, and the performance was poor.
There are two possible approaches. One is the naive and obvious one; it will actually work well in mySQL. It has advantages such as simplicity, and the ability to use URL LIKE 'whatever%' to search efficiently.
But if you have lots of URLs concentrated in a few domains ... for example ....
http://stackoverflow.com/questions/3735390/best-primary-key-for-storing-urls
http://stackoverflow.com/questions/3735391/how-to-add-a-c-compiler-flag-to-extconf-rb
etc, you're looking at indexes which vary only in the last characters. In this case you might consider storing and indexing the URLs with their character order reversed. This may lead to a more efficiently accessed index.
(The Oracle table server product happens has a built in way of doing this with a so-called reversed index.)
If I were you I would avoid an autoincrement key unless you have to join more than two tables ON TABLE_A.URL = TABLE_B.URL or some other join condition with that kind of meaing.
Depends on how you use the table. If you mostly select with WHERE url='<url>', then it's fine to have a one-column table. If you can use an autoincrement id to identify an URL in all places in your app, then use the autoincrement

How do you handle descriptive database table names and their effect on foreign key names?

I am working on a database schema, and am trying to make some decisions about table names. I like at least somewhat descriptive names, but then when I use suggested foreign key naming conventions, the result seems to get ridiculous. Consider this example:
Suppose I have table
session_subject_mark_item_info
And it has a foreign key that references
sessionSubjectID
in the
session_subjects
table.
Now when I create the foreign key name based on fk_[referencing_table]__[referenced_table]_[field_name] I end up with this maddness:
fk_session_subject_mark_item_info__session_subjects_sessionSubjectID
Would this type of a foreign key name cause me problems down the road, or is it quite common to see this?
Also, how do the more experienced database designers out there handle the conflict between descriptive naming for readability vs. the long names that result?
I am using MySQL and MySQL Workbench if that makes any difference.
UPDATE
Received the answers I needed below, but I wanted to mention that after some testing, I discovered that MySQL does have a limit on how long the FK name can be. So using the naming convention I mentioned, and descriptive table names, meant that in two instances in my db I had to shorten the names to avoid the MySQL 1059 error
http://dev.mysql.com/doc/refman/5.1/en/error-messages-server.html#error_er_too_long_ident
Why do you care what the FK names are? You never see them in code or use them. We also name our tables quite descriptively and commonly have names like this, using SQL Server. It doesn't matter to us, because we never seen them. They are just there to enforce data.
FK names are important for maintenance. Generally I only refernce the FK and the two table names, not the fields in the names. If you have named your fields correctly, it will be obvious what the fields are.
Although it probably makes no difference. I will say that i've had table names both ways. And in my opinion using long descriptive table names is overkill, and when working in code or even at the command line these long table names become burdensome and tedius. I mean seriously, who in their right mind would have a nearly 30 character table name, ie. stationchangelogmasterreport. Now imagine tens or even hundreds of these in a database system. from a developers point of view, this is just dumb!! My recommendation... put some thought into it, use abbreviations (when you can) and keep it short and to the point. for example, the above table name could be shortened to: stnchangelog, and if someone absolutely NEEDS a huge description explaining every meaning and use case for the table, then put this description in the table metadata, ie. the comments on the table. This keeps us developers from going crazy (and hating you for it), and offers the "meaning" of the table if needed.