What, exactly, does allowMultiQueries do? - mysql

Adding allowMultiQueries=true to the JDBC string makes MySQL accept Statements with multiple queries.
But what exactly does this do? Is there any benefit to this?
Perhaps it reduces the delay due to round trips? Something like
LOCK
UPDATE ...
UNLOCK
which if done in one statements holds the lock for less time.
When, if ever, would I want to combine queries in a single Statement, rather than in separate ones?

For running safe scripts of your own creation that otherwise would need to be run line by line. For instance, a script from mysqldump, or one that you would have run anyway, safely and trusted. This was pointed out to me once by someone when I asked "why would you want to do that?" He responded, his stockpile of scripts, of his own, each of which has no user input for tomfoolery and the potential of sql injection. The size of these routines is limited by max_allowed_packet and the strategy would be, of course, reading the file into your buffer, and using that for the query in a Multi.
For running a few statements in concert where one relies on the other in the transient nature of a call. Transient meaning that had you issued a subsequent call not via a Multi, that the necessary information is no longer available for a piece of it. A common example often given, wise or not, is the duo of SQL_CALC_FOUND_ROWS and FOUND_ROWS() which popularly was debunked in the Percona article To SQL_CALC_FOUND_ROWS or not to SQL_CALC_FOUND_ROWS?. There is an argument to be made in that situation that a single call that not only returns the resultset but has available the count to be grabbed shortly thereafter is a wiser route for more accurate pagination routines. This assumes that a separate call for count(*) and another for the data could generate a discrepancy in multi-user concurrent systems like all of ours most likely. So, the just mentioned verbiage addresses accuracy, not performance which is what the Percona article is about. Another use-case is priming and using User-Defined Variables into queries. Many of these can be folded into the query and initialized with a cross join, however.

When, if ever, would I want to combine queries in a single Statement, rather than in separate ones?
There are two great use cases for this feature:
If you are lazy and like to blindly run queries without checking for success or row counts or auto_increment value assignment, or
If you like the idea of increasing the odds of SQL injection vulnerabilities username ='' AND 0 = 1; ← right here. With this mode inactive, anything after the injected semicolon is an error, as it should be. With this mode active, a whole world of "oops" can open right up.
What I am saying is... You're right. Don't use it.
Yes, it reduces the impact of round-trip time to the database, pipelining queries... which can be significant with a distant database... but at the cost of increased risk that isn't worth it.

Related

Keeping mysql query results and variables set from the queries for use across pages and sessions

I have never used apc_store() before, and I'm also not sure about whether to free query results or not. So I have these questions...
In a MySQL Query Cache article here, it says "The MySQL query cache is a global one shared among the sessions. It caches the select query along with the result set, which enables the identical selects to execute faster as the data fetches from the in memory."
Does using free_result() after a select query negate the caching spoken of above?
Also, if I want to set variables and arrays obtained from the select query for use across pages, should I save the variables in memory via apc_store() for example? (I know that can save arrays too.) And if I do that, does it matter if I free the result of the query? Right now, I am setting these variables and arrays in an included file on most pages, since they are used often. This doesn't seem very efficient, which is why I'm looking for an alternative.
Thanks for any help/advice on the most efficient way to do the above.
MySQL's "Query cache" is internal to MySQL. You still have to perform the SELECT; the result may come back faster if the QC is enabled and usable in the situation.
I don't think the QC is what you are looking for.
The QC is going away in newer versions. Do not plan to use it.
In PHP, consider $_SESSION. I don't know whether it is better than apc_store for your use.
Note also, anything that is directly available in PHP constrains you to a single webserver. (This is fine for small to medium apps, but is not viable for very active apps.)
For scaling, consider storing a small key in a cookie, then looking up that key in a table in the database. This provides for storing arbitrary amounts of data in the database with only a few milliseconds of overhead. The "key" might be something as simple as a "user id" or "session number" or "cart number", etc.

Table name changing to avoid SQL injection attack

I understand the basic process of SQL injection attack. My question is related to SQL injection prevention. I was told that one way to prevent such an attack is by frequently changing the table name! Is that possible?
If so, can someone provide me a link to read about it more because I couldn't find an explanation about it on the web.
No. That makes no sense. You'd either have to change every line of code that references the table or you'd have to leave in place something like a view with the old table name that acts exactly like the old table. No reasonable person would do that. Plus, it's not like there are a ton of reasonable names for tables so you'd be doing crazy things like saying table A stores customer data and AA stores employer data and AAA was the intersection between customers and employers.
SQL injection is almost comically simple to prevent. Use prepared statements with bind variables. Don't dynamically build SQL statements. Done. Of course, in reality, making sure that the new developer doesn't violate this dictum either because they don't know any better or because they can hack something out in a bit less time if they just do a bit of string concatenation makes it a bit more complex. But the basic approach is very simple.
Pffft. What? Frequently changing a table name?
That's bogus advice, as far as "preventing SQL Injection".
The only prevention for SQL Injection vulnerabilities is to write code that isn't vulnerable. And in the vast majority of cases, that is very easy to do.
Changing table names doesn't do anything to close a SQL Injection vulnerability. It might make a successful attack vector less repeatable, requiring an attacker to make some adjustments. But it does nothing prevent SQL Injection.
As a starting point for research on SQL Injection, I recommend OWASP (Open Web Application Security Project)
Start here: https://www.owasp.org/index.php/SQL_Injection
If you run across "changing a table name" as a mitigation, let me know. I've never run across that as a prevention or mitigation for SQL Injection vulnerability.
Here's things you can do to prevent SQL injection:
Use an ORM that encapsulates your SQL calls and provides a friendly layer to your database records. Most of these are very good at writing high quality queries and protecting you from injection bugs simply because of how you use them.
Use prepared statements with placeholder values whenever possible. Write queries like this:
INSERT INTO table_name (name, age) VALUES (:name, :age)
Be very careful to properly escape any and all values that are inserted into SQL though any other method. This is always a risky thing to do, so any code you do write like this should have any escaping you do made blindingly obvious so that a quick code review can verify it's working properly. Never hide escaping behind abstractions or methods with cute names like scrub or clean. Those methods might be subtly broken and you'd never notice.
Be absolutely certain any table name parameters, if dynamic, are tested versus a white list of known-good values. For example, if you can create records of more than one type, or put data into more than one table ensure that the parameter supplied is valid.
Trust nothing supplied by the user. Presume every single bit of data is tainted and hostile unless you've taken the trouble to clean it up. This goes doubly for anything that's in your database if you got your database from some other source, like inheriting a historical project. Paranoia is not unfounded, it's expected.
Write your code such that deleting a line does not introduce a security problem. That means never doing this:
$value = $db->escaped(value);
$db->query("INSERT INTO table (value) VALUES ('$value')");
You're one line away from failure here. If you must do this, write it like so:
$value_escaped = $db->escaped(value);
$db->query("INSERT INTO table (value) VALUES ('$value_escaped')");
That way deleting the line that does the escaping does not immediately cause an injection bug. The default here is to fail safely.
Make every effort to block direct access to your database server by aggressively firewalling it and restricting access to those that actually need access. In practice this means blocking port 3306 and using SSH for any external connections. If you can, eliminate SSH and use a secured VPN to connect to it.
Never generate errors which spew out stack traces that often contain information highly useful to attackers. For example, an error that includes a table name, a script path, or a server identifier is providing way too much information. Have these for development, and ensure these messages are suppressed on production servers.
Randomly changing table names is utterly pointless and will make your code a total nightmare. It will be very hard to keep all your code in sync with whatever random name the table is assuming at any particular moment. It will also make backing up and restoring your data almost impossible without some kind of decoder utility.
Anyone who recommends doing this is proposing a pointless and naïve solution to a an already solved problem.
Suggesting that randomly changing the table names fixes anything demonstrates a profound lack of understanding of the form SQL injection bugs take. Knowing the table name is a nice thing to have, it makes your life easier as an attacker, but many attacks need no knowledge of this. A common attack is to force a login as an administrator by injecting additional clauses in the WHERE condition, the table name is irrelevant.

How to optimize and Fast run SQL query

I have following SQL query that taking too much time to fetch data.
Customer.joins("LEFT OUTER JOIN renewals ON customers.id = renewals.customer_id").where("renewals.customer_id IS NULL && customers.status_id = 4").order("created_at DESC").select('first_name, last_name, customer_state, customers.created_at, customers.customer_state, customers.id, customers.status_id')
Above query takes 230976.6ms to execute.
I added indexing on firstname, lastname, customer_state and status_id.
How can I execute query within less then 3 sec. ?
Try this...
Everyone wants faster database queries, and both SQL developers and DBAs can turn to many time-tested methods to achieve that goal. Unfortunately, no single method is foolproof or ironclad. But even if there is no right answer to tuning every query, there are plenty of proven do's and don'ts to help light the way. While some are RDBMS-specific, most of these tips apply to any relational database.
Do use temp tables to improve cursor performance
I hope we all know by now that it’s best to stay away from cursors if at all possible. Cursors not only suffer from speed problems, which in itself can be an issue with many operations, but they can also cause your operation to block other operations for a lot longer than is necessary. This greatly decreases concurrency in your system.
However, you can’t always avoid using cursors, and when those times arise, you may be able to get away from cursor-induced performance issues by doing the cursor operations against a temp table instead. Take, for example, a cursor that goes through a table and updates a couple of columns based on some comparison results. Instead of doing the comparison against the live table, you may be able to put that data into a temp table and do the comparison against that instead. Then you have a single UPDATE statement against the live table that’s much smaller and holds locks only for a short time.
Sniping your data modifications like this can greatly increase concurrency. I’ll finish by saying you almost never need to use a cursor. There’s almost always a set-based solution; you need to learn to see it.
Don’t nest views
Views can be convenient, but you need to be careful when using them. While views can help to obscure large queries from users and to standardize data access, you can easily find yourself in a situation where you have views that call views that call views that call views. This is called nesting views, and it can cause severe performance issues, particularly in two ways. First, you will very likely have much more data coming back than you need. Second, the query optimizer will give up and return a bad query plan.
I once had a client that loved nesting views. The client had one view it used for almost everything because it had two important joins. The problem was that the view returned a column with 2MB documents in it. Some of the documents were even larger. The client was pushing at least an extra 2MB across the network for every single row in almost every single query it ran. Naturally, query performance was abysmal.
And none of the queries actually used that column! Of course, the column was buried seven views deep, so even finding it was difficult. When I removed the document column from the view, the time for the biggest query went from 2.5 hours to 10 minutes. When I finally unraveled the nested views, which had several unnecessary joins and columns, and wrote a plain query, the time for that same query dropped to subseconds.
Do use table-valued functions
RESOURCES
VIDEO/WEBCAST
Sponsored
Discover your Data Dilemma
WHITE PAPER
Best Practices when Designing a Digital Workplace
SEE ALL
Search Resources
Go
This is one of my favorite tricks of all time because it is truly one of those hidden secrets that only the experts know. When you use a scalar function in the SELECT list of a query, the function gets called for every single row in the result set. This can reduce the performance of large queries by a significant amount. However, you can greatly improve the performance by converting the scalar function to a table-valued function and using a CROSS APPLY in the query. This is a wonderful trick that can yield great improvements.
Want to know more about the APPLY operator? You'll find a full discussion in an excellent course on Microsoft Virtual Academy by Itzik Ben-Gan.
Do use partitioning to avoid large data moves
Not everyone will be able to take advantage of this tip, which relies on partitioning in SQL Server Enterprise, but for those of you who can, it’s a great trick. Most people don’t realize that all tables in SQL Server are partitioned. You can separate a table into multiple partitions if you like, but even simple tables are partitioned from the time they’re created; however, they’re created as single partitions. If you're running SQL Server Enterprise, you already have the advantages of partitioned tables at your disposal.
This means you can use partitioning features like SWITCH to archive large amounts of data from a warehousing load. Let’s look at a real example from a client I had last year. The client had the requirement to copy the data from the current day’s table into an archive table; in case the load failed, the company could quickly recover with the current day’s table. For various reasons, it couldn’t rename the tables back and forth every time, so the company inserted the data into an archive table every day before the load, then deleted the current day’s data from the live table.
This process worked fine in the beginning, but a year later, it was taking 1.5 hours to copy each table -- and several tables had to be copied every day. The problem was only going to get worse. The solution was to scrap the INSERT and DELETE process and use the SWITCH command. The SWITCH command allowed the company to avoid all of the writes because it assigned the pages to the archive table. It’s only a metadata change. The SWITCH took on average between two and three seconds to run. If the current load ever fails, you SWITCH the data back into the original table.
YOU MIGHT ALSO LIKE
Microsoft Dynamics AX ERP
Microsoft Dynamics AX: A new ERP is born, this time in the cloud
Joseph Sirosh
Why Microsoft’s data chief thinks current machine learning tools are like...
Urs Holzle Structure
Google's infrastructure czar predicts cloud business will outpace ads in 5...
This is a case where understanding that all tables are partitions slashed hours from a data load.
If you must use ORMs, use stored procedures
This is one of my regular diatribes. In short, don’t use ORMs (object-relational mappers). ORMs produce some of the worst code on the planet, and they’re responsible for almost every performance issue I get involved in. ORM code generators can’t possibly write SQL as well as a person who knows what they're doing. However, if you use an ORM, write your own stored procedures and have the ORM call the stored procedure instead of writing its own queries. Look, I know all the arguments, and I know that developers and managers love ORMs because they speed you to market. But the cost is incredibly high when you see what the queries do to your database.
Stored procedures have a number of advantages. For starters, you’re pushing much less data across the network. If you have a long query, then it could take three or four round trips across the network to get the entire query to the database server. That's not including the time it takes the server to put the query back together and run it, or considering that the query may run several -- or several hundred -- times a second.
Using a stored procedure will greatly reduce that traffic because the stored procedure call will always be much shorter. Also, stored procedures are easier to trace in Profiler or any other tool. A stored procedure is an actual object in your database. That means it's much easier to get performance statistics on a stored procedure than on an ad-hoc query and, in turn, find performance issues and draw out anomalies.
In addition, stored procedures parameterize more consistently. This means you’re more likely to reuse your execution plans and even deal with caching issues, which can be difficult to pin down with ad-hoc queries. Stored procedures also make it much easier to deal with edge cases and even add auditing or change-locking behavior. A stored procedure can handle many tasks that trouble ad-hoc queries. My wife unraveled a two-page query from Entity Framework a couple of years ago. It took 25 minutes to run. When she boiled it down to its essence, she rewrote that huge query as SELECT COUNT(*) from T1. No kidding.
OK, I kept it as short as I could. Those are the high-level points. I know many .Net coders think that business logic doesn’t belong in the database, but what can I say other than you’re outright wrong. By putting the business logic on the front end of the application, you have to bring all of the data across the wire merely to compare it. That’s not good performance. I had a client earlier this year that kept all of the logic out of the database and did everything on the front end. The company was shipping hundreds of thousands of rows of data to the front end, so it could apply the business logic and present the data it needed. It took 40 minutes to do that. I put a stored procedure on the back end and had it call from the front end; the page loaded in three seconds.
Of course, the truth is that sometimes the logic belongs on the front end and sometimes it belongs in the database. But ORMs always get me ranting.
Don’t do large ops on many tables in the same batch
This one seems obvious, but apparently it's not. I’ll use another live example because it will drive home the point much better. I had a system that suffered tons of blocking. Dozens of operations were at a standstill. As it turned out, a delete routine that ran several times a day was deleting data out of 14 tables in an explicit transaction. Handling all 14 tables in one transaction meant that the locks were held on every single table until all of the deletes were finished. The solution was to break up each table's deletes into separate transactions so that each delete transaction held locks on only one table. This freed up the other tables and reduced the blocking and allowed other operations to continue working. You always want to split up large transactions like this into separate smaller ones to prevent blocking.
Don't use triggers
This one is largely the same as the previous one, but it bears mentioning. Don’t use triggers unless it’s unavoidable -- and it’s almost always avoidable.
The problem with triggers: Whatever it is you want them to do will be done in the same transaction as the original operation. If you write a trigger to insert data into another table when you update a row in the Orders table, the lock will be held on both tables until the trigger is done. If you need to insert data into another table after the update, then put the update and the insert into a stored procedure and do them in separate transactions. If you need to roll back, you can do so easily without having to hold locks on both tables. As always, keep transactions as short as possible and don’t hold locks on more than one resource at a time if you can help it.
Don’t cluster on GUID
After all these years, I can't believe we’re still fighting this issue. But I still run into clustered GUIDs at least twice a year.
A GUID (globally unique identifier) is a 16-byte randomly generated number. Ordering your table’s data on this column will cause your table to fragment much faster than using a steadily increasing value like DATE or IDENTITY. I did a benchmark a few years ago where I inserted a bunch of data into one table with a clustered GUID and into another table with an IDENTITY column. The GUID table fragmented so severely that the performance degraded by several thousand percent in a mere 15 minutes. The IDENTITY table lost only a few percent off performance after five hours. This applies to more than GUIDs -- it goes toward any volatile column.
Don’t count all rows if you only need to see if data exists
It's a common situation. You need to see if data exists in a table or for a customer, and based on the results of that check, you’re going to perform some action. I can't tell you how often I've seen someone do a SELECT COUNT(*) FROM dbo.T1 to check for the existence of that data:
SET #CT = (SELECT COUNT(*) FROM dbo.T1);
If #CT > 0
BEGIN
END
It’s completely unnecessary. If you want to check for existence, then do this:
If EXISTS (SELECT 1 FROM dbo.T1)
BEGIN
END
Don’t count everything in the table. Just get back the first row you find. SQL Server is smart enough to use EXISTS properly, and the second block of code returns superfast. The larger the table, the bigger difference this will make. Do the smart thing now before your data gets too big. It’s never too early to tune your database.
In fact, I just ran this example on one of my production databases against a table with 270 million rows. The first query took 15 seconds, and included 456,197 logical reads, while the second one returned in less than one second and included only five logical reads. However, if you really do need a row count on the table, and it's really big, another technique is to pull it from the system table. SELECT rows from sysindexes will get you the row counts for all of the indexes. And because the clustered index represents the data itself, you can get the table rows by adding WHERE indid = 1. Then simply include the table name and you're golden. So the final query is SELECT rows from sysindexes where object_name(id) = 'T1' and indexid = 1. In my 270 million row table, this returned sub-second and had only six logical reads. Now that's performance.
Don’t do negative searches
Take the simple query SELECT * FROM Customers WHERE RegionID <> 3. You can’t use an index with this query because it’s a negative search that has to be compared row by row with a table scan. If you need to do something like this, you may find it performs much better if you rewrite the query to use the index. This query can easily be rewritten like this:
SELECT * FROM Customers WHERE RegionID < 3 UNION ALL SELECT * FROM Customers WHERE RegionID
This query will use an index, so if your data set is large it could greatly outperform the table scan version. Of course, nothing is ever that easy, right? It could also perform worse, so test this before you implement it. There are too many factors involved for me to tell you that it will work 100 percent of the time. Finally, I realize this query breaks the “no double dipping” tip from the last article, but that goes to show there are no hard and fast rules. Though we're double dipping here, we're doing it to avoid a costly table scan.
Ref:http://www.infoworld.com/article/2604472/database/10-more-dos-and-donts-for-faster-sql-queries.html
http://www.infoworld.com/article/2628420/database/database-7-performance-tips-for-faster-sql-queries.html

Alternatives to LINQ To SQL on high loaded pages

To begin with, I LOVE LINQ TO SQL. It's so much easier to use than direct querying.
But, there's one great problem: it doesn't work well on high loaded requests. I have some actions in my ASP.NET MVC project, that are called hundreds times every minute.
I used to have LINQ to SQL there, but since the amount of requests is gigantic, LINQ TO SQL almost always returned "Row not found or changed" or "X of X updates failed". And it's understandable. For instance, I have to increase some value by one with every request.
var stat = DB.Stats.First();
stat.Visits++;
// ....
DB.SubmitChanges();
But while ASP.NET was working on those //... instructions, the stats.Visits value stored in the table got changed.
I found a solution, I created a stored procedure
UPDATE Stats SET Visits=Visits+1
It works well.
Unfortunately now I'm getting more and more moments like that. And it sucks to create stored procedures for all cases.
So my question is, how to solve this problem? Are there any alternatives that can work here?
I hear that Stackoverflow works with LINQ to SQL. And it's more loaded than my site.
This isn't exactly a problem with Linq to SQL, per se, it's an expected result with optimistic concurrency, which Linq to SQL uses by default.
Optimistic concurrency means that when you update a record, you check the current version in the database against the copy that was originally retrieved before making any offline updates; if they don't match, report a concurrency violation ("row not found or changed").
There's a more detailed explanation of this here. There's also a fairly sizable guide on handling concurrency errors. Typically the solution involves simply catching ChangeConflictException and picking a resolution, such as:
try
{
// Make changes
db.SubmitChanges();
}
catch (ChangeConflictException)
{
foreach (var conflict in db.ChangeConflicts)
{
conflict.Resolve(RefreshMode.KeepCurrentValues);
}
}
The above version will overwrite whatever is in the database with the current values, regardless of what other changes were made. For other possibilities, see the RefreshMode enumeration.
Your other option is to disable optimistic concurrency entirely for fields that you expect might be updated. You do this by setting the UpdateCheck option to UpdateCheck.Never. This has to be done at the field level; you can't do it at the entity level or globally at the context level.
Maybe I should also mention that you haven't picked a very good design for the specific problem you're trying to solve. Incrementing a "counter" by repeatedly updating a single column of a single row is not a very good/appropriate use of a relational database. What you should be doing is actually maintaining a history table - such as Visits - and if you really need to denormalize the count, implement that with a trigger in the database itself. Trying to implement a site counter at the application level without any data to back it up is just asking for trouble.
Use your application to put actual data in your database, and let the database handle aggregates - that's one of the things databases are good at.
Use a producer/consumer or message queue model for updates that don't absolutely have to happen immediately, particularly status updates. Instead of trying to update the database immediately keep a queue of updates that the asp.net threads can push to and then have a writer process/thread that writes the queue to the database. Since only one thread is writing, there will be much less contention on the relevant tables/roles.
For reads, use caching. For high volume sites even caching data for a few seconds can make a difference.
Firstly, you could call DB.SubmitChanges() right after stats.Visits++, and that would greatly reduce the problem.
However, that still is not going to save you from the concurrency violation (that is, simultaneously modifying a piece of data by two concurrent processes). To fight that, you may use the standard mechanism of transactions. With LINQ-to-SQL, you use transactions by instantiating a TransactionScope class, thusly:
using( TransactionScope t = new TransactionScope() )
{
var stats = DB.Stats.First();
stats.Visits++;
DB.SubmitChanges();
}
Update: as Aaronaught correctly pointed out, TransactionScope is not going to help here, actually. Sorry. But read on.
Be careful, though, not to make the body of a transaction too long, as it will block other concurrent processes, and thus, significantly reduce your overall performance.
And that brings me to the next point: your very design is probably flawed.
The core principle in dealing with highly shared data is to design your application in such way that the operations on that data are quick, simple, and semantically clear, and they must be performed one after another, not simultaneously.
The one operation that you're describing - counting visits - is pretty clear and simple, so it should be no problem, once you add the transaction. I must add, however, that while this will be clear, type-safe and otherwise "good", the solution with stored procedure is actually a much preferred one. This is actually exactly the way database applications were being designed in ye olden days. Think about it: why would you need to fetch the counter all the way from the database to your application (potentially over the network!) if there is no business logic involved in processing it. The database server may increment it just as well, without even sending anything back to the application.
Now, as for other operations, that are hidden behind // ..., it seems (by your description) that they're somewhat heavy and long. I can't tell for sure, because I don't see what's there, but if that's the case, you probably want to separate them into smaller and quicker ones, or otherwise rethink your design. I really can't tell anything else with this little information.

How do I use EXPLAIN to *predict* performance of a MySQL query?

I'm helping maintain a program that's essentially a friendly read-only front-end for a big and complicated MySQL database -- the program builds ad-hoc SELECT queries from users' input, sends the queries to the DB, gets the results, post-processes them, and displays them nicely back to the user.
I'd like to add some form of reasonable/heuristic prediction for the constructed query's expected performance -- sometimes users inadvertently make queries that are inevitably going to take a very long time (because they'll return huge result sets, or because they're "going against the grain" of the way the DB is indexed) and I'd like to be able to display to the user some "somewhat reliable" information/guess about how long the query is going to take. It doesn't have to be perfect, as long as it doesn't get so badly and frequently out of whack with reality as to cause a "cry wolf" effect where users learn to disregard it;-) Based on this info, a user might decide to go get a coffee (if the estimate is 5-10 minutes), go for lunch (if it's 30-60 minutes), kill the query and try something else instead (maybe tighter limits on the info they're requesting), etc, etc.
I'm not very familiar with MySQL's EXPLAIN statement -- I see a lot of information around on how to use it to optimize a query or a DB's schema, indexing, etc, but not much on how to use it for my more limited purpose -- simply make a prediction, taking the DB as a given (of course if the predictions are reliable enough I may eventually switch to using them also to choose between alternate forms a query could take, but, that's for the future: for now, I'd be plenty happy just to show the performance guesstimates to the users for the above-mentioned purposes).
Any pointers...?
EXPLAIN won't give you any indication of how long a query will take.
At best you could use it to guess which of two queries might be faster, but unless one of them is obviously badly written then even that is going to be very hard.
You should also be aware that if you're using sub-queries, even running EXPLAIN can be slow (almost as slow as the query itself in some cases).
As far as I'm aware, MySQL doesn't provide any way to estimate the time a query will take to run. Could you log the time each query takes to run, then build an estimate based on the history of past similar queries?
I think if you want to have a chance of building something reasonably reliable out of this, what you should do is build a statistical model out of table sizes and broken-down EXPLAIN result components correlated with query processing times. Trying to build a query execution time predictor based on thinking about the contents of an EXPLAIN is just going to spend way too long giving embarrassingly poor results before it gets refined to vague usefulness.
MySQL EXPLAIN has a column called Key. If there is something in this column, this is a very good indication, it means that the query will use an index.
Queries that use indicies are generally safe to use since they were likely thought out by the database designer when (s)he designed the database.
However
There is another field called Extra. This field sometimes contains the text using_filesort.
This is very very bad. This literally means MySQL knows that the query will have a result set larger than the available memory, and therefore will start to swap the data to disk in order to sort it.
Conclusion
Instead of trying to predict the time a query takes, simply look at these two indicators. If a query is using_filesort, deny the user. And depending on how strict you want to be, if the query is not using any keys, you should also deny it.
Read more about the resultset of the MySQL EXPLAIN statement