Efficient to do math in a MySql "order by" clause? - mysql

I've been told on several occasions that it is quite efficient to SELECT using math and that it is NOT very efficient to use math in a WHERE clause. Are these sentiments correct? And how does this apply to ORDER BY clauses?
Thanks!!
Example:
SELECT a.* FROM a ORDER BY (a.field_1*a.field_2)

Your query will have to sort the entire table using temporary files on disk if the result is larger than the sort_buffer_size.
You probably want to add a column to your table that holds the value of field1*field2. This of course slightly denormalizes your data, BUT YOU CAN CREATE AN INDEX ON THE FIELD.
If you have an index on the new field, then MySQL can read the data pre-sorted using the index, because MySQL indexes are b*tree structures and b*tree structures are stored in pre-sorted order. This won't incur extra disk IO or CPU activity for the sort and you will scan the table only once.

Its a good idea , but I never think that using mathematical function in ORDER BY clause make any sense.
You can use this by alias :-
select *,(intId * intId)as xalias from m_xxx_list order by xalias;
OR
select * from m_xxx_list order by (intId + intId);
Yes , If you are using mathemetical aggregate function of MYSQL, then Test it.

For MySQL to sort results by a computed value, it actually needs to calculate the value on the fly, after it has filtered out rows based on the WHERE clause. If the result set is quite large, then MySQL will need to compute the results for all the rows.
For a small result set, this should be fine. However, the larger your result set is (previous to the application of the LIMIT), then the more calculations the server has to do simply figure out the value to order the rows in. If the calculation is deterministic, then you should cache it in a column in the result set, and then index it. If it's on the fly, then you'll need to ensure your CPU is up to the task.
In the case provided, I would recommend creating a column, a.field_3, and store the result of (a.field_1*a.field_2) in it. Whenever the values of a.field_1 or a.field_2 change, you'll need to recalculate the result.

Related

The process order of SQL order by, group by, distinct and aggregation function?

Query like:
SELECT DISTINCT max(age), area FROM T_USER GROUP BY area ORDER BY area;
So, what is the process order of order by, group by, distinct and aggregation function ?
Maybe different order will get the same result, but will cause different performance. I want to merge multi-result, I got the sql, and parsed.So I want to know the order of standard sql dose.
This is bigger than just group by/aggregation/order by. You want to have an sense of how a query engine creates a result set. At a high level, that means creating an execution plan, retrieving data from the table into the query's working set, manipulating the data to match the requested result set, and then returning the result set back to the caller. For very simple queries, or queries that are well matched to the table design (or table schemas that are well-designed for the queries you'll need to run), this can mean streaming data from a table or index directly back to the caller. More often, it means thinking at a more detailed level, where you roughly follow these steps:
Look at the query to determine which tables will be needed.
Look at joins and subqueries, to determine which of those table depend on other tables.
Look at the conditions on the joins and in the where clause, in conjunction with indexes, to determine the how much space from each table will be needed, and how much work it will take to extract the portions of each table that you need (how well the query matches up with your indexes or the table as stored on disk).
Based the information collected from steps 1 through 3, figure out the most efficient way to retrieve the data needed for the select list, regardless of the order in which tables are included in the query and regardless of any ORDER BY clause. For this step, "most efficient" is defined as the method that keeps the working set as small as possible for as long as possible.
Begin to iterate over the records indicated by step 4. If there is a GROUP BY clause, each record has to be checked against the existing discovered groups before the engine can determine whether or not a new row should be generated in the working set. Often, the most efficient way to do this is for the query engine to conduct an effective ORDER BY step here, such that all the potential rows for the results are materialized into the working set, which is then ordered by the columns in the GROUP BY clause, and condensed so that only duplicate rows are removed. Aggregate function results for each group are updated as the records for that group are discovered.
Once all of the indicated records are materialized, such that the results of any aggregate functions are known, HAVING clauses can be evaluated.
Now, finally, the ORDER BY can be factored in, as well.
The records remaining in the working set are returned to the caller.
And as complicated as that was, it's only the beginning. It doesn't begin to account for windowing functions, common table expressions, cross apply, pivot, and on and on. It is, however, hopefully enough to give you a sense of the kind of work the database engine needs to do.

store uncertain order of list item in mysql

this is the front-end http://jsfiddle.net/JRwM7/2/
I thought of just insert normally (using order by) then do the sorting in the front-end, but I curious and wanted to know is there any method of query to have a sorted data in my db table.
A table is a set (of rows), and sets are inherently unordered. Logically, the only way to guarantee an order of rows returned from some query is to use ORDER BY clause.
On the physical level, if ORDER BY clause happens to match an index, the DBMS might be able to establish the order by just traversing the index's B-Tree instead of performing a separate sort, which may benefit performance.
insert normally (using order by)
I'm not sure what you mean by that? ORDER BY is not valid syntax for INSERT INTO.

Retrieve min and max values from different tables with same strucure

I have some logs tables with the same structure. Each tables is related to a site and count billion of entries. The reason of this split is to perform quick and efficient query, because 99.99% of the query are related to the site.
But at this time, I would like to retrieve the min and max value of a column of these tables?
I can't manage to write the SQL request. Should I use UNION?
I am just looking for the request concept, not the final SQL request.
You could use a UNION, yes. Something like this should do:
SELECT MAX(PartialMax) AS TotalMax
FROM
( SELECT MAX(YourColumn) AS PartialMax FROM FirstTable UNION ALL SELECT MAX(YourColumn) AS PartialMax FROM SecondTable ) AS X;
If you have an index over the column you want to find a MAX inside, you should have very good performance as the query should seek to the end of the index on that column to find the maximum value very rapidly. Without an index on that column, the query has to scan the whole table to find the maximum value since nothing inherently orders it.
Added some details to address a concern about "enormous queries".
I'm not sure what you mean by "enormous". You could create a VIEW that does the UNIONs for you; then, you use the view and it will make the query very small:
SELECT MAX(YourColumn) FROM YourView;
but that just optimizes for the size of your query's text. Why do you believe it is important to optimize for that? The VIEW can be helpful for maintenance -- if you add or remove a partition, just fix the view appropriately. But a long query text shouldn't really be a problem.
Or by "enormous", are you worried about the amount of I/O the query will do? Nothing can help that much, aside from making sure each table has an index on YourColumn so that maximum value on each partition can be found very quickly.

MySql queries: really never use SELECT *?

I'm a self taught developer and ive always been told not to use SELECT *, but most of my queries require to know all the values of a certain row...
what should i use then? should i list ALL of the properties every time? like Select elem1,elem2,elem3,....,elem15 FROM...?
thanks
If you really need all the columns and you're fetching the results by name, I would go ahead and use SELECT *. If you're fetching the row results by index, then it makes sense to specify the column names or else they might not be in the order you expect (especially if the table schema changes).
SELECT * FROM ... is not always the best way to go unless you need all columns. That's because if for example a table has 10 columns and you only need 2-3 of them and these columns are indexed, then if you use SELECT * the query will run slower, because the server must fetch all rows from the datafiles. If instead you used only the 2-3 columns that you actually needed, then the server could run the query much faster if the rows were fetch from a covering index. A covering index is one that is used to return results without reading the datafile.
So, use SELECT * only when you actually need all columns.
If you absolutely have to use *, try to limit it to a specific table; e.g.:
SELECT t.*
FROM mytable t
List only the columns that you need, ideally with a table alias:
SELECT t.elem1,
t.elem2
FROM YOUR_TABLE t
The presence of a table alias helps demonstrate what is a column (and where it's from) vs a derived column.
If you are positive that you will always need all the columns then select * should be ok. But my reasoning for avoiding it is: say another developer has another column added to the table which isn't required by your query..then there is overhead. This can get worse as more columns get added.
The only real performance hit you take from using select * is in the bandwidth required to send back extra columns in your result set, if they're not necessary. Other than that, there's nothing inherently "bad" about using select *.
You might SELECT * from a subquery.
Yes, Select * is bad. You do not state what language you will be using process the returned data. Suppose you receive these records back as an array (not a hash map). In this case, what is in Row[12]? Maybe it was ZipCode when you wrote the app, but guess what happens when someone inserts a field SuiteNumber before ZipCode.
Or suppose the next coder appends a huge blob field to each record. Ouch!
Or a little more subtle: let's assume you're doing a join or subselect, and you have no Text or Blob type fields. MySQL will create any temp files it needs in memory. But as soon as you include a Text field (even TinyText), MySQL will need to create the temp file on disk, and call sort-merge. This won't break the program, but it can kill performance.
Select * is sacrificing maintainability in order to save a little typing.

How do I optimize MySQL's queries with constants?

NOTE: the original question is moot but scan to the bottom for something relevant.
I have a query I want to optimize that looks something like this:
select cols from tbl where col = "some run time value" limit 1;
I want to know what keys are being used but whatever I pass to explain, it is able to optimize the where clause to nothing ("Impossible WHERE noticed...") because I fed it a constant.
Is there a way to tell mysql to not do constant optimizations in explain?
Am I missing something?
Is there a better way to get the info I need?
Edit: EXPLAIN seems to be giving me the query plan that will result from constant values. As the query is part of a stored procedure (and IIRC query plans in spocs are generated before they are called) this does me no good because the value are not constant. What I want is to find out what query plan the optimizer will generate when it doesn't known what the actual value will be.
Am I missing soemthing?
Edit2: Asking around elsewhere, it seems that MySQL always regenerates query plans unless you go out of your way to make it re-use them. Even in stored procedures. From this it would seem that my question is moot.
However that doesn't make what I really wanted to know moot: How do you optimize a query that contains values that are constant within any specific query but where I, the programmer, don't known in advance what value will be used? -- For example say my client side code is generating a query with a number in it's where clause. Some times the number will result in an impossible where clause other times it won't. How can I use explain to examine how well optimized the query is?
The best approach I'm seeing right off the bat would be to run EXPLAIN on it for the full matrix of exist/non-exist cases. Really that isn't a very good solution as it would be both hard and error prone to do by hand.
You are getting "Impossible WHERE noticed" because the value you specified is not in the column, not just because it is a constant. You could either 1) use a value that exists in the column or 2) just say col = col:
explain select cols from tbl where col = col;
For example say my client side code is generating a query with a number in it's where clause.
Some times the number will result in an impossible where clause other times it won't.
How can I use explain to examine how well optimized the query is?
MySQL builds different query plans for different values of bound parameters.
In this article you can read the list of when does the MySQL optimizer does what:
Action When
Query parse PREPARE
Negation elimination PREPARE
Subquery re-writes PREPARE
Nested JOIN simplification First EXECUTE
OUTER->INNER JOIN conversions First EXECUTE
Partition pruning Every EXECUTE
COUNT/MIN/MAX elimination Every EXECUTE
Constant subexpression removal Every EXECUTE
Equality propagation Every EXECUTE
Constant table detection Every EXECUTE
ref access analysis Every EXECUTE
range/index_merge analysis and optimization Every EXECUTE
Join optimization Every EXECUTE
There is one more thing missing in this list.
MySQL can rebuild a query plan on every JOIN iteration: a such called range checking for each record.
If you have a composite index on a table:
CREATE INDEX ix_table2_col1_col2 ON table2 (col1, col2)
and a query like this:
SELECT *
FROM table1 t1
JOIN table2 t2
ON t2.col1 = t1.value1
AND t2.col2 BETWEEN t1.value2_lowerbound AND t2.value2_upperbound
, MySQL will NOT use an index RANGE access from (t1.value1, t1.value2_lowerbound) to (t1.value1, t1.value2_upperbound). Instead, it will use an index REF access on (t1.value) and just filter out the wrong values.
But if you rewrite the query like this:
SELECT *
FROM table1 t1
JOIN table2 t2
ON t2.col1 <= t1.value1
AND t2.col1 >= t2.value1
AND t2.col2 BETWEEN t1.value2_lowerbound AND t2.value2_upperbound
, then MySQL will recheck index RANGE access for each record from table1, and decide whether to use RANGE access on the fly.
You can read about it in these articles in my blog:
Selecting timestamps for a time zone - how to use coarse filtering to filter out timestamps without a timezone
Emulating SKIP SCAN - how to emulate SKIP SCAN access method in MySQL
Analytic functions: optimizing LAG, LEAD, FIRST_VALUE, LAST_VALUE - how to emulate Oracle's analytic functions in MySQL
Advanced row sampling - how to select N records from each group in MySQL
All these things employ RANGE CHECKING FOR EACH RECORD
Returning to your question: there is no way to tell which plan will MySQL use for every given constant, since there is no plan before the constant is given.
Unfortunately, there is no way to force MySQL to use one query plan for every value of a bound parameter.
You can control the JOIN order and INDEX'es being chosen by using STRAIGHT_JOIN and FORCE INDEX clauses, but they will not force a certain access path on an index or forbid the IMPOSSIBLE WHERE.
On the other hand, for all JOIN's, MySQL employs only NESTED LOOPS. That means that if you build right JOIN order or choose right indexes, MySQL will probably benefit from all IMPOSSIBLE WHERE's.
How do you optimize a query with values that are constant only to the query but where I, the programmer, don't known in advance what value will be used?
By using indexes on the specific columns (or even on combination of columns if you always query the given columns together). If you have indexes, the query planner will potentially use them.
Regarding "impossible" values: the query planner can conclude that a given value is not in the table from several sources:
if there is an index on the particular column, it can observe that the particular value is large or smaller than any value in the index (min/max values take constant time to extract from indexes)
if you are passing in the wrong type (if you are asking for a numeric column to be equal with a text)
PS. In general, creation of the query plan is not expensive and it is better to re-create than to re-use them, since the conditions might have changed since the query plan was generated and a better query plan might exists.