I've got a rather simple query to a linked DB2 table.
SELECT GC_TBSELC.*
FROM GC_TBSELC
WHERE SELC_EFF_DATE > #1/1/2017#;
Works fine, returns results. However, when I add the "DISTINCT" keyword, I get an error:
ODBC -- CALL FAILED
[[IBM][CLI Driver][DB2] SQL0904N Unsuccessful execution caused by an
unavailable resource. Reason code: "00C90305", type of resource:
"00000100", and resource name: "DSNDB07". SQLSTATE=57011
Any idea on why the "DISTINCT" keyword would cause this, and if there's a way around it to get distinct records from the table?
SQL0904N with Reason code: 00C90305 indicates the following:
The limit on the space usage of the work file storage by an agent was
exceeded. The space usage limit is determined by the zparm keyword
MAXTEMPS.
By adding the DISTINCT clause on a SELECT * (all columns), you likely exceeded the work space available.
Let me ask a better question: Why would you want to DISTINCT all columns from a Table? Is this really the result set you are looking for? Would it be more appropriate to DISTINCT a subset of the columns in this table?
The query without the DISTINCT did not require duplicate removal - rows could just be streamed back to the caller.
The DISTINCT tells Db2 - remove duplicates before passing back the rows. In this case, Db2 likely materialized the rows into sort work and sorted to remove duplicates and during that process, sort work limits were exceeded.
Related
I have built a query editor where a user can enter in a query. However, it needs to limit the user's entry to 1000 results, otherwise the user could enter in something like :
SELECT * FROM mybigtable
It could try and download 1 billion results.
What would be the best way to enforce a limit? The first approach I thought of was to do:
SELECT * FROM (
user-query
) x LIMIT 1000
However, this would execute the entire query (and could take forever) before doing the actual limit. What would be the best way to enforce a strict limit on the user's sql input?
This is too long for a comment.
I don’t think that there is generic solution for this.
Wrapping the user query in a SELECT * FROM ... LIMIT 1000 statement is attractive but ;
there are edge cases where it can produce invalid SQL, for example if the user query contains a CTE (the WITH clause must be placed at the very beginning of the query)
while it will happily limit the number of rows returned to the user, it will not prevent the database from scanning the entire resultset
The typical solution for the second use case is to filter rows according to an autoincremented integer column (usually the primary key of your table). But that’s even harder to make generic.
To make it short : manipulating SQL at distance is tricky : if you want a complete solution for your use case, get yourself a real query builder (or at least a sql parser).
Is it possible to set the number of rows that a table can accommodate in MySQL ?
I don't want to use any java code. I want to do this using pure mysql scripts.
I wouldn't recommend trying to limit the number of rows in a SQL table, unless you had a very good reason to do so. It seems you would be better off using a query like:
select top 1000 entityID, entityName from TableName
rather than physically limiting the rows of the table.
However, if you really want to limit it to 1000 rows:
delete from TableName where entityID not in (select top 1000 entityID from TableName)
Mysql supports a MAX_ROWS parameter when creating (and maybe altering?) a table. http://dev.mysql.com/doc/refman/5.0/en/create-table.html
Edit: Sadly it turns out this is only a hint for optimization
"The maximum number of rows you plan to store in the table. This is not a hard limit, but rather a hint to the storage engine that the table must be able to store at least this many rows."
.. Your question implied that scripts are ok; is it ridiculous to make one as simple as a cron job regularly dropping table rows above a given ID ? It's not nearly as elegant as it would've been to have mysql throw errors when something tries to add a row too many, but it would do the job - and you may be able to have your application also then check if it's ID is too high, and throw a warning to the user/relevant party.
I have only used SQL rarely until recently when I began using it daily. I notice that if no "order by" clause is used:
When selecting part of a table the rows returned appear to be in the same order as they appear if I select the whole table
The order of rows returned by a selecting from a join seemes to be determined by the left most member of a join.
Is this behaviour a standard thing one can count on in the most common databases (MySql, Oracle, PostgreSQL, Sqlite, Sql Server)? (I don't really even know whether one can truly count on it in sqlite). How strictly is it honored if so (e.g. if one uses "group by" would the individual groups each have that ordering)?
If no ORDER BY clause is included in the query, the returned order of rows is undefined.
Whilst some RDBMSes will return rows in specific orders in some situations even when an ORDER BY clause is omitted, such behaviour should never be relied upon.
Section 20.2 <direct select statement: multiple rows>, subsection "General Rules" of
the SQL-92 specification:
4) If an <order by clause> is not specified, then the ordering of
the rows of Q is implementation-dependent.
If you want order, include an ORDER BY. If you don't include an ORDER BY, you're telling SQL Server:
I don't care what order you return the rows, just return the rows
Since you don't care, SQL Server is going to decide how to return the rows what it deems will be the most efficient manner possible right now (or according to the last time the plan for this specific query was cached). Therefore you should not rely on the behavior you observe. It can change from one run of the query to the next, with data changes, statistics changes, index changes, service packs, cumulative updates, upgrades, etc. etc. etc.
For PostgreSQL, if you omit the ORDER BY clause you could run the exact same query 100 times while the database is not being modified, and get one run in the middle in a different order than the others. In fact, each run could be in a different order.
One reason this could happen is that if the plan chosen involves a sequential scan of a table's heap, and there is already a seqscan of that table's heap in process, your query will start it's scan at whatever point the other scan is already at, to reduce the need for disk access.
As other answers have pointed out, if you want the data in a certain order, specify that order. PostgreSQL will take the requested order into consideration in choosing a plan, and may use an index that provides data in that order, if that works out to be cheaper than getting the rows some other way and then sorting them.
GROUP BY provides no guarantee of order; PostgreSQL might sort the data to do the grouping, or it might use a hash table and return the rows in order of the number generated by the hashing algorithm (i.e., pretty random). And that might change from one run to the next.
It never ceased to amaze me when I was a DBA that this feature of SQL was so often thought of as quirky. Consider a simple program that runs against a text file and produces some output. If the program never changes, and the data never changes, you'd expect the output to never change.
As for this:
If no ORDER BY clause is included in the query, the returned order of rows is undefined.
Not strictly true - on every RDBMS I've ever worked on (Oracle, Informix, SQL Server, DB2 to name a few) a DISTINCT clause also has the same effect as an ORDER BY as finding unique values involves a sort by definition.
EDIT (6/2/14):
Create a simple table
For DISTINCT and ORDER BY, both the plan and the cost is the same since it is ostensibly the same operation to be performed
And not surprisingly, the effect is thus the same
I am given a bunch of IDs (from an external source) that I need to cross-reference with the ones on our database and filter out those that fall between a certain date and are also "enabled" and some other parameters. I can easily do this by doing:
SELECT * FROM `table` WHERE `id` IN (csv_list_of_external_ids)
AND (my other cross-reference parameters);
And by doing this, of all those incoming IDs I will get the ones that I want. But obviously this is not a very efficient method when the external ids are in the thousands. And I'm not sure if MySQL will even support such a huge query.
Given that nothing can be cached (since both the user data and the external ids change pretty much on every query), and that these queries happen at least every 10 seconds. What other SQL alternatives are there?
I believe the only limit is the length of the actual query, which is controlled by the "max_allowed_packet" parameter in your my.cnf file.
If you express it as a subquery:
SELECT * FROM table
WHERE id IN (SELECT ID FROM SOME_OTHER_TABLE)
AND ...
there is no limit.
So I have a table that has a little over 5 million rows. When I use SQL_CALC_FOUND_ROWS the query just hangs forever. When I take it out the query executes within a second withe LIMIT ,25. My question is for pagination reasons is there an alternative to getting the number of total rows?
SQL_CALC_FOUND_ROWS forces MySQL to scan for ALL matching rows, even if they'd never get fetched. Internally it amounts to the same query being executed without the LIMIT clause.
If the filtering you're doing via WHERE isn't too crazy, you could calculate and cache various types of filters to save the full-scan load imposed by calc_found_rows. Basically run a "select count(*) from ... where ...." for most possible where clauses.
Otherwise, you could go Google-style and just spit out some page numbers that occasionally have no relation whatsoever with reality (You know, you see "Goooooooooooogle", get to page 3, and suddenly run out of results).
Detailed talk about implementing Google-style pagination using MySQL
You should choose between COUNT(*) AND SQL_CALC_FOUND_ROWS depending on situation. If your query search criteria uses rows that are in index - use COUNT(*). In this case Mysql will "read" from indexes only without touching actual data in the table while SQL_CALC_FOUND_ROWS method will load rows from disk what can be expensive and time consuming on massive tables.
More information on this topic in this article #mysqlperformanceblog.