Difference between offset vs limit [duplicate] - mysql

I have this really big table with some millions of records every day and in the end of every day I am extracting all the records of the previous day. I am doing this like:
String SQL = "select col1, col2, coln from mytable where timecol = yesterday";
Statement.executeQuery(SQL);
The problem is that this program takes like 2GB of memory because it takes all the results in memory then it processes it.
I tried setting the Statement.setFetchSize(10) but it takes exactly the same memory from OS it does not make any difference. I am using Microsoft SQL Server 2005 JDBC Driver for this.
Is there any way to read the results in small chunks like the Oracle database driver does when the query is executed to show only a few rows and as you scroll down more results are shown?

In JDBC, the setFetchSize(int) method is very important to performance and memory-management within the JVM as it controls the number of network calls from the JVM to the database and correspondingly the amount of RAM used for ResultSet processing.
Inherently if setFetchSize(10) is being called and the driver is ignoring it, there are probably only two options:
Try a different JDBC driver that will honor the fetch-size hint.
Look at driver-specific properties on the Connection (URL and/or property map when creating the Connection instance).
The RESULT-SET is the number of rows marshalled on the DB in response to the query.
The ROW-SET is the chunk of rows that are fetched out of the RESULT-SET per call from the JVM to the DB.
The number of these calls and resulting RAM required for processing is dependent on the fetch-size setting.
So if the RESULT-SET has 100 rows and the fetch-size is 10,
there will be 10 network calls to retrieve all of the data, using roughly 10*{row-content-size} RAM at any given time.
The default fetch-size is 10, which is rather small.
In the case posted, it would appear the driver is ignoring the fetch-size setting, retrieving all data in one call (large RAM requirement, optimum minimal network calls).
What happens underneath ResultSet.next() is that it doesn't actually fetch one row at a time from the RESULT-SET. It fetches that from the (local) ROW-SET and fetches the next ROW-SET (invisibly) from the server as it becomes exhausted on the local client.
All of this depends on the driver as the setting is just a 'hint' but in practice I have found this is how it works for many drivers and databases (verified in many versions of Oracle, DB2 and MySQL).

The fetchSize parameter is a hint to the JDBC driver as to many rows to fetch in one go from the database. But the driver is free to ignore this and do what it sees fit. Some drivers, like the Oracle one, fetch rows in chunks, so you can read very large result sets without needing lots of memory. Other drivers just read in the whole result set in one go, and I'm guessing that's what your driver is doing.
You can try upgrading your driver to the SQL Server 2008 version (which might be better), or the open-source jTDS driver.

You need to ensure that auto-commit on the Connection is turned off, or setFetchSize will have no effect.
dbConnection.setAutoCommit(false);
Edit: Remembered that when I used this fix it was Postgres-specific, but hopefully it will still work for SQL Server.

Statement interface Doc
SUMMARY: void setFetchSize(int rows)
Gives the JDBC driver a hint as to the
number of rows that should be fetched
from the database when more rows are
needed.
Read this ebook J2EE and beyond By Art Taylor

Sounds like mssql jdbc is buffering the entire resultset for you. You can add a connect string parameter saying selectMode=cursor or responseBuffering=adaptive. If you are on version 2.0+ of the 2005 mssql jdbc driver then response buffering should default to adaptive.
http://msdn.microsoft.com/en-us/library/bb879937.aspx

It sounds to me that you really want to limit the rows being returned in your query and page through the results. If so, you can do something like:
select * from (select rownum myrow, a.* from TEST1 a )
where myrow between 5 and 10 ;
You just have to determine your boundaries.

Try this:
String SQL = "select col1, col2, coln from mytable where timecol = yesterday";
connection.setAutoCommit(false);
PreparedStatement stmt = connection.prepareStatement(SQL, SQLServerResultSet.TYPE_SS_SERVER_CURSOR_FORWARD_ONLY, SQLServerResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(2000);
stmt.set....
stmt.execute();
ResultSet rset = stmt.getResultSet();
while (rset.next()) {
// ......

I had the exact same problem in a project. The issue is that even though the fetch size might be small enough, the JDBCTemplate reads all the result of your query and maps it out in a huge list which might blow your memory. I ended up extending NamedParameterJdbcTemplate to create a function which returns a Stream of Object. That Stream is based on the ResultSet normally returned by JDBC but will pull data from the ResultSet only as the Stream requires it. This will work if you don't keep a reference of all the Object this Stream spits. I did inspire myself a lot on the implementation of org.springframework.jdbc.core.JdbcTemplate#execute(org.springframework.jdbc.core.ConnectionCallback). The only real difference has to do with what to do with the ResultSet. I ended up writing this function to wrap up the ResultSet:
private <T> Stream<T> wrapIntoStream(ResultSet rs, RowMapper<T> mapper) {
CustomSpliterator<T> spliterator = new CustomSpliterator<T>(rs, mapper, Long.MAX_VALUE, NON-NULL | IMMUTABLE | ORDERED);
Stream<T> stream = StreamSupport.stream(spliterator, false);
return stream;
}
private static class CustomSpliterator<T> extends Spliterators.AbstractSpliterator<T> {
// won't put code for constructor or properties here
// the idea is to pull for the ResultSet and set into the Stream
#Override
public boolean tryAdvance(Consumer<? super T> action) {
try {
// you can add some logic to close the stream/Resultset automatically
if(rs.next()) {
T mapped = mapper.mapRow(rs, rowNumber++);
action.accept(mapped);
return true;
} else {
return false;
}
} catch (SQLException) {
// do something with this Exception
}
}
}
you can add some logic to make that Stream "auto closable", otherwise don't forget to close it when you are done.

Related

Postgres vs MySQL: Commands out of sync;

MySQL scenario:
When I execute "SELECT" queries in MySQL using multiple threads I get the following message: "Commands out of sync; you can't run this command now", I found that this is due to the limitation of having to wait "consume" the results to make another query. C ++ example:
void DataProcAsyncWorker::Execute()
{
std::thread (&DataProcAsyncWorker::Run, this).join();
}
void DataProcAsyncWorker :: Run () {
sql::PreparedStatement * prep_stmt = c->con->prepareStatement(query);
...
}
Important:
I can't help using multiple threads per query (SELECT, INSERT, ETC) because the module I'm building that is being integrated with NodeJS "locks" the thread until the result is already obtained, for this reason I need to run in the background (new thread) and resolve the "promise" containing the result obtained from MySQL
Important:
I am saving several "connections" [example: 10], and with each SQL call the function chooses a connection. This is: 1. A connection pool that contains 10 established connections, Ex:
for (int i = 0; i <10; i ++) {
Com * c = new Com;
c->id = i;
c->con = openConnection ();
c->con->setSchema("gateway");
conns.push_back(c);
}
The problem occurs when executing> = 100 SELECT queries per second, I believe that even with the connection balance 100 connections per second is a high number and the connection "ex: conns.at(10)" is in process and was not consumed
My question:
Does PostgreSQL have this limitation as well? Or in PostgreSQL there is also such a limitation?
Note:
In PHP Docs about MySQL, the mysqli_free_result command is required after using mysqli_query, if not, I will get a "Commands out of sync" error, in contrast to the PostgreSQL documentation, the pg_free_result command is completely optional after using pg_query.
That said, someone using PostgreSQL has already faced problems related to "commands are out of sync", maybe there is another name for this error?
Or is PostgreSQL able to deal with this problem automatically for this reason the free_result is being called invisibly by the server without causing me this error?
You need to finish using one prepared statement (or cursor or similar construct) before starting another.
"Commands out of sync" is often cured by adding the closing statement.
"Question:
Does PostgreSQL have this limitation as well? Or in PostgreSQL there is also such a limitation?"
No, the PostgreSQL does not have this limitation.

AWS RDS MySQL retrieving data for certain table times out

Strange problem with my database that I host using AWS RDS. For a certain table, I sometimes suddenly get timeouts for almost all queries. Interestingly, for the other tables, there are almost no time outs (after 150.000 ms which is the max I have set for the lambda, after that it terminates) while they contain similar data.
This is the Lambda (the function that gets the data from the database) log:
15:38:10 Connecting db: jdbc:mysql://database.rds.amazonaws.com:3306/database_name Connected
15:38:10 Connection retrieved for matches_table matches, proceeding to statement
15:38:10 Statement created, proceeding to executing SQL
15:40:35 END RequestId: 410f7edf-0f48-45df-b509-a9b822fa5c1c
15:40:35 REPORT RequestId: 410f7edf-0f48-45df-b509-a9b822fa5c1c Duration: 150083.43 ms Billed Duration: 150000 ms Memory Size: 1024 MB Max Memory Used: 115 MB
15:40:35 2019-06-04T15:40:35.514Z 410f7edf-0f48-45df-b509-a9b822fa5c1c Task timed out after 150.08 seconds
And this is Java code that I use:
LinkedList<Object> matches = new LinkedList<Object>();
try {
String sql = db_conn.getRetrieveAllMatchesSqlSpecificColumn(userid, websiteid, profileid, matches_table, "matchid");
Connection conn = db_conn.getConnection();
System.out.println("Connection retrieved for matches_table " +matches_table+", proceeding to statement");
Statement st = conn.createStatement();
System.out.println("Statement created, proceeding to executing SQL");
// execute the query, and get a java resultset
ResultSet rs = st.executeQuery(sql);
System.out.println("SQL executed, now iterating to resultset");
// iterate through the java resultset
st.close();
} catch (SQLException ex) {
Logger.getLogger(AncestryDnaSQliteJDBC.class.getName()).log(Level.SEVERE, null, ex);
}
return matches;
A couple of months ago I did a big database resources upgrade and some removal of unwanted data and that more or less fixed it. But if I look at the current stats, it looks ok. Plenty of RAM (1GB) available, no swap used, enough cpu credits.
So I am not sure if this is a MySQL problem or a database problem linked to RDW AWS. Any suggestions?
Alright, it turned out to be an AWS specific thing. Turns out, there is some kind of IO credit system linked to the database. Interestingly, the chart that describes the number of credits left is not available in the default monitoring view of AWS RDS. You have to dive into CloudWatch and find it quite hidden. By increasing the allocated storage for this database, you earn more credits and by doing so I fixed the problem.

Does Statement.RETURN_GENERATED_KEYS generate any extra round trip to fetch the newly created identifier?

JDBC allows us to fetch the value of a primary key that is automatically generated by the database (e.g. IDENTITY, AUTO_INCREMENT) using the following syntax:
PreparedStatement ps= connection.prepareStatement(
"INSERT INTO post (title) VALUES (?)",
Statement.RETURN_GENERATED_KEYS
);
while (resultSet.next()) {
LOGGER.info("Generated identifier: {}", resultSet.getLong(1));
}
I'm interested if the Oracle, SQL Server, postgresQL, or MySQL driver uses a separate round trip to fetch the identifier, or there is a single round trip which executes the insert and fetches the ResultSet automatically.
It depends on the database and driver.
Although you didn't ask for it, I will answer for Firebird ;). In Firebird/Jaybird the retrieval itself doesn't require extra roundtrips, but using Statement.RETURN_GENERATED_KEYS or the integer array version will require three extra roundtrips (prepare, execute, fetch) to determine the columns to request (I still need to build a form of caching for it). Using the version with a String array will not require extra roundtrips (I would love to have RETURNING * like in PostgreSQL...).
In PostgreSQL with PgJDBC there is no extra round-trip to fetch generated keys.
It sends a Parse/Describe/Bind/Execute message series followed by a Sync, then reads the results including the returned result-set. There's only one client/server round-trip required because the protocol pipelines requests.
However sometimes batches that can otherwise be streamed to the server may be broken up into smaller chunks or run one by on if generated keys are requested. To avoid this, use the String[] array form where you name the columns you want returned and name only columns of fixed-width data types like integer. This only matters for batches, and it's a due to a design problem in PgJDBC.
(I posted a patch to add batch pipelining support in libpq that doesn't have that limitation, it'll do one client/server round trip for arbitrary sized batches with arbitrary-sized results, including returning keys.)
MySQL receives the generated key(s) automatically in the OK packet of the protocol in response to executing a statement. There is no communication overhead when requesting generated keys.
In my opinion even for such a trivial thing a single approach working in all database systems will fail.
The only pragmatic solution is (in analogy to Hibernate) to find the best working solution for each target RDBMS (and
call it a dialect of your one for all solution:)
Here the information for Oracle
I'm using a sequence to generate key, same behavior is observed for IDENTITY column.
create table auto_pk
(id number,
pad varchar2(100));
This works and use only one roundtrip
def stmt = con.prepareStatement("insert into auto_pk values(auto_pk_seq.nextval, 'XXX')",
Statement.RETURN_GENERATED_KEYS)
def rowCount = stmt.executeUpdate()
def generatedKeys = stmt.getGeneratedKeys()
if (null != generatedKeys && generatedKeys.next()) {
def id = generatedKeys.getString(1);
But unfortunately you get ROWID as a result - not the generated key
How is it implemented internally? You can see it if you activate a 10046 trace (BTW this is also the best way to see
how many roundtrips were performed)
PARSING IN CURSOR
insert into auto_pk values(auto_pk_seq.nextval, 'XXX')
RETURNING ROWID INTO :1
END OF STMT
So you see the JDBC Standard 3.0 is implemented, but you don't get a requested result. Under the cover is used the
RETURNING clause.
The right approach to get the generated key in Oracle is therefore:
def stmt = con.prepareStatement("insert into auto_pk values(auto_pk_seq.nextval, 'XXX') returning id into ?")
stmt.registerReturnParameter(1, Types.INTEGER);
def rowCount = stmt.executeUpdate()
def generatedKeys = stmt.getReturnResultSet()
if (null != generatedKeys && generatedKeys.next()) {
def id = generatedKeys.getLong(1);
}
Note:
Oracle Release 12.1.0.2.0
To activate the 10046 trace use
con.createStatement().execute "alter session set events '10046 trace name context forever, level 12'"
con.createStatement().execute "ALTER SESSION SET tracefile_identifier = my_identifier"
Depending on frameworks or libraries to do things that are perfectly possible in plain SQL is bad design IMHO, especially when working against a defined DBMS. (The Statement.RETURN_GENERATED_KEYS is relatively innocuous, although it apparently does raise a question for you, but where frameworks are built on separate entities and doing all sorts of joins and filters in code or have custom-built transaction isolation logic things get inefficient and messy very quickly.)
Why not simply:
PreparedStatement ps= connection.prepareStatement(
"INSERT INTO post (title) VALUES (?) RETURNING id");
Single trip, defined result.

Calling MySQL stored procedure in ROR 4

There are few example out there but non of them are very clarified (or on old version).
I want to call MySQL procedure and check the return status (in rails 4.2). The most common method I saw is to call result = ActiveRecord::Base.connection.execute("call example_proc()"), but in some places people wrote there is prepared method result = ActiveRecord::Base.connection.execute_procedure("Stored Procedure Name", arg1, arg2) (however it didn't compiled).
So what is the correct way to call and get the status for MySQL procedure?
Edit:
And how to send parameters safly, where the first parameter is integer, second string and third boolean?
Rails 4 ActiveRecord::Base doesn't support execute_procedure method, though result = ActiveRecord::Base.connection still works. ie
result = ActiveRecord::Base.connection.execute("call example_proc('#{arg1}','#{arg2}')")
You can try Vishnu approach below
or
You can also try
ActiveRecord::Base.connections.exec_query("call example_proc('#{arg1}','#{arg2}')")
here is the document
In general, you should be able to call stored procedures in a regular where or select method for a given model:
YourModel.where("YOUR_PROC(?, ?)", var1, var2)
As for your comment "Bottom line I want the most correct approach with procedure validation afterwards (for warnings and errors)", I guess it always depends on what you actually want to implement and how readable you want your code to be.
For example, if you want to return rows of YourModel attributes, then it probably would be better if you use the above statement with where method. On the other hand, if you write some sql adapter then you might want to go down to the ActiveRecord::Base.connection.execute level.
BTW, there is something about stored proc performance that should be mentioned here. In several databases, database does stored proc optimization on the first run of the stored proc. However, the parameters that you pass to that first run might not be those that will be running on it more frequently later on. As a result, your stored-proc might be auto-optimized in a "none-optimal" way for your case. It may or may not happen this way, but it is something that you should consider while using stored procs with dynamic params.
I believe you have tried many other solutions and got some or other errors mostly "out of sync" or "closed connection" errors. These errors occur every SECOND time you try to execute the queries. We need to workaround like the connection is new every time to overcome this. Here is my solution that didn't throw any errors.
#checkout a connection for Model
conn = ModelName.connection_pool.checkout
#use the new connection to execute the query
#records = conn.execute("call proc_name('params')")
#checkout the connection
ModelName.connection_pool.checkin(conn)
The other approaches failed for me, possibly because ActiveRecord connections are automatically handled to checkout and checking for each thread. When our method tries to checkout a connection just to execute the SP, it might conflict since there will be an active connection just when the method started.
So the idea is to manually #checkout a connection for the model instead of for thread/function from the pool and #checkin once the work is done. This worked great for me.

Zend Framework and Mysql - very slow

I am creating a web site using php, mysql and zend framework.
When I try to run any sql query, page generation jumps to around 0.5 seconds. That's too high. If i turn of sql, page generation is 0.001.
The amount of queries I run, doesn't really affect the page generation time (1-10 queries tested). Stays at 0.5 seconds
I can't figure out, what I am doing wrong.
I connect to sql in bootstrap:
protected function _initDatabase ()
{
try
{
$config = new Zend_Config_Ini( APPLICATION_PATH . '/configs/application.ini', APPLICATION_ENV );
$db = Zend_Db::factory( $config -> database);
Zend_DB_Table_Abstract::setDefaultAdapter( $db );
}
catch ( Zend_Db_Exception $e )
{
}
}
Then I have a simple model
class StandardAccessory extends Zend_DB_Table_Abstract
{
/**
* The default table name
*/
protected $_name = 'standard_accessory';
protected $_primary = 'model';
protected $_sequence = false;
}
And finally, inside my index controller, I just run the find method.
require_once APPLICATION_PATH . '/models/StandardAccessory.php';
$sa = new StandardAccessory( );
$stndacc = $sa->find( 'abc' );
All this takes ~0.5 seconds, which is way too long. Any suggestions?
Thanks!
Tips:
Cache the table metadata. By default, Zend_Db_Table tries to discover metadata about the table each time your table object is instantiated. Use a cache to reduce the number of times it has to do this. Or else hard-code it in your Table class (note: db tables are not models).
Use EXPLAIN to analyze MySQL's optimization plan. Is it using an index effectively?
mysql> EXPLAIN SELECT * FROM standard_accessory WHERE model = 'abc';
Use BENCHMARK() to measure the speed of the query, not using PHP. The subquery must return a single column, so be sure to return a non-indexed column so the query has to touch the data instead of just returning an index entry.
mysql> SELECT BENCHMARK(1000,
(SELECT nonindexed_column FROM standard_accessory WHERE model = 'abc'));
Note that Zend_Db_Adapter lazy-loads its db connection when you make the first query. So if there's any slowness in connecting to the MySQL server, it'll happen as you instantiate the Table object (when it queries metadata). Any reason this could take a long time? DNS lookups, perhaps?
The easiest way to debug this, is to profile your sql queries. you can use Firephp (plugin for firebug) see http://framework.zend.com/manual/en/zend.db.profiler.html#zend.db.profiler.profilers.firebug
another way to speed up things a little is to cache the metadata of your tables.
see: http://framework.zend.com/manual/en/zend.db.table.html#zend.db.table.metadata.caching
Along with the above suggestions I did a very unscientific test and found that the PDO adapter was faster for me in my application (I know mysqli is supposed to be faster but maybe it's the ZF abstraction). I show the results here (the times shown are only good for comparison)