Data connection - Parallel JDBC extracts failing with OutOfMemoryError - palantir-foundry

I'm trying to run a few JDBC extracts in parallel, but this fails on: java.lang.OutOfMemoryError: Java heap space.
How does Data Connection memory usage work, and how do I resolve this problem?

The Data Connection Agent's memory usage here actually depends mostly on the value of the fetchSize parameter. Per the Oracle JDBC driver docs, fetchSize:
Gives the JDBC driver a hint as to the number of rows that should be fetched from the database when more rows are needed for this ResultSet object.
So, the agent's memory usage should roughly be:
number of JDBC extracts running in parallel x fetchSize x size of each row
Unfortunately, the default value of fetchSize varies vastly among different JDBC drivers. For instance, certain versions of the Hive JDBC driver have it set to 50, while other, newer versions have a default value of 1000. Oracle JDBC drivers have a default of 10. Postgres by default will try to get the entire ResultSet at once.
Hence, Data Connection allows you to configure the fetchSize value. This is configurable both per-source and per-extract.
OOM errors aside, tuning fetchSize can improve performance significantly in general. There isn't a one-size-fits-all solution, though, and you'll have to experiment to figure out the best parameter value for your extracts. It usually lies somewhere in the 500–5000 range.

Related

Kernel Memory parameters for MySQL

I want to check whether kernel memory parameters i.e. shmmax or shmall is required or not for MYSQL database and what are the pros & cons if we set this parameter.
Because in my case MySQL working fine without setting them, However they are mandatory for Oracle database
Thanks

Best way to optimize database performance in mysql (mariadb) with buffer size and persistence connection configurations

I have;
a CRUD heavily loaded application in PHP 7.3 which uses CodeIgniter framework.
only 2 users access to application.
The DB is mariadb 10.2 and has 10 tables.In generally, stored INT and engine default is InnoDB but in a table, I store a "mediumtext" column.
application managed by cronjobs (10 different jobs for every minute)
a job average proceed is 100-200 CRUD from DB. (Totally ~ 1k-2k CRUD works in a minute with 10 tables)
Tested;
Persistent Connection in MySQL
I faced an issue maximum connection exceed, so I noticed the Code Igniter do not close connection if you do not set pconnect to config to true in database.php. So, simplified that, it uses allow persistent connection if you set it true. So, I want to fix that issue and I find a solution that I need to set it false and it will close all connections automatically.
I changed my configuration to disallow Persistent connections.
After I update persistent connection disabled. My app started to run properly and after 1 hour later, it crashed again because of a couple of errors that showed below and I fixed those errors with setting max_allow_package to maximum value in my.cnf for mariadb.
Warning --> Error while sending QUERY packet. PID=2434
Query error: MySQL server has gone away
I noticed the DB needs to be tuning. The database size is 1GB+. I have a lot of CRUD jobs scheduled for every minute. So, I changed to buffer size to 1GB and innodb engine pool size to %25 of it. I get used to MySQL Tuner and I figure out those variables with that.
Finally, I am still getting query package errors.
Packets out of order. Expected 0 received 1. Packet size=23
My server has 8GB ram (%25 used), 4 core x 2ghz (%10 used)
I couldn't decide which configuration is the best option for now. I couldn't increase RAM, also %25 used of ram because a key buffer size is 1GB and it could get full use of ram instant jobs.
Can I;
fix the DB errors,
increase average completed CRUD process
8GB ram --> innodb_buffer_pool_size = 5G.
200 qpm --> no problem. (200qps might be a challenge).
10 tables; 2 users --> not an issue.
persistent connections --> frill; not required.
key_buffer_size = 1G? --> Why? You should not be using MyISAM. Change to 30M.
max_allow_package --> What's that? Perhaps a typo for max_allow_packet? Don't set that to more than 1% of RAM.
Packets out of order --> sounds like a network glitch, not a database error.
MEDIUMINT --> one byte smaller than INT, so it is a small benefit when applicable.

Limit number of connection to MySQL database using JDBC driver in spark

I am currently importing data from a MySQL database into spark using the JDBC driver using the following command in pyspark:
dataframe_mysql = sqlctx
.read
.format("jdbc")
.option("url", "jdbc:mysql://<IP-ADDRESS>:3306/<DATABASE>")
.option("driver", "com.mysql.jdbc.Driver")
.option("dbtable", "<TABLE>")
.option("user", "<USER>")
.option("password","<PASSWORD>")
.load()
When I run the spark job, I get the following error message:
com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException (Too many connections).
It seems that since several nodes are attempting to connect concurrently to the database, I am exceeding MySQL's connection limit (151) and this is causing my job to run slower.
How can I limit the number of connections that the JDBC driver uses in pyspark? Any help would be great!
Try to use numPartitions param. According to the documentation it is the maximum number of partitions that can be used for parallelism in table reading and writing. This also determines the maximum number of concurrent JDBC connections. If the number of partitions to write exceeds this limit, then there is a decrease to this limit by calling coalesce(numPartitions) before writing.
I guess you should reduce the default partition size, or reduce the number of executors.

aws rds, lost connection to MySQL server during query, when importing large file

I try to import an 1.4G mysql file into aws rds. I tried the 2 cpu and 4G mem option. I still got error: Lost connection to MySQL server during query. My quetion is that how do I import large mysql file into rds.
MySQL Server and the MySQL client both have a parameter max_allowed_packet.
This is designed as a safety check to prevent the useless and disruptive allocation of massive amounts of memory that could occur if data corruption caused the receiving end of the connection to believe a packet¹ to be extremely large.
When transmitting queries and result sets, neither client nor server is allowed to send any single "thing" (usually a query or the value of a column) that is larger than max_allowed_packet -- the sending side will throw an error and refuse to send it if you try, and the receiving side will throw an error and then close the connection on you (so the client may or may not actually report the error thrown -- it may simply report that the connection was lost).
Unfortunately, the client setting and server setting for this same parameter are two independent settings, and they are uncoordinated. There is technically no requirement that they be the same, but discrepant values only works as long as neither of them ever exceeds the limit imposed by the other.
Worse, their defaults are actually different. In recent releases, the server defaults to 4 MiB, while the client defaults to 16 MiB.
Finding the server's value (SELECT ##MAX_ALLOWED_PACKET) and then setting the client to match the server (mysql --max-allowed-packet=max_size_in_bytes) will "fix" the mysterious Lost connection to MySQL server during query error message by causing the client to Do The Right Thing™ and not attempt to send a packet that the server won't accept. But you still get an error -- just a more informative one.
So, we need to reconfigure both sides to something more appropriate... but how do we know the right value?
You have to know your data. What's the largest possible value in any column? If that's a stretch (and in many cases, it is), you can simply start with a reasonably large value based on the longest line in a dump file.
Use this one-liner to find that:
$ perl -ne '$max = length($_) > $max ? length($_) : $max; END { print "$max\n" }' dumpfile.sql
The output will be the length, in bytes, of the longest line in your file.
You might want to round it up to the next power of two, or at least the next increment of 1024 (1024 is the granularity accepted by the server -- values are rounded) or whatever you're comfortable with, but this result should give you a value that should allow you to load your dump file without issue.
Now that we've established a new value that should work, change max_allowed_packet on the server to the new value you've just discovered. In RDS, this is done in the parameter group. Be sure the value has been applied to your server (SELECT ##GLOBAL.MAX_ALLOWED_PACKET;).
Then, you'll need to pass the same value to your client program, e.g. mysql --max-allowed-packet=33554432 if this value is smaller than the default client value. You can find the default client value with this:
$ mysql --help --verbose | grep '^max.allowed.packet'
max-allowed-packet 16777216
The client also allows you to specify the value in SI units, like --max-allowed-packet=32M for 32 MiB (33554432 bytes).
This parameter -- and the fact that there are two of them, one for the client and one for the server -- causes a lot of confusion and has led to the spread of some bad information: You'll find people on the Internet telling you to set it to ridiculous values like 1G (1073741824, which is the maximum value possible) but this is not a really good strategy since, as mentioned above, this is a protective mechanism. If a packet should happen to get corrupted on the network in just the wrong way, the server could conclude that it actually needs to allocate a substantial amount of memory just so that this packet can successfully be loaded into a buffer -- and this could lead to system impairment or a denial of service by starving the system for available memory.
The actual amount of memory the server normally allocates for reading packets from the wire is net_buffer_length. The size indicated in the packet isn't actually allocated unless it's larger than net_buffer_length.
¹ a packet refers to a layer 7 packet in the MySQL Client/Server Protocol sense. Not to be confused with an IP packet or datagram.
Your connection may timeout if you are importing from your local computer or laptop or a machine which is not in the same region as the RDS instance.
Try to import from an EC2 instance, which has access to this RDS. You will need to the upload the file to S3, ssh into the EC2 instance and run an import into RDS.

What are the max number of allowable parameters per database provider type?

There is a limit of 2,100 parameters which can be passed to a Sql Server query i.e. via ADO.Net, but what are the documented limits for other common databases used by .Net developers - in particular I'm interested in:
Oracle 10g/11g
MySql
PostgreSql
Sqlite
Does anyone know?
Oracle: 64,000. Source
MySQL:
By default, there is no limit. The MySQL "text protocol" requires that the .NET client library substitute all parameters before sending the command text to the server; there is no server-side limit that can be enforced, and the client has no limit (other than available memory).
If using "prepared statements" by calling MySqlCommand.Prepare() (and specifying IgnorePrepare=false in the connection string), then there is a limit of 65,535 parameters (because num_params has to fit in two bytes).
PostgreSql: EDIT: 34464 for a query and 100 for a function as per Magnus Hagander's answer (Answer copied here to provide a single point of reference)
SqlLite: 999 (SQLITE_MAX_VARIABLE_NUMBER, which defaults to 999, but can be lowered at runtime) - And for functions default is 100 parameters. See section 9 Of Run-time limits documentation
In jOOQ, we've worked around these limitations by inlining bind values once we reach the relevant number per vendor. The numbers are documented here. Not all numbers are necessarily the correct ones according to vendor documentation, we've discovered them empirically by trial and error through JDBC. They are (without tying them to a specific version):
Ingres : 1024
Microsoft Access : 768
Oracle : 32767
PostgreSQL : 32767
SQLite : 999
SQL Server : 2100 (depending on the version)
Sybase ASE : 2000
Other databases do not seem to have any limitations - at least we've not discovered them yet (haven't been looking far beyond 100000, though).
The correct answer for PostgreSQL appears to be 34464, when talking about bound parameters to a query. The response 100 is still correct for number of parameters to a function.
The PostgreSQL wire protocol uses 16-bit integers for count of parameters in the bind message (https://www.postgresql.org/docs/current/protocol-message-formats.html).
Thus the PostgreSQL protocol doesn't allow over 65535 parameters for a single statement. This is, OK to send a single ado.net command with two statements, each of which has 65535 parameters.
In my view, the MySQL question actually has two answers. The prepared statement protocol defines a signed 2 byte short to describe the number of parameters that will be retrieved from the server. The client firstly calls COM_STMT_PREPARE, for which it receives a COM_STMT_PREPARE response if successful.
The documentation for the response states:
If num_params > 0 more packets will follow:
Parameter Definition Block
num_params * Protocol::ColumnDefinition
EOF_Packet
Given that num_params can only be a maximum of 2^16 (signed short), it would follow that this is the limit of parameters and as my company has a custom MySQL driver we chose to follow this rule when implementing it and an exception is thrown if the limit is exceeded.
However, COM_STMT_PREPARE does not actually return an error if you send more than this number of parameters. The value of num_params is actually just 2^16 and more parameters will follow afterwards. I'm not sure if this is a bug but the protocol documentation does not describe this behaviour.
So long as you have a way on your client-side to know the number of parameters (client_num_params if you will), you could implement your MySQL client in such a way that it expects to see client_num_params x Protocol::ColumnDefinition. You could also watch for EOF_Packet but that's only actually sent if CLIENT_DEPRECATE_EOF is not enabled.
It's also interesting to note that there's a reserved byte after num_params, indicating that the protocol designers probably wanted the option to make this a 24-bit number, allowing about 8.3 million parameters. This would also require an extra client capability flag.
To summarise:
The client/server protocol documentation seems to indicate that the maximum number of parameters could be 32768
The server doesn't seem to care if you send more but this doesn't appear to be documented and it might not be supported in future releases. I very much doubt this would happen though as this would break multiple drivers including Oracle's own ADO.NET Connector.