Airflow + sqlalchemy short-lived connections to metadata db - sqlalchemy

I deployed the latest airflow on a centos 7.5 vm and updated sql_alchemy_conn and result_backend to postgres databases on a postgresql instance and designated my executor as CeleryExecutor. Without enabling any dag at all and even with no airflow scheduler started, I see about one connection established every 5 seconds and then disposed to run a SELECT 1 and a SELECT CAST('test plain returns' AS VARCHAR(60)) AS anon_1.
The number of short-lived connections drastically increase when one starts the scheduler and turns on dags. Does anyone know the reason for this? Is this a heartbeat check or task status check? With sql_alchemy_pool_enabled = True in airflow.cfg should these connections not be longer lived? Is there a log that I can look to pinpoint the source of these connections with sub-second life?
Config values used for reference
executor = CeleryExecutor
sql_alchemy_conn = postgres://..../db1
sql_alchemy_pool_enabled = True
sql_alchemy_pool_size = 5
sql_alchemy_max_overflow = 0
parallelism = 32
dag_concurrency = 16
max_active_runs_per_dag = 16
worker_concurrency = 16
broker_url = redis://...
result_backend = db+postgresql+psycopg2://.../db2
job_heartbeat_sec = 5
scheduler_heartbeat_sec = 5

Set AIRFLOW__CORE__SQL_ALCHEMY_POOL_PRE_PING to False.
Check connection at the start of each connection pool checkout. Typically, this is a simple statement like SELECT 1.
More information here: https://docs.sqlalchemy.org/en/13/core/pooling.html#disconnect-handling-pessimistic

Related

Tomcat 8.5 Connection Pool not reconnecting after DB failover

I have an application using Tomcat 8.5 connection pool, Java 8, and Multi-AZ AWS RDS MySQL database. In the last years, we had a couple of database issues that lead to failover. When the failover occurred, the pool was always able to detect the connection was closed (No operations allowed after connection closed) and reconnect correctly a minute later when the backup node is up.
Some days ago we had a failover that didn't follow this rule. Because of a hardware database issue, the database was unavailable and a failover took place. Then, when the backup node was up a couple of minutes later, we could connect correctly to the database from our desktop MySQL client.
Even several minutes after the failover took place and connectivity to database was recovered, the application showed logs hundreds of exceptions like:
com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException: No operations allowed after connection closed
...
Caused by: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure
...
The last packet successfully received from the server was 20,017 milliseconds ago. The last packet sent successfully to the server was 20,016 milliseconds ago
...
Caused by: java.net.SocketTimeoutException: Read timed out
...
The application couldn't reconnect until we restarted the Tomcat servers.
Our pool is configured this way:
initialSize = 5
maxActive = 16
minIdle = 5
maxIdle = 8
maxWait = 10000
maxAge = 600000
timeBetweenEvictionRunsMillis = 5000
minEvictableIdleTimeMillis = 60000
validationQuery = "SELECT 1"
validationQueryTimeout = 3
validationInterval = 15000
testOnBorrow = true
testWhileIdle = true
testOnReturn = false
jdbcInterceptors = "ConnectionState;StatementCache(max=200)"
defaultTransactionIsolation = java.sql.Connection.TRANSACTION_READ_COMMITTED
And the JDBC connection URL has these parameters:
autoreconnect=true&socketTimeout=20000
Under my understanding, the validationQuery should have failed and the connection discarded, so a new correct connection should have created. Also, according to maxAge after 10 minutes all connections should have been discarded and new ones created.
The pool couldn't be recovered even after 20 minutes. As said, we had to restart the Tomcat servers.
Is there any explanation why the pool has always recovered correctly from a failover, but in this case, it couldn't?
Try to add ENABLE=Broken in your connection string.
For example :
jdbc:oracle:thin:#(DESCRIPTION=(ENABLE=BROKEN)(ADDRESS=(PROTOCOL=tcp)(PORT=)(HOST=))(CONNECT_DATA=(SID=)))
I ended up adding an AWS RDS Proxy that resolves this issue.
I have been provoking DB Failovers for an hour and everything worked fine with outages less than 20 seconds. And this, without modifying my application code, only pointing to the new proxy endpoint.

Inserting large number of data rows into cloudsql table via PyMySql from .csv

I am new to CloudSQL & I am trying to insert records of 40+ different columns and over 1.5 million rows. However I am unable to do this in Google CloudSQL. I have taken quite a number of measures listed below to resolve this issue but the main error that I get is:
ERROR:
textPayload: "2019-04-12T06:10:47.348295Z 8554 [Note] Aborted connection 8554 to db: 'xxxxx_xxx' user: 'root' host: 'x.x.x.x' (Got an error reading communication packets)"
Summary:
I am using Python, PyMySql to insert 1.5 million rows of data into a table of 35 columns
Instance, DB, Table have already been created in CloudSQL.
System configuration: vCPUs - 4, Memory- 15 GB, SSD storage - 10 GB
I can load this data completely fine in my local system.
In Google CloudSQl, deployment time is extremely long and deployment is Successful.
But when I check my table, it is empty.
The MySql error logs in the instance shows the above.
I have made tried the following actions:
Using API URL/ .txt / .json file for upload instead now using .csv.
Thinking it is a system issue, I upgraded the system from 8GB Memory to 15 GB Memory.
Thinking that SQL default configs are causing limitations, I have added the following:
sql_mode : MAXDB,NO_AUTO_CREATE_USER
max_allowed_packet: 1073741824
net_read_timeout: 4294967295
wait_timeout: 31536000
Inserted lesser number of rows, max rows able to insert = 100
def adddata():
try:
conn = pymysql.connect(unix_socket='/cloudsql/' + 'karto-235001:asia-east1:karto', user='xxx', password='xxx', db='xxx')
cur = conn.cursor()
insert_ = "INSERT INTO data_table(a, b, c) VALUES (%s, %s, %s)"
with open('info.csv', newline='') as myFile:
reader = csv.reader(myFile)
for item in reader:
cur.execute(insert_, (item[3], item[4], item[5]))
conn.commit()
cur.close()
finally:
conn.close()
I have checked online and have implemented the recommended solutions by CloudSQL and other stack-overflow users. If anyone could identify what I am doing wrong or if there are issues with my code or configuration? Thank you very much.
I see that you want to upload information contained in a CSV file using Python. Have you tried importing directly to the database? You can follow the steps in the link[1].
In the meantime I’ll try to replicate your case. You also may want to check if your installations and configuration are correct.
Verify if Your Cloud SQL instance and connection [2] and you Python installation[3].
[1]https://cloud.google.com/sql/docs/mysql/import-export/importing#csv
[2]https://cloud.google.com/sql/docs/mysql/connect-compute-engine
[3]https://cloud.google.com/python/setup

How does MariaDB work with maintaining db connection?

I am using Peewee ORM to update and modify database table from my python project.
I have set max connection limit to 15 using:
set global max_connections = 15
to get total connection I run the command,
> select count(*) from information_schema.processlist;
> 12
now that connection limit is 15 and even if I run a my code to do some work on db by opening a connection the number of connection goes up by 2
> select count(*) from information_schema.processlist;
> 14
now even if am done doing the task, I close python terminal I still see total number of connection from process list count to be 14, it seems like old connections get reused or what, if I run the same command to update db table, from different terminal I add 2 more connection but it gives error saying too many connections. but the first terminal I had opened still work.
I can post peewee code if required.
If you are using the regular MySQLDatabase class, then calling .close() on the connection will close it.
On the other hand if you are using PooledMySQLDatabase, .close() will recycle the connection to the pool of available connections. You can manage the connections in the pool using the following APIs: http://docs.peewee-orm.com/en/latest/peewee/playhouse.html#PooledDatabase

MySQL 5.7.10 issues with Ruby on OSX 10.11.3

I successfully installed mySQL 5.7.10 and the mySQL gem for Ruby on my OSX 10.11.3 based system. I am trying now to run following code:
require 'mysql'
require 'cgi'
class MysqlSaver
def saveWordStats(globalWordStats,time)
con = Mysql.new 'localhost', 'x', 'x', 'x'
i = 0
for word in globalWordStats.keys[0..10000]
print "#{i}\r"
i+=1
stat = globalWordStats[word]
time = time
escaped_word = Mysql.escape_string(word)
begin
escaped_word = escaped_word.gsub("\\","")
escaped_word = escaped_word.gsub("/","")
escaped_word = escaped_word.gsub("-","")
escaped_word = "#{escaped_word}_word"
con.query("CREATE TABLE IF NOT EXISTS #{escaped_word}(percent DOUBLE, time INT)")
con.query("INSERT INTO #{escaped_word}(percent,time) VALUES('#{stat}','#{time}')")
rescue
puts "#{$!}"
end
end
con.close
puts "DONE"
end
end
This code works without any errors. I'am able to create tables and store values in my mySQL database. But however, if I try to create/store >= ≈10.000 values in my database with this code I am no longer able to connect to my mySQL server, after the script finished running:
mySQL.rb:5:in `new': Lost connection to MySQL server at 'reading initial communication packet', system error: 102 (Mysql::Error)
from /Users/david/Desktop/Birta2/mySQL.rb:5:in `saveWordStats'
from run.rb:84:in `<main>'
Also a restart of the mySQL server doesn't help (only a restart of my entire mac helps!).
After the error occurs I can find this strange line in the mySQL log file:
2016-02-11T18:20:51.177054Z 0 [Warning] File Descriptor 1098 exceedeed FD_SETSIZE=1024
Is there any way to fix this error?
FD_SETSIZE is the maximum number of files you can have open at once. If you're using InnoDB, each mysqld process keeps one file open per table in the active database, so it's easy to exceed if you have a large number of tables or a large number of processes. You can change some settings in my.cnf to fix this.
table_open_cache is the number of tables MySQL will try to keep open at once:
table_open_cache = 1000
max_connections is the maximum number of simultaneous connections (mysqld processes) to allow:
max_connections = 25
If your database has N tables, it's best to keep N * table_open_cache * max_connections less than FD_SETSIZE.

Correct way to keep pooled connections alive (or time them out and get fresh ones) during longer inactivity for MySQL, Grails 2 app

I have a grails app that has flurries of high activity, but then often periods of inactivity that can last several hours to over night. I notice that the first users in the morning get the following type of exception, and I believe this is due to the connections in the pool going stale and MYSql database closing them.
I've found conflicting information in Googling about whether using Connector/J connection property 'autoReconnect=true' is a good idea (and whether or not the client will still get an exception even if the connection is then restored), or whether to set other properties that will periodically evict or refresh idle connections, test on borrow, etc. Grails uses DBCP underneath. I currently have a simple config as below, and am looking for an answer on how to best ensure that any connection grabbed out of the pool after a long inactive period is valid and not closed.
dataSource {
pooled = true
dbCreate = "update"
url = "jdbc:mysql://my.ip.address:3306/databasename"
driverClassName = "com.mysql.jdbc.Driver"
dialect = org.hibernate.dialect.MySQL5InnoDBDialect
username = "****"
password = "****"
properties {
//what should I add here?
}
}
Exception
2012-06-20 08:40:55,150 [http-bio-8443-exec-1] ERROR transaction.JDBCTransaction - JDBC begin failed
com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: The last packet successfully received from the server was 64,129,968 milliseconds ago. The last packet sent successfully to the server was 64,129,968 milliseconds ago. is longer than the server configured value of 'wait_timeout'. You should consider either expiring and/or testing connection validity before use in your application, increasing the server configured values for client timeouts, or using the Connector/J connection property 'autoReconnect=true' to avoid this problem.
at com.mysql.jdbc.Util.handleNewInstance(Util.java:411)
at com.mysql.jdbc.SQLError.createCommunicationsException(SQLError.java:1116)
at com.mysql.jdbc.MysqlIO.send(MysqlIO.java:3851)
...... Lots more .......
Caused by: java.sql.SQLException: Already closed.
at org.apache.commons.dbcp.PoolableConnection.close(PoolableConnection.java:114)
The easiest is to configure the connection pool to specify the query to be run to test the connection before it is passed to the application:
validationQuery="select 1 as dbcp_connection_test"
testOnBorrow=true
This same "connection validation" query can be run on other events. I'm not sure of the defaults for these:
testOnReturn=true
testWhileIdle=true
There are also configuration settings that limit the "age" of idle connections in the pool, which can be useful if idle connections are being closed at the server end.
minEvictableIdleTimeMillis
timeBetweenEvictionRunsMillis
http://commons.apache.org/dbcp/configuration.html
I don't know if it is the best way to handle database connection, but I had the same problems as you described. I tried a lot and ended up with the c3p0 connection pool.
Using c3p0 you could force your app to refresh the database connection after a certain time.
Place the c3p0.jar into your lib folder and add your configuration to conf/spring/resources.groovy.
My resources.groovy looks like this:
import com.mchange.v2.c3p0.ComboPooledDataSource
import org.codehaus.groovy.grails.commons.ConfigurationHolder as CH
beans = {
/**
* c3P0 pooled data source that forces renewal of DB connections of certain age
* to prevent stale/closed DB connections and evicts excess idle connections
* Still using the JDBC configuration settings from DataSource.groovy
* to have easy environment specific setup available
*/
dataSource(ComboPooledDataSource) { bean ->
bean.destroyMethod = 'close'
//use grails' datasource configuration for connection user, password, driver and JDBC url
user = CH.config.dataSource.username
password = CH.config.dataSource.password
driverClass = CH.config.dataSource.driverClassName
jdbcUrl = CH.config.dataSource.url
//force connections to renew after 4 hours
maxConnectionAge = 4 * 60 * 60
//get rid too many of idle connections after 30 minutes
maxIdleTimeExcessConnections = 30 * 60
}
}