Kafka Connect - Failed to flush, timed out / Failed to commit offsets - mysql

I am getting the following error:
"ERROR WorkerSourceTask(id=test-mysql-dbc-source-0) Failed to flush, timed out while waiting for producer to flush outstanding N messages.
ERROR Failed to commit offsets. (org.apache.kafka.connect.runtime.WorkerSourceTask:XXX)"
The Setup:
I have an EC2 instance in AWS (t2.medium - 2 cores 4GB RAM) which serves as Kafka Server. Another EC2 instance has a sandbox MySQL database and Kafka Connect with Confluent JDBC Source Connector. Python script inserts a couple of rows in a table randomly, to simulate some activity. On my laptop I opened Kafka Console Consumer to read the messages from Kafka Server.
Ports 22, 3306, 9092, 2888 are open to all IP addresses on both EC2 instances.
Below are config files I used for Kafka Connect Source
Config files:
connect-standalone.properties
bootstrap.servers=FIRST_EC2_PUBLIC_IP:9092
key.converter.schemas.enable=true
value.converter.schemas.enable=true
offset.storage.file.filename=/tmp/connect.offsets
acks=1
compression.type=lz4
plugin.path=/usr/share/java
jdbc_source.properties
name=test-mysql-jdbc-source
connector.class=io.confluent.connect.jdbc.JdbcSourceConnector
tasks.max=1
connection.url=jdbc:mysql://localhost:3306/DB_NAME
connection.user=root
connection.password=DB_PASSWORD
table.whitelist=TEST_TABLE
mode=timestamp+incrementing
incrementing.column.name=ID_RECORD
timestamp.column.name=TIMESTMP
topic.prefix=mysql-source-
acks=1
compression.type=lz4
I tried to manipulate with various settings and options. Mostly I tried to play around with offset.flush.timeout.ms and buffer.memory options as advised in this link.
The Behavior of Producer:
So basically, after starting a producer on my second EC2 instance, I can see the messages on my laptop in kafka console consumer, so it works. I can see new records as they appear for some time. However, very often when a new row in a table is created, producer just does not push it to kafka server (first EC2 instance) throwing above mentioned error for about 5 to 20 minutes. The number of outstanding messages doesn't get big. Most of the time it is 2-6 messages. After throwing an error for 5-20 minutes it finally sends the data and the console consumer consumes the data and work fine for some time and after that an error appears again.
If I manually stop the producer and start it again, the outstanding messages flush instantly and can be seen in a console consumer on my laptop.
Could you please point me, where the problem can be, and what can cause such a behavior.

Related

AWS Time Out Problems with Elastic Beanstalk App with DB Access

Hi When my Elastic Beanstalk (m5a.large Windows Server with deployed .net Core WebApi) comes under heavy load, the Status in the Health Page for my EC2 instances turns red, my Requests and the Healthcheck are timing out. That happens around 1-3 minutes after having a minimum of 10-20 Req/sec for a server.
I have to launch a lot of servers, so that each server gets a Request/Second count of 1-5 so they do not turn red.
In my logs I saw the following Errors:
Exception=MySql.Data.MySqlClient.MySqlException (0x80004005): Unable to connect to any of the specified MySQL hosts.
---> MySql.Data.MySqlClient.MySqlException (0x80004005): Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
These Errors brought me to the topic Connection Pooling so i switched
using MySql.Data.MySqlClient;
to
using MySqlConnector;
Now these Errors do not come up anymore but the Problem remains.
The Monitoring Feature of EB and RDS do not state any obvious Problems. Running Queries in Mysql Workbench against the Database is fast as usual.
At the moment, my Database calls from the server are synchronous and not using the async feature of MysqlConnector.
Does the m5a.large cannot process more than 5 Request/Second?
Kind Regards

AWS RDS automatically stopping soon after it is started

I have created an RDS on AWS which initially shows the status of 'available' but when I use my sql client to connect to it I receive the error:
Status : Failure -Test failed: IO Error: Connection reset by peer, Authentication lapse 0 ms
Then when I check the status of the RDS online (AWS dashboard) it says 'stopping'.
When I try to start the RDS again it's status will go from 'starting' to 'stopping' after a couple of minutes and then eventually 'stopped'. I can't find anything online referring to an RDS automatically stopping and I am somewhat a novice to AWS.
Based on the comments.
The solution was found by checking CloudTrial Event history. Based on the search it was identified that StopDBInstance was issued by HIPComplianceWorker user.
This probably means that there is an automation that checks the db instances launched and verifies if they comply with your companies policies. Your instance could be violating such policies, and it was automatically stopped.
You would have to contact your admins to check with them what kind of RDS you can use.

SSIS Packages running on SQl Server Agent randomly cannot connect to Snowflake

For the past week, multiple SSIS packages running on SQL Server Agent that load data into Snowflake have started returning the follow message randomly.
"Failed to acquire connection "snowflake". Connection may not be configured correctly or you may not have the right permissions on this connection."
We are seeing this message across multiple jobs and each of the jobs is loading multiple tables and its not happening on each call to Snowflake within the projects, but just on one or two tasks in jobs that have 100s.
We are using the 2.20.2 drivers from Snowflake
We have ran the jobs while WireShark was capturing network traffic and were received by the network team. They didn't have much luck because the ACK messages were not being shown.
We also ran Process Monitor while the jobs ran and we did not find anything that alluded to any issues
We also dug though the logs from the Snowflake driver and found the calls right before and right after, but no messages for the task that failed. Since those logs bounce around on which file they are sending to, its a bit hard to track sequential actions when multiple task on a job are running together.
We also installed SnowCD and ran it and it returned a full success message.
The user that runs the jobs on SQL Server Agent is an Admin on the server and has SysAdmin rights on the Sql Sever instance.
The warehouse the drivers are connected to are a size Large with a max of 3 clusters (was at 1 when the issue started, but upped it to 3 to see if that helped)
Jobs are running on Windows Server 2016 DataCenter in Azure
SQL Server instance is Sql Sever 2016 13.0.4604.0
We cannot figure out why we are suddenly and randomly using connection to Snowflake.
Some ideas to help get these packages working:
Add a retry to the tasks that are failing. The task would move onto the next step only upon success:
https://www.mssqltips.com/sqlservertip/5625/how-to-retry-sql-server-integration-services-ssis-control-flow-tasks/
You can also combine the truncate and insert into one step using the insert overwrite into command which will allow your package to run quicker and have one less task for failure:
https://docs.snowflake.net/manuals/sql-reference/sql/insert.html#insert-using-overwrite
Once the SSIS packages are consistently completing, you can analyze the logs at the point of failure to see if there is any pattern to help you identify the root cause.

Google cloud_sql_proxy unable to connect to instance, stream error, protocol_error

I've been successfully using the Google cloud_sql_proxy on multiple Compute Engine instances for some time, until today, one instance at a time, the proxy started to show the following error pattern:
2017/05/30 13:28:07 New connection for "project-id-1234:us-central1:sql_instance"
2017/05/30 13:28:07 couldn't connect to "project-id-1234:us-central1:sql_instance": Post https://www.googleapis.com/sql/v1beta4/projects/project-id-1234/instances/sql_instance/createEphemeral?alt=json: stream error: stream ID 1; PROTOCOL_ERROR
2017/05/30 13:28:41 New connection for "project-id-1234:us-central1:sql_instance"
2017/05/30 13:28:41 Thottling refreshCfg(project-id-1234:us-central1:sql_instance): it was only called 33.490705951s ago
2017/05/30 13:28:41 couldn't connect to "project-id-1234:us-central1:sql_instance": Post https://www.googleapis.com/sql/v1beta4/projects/project-id-1234/instances/sql_instance/createEphemeral?alt=json: stream error: stream ID 1; PROTOCOL_ERROR
When trying to connect directly to MySQL (while using the proxy) I get error 2013 (HY000):
ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 0 "Internal error/check (Not system error)"
What I've tried
Restarting the cloud_sql_proxy yielded a temporary fix until finally both my Compute Engine instances are unable to connect to my Cloud SQL instance and the proxies show only this result.
Restarting the Cloud SQL instance and both Compute Engine instances.
Eliminating the proxy: I added the appropriate networks to my SQL instance's Authorized Networks, and updated all applications to use the public IP. This restored functionality to my production apps, but now I'm using a public connection instead of local/proxy.
Some research
I came across a similar issue relating to Google Cloud SQL that yielded the same MySQL error above, but it appears to have only affected connecting to Cloud SQL from external, non GCE/GKE networks.
A few others have reported the same issue also started for them this morning on the Google Cloud SQL Discuss group.
My team started seeing the same issue appear today, with GKE managed servers. Same as you saw: restarts of servers and DB did nothing.
We tried doing an update of the version of Google Cloud Proxy we were using from v1.05 to v1.09 and the problem went away (for now).
I know that's not much of an explanation but give it a try to see if that helps you.

Configure GlassFish JDBC connection pool to handle Amazon RDS Multi-AZ failover

I have a Java EE application running in GlassFish on EC2, with a MySQL database on Amazon RDS.
I am trying to configure the JDBC connection pool to in order to minimize downtime in case of database failover.
My current configuration isn't working correctly during a Multi-AZ failover, as the standby database instance appears to be available in a couple of minutes (according to the AWS console) while my GlassFish instance remains stuck for a long time (about 15 minutes) before resuming work.
The connection pool is configured like this:
asadmin create-jdbc-connection-pool --restype javax.sql.ConnectionPoolDataSource \
--datasourceclassname com.mysql.jdbc.jdbc2.optional.MysqlConnectionPoolDataSource \
--isconnectvalidatereq=true --validateatmostonceperiod=60 --validationmethod=auto-commit \
--property user=$DBUSER:password=$DBPASS:databaseName=$DBNAME:serverName=$DBHOST:port=$DBPORT \
MyPool
If I use a Single-AZ db.m1.small instance and reboot the database from the console, GlassFish will invalidate the broken connections, throw some exceptions and then reconnect as soon the database is available. In this setup I get less than 1 minute of downtime.
If I use a Multi-AZ db.m1.small instance and reboot with failover from the AWS console, I see no exception at all. The server halts completely, with all incoming requests timing out. After 15 minutes I finally get this:
Communication failure detected when attempting to perform read query outside of a transaction. Attempting to retry query. Error was: Exception [EclipseLink-4002] (Eclipse Persistence Services - 2.3.2.v20111125-r10461): org.eclipse.persistence.exceptions.DatabaseException
Internal Exception: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure
The last packet successfully received from the server was 940,715 milliseconds ago. The last packet sent successfully to the server was 935,598 milliseconds ago.
It appears as if each HTTP thread gets blocked on an invalid connection without getting an exception and so there's no chance to perform connection validation.
Downtime in the Multi-AZ case is always between 15-16 minutes, so it looks like a timeout of some sort but I was unable to change it.
Things I have tried without success:
connection leak timeout/reclaim
statement leak timeout/reclaim
statement timeout
using a different validation method
using MysqlDataSource instead of MysqlConnectionPoolDataSource
How can I set a timeout on stuck queries so that connections in the pool are reused, validated and replaced?
Or how can I let GlassFish detect a database failover?
As I commented before, it is because the sockets that are open and connected to the database don't realize the connection has been lost, so they stayed connected until the OS socket timeout is triggered, which I read might be usually in about 30 minutes.
To solve the issue you need to override the socket Timeout in your JDBC Connection String or in the JDNI COnnection Configuration/Properties to define the socketTimeout param to a smaller time.
Keep in mind that any connection longer than the value defined will be killed, even if it is being used (I haven't been able to confirm this, is what I read).
The other two parameters I mention in my comment are connectTimeout and autoReconnect.
Here's my JDBC Connection String:
jdbc:(...)&connectTimeout=15000&socketTimeout=60000&autoReconnect=true
I also disabled Java's DNS cache by doing
java.security.Security.setProperty("networkaddress.cache.ttl" , "0");
java.security.Security.setProperty("networkaddress.cache.negative.ttl" , "0");
I do this because Java doesn't honor the TTL's, and when the failover takes place, the DNS is the same but the IP changes.
Since you are using an Application Server, the parameters to disable DNS cache must be passed to the JVM when starting the glassfish with -Dnet and not the application itself.