I am currently importing data from a MySQL database into spark using the JDBC driver using the following command in pyspark:
dataframe_mysql = sqlctx
.read
.format("jdbc")
.option("url", "jdbc:mysql://<IP-ADDRESS>:3306/<DATABASE>")
.option("driver", "com.mysql.jdbc.Driver")
.option("dbtable", "<TABLE>")
.option("user", "<USER>")
.option("password","<PASSWORD>")
.load()
When I run the spark job, I get the following error message:
com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException (Too many connections).
It seems that since several nodes are attempting to connect concurrently to the database, I am exceeding MySQL's connection limit (151) and this is causing my job to run slower.
How can I limit the number of connections that the JDBC driver uses in pyspark? Any help would be great!
Try to use numPartitions param. According to the documentation it is the maximum number of partitions that can be used for parallelism in table reading and writing. This also determines the maximum number of concurrent JDBC connections. If the number of partitions to write exceeds this limit, then there is a decrease to this limit by calling coalesce(numPartitions) before writing.
I guess you should reduce the default partition size, or reduce the number of executors.
Related
I'm trying to run a few JDBC extracts in parallel, but this fails on: java.lang.OutOfMemoryError: Java heap space.
How does Data Connection memory usage work, and how do I resolve this problem?
The Data Connection Agent's memory usage here actually depends mostly on the value of the fetchSize parameter. Per the Oracle JDBC driver docs, fetchSize:
Gives the JDBC driver a hint as to the number of rows that should be fetched from the database when more rows are needed for this ResultSet object.
So, the agent's memory usage should roughly be:
number of JDBC extracts running in parallel x fetchSize x size of each row
Unfortunately, the default value of fetchSize varies vastly among different JDBC drivers. For instance, certain versions of the Hive JDBC driver have it set to 50, while other, newer versions have a default value of 1000. Oracle JDBC drivers have a default of 10. Postgres by default will try to get the entire ResultSet at once.
Hence, Data Connection allows you to configure the fetchSize value. This is configurable both per-source and per-extract.
OOM errors aside, tuning fetchSize can improve performance significantly in general. There isn't a one-size-fits-all solution, though, and you'll have to experiment to figure out the best parameter value for your extracts. It usually lies somewhere in the 500–5000 range.
I have a Spark job which is loading data from CSV files into a MySQL database.
Everything works fine but recently I noticed that Spark opens many connections during the insert stage (300+ connections). It feels like for each insert statement its opening a new connection, keep it open and at some point of time doing commit and closing the connection. Is there a way to do commit after each insert or after processing it in 10K batches and do one commit?
This would be to not open a connection for each insert. It's good if it needs to process 1K records but when you work with billions of records it's taking a lot of resources.
If you have any operations on the dataframe the dataframe that causes shuffl, spark by default, create 200 partitions. Causes 200 connections to the database.
spark.sql.shuffle.partitions -- Configures the number of partitions to use when shuffling data for joins or aggregations. -- default: 200
Check the number of partitions of the dataframe using:
df.rdd.getNumPartitions
Re-partition the dataframe using on your frequent used column:
df.repartition(NUMBER_OF_PARTIOTONS, col("Frequent_used_column"))
You can also set the 'batchsize' parameter to control the number of rows to insert per round trip. This helps the performance on JDBC drivers. It defaults to 1000.
jdbcDF.write
.format("jdbc")
.option("url", "jdbc:postgresql:dbserver")
.option("dbtable", "schema.tablename")
.option("user", "username")
.option("password", "password")
.option("batchsize", 5000)
.save()
I'd like to tell in advance that several related questions, like the following, DO NOT address my problems:
Spark query running very slow
Converting mysql table to dataset is very slow...
Spark Will Not Load Large MySql Table
Spark MySQL Error while Reading from Database
This one comes close but the stack-trace is different and it is unresolved anyways. So rest assured that I'm posting this question after several days of (failed) solution-hunting.
I'm trying to write a job that moves data (once a day) from MySQL tables to Hive tables stored as Parquet / ORC files on Amazon S3. Some of the tables are quite big: ~ 300M records with 200 GB+ size (as reported by phpMyAdmin).
Currently we are using sqoop for this but we want to move to Spark for the following reasons:
To leverage it's capabilities with DataFrame API (in future, we would be performing transformations while moving data)
We already have a sizeable framework written in Scala for Spark jobs used elsewhere in the organization
I've been able to achieve this on small MySQL tables without any issue. But the Spark job (that reads data from MySQL into DataFrame) fails if I try to fetch more than ~1.5-2M records at a time. I've shown the relevant portions of stack-trace below, you can find the complete stack-trace here.
...
javax.servlet.ServletException: java.util.NoSuchElementException: None.get
at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:489)
at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:427)
...
Caused by: java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:347)
at scala.None$.get(Option.scala:345)
...
org.apache.spark.status.api.v1.OneStageResource.taskSummary(OneStageResource.scala:62)
at sun.reflect.GeneratedMethodAccessor188.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
...
[Stage 27:> (0 + 30) / 32]18/03/01 01:29:09 WARN TaskSetManager: Lost task 3.0 in stage 27.0 (TID 92, ip-xxx-xx-xx-xxx.ap-southeast-1.compute.internal, executor 6): java.sql.SQLException: Incorrect key file for table '/rdsdbdata/tmp/#sql_14ae_5.MYI'; try to repair it
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:964)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3973)
...
** This stack-trace was obtained upon failure of moving a 148 GB table containing 186M records
As apparent from (full) stack-trace, the Spark read job starts sulking with the false warnings of None.get error followed by SQLException: Incorrect key for file.. (which is related to MySQL's tmp table becoming full)
Now clearly this can't be a MySQL problem because in that case sqoop should fail as well. As far as Spark is concerned, I'm parallelizing the read operation by setting numPartitions = 32 (we use parallelism of 40 with sqoop).
From my limited knowledge of Spark and BigData, 148 GB shouldn't be overwhelming for Spark by any measure. Moreover since MySQL, Spark (EMR) and S3 all reside in same region (AWS AP-SouthEast), so latency shouldn't be the bottleneck.
My questions are:
Is Spark a suitable tool for this?
Could Spark's Jdbc driver be blamed for this issue?
If answer to above question is
Yes: How can I overcome it? (alternate driver, or some other workaround)?
No: What could be the possible cause?
Framework Configurations:
Hadoop distribution: Amazon 2.8.3
Spark 2.2.1
Hive 2.3.2
Scala 2.11.11
EMR Configurations:
EMR 5.12.0
1 Master: r3.xlarge [8 vCore, 30.5 GiB memory, 80 SSD GB storage EBS Storage:32 GiB]
1 Task: r3.xlarge [8 vCore, 30.5 GiB memory, 80 SSD GB storage EBS Storage:none]
1 Core: r3.xlarge [8 vCore, 30.5 GiB memory, 80 SSD GB storage
EBS Storage:32 GiB]
** These are the configurations of development cluster; production cluster would be better equipped
Spark JDBC API seem to fork to load all data from MySQL table to memory without. So when you try to load a big table, what you should do is use Spark API clone data to HDFS first (JSON should be used to keep schema structure), like this:
spark.read.jdbc(jdbcUrl, tableName, prop)
.write()
.json("/fileName.json");
Then you can working on HDFS instead normally.
spark.read().json("/fileName.json")
.createOrReplaceTempView(tableName);
There is a option named "unbuffered" of mysql client and a very simple line about it, "Flush the buffer after each query.", in mysql manual.
My question is what is its usage?
I try to read mysql source code and it may be the option "flush mysql client log/output buffer after each query", but I'm not sure.
Thanks.
The default behavior for the db is to buffer your query before outputting any info. If you run an unbiffered query, you are asking mysql to start the output as soo as possible. Theoretically this only stores one row in memory so you can stream huge tables without running out of memory.
Downsides are you cannot run 2 unbuffered queries at the same time. Whereas buffered queries will jjst enqueue the second one, unbuffered statements will throw an error.
Another downside is that you don't know how many rows are left until you end iterating through the redult.
I am accessing a large indexed text dataset using sphinxse via MySQL. The size of resultset is on the order of gigabytes. However, I have noticed that MySQL stops the query with following error whenever the dataset is larger than 16MB:
1430 (HY000): There was a problem processing the query on the foreign data source. Data source error: bad searchd response length (length=16777523)
length shows the length of resultset that offended MySQL. I have tried the same query with Sphinx's standalone search program. It works fine. I have tried all possible variables in both MySQL and Sphinx, but nothing is helping.
I am using Sphinx 0.9.9 rc-2 and MySQL 5.1.46.
Thanks
I finally solved the problem. It turns out that the sphinx plugin for mysql (SphinxSE) hard-codes the 16 MB response limit on the resultset in the source code (bad bad bad source-code). I changed SPHINXSE_MAX_ALLOC to 1*1024*1024*1024 in file ha_sphinx.cc, and everything works fine now.
you probably need to increase max_allowed_packet from its default value of 16M:
From mysql's documentation
Both the client and the server have their own max_allowed_packet variable, so if you want to handle big packets, you must increase this variable both in the client and in the server.
If you are using the mysql client program, its default max_allowed_packet variable is 16MB. To set a larger value, start mysql like this:
shell> mysql --max_allowed_packet=32M
That sets the packet size to 32MB.