I have a Spark job which is loading data from CSV files into a MySQL database.
Everything works fine but recently I noticed that Spark opens many connections during the insert stage (300+ connections). It feels like for each insert statement its opening a new connection, keep it open and at some point of time doing commit and closing the connection. Is there a way to do commit after each insert or after processing it in 10K batches and do one commit?
This would be to not open a connection for each insert. It's good if it needs to process 1K records but when you work with billions of records it's taking a lot of resources.
If you have any operations on the dataframe the dataframe that causes shuffl, spark by default, create 200 partitions. Causes 200 connections to the database.
spark.sql.shuffle.partitions -- Configures the number of partitions to use when shuffling data for joins or aggregations. -- default: 200
Check the number of partitions of the dataframe using:
df.rdd.getNumPartitions
Re-partition the dataframe using on your frequent used column:
df.repartition(NUMBER_OF_PARTIOTONS, col("Frequent_used_column"))
You can also set the 'batchsize' parameter to control the number of rows to insert per round trip. This helps the performance on JDBC drivers. It defaults to 1000.
jdbcDF.write
.format("jdbc")
.option("url", "jdbc:postgresql:dbserver")
.option("dbtable", "schema.tablename")
.option("user", "username")
.option("password", "password")
.option("batchsize", 5000)
.save()
Related
I want to write to DB using pyspark as below
df.write.format("jdbc").option("url", jdbc_url).option("driver", "com.mysql.jdbc.Driver")\
.option("dbtable", table).option("user", user).option("password", password)\
.mode('append').save()
I am not sure how it works fully, i.e. is it atomic in nature, sometimes even if there is a failure i see some records inserted and sometimes in case of errors none of the records is inserted. What is the default behaviour of this write.
Also is there any option where i can write while ignoring the DML errors like column length issues or date format issues etc.
Spark will execute insert queries parallelly in independent transactions and this is why some times you find data in case of error and sometimes you don't. It may possible that insertion order is not the same in each and every time. Spark will auto created query based on the value of spark.sql.shuffle.partitions the default value will be 200 or if you have used df.repartition(val). So based on the configuration and available partitions, it will execute insert query parallelly having independent transactions for each partition. If any error encountered, it will stop the execution.
To answer your other question, you are executing queries on MySQL so Spark doesn't have any control over it. So NO there can not be the way that you can ignore such DML errors. Spark will execute a query as any other external system do. It should be the developer's responsibility that data are in expected length and format. You can always convert the data in the expected format before writing to MySQL.
For more info refer this.
I have large table in CloudSQL that needs to be updated every hour, and I'm considering Airflow as a potential solution. What is the best way to update a large amount of data in a CloudSQL database from Airflow?
The constrain are:
The table need still be readable while the job is running
The table need to be writable in case one of the job runs overtime and 2 jobs end up running at the same time
Some of the ideas I have:
Load data needs to update into a pandas framework and run pd.to_sql
Load data into a csv in Cloud Storage and execute LOAD DATA LOCAL INFILE
Load data in memory, break it into chunks, and run a multi-thread process that each update the table chunk by chunk using a shared connection pool to prevent exhausting connection limits
My recent airflow related ETL project could be a reference for you.
Input DB: LargeDB (billion row level Oracle)
Interim DB: Mediam DB( tens of million level HD5 file)
Output
DB: Mediam DB (tens of millsion level mysql )
As far as I encountered, write to db is main block for such ETL process. so as you can see,
For interim stage, I use HD5 as interim DB or file for data transforming. the pandas to_hdf function provide a seconds level performance to large data. in my case, 20 millison rows write to hdf5 using less than 3 minutes.
Below is the performance benchmarking for pandas IO. HDF5 format is top3 fastest and most popular format. https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-perf
For the output stage, I use to_sql with chunk_size parameter. in order to speed up to_sql , you has to manually mapping the column type to database colume type and length,especialy the string or varchar format. With manualy mapping it, to_sql will mapp to blob format or varchar(1000). the default mode is 10 times slow than manually mapping mode.
total 20millions rows write to db via to_sql(chunksize mode) spend about 20 minutes.
if you like the answer, pls vote it up
One clue for your reference based on postgresql partition table but need some DML operation define the partitioned table.
Currently, you main constrains are:
the table need still be readable while the job is running
It means no lock allowed.
the table need to be writable in case one of the job runs overtime and 2 jobs end up running at the same time
it should capable with multiple writing in sample time.
I add one things for you may considered as well:
reasonable read performance while writing.
** performance and user experience is key
Partition table could reach all requirements. It is transparence to client applicationi.
At present, you are doing ETL, soon will facing performance issue as the table size gain quickly. The partitioned table is only solution.
The main steps are:
Create partition table with partition list.
normal reading and writing to the table running as usual.
ETL process(could be in parallel):
-. ETL data and uploaded to new table. (very slow, minutes to hours. but no impact to main table)
-. Add the new table to the main table partition list. (super fast, micro seconds level to enable main table)
normal main table reading and write as usual with new data.
If you like the answer, pls vote it up.
Best Regards,
WY
A crucial step to consider while setting up your workflow is to always use good connection management practices to minimize your application's footprint and reduce the likelihood of exceeding Cloud SQL connection limits. Databases connections consume resources on the server and the connection application.
Cloud Composer has no limitations when it comes to your ability to interface with CloudSQL. Therefore, either of the first 2 options is good.
A Python dependency is installable if it has no external dependencies and does not conflict with Composer’s dependencies. In addition, 14262433 explicitly explains the process of setting up a "Large data" workflow using Pandas.
LOAD DATA LOCAL INFILE requires you to use --local-infile for the mysql client. To import data into Cloud SQL, make sure to follow the best practices.
I have spark proses which are doing some calculations then it's doing an insert into MySQL table, all calculation done in 40-50 minutes but. Write into the table is 2-3 hours (depends on DB usage). I tried to do batchsize
val db_url_2 = "jdbc:mysql://name.amazonaws.com:port/db_name?rewriteBatchedStatements=true"
df_trsnss.write.format("jdbc").option("url", db_url_2).option("dbtable", output_table_name).option("user", db_user).option("password", db_pwd).option("truncate","true").option("batchsize", 5000).mode("overwrite").save()
but it still taking forever to load, I can't afford to spend 2-4 hours a day just to calculate and write data into the table.
Is there any way to speed up this process?
Start to think to do write into CSV and then load it into db from CSV so I can reduce EMR time.
Try something like this - right from the DataBricks Guide in fact:
JDBC writes
Spark’s partitions dictate the number of connections used to push data through the JDBC API. You can control the parallelism by calling coalesce() or repartition() depending on the existing number of partitions. Call coalesce when reducing the number of partitions, and repartition when increasing the number of partitions.
Try and see how this compares to your write approach and let us know.
import org.apache.spark.sql.SaveMode
val df = spark.table("diamonds")
println(df.rdd.partitions.length)
// Given the number of partitions above, you can reduce the partition value by calling coalesce() or increase it by calling repartition() to manage the number of connections.
df.repartition(10).write.mode(SaveMode.Append).jdbc(jdbcUrl, "diamonds", connectionProperties)
I am using 2 separate processes via multiprocessing in my application. Both have access to a MySQL database via sqlalchemy core (not the ORM). One process reads data from various sources and writes them to the database. The other process just reads the data from the database.
I have a query which gets the latest record from the a table and displays the id. However it always displays the first id which was created when I started the program rather than the latest inserted id (new rows are created every few seconds).
If I use a separate MySQL tool and run the query manually I get correct results, but SQL alchemy is always giving me stale results.
Since you can see the changes your writer process is making with another MySQL tool that means your writer process is indeed committing the data (at least, if you are using InnoDB it does).
InnoDB shows you the state of the database as of when you started your transaction. Whatever other tools you are using probably have an autocommit feature turned on where a new transaction is implicitly started following each query.
To see the changes in SQLAlchemy do as zzzeek suggests and change your monitoring/reader process to begin a new transaction.
One technique I've used to do this myself is to add autocommit=True to the execution_options of my queries, e.g.:
result = conn.execute( select( [table] ).where( table.c.id == 123 ).execution_options( autocommit=True ) )
assuming you're using innodb the data on your connection will appear "stale" for as long as you keep the current transaction running, or until you commit the other transaction. In order for one process to see the data from the other process, two things need to happen: 1. the transaction that created the new data needs to be committed and 2. the current transaction, assuming it's read some of that data already, needs to be rolled back or committed and started again. See The InnoDB Transaction Model and Locking.
I am consuming a high rate data stream and doing the following steps to store data in a MySQL database. For each new arriving item.
(1) Parse incoming item.
(2) Execute several "INSERT ... ON DUPLICATE KEY UPDATE"
I have used INSERT ... ON DUPLICATE KEY UPDATE to eliminate one additional round-trip to the database.
While trying to improve the overall performance, I have considered doing bulk updates in the following way:
(1) Parse incoming item.
(2) Generate SQL statement with "INSERT ... ON DUPLICATE KEY UPDATE" and append to a file.
Periodically flush the SQL statements in the file to the database.
Two questions:
(1) will this have a positive impact in the database load?
(2) how should I flush the statements to the database so that indices are only reconstructed after the complete flush? (using transactions?)
UPDATE: I am using Perl DBI + MySQL MyISAM.
Thanks in advance for any comments.
If your data does not need to go into the database immediately you can cache your insert data somewhere, then issue one larger insert statement, e.g.
insert into table_name (x, y, z) values (x1, y1, z1), (x2, y2, z2), ... (xN, yN, zN) on duplicate update ...;
To be clear, I would maintain a list of pending inserts. In this case a list of (x,z,y) triplets. Then once your list exceeds some threshold (N) you generate the insert statement and issue it.
I have no accurate timing figures for you, but this increased performance roughly 10 times when compared to inserting each row individually.
I also haven't played with the value of N, but I found 1000 to work nicely. I expect the optimal value is affected by hardware and database settings.
Hope this helps (I am also using MyIsam).
You don't say what kind of database access environment (PERL DBI? JDBC? ODBC?) you're running in, or what kind of table storage engine (MyISAM? InnoDB?) you're using.
First of all, you're right to pick INSERT ... ON DUPLICATE KEY UPDATE. Good move, unless you can guarantee unique keys.
Secondly, if your database access environment allows it, you should use prepared statements. You definitely won't get good performance if you write a bunch of statements into a file, and then make a database client read the file once again. Do the INSERT operations directly from the software package that consumes the incoming data stream.
Thirdly, pick the right kind of table storage engine. MyISAM inserts are going to be faster than InnoDB, so if you're logging data and retrieving it later that will be a win. But InnoDB has better transactional integrity. If you're really handling tonnage of data, and you don't need to read it very often, consider the ARCHIVE storage engine.
Finally, consider doing a START TRANSACTION at the beginning of a batch of INSERT ... commands, then doing a COMMIT and another START TRANSACTION after a fixed number of rows, like 100 or so. If you're using InnoDB, this will speed things up a lot. If you're using MyISAM or ARCHIVE, it won't matter.
Your big wins will come from the prepared statement stuff and the best choice of storage engine.