Spark: Reading big MySQL table into DataFrame fails

Spark: Reading big MySQL table into DataFrame fails - mysql

I'd like to tell in advance that several related questions, like the following, DO NOT address my problems:
Spark query running very slow
Converting mysql table to dataset is very slow...
Spark Will Not Load Large MySql Table
Spark MySQL Error while Reading from Database
This one comes close but the stack-trace is different and it is unresolved anyways. So rest assured that I'm posting this question after several days of (failed) solution-hunting.
I'm trying to write a job that moves data (once a day) from MySQL tables to Hive tables stored as Parquet / ORC files on Amazon S3. Some of the tables are quite big: ~ 300M records with 200 GB+ size (as reported by phpMyAdmin).
Currently we are using sqoop for this but we want to move to Spark for the following reasons:
To leverage it's capabilities with DataFrame API (in future, we would be performing transformations while moving data)
We already have a sizeable framework written in Scala for Spark jobs used elsewhere in the organization
I've been able to achieve this on small MySQL tables without any issue. But the Spark job (that reads data from MySQL into DataFrame) fails if I try to fetch more than ~1.5-2M records at a time. I've shown the relevant portions of stack-trace below, you can find the complete stack-trace here.
...
javax.servlet.ServletException: java.util.NoSuchElementException: None.get
at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:489)
at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:427)
...
Caused by: java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:347)
at scala.None$.get(Option.scala:345)
...
org.apache.spark.status.api.v1.OneStageResource.taskSummary(OneStageResource.scala:62)
at sun.reflect.GeneratedMethodAccessor188.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
...
[Stage 27:> (0 + 30) / 32]18/03/01 01:29:09 WARN TaskSetManager: Lost task 3.0 in stage 27.0 (TID 92, ip-xxx-xx-xx-xxx.ap-southeast-1.compute.internal, executor 6): java.sql.SQLException: Incorrect key file for table '/rdsdbdata/tmp/#sql_14ae_5.MYI'; try to repair it
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:964)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3973)
...
** This stack-trace was obtained upon failure of moving a 148 GB table containing 186M records
As apparent from (full) stack-trace, the Spark read job starts sulking with the false warnings of None.get error followed by SQLException: Incorrect key for file.. (which is related to MySQL's tmp table becoming full)
Now clearly this can't be a MySQL problem because in that case sqoop should fail as well. As far as Spark is concerned, I'm parallelizing the read operation by setting numPartitions = 32 (we use parallelism of 40 with sqoop).
From my limited knowledge of Spark and BigData, 148 GB shouldn't be overwhelming for Spark by any measure. Moreover since MySQL, Spark (EMR) and S3 all reside in same region (AWS AP-SouthEast), so latency shouldn't be the bottleneck.
My questions are:
Is Spark a suitable tool for this?
Could Spark's Jdbc driver be blamed for this issue?
If answer to above question is
Yes: How can I overcome it? (alternate driver, or some other workaround)?
No: What could be the possible cause?
Framework Configurations:
Hadoop distribution: Amazon 2.8.3
Spark 2.2.1
Hive 2.3.2
Scala 2.11.11
EMR Configurations:
EMR 5.12.0
1 Master: r3.xlarge [8 vCore, 30.5 GiB memory, 80 SSD GB storage EBS Storage:32 GiB]
1 Task: r3.xlarge [8 vCore, 30.5 GiB memory, 80 SSD GB storage EBS Storage:none]
1 Core: r3.xlarge [8 vCore, 30.5 GiB memory, 80 SSD GB storage
EBS Storage:32 GiB]
** These are the configurations of development cluster; production cluster would be better equipped

Spark JDBC API seem to fork to load all data from MySQL table to memory without. So when you try to load a big table, what you should do is use Spark API clone data to HDFS first (JSON should be used to keep schema structure), like this:
spark.read.jdbc(jdbcUrl, tableName, prop)
.write()
.json("/fileName.json");
Then you can working on HDFS instead normally.
spark.read().json("/fileName.json")
.createOrReplaceTempView(tableName);

Related

Update large amount of data in SQL database via Airflow

I have large table in CloudSQL that needs to be updated every hour, and I'm considering Airflow as a potential solution. What is the best way to update a large amount of data in a CloudSQL database from Airflow?
The constrain are:
The table need still be readable while the job is running
The table need to be writable in case one of the job runs overtime and 2 jobs end up running at the same time
Some of the ideas I have:
Load data needs to update into a pandas framework and run pd.to_sql
Load data into a csv in Cloud Storage and execute LOAD DATA LOCAL INFILE
Load data in memory, break it into chunks, and run a multi-thread process that each update the table chunk by chunk using a shared connection pool to prevent exhausting connection limits

My recent airflow related ETL project could be a reference for you.
Input DB: LargeDB (billion row level Oracle)
Interim DB: Mediam DB( tens of million level HD5 file)
Output
DB: Mediam DB (tens of millsion level mysql )
As far as I encountered, write to db is main block for such ETL process. so as you can see,
For interim stage, I use HD5 as interim DB or file for data transforming. the pandas to_hdf function provide a seconds level performance to large data. in my case, 20 millison rows write to hdf5 using less than 3 minutes.
Below is the performance benchmarking for pandas IO. HDF5 format is top3 fastest and most popular format. https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-perf
For the output stage, I use to_sql with chunk_size parameter. in order to speed up to_sql , you has to manually mapping the column type to database colume type and length,especialy the string or varchar format. With manualy mapping it, to_sql will mapp to blob format or varchar(1000). the default mode is 10 times slow than manually mapping mode.
total 20millions rows write to db via to_sql(chunksize mode) spend about 20 minutes.
if you like the answer, pls vote it up

One clue for your reference based on postgresql partition table but need some DML operation define the partitioned table.
Currently, you main constrains are:
the table need still be readable while the job is running
It means no lock allowed.
the table need to be writable in case one of the job runs overtime and 2 jobs end up running at the same time
it should capable with multiple writing in sample time.
I add one things for you may considered as well:
reasonable read performance while writing.
** performance and user experience is key
Partition table could reach all requirements. It is transparence to client applicationi.
At present, you are doing ETL, soon will facing performance issue as the table size gain quickly. The partitioned table is only solution.
The main steps are:
Create partition table with partition list.
normal reading and writing to the table running as usual.
ETL process(could be in parallel):
-. ETL data and uploaded to new table. (very slow, minutes to hours. but no impact to main table)
-. Add the new table to the main table partition list. (super fast, micro seconds level to enable main table)
normal main table reading and write as usual with new data.
If you like the answer, pls vote it up.
Best Regards,
WY

A crucial step to consider while setting up your workflow is to always use good connection management practices to minimize your application's footprint and reduce the likelihood of exceeding Cloud SQL connection limits. Databases connections consume resources on the server and the connection application.
Cloud Composer has no limitations when it comes to your ability to interface with CloudSQL. Therefore, either of the first 2 options is good.
A Python dependency is installable if it has no external dependencies and does not conflict with Composer’s dependencies. In addition, 14262433 explicitly explains the process of setting up a "Large data" workflow using Pandas.
LOAD DATA LOCAL INFILE requires you to use --local-infile for the mysql client. To import data into Cloud SQL, make sure to follow the best practices.

how to speedup large data import in neo4j

I'm using neo4j-import command line to load large csv files into neo4j. I've tested the command line with subset of the data and it works well. The size of csv file is about 200G, containing ~10M nodes and ~B relationships. Currently, I'm using default neo4j configuration and it takes hours to create nodes, and it got stuck at [*SORT:20.89 GB-------------------------------------------------------------------------------] 0 I'm worried that it will take even longer time to create relationships. Thus, I would like to know possible ways to speedup data import.
It's a 16GB machine, and the neo4j-import output message shows the following.
free machine memory: 166.94 MB Max heap memory : 3.48 GB Should I change neo4j configuration to increase memory? Will it help?
I'm setting neo4j-import --processes=8. However, the CPU usages of the JAVA command is only about ~1%. Does it look right?
Can someone give me a ballpark number of loading time, given the size of my dataset? It's a 8-core, 16GB memory standalone machine.
Anything else I should look at to speedup the data import?
Updated:
The machine does not have SSD disk
I run top command, and it shows that 85% of RAM is being used by the JAVA process, which I think belongs to the neo4j-import command.
The import command is: neo4j-import --into /var/lib/neo4j/data/graph.db/ --nodes:Post Posts_Header.csv,posts.csv --nodes:User User_Header.csv,likes.csv --relationships:LIKES Likes_Header.csv,likes.csv --skip-duplicate-nodes true --bad-tolerance 100000000 --processors 8
4.Posts_Header:Post_ID:ID(Post),Message:string,Created_Time:string,Num_Of_Shares:int,e:IGNORE, f:IGNORE User_Header:a:IGNORE,User_Name:string,User_ID:ID(User) Likes_Header: :END_ID(Post),b:IGNORE,:START_ID(User)
I ran the sample data import and it's pretty fast, like several seconds. Since I use the default neo4j heap setting and default Java memory setting, will it help if I configure these numbers?

Some questions:
What kind of disk do you have (SSD is preferable).
It also seems all your RAM is already used up, check with top or ps what other processes use the memory and kill them.
Can you share the full neo4j-import command?
What does a sample of your CSV and the header line look like?
It seems that you have a lot of properties? Are they all properly quoted? Do you really need all of them in the graph?
Try with a sample first, like head -100000 file.csv > file100k.csv
Usually it can import 1M records / s, with a fast disk.
That includes nodes, property and relationship-records.

Mysql MyISAM table crashes with lots of inserts and selects, what solution should I pick?

I have the following situation:
MySQL MyISAM database on Amazon EC2 instance with PHP on a apache webserver. We need to store incomming packages in json in MySql. For this I use a staging database where a cronjob each minutes moves old data with a where DateTime > 'date - 2min' query to another table (named stage2).
The stage1 table has only actual information and contains 35k rows at normal and can contain up to 100k when it's busy. We can reach 50k new rows a minute, which should be about 10k insert queries. The insert looks like this:
INSERT DELAYED IGNORE INTO stage1 VALUES ( ... ), (....), (....), (....), (...)
Then we have 4 scripts running about each 10second doing the following:
grab max RowID from stage1 (primary key)
export data till that rowID and from the previous max RowId
a) 2 scripts are in bash and using the mysql export commandline method
b) 1 script in node.js and is using the export method with into outfile
c) 1 script in php which using the default mysql select statement and loop through each row.
send data to external client
write last send time and last rowid to a mysql table so it knows where it is next time.
Then we have one cronjob each minute moving old data form stage1 to stage2.
So everything worked well for a long time but now we are increasing in users and during our rush hours the stage1 table is crashing now and then. We can easily repair it but that's not the right way because we will be down for some time. Memory and CPU are ok during the rush hours but when stage1 is crashing everything is crashing.
Also worth to say: I don't care if I'm missing rows because of a failure, so I don't need any special backup plans just in case something went wrong.
What I did so far:
Adding delayed and ignore to the insert statements.
Tried switching to innoDB but this was even worse, mainly think of the large memory it needed. My EC2 currently is a t2.medium which has 4gb memory and 2 vCPU with burst capacity. Following: https://dba.stackexchange.com/questions/27328/how-large-should-be-mysql-innodb-buffer-pool-size and running this query:
SELECT CEILING(Total_InnoDB_Bytes*1.6/POWER(1024,3)) RIBPS FROM
(SELECT SUM(data_length+index_length) Total_InnoDB_Bytes
FROM information_schema.tables WHERE engine='InnoDB') A;
it returned 11gb, I tried 3gb which is the max for my instance (80%). And since it was more instable I switched every table back to myISAM yesterday
recreate the stage1 table structure
What are my limitations?
I cannot change all 4 scripts to one export because the output to the client is different. for example some use json others xml.
Options I'm considering
m3.xlarge instance with 15GB memory is 5 times more expensive, but if this is needed Im willing to do the offer. Then switch to innoDB again and see if it's stable ?
can I just move stage1 to innoDB and run it with 3gb buffer pool size? So the rest will be myISAM ?
Try doing it with a nosql database or a in memory type database. Should that work?
Queue the packages in memory and have the 4 scripts get the data from memory and save all later when done to mysql. Is there some kind of tool for this?
Move stage1 to a RDS instance with innoDB
Love to get some advice and help on this! Perhaps I'm missing the easy answer ? or what options should I not consider.
Thanks,
Sjoerd Perfors

Today we fixed these issues with the following setup:
AWS Loadbalancer going to a T2.Small instance "Worker" where Apache en PHP handeling the request and sending to a EC2 instance mysql system calling the "Main".
When CPU of the T2.small instance is above 50% automatically new instances are launched connecting to the loadbalancer.
"Main" EC2 has mysql running with innodb.
All updated to Apache 2.4 and php 5.5 with performance updates.
Fixed one script acting a lot faster.
Innodb has now 6GB
Things we did try-out but didn't worked:
- Setting up a DynamoDB but sending to this DB did cost almost 5seconds.
Things we are considering:
- Removing the stage2 database and doing backups directly from Stage1. Seems having this kind of rows isn't bad for the performance.

Abnormally high MySQL writes and larger than normal Binary Log files. How can I determine what caused this?

We have a MySQL master database which replicates to a MySQL slave. We were experiencing issues where MySQL was showing a high number of writes (but not an increased number of queries being ran) for a short period of time (a few hours). We are trying to investigate the cause.
Normally our binary logs are 1 GB in file size but during the period that we were experiencing these issues, the log files jumped to 8.5 GB.
When I run mysqlbinlog --short-form BINARYLOG.0000 on one of the 8.5 GB binary log it only returns 196 KB of queries and data. When I run mysqlbinlog --short-form on a normal binary log (1 GB) it returns around 8,500 KB worth of queries and database activity. That doesn't make sense because it has 7 GB more of data yet returns less than a 1 GB binary log file.
I see lots of these statements with very sequential timestamps, but I'm not sure if that's related to the problem because they're in both the normal period as well as when we experienced these issues.
SET TIMESTAMP=1391452372/*!*/;COMMIT/*!*/;
SET TIMESTAMP=1391452372/*!*/;BEGIN/*!*/;COMMIT/*!*/;
SET TIMESTAMP=1391452372/*!*/;BEGIN/*!*/;COMMIT/*!*/;
SET TIMESTAMP=1391452372/*!*/;BEGIN/*!*/;COMMIT/*!*/;
How can I determine what caused those binary logs to balloon in size which also caused high writes, so much so it took the server offline at points, almost like a DDoS attack would?
How could mysqlbinlog return so much less data, even though the binary log file itself had 7 GB more? What can I do to identify the difference between a normal period where the binary logs are 1 GB to the period we had issues with the 8 GB binary log? Thank you for any help you can provide.
Bill

I would guess that your log contains some form of LOAD DATA [LOCAL] INFILE commands and the data files associated with them. These commands do not generate much SQL output as their data is written to a temporary file by mysqlbinlog during processing. Can you check if the output contains any such LOAD DATA commands?

Cassandra giving time out exception after some inserts

I am using Cassandra 1.0.6 version... I have around ~1million JSON Objects of 5KB each to be inserted to the cassandra. As the inserts goes on, the memory consumption of cassandra also goes up until it gets stable to certain point.. After some inserts(around 2-3 lkhs) the ruby client gives me "`recv_batch_mutate': CassandraThrift::TimedOutException" exception.
I have also tried inserting 1KB sized JSON Objects more than a million times. This doesnt give any exception. Also in this experiment I plotted a graph between time taken by 50000 inserts vs number of 50000 inserts. I could find that there is a sharp rise in time taken to inserts after some iterations and suddenly that falls down. This could be due to Garbage collection done by JVM. But the same doesnt happen while inserting 5KB of data for a million times.
What may be the problem?? Some of the configuration options which I am using:-
System:-
8GB with 4-core ..
Cassandra configuration:-
- concurrent_writes: 64
memtable_flush_writers: 4
memtable_flush_queue_size: 8
rpc_server_type: sync
thrift_framed_transport_size_in_mb: 30
in_memory_compaction_limit_in_mb: 64
multithreaded_compaction: true
Do I need to do any changes in configuration. Is it related to JVM heap space or due to Garbage collection ??

You can increase the rpc timeout to a larger value in cassandra config file, look for rpc_timeout_in_ms . But you should really look into your ruby client on the connection part.
# Time to wait for a reply from other nodes before failing the command
rpc_timeout_in_ms: 10000

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008