Using DBOutputFormat to write data to Mysql causes IOException - mysql

Recently, I am learning MapReduce and use it to write data to MySQL database. There are two ways to do so, DBOutputFormat and SQOOP. I tried the first one (refer to here), but encountered a problem, following is the error:
...
16/05/25 09:36:53 INFO mapred.LocalJobRunner: 3 / 3 copied.
16/05/25 09:36:53 INFO mapred.LocalJobRunner: reduce task executor complete.
16/05/25 09:36:53 WARN output.FileOutputCommitter: Output Path is null in cleanupJob()
16/05/25 09:36:53 WARN mapred.LocalJobRunner: job_local1404930626_0001
java.lang.Exception: java.io.IOException
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: java.io.IOException
at org.apache.hadoop.mapreduce.lib.db.DBOutputFormat.getRecordWriter(DBOutputFormat.java:185)
at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.<init>(ReduceTask.java:540)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:614)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
16/05/25 09:36:54 INFO mapreduce.Job: Job job_local1404930626_0001 failed with state FAILED due to: NA
16/05/25 09:36:54 INFO mapreduce.Job: Counters: 38
File System Counters
FILE: Number of bytes read=32583
FILE: Number of bytes written=796446
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=402
HDFS: Number of bytes written=0
HDFS: Number of read operations=18
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
...
while I manually use JDBC to connect and insert data, it turns out to be successful. And I notice that the map/reduce task executors are complete, but it encounters the IOException. So I guess the problem is database-related.
My code is here. Appriciated if some one could help me to figure out what is the problem.
Thanks in advance!

Related

Error timeout when I want to read file from hadoop with pyspark

I want to read a csv file from hadoop with Pyspark with the following code:
dfcsv = spark.read.csv("hdfs://my_hadoop_cluster_ip:9000/user/root/input/test.csv")
dfcsv.printSchema()
My cluster hadoop is on a Docker container on my local machine and link with two other slave container for the workers.
As you see in this picture from my ui hadoop cluster, the path is the right path.
But when I submit my script with this command :
spark-submit --master spark://my_cluster_spark_ip:7077 test.py
My script stuck on the read, and after few minutes I have this following error :
22/02/09 15:42:29 WARN TaskSetManager: Lost task 0.1 in stage 4.0 (TID 4) (my_slave_spark_ip executor 1): org.apache.hadoop.net.ConnectTimeoutException: Call From spark-slave1/my_slave_spark_ip to my_hadoop_cluster_ip:9000 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=my_hadoop_cluster_ip/my_hadoop_cluster_ip:9000]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:751)
...
For information, my csv file is very small, just 3 lines and 64 KB.
Have you any solution to fix this issue?

Py4JError while converting csv file to parquet using jupyter-notebook

I want to convert a csv to parquet file using jupyter notebook, python3. However, i get the next error:
Py4JJavaError Traceback (most recent call last)
Py4JJavaError: An error occurred while calling o40.parquet.
: org.apache.spark.SparkException: Job aborted.
at ...…...…..
at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.spark.SparkException: Task failed while writing rows.
at
Caused by: org.apache.spark.SparkException: Task failed while writing rows.
Caused by: java.net.SocketException: Connection reset by peer: socket write error
How can I resolve it please?
Make sure you have hadoop binaries available and HADOOP_HOME is set
If not download them from here
Then set HADOOP_HOME
import os
os.environ['HADOOP_HOME']=r"C:\hadoop-2.7.1"
os.environ["JAVA_HOME"] = r"C:\Program Files\Java\jdk1.8.0_212"
Then save the file

JMeter throwing 'File never reserved' error

While running the JMeter for data driven testing, JMeter is thrown with the error below. A big chunk of data was not pulled in. CSV file has about 1500 email ids and would like to ramp up the test to a million. However, the test is failing at the first attempt itself.
Any thoughts, please. Appreciate your help.
2019-03-19 23:55:08,169 ERROR o.a.j.c.CSVDataSet: java.io.IOException: File never reserved: C:\Users\sp\Desktop\STM\STM-JMeterTests\emailAdd.csv
2019-03-19 23:55:08,572 INFO o.a.j.t.JMeterThread: Thread is done: emailAdd 1-5
2019-03-19 23:55:08,572 INFO o.a.j.t.JMeterThread: Thread finished: emailAdd 1-5
2019-03-19 23:55:08,594 ERROR o.a.j.c.CSVDataSet: java.io.IOException: File never reserved: C:\Users\sp\Desktop\STM\STM-JMeterTests\emailAdd.csv
2019-03-19 23:55:08,595 ERROR o.a.j.c.CSVDataSet: java.io.IOException: File never reserved: C:\Users\sp\Desktop\STM\STM-JMeterTests\emailAdd.csv
2019-03-19 23:55:08,607 INFO o.a.j.t.JMeterThread: Thread is done: emailAdd 1-4

org.apache.thrift.transport.TTransportException error while Reading large JSON file in zeppelin scala

I am trying to read a large JSON file (1.5 GB) using Zeppelin and Scala.
Zeppelin is working on SPARK in local mode installed on Ubuntu OS on a VM with 10 GB RAM. I have alloted 8GB to the spark.executor.memory
My Code is as below
val inputFileWeather="/home/shashi/incubator-zeppelin-master/data/ai/weather.json"
val temp=sqlContext.read.json(inputFileWeather)
I am getting the following error
org.apache.thrift.transport.TTransportException
at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.recv_interpret(RemoteInterpreterService.java:241)
at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.interpret(RemoteInterpreterService.java:225)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.interpret(RemoteInterpreter.java:229)
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:93)
at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:229)
at org.apache.zeppelin.scheduler.Job.run(Job.java:171)
at org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:328)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
The error you got is due to a problem in running the Spark interpreter, so Zeppelin could not connect with the interpreter process.
You have to check your logs located in /PATH/TO/ZEPPELIN/logs/*.out to know exactly what happening. Perhaps in the interpreter logs you will see an OOM.
I think that 8GB for executor memory on a VM with 10 GB is a unreasonable,(and how many executors are you starting?). You have to consider the driver memeory as well
Increase the driver memory in the pyspark interpreter i.e. spark.driver.memory. By default its 1G

While Cassandra compaction Fatal exception in thread CompactionExecutor

I am having cassandra cluster of 12 nodes on EC2 running cassandra-0.8.2.
While compaction I got the following exception which caused Seed node to get down.
Below is the exception stack trace.
ERROR [CompactionExecutor:31] 2011-12-16 08:06:02,308 AbstractCassandraDaemon.java (line 139) Fatal exception in thread Thread[CompactionExecutor:31,1,main]
java.io.IOError: java.io.EOFException: EOF after 430959023 bytes out of 778986868
at org.apache.cassandra.io.sstable.SSTableIdentityIterator.(SSTableIdentityIterator.java:149)
at org.apache.cassandra.io.sstable.SSTableIdentityIterator.(SSTableIdentityIterator.java:90)
at org.apache.cassandra.io.sstable.SSTableIdentityIterator.(SSTableIdentityIterator.java:74)
at org.apache.cassandra.io.sstable.SSTableScanner$KeyScanningIterator.next(SSTableScanner.java:179)
at org.apache.cassandra.io.sstable.SSTableScanner$KeyScanningIterator.next(SSTableScanner.java:144)
at org.apache.cassandra.io.sstable.SSTableScanner.next(SSTableScanner.java:136)
at org.apache.cassandra.io.sstable.SSTableScanner.next(SSTableScanner.java:39)
at org.apache.commons.collections.iterators.CollatingIterator.set(CollatingIterator.java:284)
at org.apache.commons.collections.iterators.CollatingIterator.least(CollatingIterator.java:326)
at org.apache.commons.collections.iterators.CollatingIterator.next(CollatingIterator.java:230)
at org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:69)
at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140)
at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135)
at org.apache.commons.collections.iterators.FilterIterator.setNextObject(FilterIterator.java:183)
at org.apache.commons.collections.iterators.FilterIterator.hasNext(FilterIterator.java:94)
at org.apache.cassandra.db.compaction.CompactionManager.doCompactionWithoutSizeEstimation(CompactionManager.java:569)
at org.apache.cassandra.db.compaction.CompactionManager.doCompaction(CompactionManager.java:506)
at org.apache.cassandra.db.compaction.CompactionManager$1.call(CompactionManager.java:141)
at org.apache.cassandra.db.compaction.CompactionManager$1.call(CompactionManager.java:107)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.EOFException: EOF after 430959023 bytes out of 778986868
at org.apache.cassandra.io.util.FileUtils.skipBytesFully(FileUtils.java:229)
at org.apache.cassandra.io.sstable.IndexHelper.skipIndex(IndexHelper.java:63)
at org.apache.cassandra.io.sstable.SSTableIdentityIterator.(SSTableIdentityIterator.java:141)
... 23 more
It says it is Caused by: java.io.EOFException:
Is it because of the corrupt sstables?
if it is, then how to remove or repair those sstables?
It looks like this is indeed caused by corrupt sstables (which may indicate a hardware problem). My recommendations:
Upgrade to the latest stable 0.8.x version of Cassandra. This will be a drop-in replacement for 0.8.2.
Run "nodetool scrub" on the machine having problems
Review http://www.datastax.com/docs/1.0/install/cluster_init -- I recommend two seed nodes per data center, but remember that a seed node is only consulted when restarting nodes, so it's not a big deal to have one down during normal operation