org.apache.thrift.transport.TTransportException error while Reading large JSON file in zeppelin scala - json

I am trying to read a large JSON file (1.5 GB) using Zeppelin and Scala.
Zeppelin is working on SPARK in local mode installed on Ubuntu OS on a VM with 10 GB RAM. I have alloted 8GB to the spark.executor.memory
My Code is as below
val inputFileWeather="/home/shashi/incubator-zeppelin-master/data/ai/weather.json"
val temp=sqlContext.read.json(inputFileWeather)
I am getting the following error
org.apache.thrift.transport.TTransportException
at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.recv_interpret(RemoteInterpreterService.java:241)
at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.interpret(RemoteInterpreterService.java:225)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.interpret(RemoteInterpreter.java:229)
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:93)
at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:229)
at org.apache.zeppelin.scheduler.Job.run(Job.java:171)
at org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:328)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

The error you got is due to a problem in running the Spark interpreter, so Zeppelin could not connect with the interpreter process.
You have to check your logs located in /PATH/TO/ZEPPELIN/logs/*.out to know exactly what happening. Perhaps in the interpreter logs you will see an OOM.
I think that 8GB for executor memory on a VM with 10 GB is a unreasonable,(and how many executors are you starting?). You have to consider the driver memeory as well

Increase the driver memory in the pyspark interpreter i.e. spark.driver.memory. By default its 1G

Related

Read Parquet File Error Spark on Azure Synapse Workspace

I am running a pyspark job using azure synpase workspace. My Spark Job is failing with following error. Can someone help me in debugging this error?
This error is coming in spark application run by Pipeline on Azure Synapse
Stacktrace: An error occurred while calling o1394.execute.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 94.0 failed 4 times, most recent failure: Lost task 0.3 in stage 94.0 (TID 2313) (vm-1d164027 executor 3): java.io.EOFException
at org.apache.parquet.bytes.BytesUtils.readIntLittleEndian(BytesUtils.java:85)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:520)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:505)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:499)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:476)
at
The error message indicates that the Spark job is failing because it is encountering an EOFException while reading Parquet files. This suggests that there is something wrong with the Parquet files themselves, and that they are either incomplete or corrupt.
To debug this issue, you will need to inspect the Parquet files themselves to see if there is anything wrong with them. One way to do this is to use the "parquet-tools" command-line tool. This can be used to examine the contents of Parquet files, and can be helpful in identifying issues such as missing or corrupted data.
If you are unable to identify the cause of the issue using the parquet-tools tool, it's possible it could be a library implementation issue.

Out of memory error on initializing Couchbase java Client

Facing out of memory error on initializing Couchbase Java Client. The issue happens in the context of running Test cases in the Gradle build. It doesn't seem to be happening when running individual test cases. It seems to be happening on running all test cases in the build. The error is occurring on MacOS and not on Linux build machine
Environment
JVM = 16 (OpenJDK)
OS = MacOS Monterey
task = Gradle build
jvm memory settings = -Xmx8000m" "-Xms512m" "-XX:MaxDirectMemorySize=2000m"
StackTrace -
Caused by: java.lang.OutOfMemoryError: Cannot reserve 16384 bytes of direct buffer memory (allocated: 536861104, limit: 536870912)
at java.base/java.nio.Bits.reserveMemory(Bits.java:178)
at java.base/java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:121)
at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:330)
at com.couchbase.client.core.deps.io.netty.channel.unix.Buffer.allocateDirectWithNativeOrder(Buffer.java:40)
at com.couchbase.client.core.deps.io.netty.channel.unix.IovArray.<init>(IovArray.java:72)
at com.couchbase.client.core.deps.io.netty.channel.kqueue.KQueueEventLoop.<init>(KQueueEventLoop.java:62)
at com.couchbase.client.core.deps.io.netty.channel.kqueue.KQueueEventLoopGroup.newChild(KQueueEventLoopGroup.java:151)
at com.couchbase.client.core.deps.io.netty.channel.kqueue.KQueueEventLoopGroup.newChild(KQueueEventLoopGroup.java:32)
at com.couchbase.client.core.deps.io.netty.util.concurrent.MultithreadEventExecutorGroup.<init>(MultithreadEventExecutorGroup.java:84)
at com.couchbase.client.core.deps.io.netty.util.concurrent.MultithreadEventExecutorGroup.<init>(MultithreadEventExecutorGroup.java:60)
at com.couchbase.client.core.deps.io.netty.util.concurrent.MultithreadEventExecutorGroup.<init>(MultithreadEventExecutorGroup.java:49)
at com.couchbase.client.core.deps.io.netty.channel.MultithreadEventLoopGroup.<init>(MultithreadEventLoopGroup.java:59)
at com.couchbase.client.core.deps.io.netty.channel.kqueue.KQueueEventLoopGroup.<init>(KQueueEventLoopGroup.java:110)
at com.couchbase.client.core.deps.io.netty.channel.kqueue.KQueueEventLoopGroup.<init>(KQueueEventLoopGroup.java:97)
at com.couchbase.client.core.deps.io.netty.channel.kqueue.KQueueEventLoopGroup.<init>(KQueueEventLoopGroup.java:73)
at com.couchbase.client.core.env.IoEnvironment.createEventLoopGroup(IoEnvironment.java:476)
at com.couchbase.client.core.env.IoEnvironment.<init>(IoEnvironment.java:285)
at com.couchbase.client.core.env.IoEnvironment.<init>(IoEnvironment.java:66)
at com.couchbase.client.core.env.IoEnvironment$Builder.build(IoEnvironment.java:674)
at com.couchbase.client.core.env.CoreEnvironment.<init>(CoreEnvironment.java:153)
at com.couchbase.client.java.env.ClusterEnvironment.<init>(ClusterEnvironment.java:53)
at com.couchbase.client.java.env.ClusterEnvironment.<init>(ClusterEnvironment.java:46)
at com.couchbase.client.java.env.ClusterEnvironment$Builder.build(ClusterEnvironment.java:213)

Error timeout when I want to read file from hadoop with pyspark

I want to read a csv file from hadoop with Pyspark with the following code:
dfcsv = spark.read.csv("hdfs://my_hadoop_cluster_ip:9000/user/root/input/test.csv")
dfcsv.printSchema()
My cluster hadoop is on a Docker container on my local machine and link with two other slave container for the workers.
As you see in this picture from my ui hadoop cluster, the path is the right path.
But when I submit my script with this command :
spark-submit --master spark://my_cluster_spark_ip:7077 test.py
My script stuck on the read, and after few minutes I have this following error :
22/02/09 15:42:29 WARN TaskSetManager: Lost task 0.1 in stage 4.0 (TID 4) (my_slave_spark_ip executor 1): org.apache.hadoop.net.ConnectTimeoutException: Call From spark-slave1/my_slave_spark_ip to my_hadoop_cluster_ip:9000 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=my_hadoop_cluster_ip/my_hadoop_cluster_ip:9000]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:751)
...
For information, my csv file is very small, just 3 lines and 64 KB.
Have you any solution to fix this issue?

My account on Cosmos global instance seems to be running out of space - maybe need to increase quota

Trying to run a simple hdfs query failed with:
[ms#cosmosmaster-gi ~]$ hadoop fs -ls /user/ms/def_serv/def_servpath
Java HotSpot(TM) 64-Bit Server VM warning: Insufficient space for shared memory file:
/tmp/hsperfdata_ms/21066
Try using the -Djava.io.tmpdir= option to select an alternate temp location.
Exception in thread "main" java.lang.NoClassDefFoundError: ___/tmp/hsperfdata_ms/21078
Caused by: java.lang.ClassNotFoundException: ___.tmp.hsperfdata_ms.21078
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
Could not find the main class: ___/tmp/hsperfdata_ms/21078. Program will exit.
Any idea how to fix that or increase quota?
Thanks!
ms
Your quota has not been exceeded (see command below), but this was a problem with the cluster. It should be fixed now.
$ hadoop fs -dus /user/ms
hdfs://cosmosmaster-gi/user/ms 90731

Cassandra ColumnFamilyInputformat throwing IncompatibleClassChangeError on Hadoop 2.2

When I tried to run a simple map reduce program talking to Cassandra, I get the following error. I am using Hadoop 2.2 and Cassandra 2.0.2. Can someone who solved this problem please respond back with the solution?
Exception in thread "main" java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected
at org.apache.cassandra.hadoop.AbstractColumnFamilyInputFormat.getSplits(AbstractColumnFamilyInputFormat.java:116)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:491)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:508)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:392)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1268)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1265)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1265)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1286)