Why is spark unable to read a large JSON text file despite of sufficient memory for both execution and cache? - json

I have configured a standalone cluster (a node of 32gb & 32 cores) with 2 workers of 16 cores & 10gb memory each. The size of the JSON file is only 6gb. I have tried tweaking different configurations, including increasing young generation memory. Is there anything else I am missing?
PS: This doesn't work even in spark-shell.
Configurations inside env file.
SPARK_EXECUTOR_MEMORY=8g
SPARK_WORKER_CORES=16
SPARK_WORKER_INSTANCES=2
SPARK_WORKER_MEMORY=10g
I have tried following config params while submitting job:
./spark-submit --driver-memory 4g \
--supervise --verbose \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.memory.fraction=0.2 \
--conf spark.memory.storageFraction=0.5 \
--conf spark.network.timeout=10000000 \
--conf spark.executor.heartbeatInterval=10000000
val dataFrame = spark.read.option("multiline", "true")
.schema(schema)
.json(jsonFilePath)
.persist(StorageLevel.MEMORY_ONLY_SER_2)
dataFrame.createOrReplaceTempView("node")
val df = spark.sqlContext.sql("select col1, col2.x, col3 from node")
.persist(StorageLevel.MEMORY_ONLY_SER_2)
val newrdd = df.repartition(64)
.rdd.mapPartitions(partition => {
val newPartition = partition.map(x => {
//somefunction()
}).toList
newPartition.iterator
}).persist(StorageLevel.MEMORY_ONLY_SER_2)
Error:
Lost task 0.0 in stage 0.0 (TID 0, 131.123.39.101, executor 0): java.lang.OutOfMemoryError: GC overhead limit exceeded
Thanks in advance

Related

will the dataframe.write.save() fail if the --driver-memory is less than the data to be written in PySpark?

We have a dataframe with 5 columns and nearly 1.5 Billion records.
We want to write that as a single line (single record) json.
We are facing two issues
df.write.format('json') is writing as single line or single record but there is second line that is empty and we want to avoid it
df.write.format('json').save('somedir_in_HDFS') is giving error
We want to save the data as single file so that the downstream application can read it
Here is the sample code.
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
from pyspark.sql.types import *
schema = StructType([
StructField("author", StringType(), False),
StructField("title", StringType(), False),
StructField("pages", IntegerType(), False),
StructField("email", StringType(), False)
])
#likewise we have 1.5 billion records
data = [
["author1", "title1", 1, "author1#gmail.com"],
["author2", "title2", 2, "author2#gmail.com"],
["author3", "title3", 3, "author3#gmail.com"],
["author4", "title4", 4, "author4#gmail.com"]
]
if __name__ == "__main__":
spark=SparkSession.builder.appName("Test").enableHiveSupport().getOrCreate()
df = spark.createDataFrame(data, schema)
dfprocessed=df #here we are doing lots of joins with other tables
dfprocessed = dfprocessed.agg(
f.collect_list(f.struct(f.col('author'), f.col('title'), f.col('pages'), f.col('email'))).alias("list-item"))
dfprocessed = dfprocessed.withColumn("version", f.lit("1.0").cast(IntegerType()))
dfprocessed.printSchema()
dfprocessed.write.format("json").mode("overwrite").option("escape", "").save('./TestJson')
#the above write adds one extra empty line
submit command
spark-submit --conf spark.port.maxRetries=100 --master yarn --deploy-mode cluster
--executor memory 16g --executor-core 4 --driver-memory 16g
--conf spark.driver.maxResultSize=10g
--conf spark.dynamicAllocation.enabled=true
--cong spark.dynamicAllocation.maxExecutor=30
--conf spark.dynamicAllocation.minExecutor=15
--code.py
error
"Error occurred in save_daily_report : An error occurred while calling o149.save.
:org.apache.spark.sparkException:Job aborted
at org.apache.spark.sql.execution.datasource.FileFormatWriter$.write(FileFormatWriter.scala:198)
at org.apache.spark.sql.execution.datasource.InsertIntoHadoopFSRelationCommand.run(InsertIntoHadoopFSRelationCommand.scala:159)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(command.scala:104)
...
..
..
caused by org.apache.spark.sparkException:Job aborted due to stage Failure:Task 26 in stage 211.0 failed 4 times,most recent
failure:Lost task 26.3 in stage 211.0
ExecutorLostFailure(executor xxx exited caused by one the the running task)Reason:Container maked as failed
Container exited with a non-zero exit code 143
Killed by external signal
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failedJobAndIndependent
Stages (DAGScheduler.scala:1890)
Why would the above code fail? How do we resolve the error?

Automatitation of python file in bash [duplicate]

In Python, how can we find out the command line arguments that were provided for a script, and process them?
For some more specific examples, see Implementing a "[command] [action] [parameter]" style command-line interfaces? and How do I format positional argument help using Python's optparse?.
import sys
print("\n".join(sys.argv))
sys.argv is a list that contains all the arguments passed to the script on the command line. sys.argv[0] is the script name.
Basically,
import sys
print(sys.argv[1:])
The canonical solution in the standard library is argparse (docs):
Here is an example:
from argparse import ArgumentParser
parser = ArgumentParser()
parser.add_argument("-f", "--file", dest="filename",
help="write report to FILE", metavar="FILE")
parser.add_argument("-q", "--quiet",
action="store_false", dest="verbose", default=True,
help="don't print status messages to stdout")
args = parser.parse_args()
argparse supports (among other things):
Multiple options in any order.
Short and long options.
Default values.
Generation of a usage help message.
Just going around evangelizing for argparse which is better for these reasons.. essentially:
(copied from the link)
argparse module can handle positional
and optional arguments, while
optparse can handle only optional
arguments
argparse isn’t dogmatic about
what your command line interface
should look like - options like -file
or /file are supported, as are
required options. Optparse refuses to
support these features, preferring
purity over practicality
argparse produces more
informative usage messages, including
command-line usage determined from
your arguments, and help messages for
both positional and optional
arguments. The optparse module
requires you to write your own usage
string, and has no way to display
help for positional arguments.
argparse supports action that
consume a variable number of
command-line args, while optparse
requires that the exact number of
arguments (e.g. 1, 2, or 3) be known
in advance
argparse supports parsers that
dispatch to sub-commands, while
optparse requires setting
allow_interspersed_args and doing the
parser dispatch manually
And my personal favorite:
argparse allows the type and
action parameters to add_argument()
to be specified with simple
callables, while optparse requires
hacking class attributes like
STORE_ACTIONS or CHECK_METHODS to get
proper argument checking
There is also argparse stdlib module (an "impovement" on stdlib's optparse module). Example from the introduction to argparse:
# script.py
import argparse
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument(
'integers', metavar='int', type=int, choices=range(10),
nargs='+', help='an integer in the range 0..9')
parser.add_argument(
'--sum', dest='accumulate', action='store_const', const=sum,
default=max, help='sum the integers (default: find the max)')
args = parser.parse_args()
print(args.accumulate(args.integers))
Usage:
$ script.py 1 2 3 4
4
$ script.py --sum 1 2 3 4
10
If you need something fast and not very flexible
main.py:
import sys
first_name = sys.argv[1]
last_name = sys.argv[2]
print("Hello " + first_name + " " + last_name)
Then run python main.py James Smith
to produce the following output:
Hello James Smith
The docopt library is really slick. It builds an argument dict from the usage string for your app.
Eg from the docopt readme:
"""Naval Fate.
Usage:
naval_fate.py ship new <name>...
naval_fate.py ship <name> move <x> <y> [--speed=<kn>]
naval_fate.py ship shoot <x> <y>
naval_fate.py mine (set|remove) <x> <y> [--moored | --drifting]
naval_fate.py (-h | --help)
naval_fate.py --version
Options:
-h --help Show this screen.
--version Show version.
--speed=<kn> Speed in knots [default: 10].
--moored Moored (anchored) mine.
--drifting Drifting mine.
"""
from docopt import docopt
if __name__ == '__main__':
arguments = docopt(__doc__, version='Naval Fate 2.0')
print(arguments)
One way to do it is using sys.argv. This will print the script name as the first argument and all the other parameters that you pass to it.
import sys
for arg in sys.argv:
print arg
#set default args as -h , if no args:
if len(sys.argv) == 1: sys.argv[1:] = ["-h"]
I use optparse myself, but really like the direction Simon Willison is taking with his recently introduced optfunc library. It works by:
"introspecting a function
definition (including its arguments
and their default values) and using
that to construct a command line
argument parser."
So, for example, this function definition:
def geocode(s, api_key='', geocoder='google', list_geocoders=False):
is turned into this optparse help text:
Options:
-h, --help show this help message and exit
-l, --list-geocoders
-a API_KEY, --api-key=API_KEY
-g GEOCODER, --geocoder=GEOCODER
I like getopt from stdlib, eg:
try:
opts, args = getopt.getopt(sys.argv[1:], 'h', ['help'])
except getopt.GetoptError, err:
usage(err)
for opt, arg in opts:
if opt in ('-h', '--help'):
usage()
if len(args) != 1:
usage("specify thing...")
Lately I have been wrapping something similiar to this to make things less verbose (eg; making "-h" implicit).
As you can see optparse "The optparse module is deprecated with and will not be developed further; development will continue with the argparse module."
Pocoo's click is more intuitive, requires less boilerplate, and is at least as powerful as argparse.
The only weakness I've encountered so far is that you can't do much customization to help pages, but that usually isn't a requirement and docopt seems like the clear choice when it is.
import argparse
parser = argparse.ArgumentParser(description='Process some integers.')
parser.add_argument('integers', metavar='N', type=int, nargs='+',
help='an integer for the accumulator')
parser.add_argument('--sum', dest='accumulate', action='store_const',
const=sum, default=max,
help='sum the integers (default: find the max)')
args = parser.parse_args()
print(args.accumulate(args.integers))
Assuming the Python code above is saved into a file called prog.py
$ python prog.py -h
Ref-link: https://docs.python.org/3.3/library/argparse.html
You may be interested in a little Python module I wrote to make handling of command line arguments even easier (open source and free to use) - Commando
Yet another option is argh. It builds on argparse, and lets you write things like:
import argh
# declaring:
def echo(text):
"Returns given word as is."
return text
def greet(name, greeting='Hello'):
"Greets the user with given name. The greeting is customizable."
return greeting + ', ' + name
# assembling:
parser = argh.ArghParser()
parser.add_commands([echo, greet])
# dispatching:
if __name__ == '__main__':
parser.dispatch()
It will automatically generate help and so on, and you can use decorators to provide extra guidance on how the arg-parsing should work.
I recommend looking at docopt as a simple alternative to these others.
docopt is a new project that works by parsing your --help usage message rather than requiring you to implement everything yourself. You just have to put your usage message in the POSIX format.
Also with python3 you might find convenient to use Extended Iterable Unpacking to handle optional positional arguments without additional dependencies:
try:
_, arg1, arg2, arg3, *_ = sys.argv + [None] * 2
except ValueError:
print("Not enough arguments", file=sys.stderr) # unhandled exception traceback is meaningful enough also
exit(-1)
The above argv unpack makes arg2 and arg3 "optional" - if they are not specified in argv, they will be None, while if the first is not specified, ValueError will be thouwn:
Traceback (most recent call last):
File "test.py", line 3, in <module>
_, arg1, arg2, arg3, *_ = sys.argv + [None] * 2
ValueError: not enough values to unpack (expected at least 4, got 3)
My solution is entrypoint2. Example:
from entrypoint2 import entrypoint
#entrypoint
def add(file, quiet=True):
''' This function writes report.
:param file: write report to FILE
:param quiet: don't print status messages to stdout
'''
print file,quiet
help text:
usage: report.py [-h] [-q] [--debug] file
This function writes report.
positional arguments:
file write report to FILE
optional arguments:
-h, --help show this help message and exit
-q, --quiet don't print status messages to stdout
--debug set logging level to DEBUG
import sys
# Command line arguments are stored into sys.argv
# print(sys.argv[1:])
# I used the slice [1:] to print all the elements except the first
# This because the first element of sys.argv is the program name
# So the first argument is sys.argv[1], the second is sys.argv[2] ecc
print("File name: " + sys.argv[0])
print("Arguments:")
for i in sys.argv[1:]:
print(i)
Let's name this file command_line.py and let's run it:
C:\Users\simone> python command_line.py arg1 arg2 arg3 ecc
File name: command_line.py
Arguments:
arg1
arg2
arg3
ecc
Now let's write a simple program, sum.py:
import sys
try:
print(sum(map(float, sys.argv[1:])))
except:
print("An error has occurred")
Result:
C:\Users\simone> python sum.py 10 4 6 3
23
This handles simple switches, value switches with optional alternative flags.
import sys
# [IN] argv - array of args
# [IN] switch - switch to seek
# [IN] val - expecting value
# [IN] alt - switch alternative
# returns value or True if val not expected
def parse_cmd(argv,switch,val=None,alt=None):
for idx, x in enumerate(argv):
if x == switch or x == alt:
if val:
if len(argv) > (idx+1):
if not argv[idx+1].startswith('-'):
return argv[idx+1]
else:
return True
//expecting a value for -i
i = parse_cmd(sys.argv[1:],"-i", True, "--input")
//no value needed for -p
p = parse_cmd(sys.argv[1:],"-p")
Several of our biotechnology clients have posed these two questions recently:
How can we execute a Python script as a command?
How can we pass input values to a Python script when it is executed as a command?
I have included a Python script below which I believe answers both questions. Let's assume the following Python script is saved in the file test.py:
#
#----------------------------------------------------------------------
#
# file name: test.py
#
# input values: data - location of data to be processed
# date - date data were delivered for processing
# study - name of the study where data originated
# logs - location where log files should be written
#
# macOS usage:
#
# python3 test.py "/Users/lawrence/data" "20220518" "XYZ123" "/Users/lawrence/logs"
#
# Windows usage:
#
# python test.py "D:\data" "20220518" "XYZ123" "D:\logs"
#
#----------------------------------------------------------------------
#
# import needed modules...
#
import sys
import datetime
def main(argv):
#
# print message that process is starting...
#
print("test process starting at", datetime.datetime.now().strftime("%Y%m%d %H:%M"))
#
# set local values from input values...
#
data = sys.argv[1]
date = sys.argv[2]
study = sys.argv[3]
logs = sys.argv[4]
#
# print input arguments...
#
print("data value is", data)
print("date value is", date)
print("study value is", study)
print("logs value is", logs)
#
# print message that process is ending...
#
print("test process ending at", datetime.datetime.now().strftime("%Y%m%d %H:%M"))
#
# call main() to begin processing...
#
if __name__ == '__main__':
main(sys.argv)
The script can be executed on a macOS computer in a Terminal shell as shown below and the results will be printed to standard output (be sure the current directory includes the test.py file):
$ python3 test.py "/Users/lawrence/data" "20220518" "XYZ123" "/Users/lawrence/logs"
test process starting at 20220518 16:51
data value is /Users/lawrence/data
date value is 20220518
study value is XYZ123
logs value is /Users/lawrence/logs
test process ending at 20220518 16:51
The script can also be executed on a Windows computer in a Command Prompt as shown below and the results will be printed to standard output (be sure the current directory includes the test.py file):
D:\scripts>python test.py "D:\data" "20220518" "XYZ123" "D:\logs"
test process starting at 20220518 17:20
data value is D:\data
date value is 20220518
study value is XYZ123
logs value is D:\logs
test process ending at 20220518 17:20
This script answers both questions posed above and is a good starting point for developing scripts that will be executed as commands with input values.
Reason for the new answer:
Existing answers specify multiple options.
Standard option is to use argparse, a few answers provided examples from the documentation, and one answer suggested the advantage of it. But all fail to explain the answer adequately/clearly to the actual question by OP, at least for newbies.
An example of argparse:
import argparse
def load_config(conf_file):
pass
if __name__ == '__main__':
parser = argparse.ArgumentParser()
//Specifies one argument from the command line
//You can have any number of arguments like this
parser.add_argument("conf_file", help="configuration file for the application")
args = parser.parse_args()
config = load_config(args.conf_file)
Above program expects a config file as an argument. If you provide it, it will execute happily. If not, it will print the following
usage: test.py [-h] conf_file
test.py: error: the following arguments are required: conf_file
You can have the option to specify if the argument is optional.
You can specify the expected type for the argument using type key
parser.add_argument("age", type=int, help="age of the person")
You can specify default value for the arguments by specifying default key
This document will help you to understand it to an extent.

Databricks Write Json file is too slow

I have simple scala snippet to read/write json files of total of 10GB (with mounting dir from storage account) --> it took 1.7 hour with almost all the time in the write json file line.
Cluster setup:
Azure Databricks DBR 7.3 LTS, spark 3.0.1, scala 2.12
11 workers + one driver of type Standard_E4as_v4 (each has 32.0 GB Memory, 4 Cores)
Why writing is too slow?
Is not writing is parallelized as reading accross partitions/workers?
How to speed writing or the whole process up?
Code for mount dir:
val containerName = "containerName"
val storageAccountName = "storageAccountName"
val sas = "sastoken"
val config = "fs.azure.sas." + containerName+ "." + storageAccountName + ".blob.core.windows.net"
dbutils.fs.mount(
source = "wasbs://containerName#storageAccountName.blob.core.windows.net/",
mountPoint = "/mnt/myfile",
extraConfigs = Map(config -> sas))
Code for read/write json files:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import spark.implicits._
val jsonDF = spark.read.json("/mnt/myfile")
jsonDF.write.json("/mnt/myfile/spark_output_test")
Solved by using "I would say" better spark apis, like
read --> spark.read.format("json").load("path")
write --> res.write.format("parquet").save("path)
and writing in parquet format as it is compressed and very optimized for spark

convert pb file to tflite file for running on Coral Dev Board (Segmentation fault (core dumped) )

How to convert a pb file into tflite file using python3 or in terminal.
Don't know any of the details of the model.Here is the link of the pb file.
(Edited)
I have done converting pb file to tflite file by the following code :
import tensorflow.compat.v1 as tf
import numpy as np
graph_def_file = "./models/20170512-110547.pb"
def representative_dataset_gen():
for _ in range(num_calibration_steps):
# Get sample input data as a numpy array in a method of your choosing.
yield [input]
converter = tf.lite.TFLiteConverter.from_frozen_graph(graph_def_file,
input_arrays=["input","phase_train"],
output_arrays=["embeddings"],
input_shapes={"input":[1,160,160,3],"phase_train":False})
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset_gen
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8
tflite_model = converter.convert()
print("converting")
open("./models/converted_model.tflite", "wb").write(tflite_model)
print("Done")
Error :Getting Segmentation fault(core dumped)
2020-01-20 11:42:18.153263: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-01-20 11:42:18.153363: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-01-20 11:42:18.153385: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2020-01-20 11:42:18.905028: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-01-20 11:42:18.906845: E tensorflow/stream_executor/cuda/cuda_driver.cc:351] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2020-01-20 11:42:18.906874: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (kalgudi-GA-78LMT-USB3-6-0): /proc/driver/nvidia/version does not exist
2020-01-20 11:42:18.934144: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3616020000 Hz
2020-01-20 11:42:18.934849: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x39aa0f0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-01-20 11:42:18.934910: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
Segmentation fault (core dumped)
with out any details of the model, you cannot convert it to a .tflite model. I suggest that you go through this document for Post Training Quantization again. As there are too many details that is redundant to show here.
Here is an example for post training quantization of a frozen graph. The model is taken from here (you can see that it's a full tarball of infomation about the model)
import sys, os, glob
import tensorflow as tf
import pathlib
import numpy as np
if len(sys.argv) != 2:
print('Usage: <' + sys.argv[0] + '> <frozen_graph_file> <representative_image_dir>')
exit()
tf.compat.v1.enable_eager_execution()
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.DEBUG)
def fake_representative_data_gen():
for _ in range(100):
fake_image = np.random.random((1,192,192,3)).astype(np.float32)
yield [fake_image]
frozen_graph = sys.argv[1]
input_array = ['input']
output_array = ['MobilenetV1/Predictions/Reshape_1']
converter = tf.compat.v1.lite.TFLiteConverter.from_frozen_graph(frozen_graph, input_array, output_array)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = fake_representative_data_gen
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8
tflite_model = converter.convert()
quant_dir = pathlib.Path(os.getcwd(), 'output')
quant_dir.mkdir(exist_ok=True, parents=True)
tflite_model_file = quant_dir/'mobilenet_v1_0.25_192_quant.tflite'
tflite_model_file.write_bytes(tflite_model)

How to read csv files from the local driver node using Spark?

I had to unzip files from Amazon S3 into my driver node (Spark cluster), and I need to load all these csv files as a Spark Dataframe, but I found the next problem when I tried to load the data from the driver node:
PySpark:
df = self.spark.read.format("csv").option("header", True).load("file:/databricks/driver/*.csv")
'Path does not exist: file:/folder/*.csv'
I tried to move all these files to dbfs using dbutils.fs.mv() but I'm running a Python file and I can't use dbutils(). I think I need to broadcast the file but I don't know-how, because I tried with self.sc.textFile("file:/databricks/driver/*.csv").collect() and self.sc.addFile("file:/databricks/driver/*.csv") and the process it is not able to find the files.
UPDATE
When I ran this code:
import os
BaseLogs("INFO", os.getcwd())
folders = []
for r, d, f in os.walk(os.getcwd()):
for folder in d:
folders.append(os.path.join(r, folder))
for f in folders:
BaseLogs("INFO", f)
BaseLogs("INFO", os.listdir("/databricks/driver/zipFiles/s3Sensor/2017/Tracking_Bounces_20190906.csv.zip"))
BaseLogs("INFO", os.listdir("/databricks/driver/zipFiles/s3Sensor/2017/Tracking_Opens_20190907.zip"))
I obtained:
Then I tried to do:
try:
df = self.spark.read.format("csv").option("header", True).option("inferSchema", "true").load("file:///databricks/driver/zipFiles/s3Sensor/2017/Tracking_Bounces_20190906.csv.zip/Bounces.csv")
except Exception as e:
BaseLogs("INFO", e)
BaseLogs("INFO", "Reading {0} as Spark Dataframe".format("file://" + file + ".csv"))
df = self.spark.read.format("csv").option("header", True).option("inferSchema", "true").load("file://" + file + ".csv")
I obtained the next error:
2019-10-24T15:16:25.321+0000: [GC (Allocation Failure) [PSYoungGen:
470370K->14308K(630272K)] 479896K->30452K(886784K), 0.0209171 secs]
[Times: user=0.04 sys=0.01, real=0.02 secs]
2019-10-24T15:16:25.977+0000: [GC (Metadata GC Threshold) [PSYoungGen:
211288K->20462K(636416K)] 227432K->64316K(892928K), 0.0285984 secs]
[Times: user=0.04 sys=0.02, real=0.02 secs]
2019-10-24T15:16:26.006+0000: [Full GC (Metadata GC Threshold)
[PSYoungGen: 20462K->0K(636416K)] [ParOldGen: 43854K->55206K(377344K)]
64316K->55206K(1013760K), [Metaspace: 58323K->58323K(1099776K)],
0.1093583 secs] [Times: user=0.31 sys=0.02, real=0.12 secs] 2019-10-24T15:16:28.333+0000: [GC (Allocation Failure) [PSYoungGen:
612077K->23597K(990720K)] 667283K->78811K(1368064K), 0.0209207 secs]
[Times: user=0.02 sys=0.01, real=0.02 secs] INFO: An error occurred
while calling o195.load. : org.apache.spark.SparkException: Job
aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most
recent failure: Lost task 0.3 in stage 0.0 (TID 3, 172.31.252.216,
executor 0): java.io.FileNotFoundException: File
file:/databricks/driver/zipFiles/s3Sensor/2017/Tracking_Bounces_20190906.csv.zip/Bounces.csv
does not exist It is possible the underlying files have been updated.
You can explicitly invalidate the cache in Spark by running 'REFRESH
TABLE tableName' command in SQL or by recreating the Dataset/DataFrame
involved. at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:248)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
You could try to read your data into a panda dataframe:
import pandas as pd
pdf = pd.read_csv("file:/databricks/driver/xyz.csv")
and convert it to a spark dataframe:
df = spark.createDataFrame(pdf)
Try this
scala> val test = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("file:///path/to/csv/testcsv.csv")