Exception handling in Bulk Copying to Amazon Redshift Using Avro files - json

I am trying to load avro files in S3 to a table in Redshift. one of the Avro files doesn't have a correct format. the problem is when copy command tries to load that file, it throws an exception and doesn't run the copy for correct files. how can I skip the wrong-formatted file and c
opy the correct files? here is my code for loading file:
COPY tmp.table
FROM 's3://{BUCKET}/{PREFIX}'
IAM_ROLE '{ROLE}'
FORMAT AVRO 's3://{BUCKET}/{AVRO_PATH}'
the error that I am getting is:
code: 8001
context: Cannot init avro reader from s3 file Incorrect Avro container file magic number
query: 19308992
location: avropath_request.cpp:438
process: query0_125_19308992 [pid=23925]

You can preprocess the s3://{BUCKET}/{PREFIX} files and create a manifest file with only the Avro files that have the right format/schema. Redshift can't do this for you and will try to process all files on the s3://{BUCKET}/{PREFIX} path.

Related

Databrick csv cannot find local file

In a program I have csv extracted from excel, I need to upload the csv to hdfs and save it as parquet format, doesn't matter with python version or spark version, no scala please.
Almost all discussions I came across are about databrick, however, it seems cannot find the file, here is the code and error:
df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema","true").option("delimiter",",").load("file:///home/rxie/csv_out/wamp.csv")
Error:
java.io.FileNotFoundException: File file:/home/rxie/csv_out/wamp.csv
does not exist
The file path:
ls -la /home/rxie/csv_out/wamp.csv
-rw-r--r-- 1 rxie linuxusers 2896878 Nov 12 14:59 /home/rxie/csv_out/wamp.csv
Thank you.
I found the issue now!
The reason why it errors out of file not found is actually correct, because I was using Spark Context with setMaster("yarn-cluster"), that means all worker nodes will look for the csv file, of course all worker nodes (except the one starting the program where the csv resides) do not have this file and hence error out. What I really should do is to use setMaster("local").
FIX:
conf = SparkConf().setAppName('test').setMaster("local")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
csv = "file:///home/rxie/csv_out/wamp.csv"
df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema","true").option("delimiter",",").load(csv)
Yes, you are right, the file should be present at all worker nodes.
well. you can still read a local file in yarn cluster mode. you just need to add your file using addFile.
spark.sparkContext.addFile("file:///your local file path ")
spark will copy the file to each node where executor will be created and can be able to process your file in cluster mode as well.
I am using spark 2.3 version so you can change your spark context accordingly but addFile method remains same.
try this with your yarn (cluster mode) and let me know if it works for you.

Read file from Cloudera CDSW Project with PySpark

I have a file sitting in my Cloudera project under "/home/cdsw/npi.json". I've tried using the following commands to use PySpark for reading from my "local" CDSW project, but can't get at it with any of the following commands. They all throw the "Path does not exist: " error
npi = sc.read.format("json").load("file:///home/cdsw/npi.json")
npi = sc.read.format("json").load("file:/home/cdsw/npi.json")
npi = sc.read.format("json").load("home/cdsw/npi.json")
As per this documentation, Accessing Data from HDFS
From terminal, copy the file from local file system to HDFS. Either use -put or -copyFromLocal.
hdfs dfs -put /home/cdsw/npi.json /destination
where, /destination is in HDFS.
Then, read the file in PySpark.
npi = sc.read.format("json").load("/destination/npi.json")
For more information:
put
put [-f] [-p] [-l] <localsrc> ... <destination>
Copy files from the local file system into fs. Copying fails if the file already
exists, unless the -f flag is given.

Neo.ClientError.Statement.ExternalResourceFailed on ubuntu

I'm trying to import a csv file in NEO4j db using script :
LOAD CSV FROM "file:///dataframe6.txt" AS line
RETURN count(*)
But I get following error:
Neo.ClientError.Statement.ExternalResourceFailed
Couldn't load the external resource at: file:/home/gaurav/sharing/dataframe6.txt
P.S. : I'm using ubuntu machine
added this line dbms.directories.import=/home/gaurav/sharing/
and dbms.security.allow_csv_import_from_file_urls=true
I had to change folder permission for user NEO4j, it works now

Compress text file (CSV data) using LZ4 and read it in Spark

I was using the linux command line lz4 to compress the csv file.
example:-
lz4 input.csv
which results in input.csv.lz4 as output
But when I try to read the lz4 file in spark shell using following commands it always results in empty result.
val output = sparkSession.read.format("com.databricks.spark.csv").option("delimiter", "\t").load("s3:///input.csv.lz4")
output.count
res: Long = 0
I found somewhere that lz4 commandline tool might incompatible with spark
https://forums.databricks.com/questions/7957/how-can-i-read-in-lz4-compressed-json-files.html
Has anyone got it working on reading lz4 files in spark. If yes how was lz4 file created ?

Pig-Loading Json file which is present on Azure Storage

I am trying to load a Json file (Email_Master.json) using a pig script which is present on Azure storage container. The json file has been generated by a pig script and stored onto azure container. Below is the image how the file looks on container.
I am facing the error while loading the file using pig script through Powershell
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1131: Could not find schema file
The command used is
a = LOAD '$Azure_Path/Email_Master.json' USING JsonLoader();
How to resolve the issue?
The issue is with Default Container which is specified while provisioning HdInsight Cluster.The schema and Header files are storing in Default container.