Apache nifi issue with saving data from json to orc - json

I am using NIFI jsontoavro->avrotoorc->puthdfs. But facing following issues.
1)Single ORC file is being saved on HDFS. I am not using any compression.
2) when i try to access these files they are giving errors like buffer memory.
Thanks for help in advance.

You should be merging together many Avro records before ConvertAvroToORC.
You could do this by using MergeContent with the mode set to Avro right before ConvertAvroToORC.
You could also do this by merging your JSON together using MergeContent, and then sending the merged JSON to ConvertJsonToAvro.
Using PutHDFS to append to ORC files that are already in HDFS will not work. The HDFS processor does not know anything about the format of the data and is just writing additional raw bytes on to the file and will likely create an invalid ORC file.

Related

How do we name the files that are streamed via firehose?

I'm building an architecture using boto3, and I hope to dump the data in JSON format from API to S3. What blocks in my way right now is first, firehose does NOT support JSON; my workaround right now is not compressing them but it's still different from a JSON file. But I still want to see a better choice to make the files more compatible.
And second, the file names can't be customized. All the data I collected will be eventually converted onto Athena for the query, so can boto3 do the naming?
Answering a couple of the questions you have. Firstly if you stream JSON into Firehose it will write JSON to S3. JSON is the file data structure and compression is the file type. Compressing JSON doesn't make it something else. You'll just need to decompress it before consuming it.
RE: file naming, you shouldn't care about that. Let the system name it whatever. If you define the Athena table with the location, you'll be able to query it. When new files are added, you'll be able to query them immediately.
Here is an AWS tutorial that walks you through this process. JSON stream to S3 with Athena query.

Moving a json file from databricks to blob storage

I have created a mount in databricks which connects to my blob storage and I am able to read files from blob to databricks using a notebook.
I then transposed a .txt to json format using pyspark and now I would like to load it back to the blob storage. Does anyone know how I would do that?
Here are a few things I have tried:
my_json.write.option("header", "true").json("mnt/my_mount/file_name.json")
write.json(my_json, mnt/my_mount)
Neither work. I can put load a csv file from databricks to blob using:
my_data_frame.write.option("header", "true").csv("mnt/my_mount_name/file name.csv")
This works fine but I can't find a solution for moving a json.
Any ideas?
Disclaimer: I am new to pySpark but this is what I have done.
This is what I did after referencing the docs pyspark.sql.DataFrameWriter.json
# JSON
my_dataframe.write.json("/mnt/my_mount/my_json_file_name.json")
# For a single JSON file
my_dataframe.repartition(1).write.json("/mnt/my_mount/my_json_file_name.json")
# Parquet
my_dataframe.write.mode("Overwrite").partitionBy("myCol").parquet("/mnt/my_mount/my_parquet_file_name.parquet")

Can we merge .CSV file and .RAR file in hive (Hadoop tools)?

Can you suggest how we can do merging of different types of files ?
Merging of different types of files cannot be accomplished. Each file type has their own way of compression and storing data.
RAR file on the other hand is not usually used in Hadoop. If there are other formats like, parquet, orc, json - these can be merged by converting the files to the same type.
For example if the requirement is to merge parquet and json files, the parquet files can be converted into json using tools like parquet-tools.jar and can be merged by creating tables by loading these files into a table with appropriate schema.
Hope this helps!

Read Array Of Jsons From File to Spark Dataframe

I have a gzipped JSON file that contains Array of JSON, something like this:
[{"Product":{"id"1,"image":"/img.jpg"},"Color":"black"},{"Product":{"id"2,"image":"/img1.jpg"},"Color":"green"}.....]
I know this is not the ideal data format to read into scala, however there is no other alternative but to process the feed in this manner.
I have tried :
spark.read.json("file-path")
which seems to take a long time (processes very quickly if you have data in MBs, however takes way long for GBs worth of data ), probably because spark is not able to split the file and distribute accross to other executors.
Wanted to see if there is a any way out to preprocess this data and load it into spark context as a dataframe.
Functionality I want seems to be similar to: Create pandas dataframe from json objects . But I wanted to see if there is any scala alternative which could do similar and convert the data to spark RDD / dataframe .
You can read the "gzip" file using spark.read().text("gzip-file-path"). Since Spark API's are built on top of HDFS API , Spark can read the gzip file and decompress it to read the files.
https://github.com/mesos/spark/blob/baa30fcd99aec83b1b704d7918be6bb78b45fbb5/core/src/main/scala/spark/SparkContext.scala#L239
However, gzip is non-splittable so spark creates an RDD with single partition. Hence, reading gzip files using spark doe not make sense.
You may decompress the gzip file and read the decompressed files to get most out of the distributed processing architecture.
Appeared like a problem with the data format being given to spark for processing. I had to pre-process the data to change the format to a spark friendly format, and run spark processes over that. This is the preprocessing I ended up doing: https://github.com/dipayan90/bigjsonprocessor/blob/master/src/main/java/com/kajjoy/bigjsonprocessor/Application.java

avoid splitting json output by pyspark (v. 2.1)

using spark v2.1 and python, I load json files with
sqlContext.read.json("path/data.json")
I have problem with output json. Using the below command
df.write.json("path/test.json")
data is saved in a folder called test.json (not a file) which includes two empty files: one empty and the other with a strange name:
part-r-00000-f9ec958d-ceb2-4aee-bcb1-fa42a95b714f
Is there anyway to have a clean single json output file?
thanks
Yes, spark writes the output in multiple file when you try to save. Since the computation is distributed the output files are written in multiples part files like (part-r-00000-f9ec958d-ceb2-4aee-bcb1-fa42a95b714f). The number of files created are equal to the number of partition.
If your data is small and can fits in the memory then you can save your output file in a single file. But if your data is large saving on a single file is not the suggested way.
Actually the test.json is a directory and not a json file. It contains multiple part files inside it. This does not create any problem for you you can easily read this later.
If you still want your output in a single file then you need to repartition to 1, which brings your all data to single node and saves. This may cause issue if you have large data.
df.repartition(1).write.json("path/test.json")
Or
df.collect().write.json("path/test.json")