While I use caffe, I convert my images to LMDB file as input dataset. I got two files after I convert into LMDB format. One is data.mdb, which I figured should be where the data stored. The other one, lock.mdb, I wonder what is in this file and what it uses for?
Related
I am trying to save a data frame into a document but it returns saying that the below error
java.lang.ClassNotFoundException: Failed to find data source: docx. Please find packages at http://spark.apache.org/third-party-projects.html
My code is below:
#f_data is my dataframe with data
f_data.write.format("docx").save("dbfs:/FileStore/test/test.csv")
display(f_data)
Note that i could save files of CSV, text and JSON format but is there any way to save a docx file using pyspark?
My question here. Do we have the support for saving data in the format of doc/docx?
if not, Is there any way to store the file like writing a file stream object into particular folder/S3 bucket?
In short: no, Spark does not support DOCX format out of the box. You can still collect the data into the driver node (i.e.: pandas dataframe) and work from there.
Long answer:
A document format like DOCX is meant for presenting information in small tables with style metadata. Spark focus on processing large amount of files at scale and it does not support DOCX format out of the box.
If you want to write DOCX files programmatically, you can:
Collect the data into a Pandas DataFrame pd_f_data = f_data.toDF()
Import python package to create the DOCX document and save it into a stream. See question: Writing a Python Pandas DataFrame to Word document
Upload the stream to a S3 blob using for example boto: Can you upload to S3 using a stream rather than a local file?
Note: if your data has more than one hundred rows, ask the receivers how they are going to use the data. Just use docx for reporting no as a file transfer format.
Generically data follows one XML (ACORD) at a time. There is a process which convert XML into multiple CSV files (around 200+). Is it possible to merge all 200 files into one Parquet file? Or Parquet only supports one CSV structure per file?
Can you suggest how we can do merging of different types of files ?
Merging of different types of files cannot be accomplished. Each file type has their own way of compression and storing data.
RAR file on the other hand is not usually used in Hadoop. If there are other formats like, parquet, orc, json - these can be merged by converting the files to the same type.
For example if the requirement is to merge parquet and json files, the parquet files can be converted into json using tools like parquet-tools.jar and can be merged by creating tables by loading these files into a table with appropriate schema.
Hope this helps!
I am using NIFI jsontoavro->avrotoorc->puthdfs. But facing following issues.
1)Single ORC file is being saved on HDFS. I am not using any compression.
2) when i try to access these files they are giving errors like buffer memory.
Thanks for help in advance.
You should be merging together many Avro records before ConvertAvroToORC.
You could do this by using MergeContent with the mode set to Avro right before ConvertAvroToORC.
You could also do this by merging your JSON together using MergeContent, and then sending the merged JSON to ConvertJsonToAvro.
Using PutHDFS to append to ORC files that are already in HDFS will not work. The HDFS processor does not know anything about the format of the data and is just writing additional raw bytes on to the file and will likely create an invalid ORC file.
using spark v2.1 and python, I load json files with
sqlContext.read.json("path/data.json")
I have problem with output json. Using the below command
df.write.json("path/test.json")
data is saved in a folder called test.json (not a file) which includes two empty files: one empty and the other with a strange name:
part-r-00000-f9ec958d-ceb2-4aee-bcb1-fa42a95b714f
Is there anyway to have a clean single json output file?
thanks
Yes, spark writes the output in multiple file when you try to save. Since the computation is distributed the output files are written in multiples part files like (part-r-00000-f9ec958d-ceb2-4aee-bcb1-fa42a95b714f). The number of files created are equal to the number of partition.
If your data is small and can fits in the memory then you can save your output file in a single file. But if your data is large saving on a single file is not the suggested way.
Actually the test.json is a directory and not a json file. It contains multiple part files inside it. This does not create any problem for you you can easily read this later.
If you still want your output in a single file then you need to repartition to 1, which brings your all data to single node and saves. This may cause issue if you have large data.
df.repartition(1).write.json("path/test.json")
Or
df.collect().write.json("path/test.json")