avoid splitting json output by pyspark (v. 2.1) - json

using spark v2.1 and python, I load json files with
sqlContext.read.json("path/data.json")
I have problem with output json. Using the below command
df.write.json("path/test.json")
data is saved in a folder called test.json (not a file) which includes two empty files: one empty and the other with a strange name:
part-r-00000-f9ec958d-ceb2-4aee-bcb1-fa42a95b714f
Is there anyway to have a clean single json output file?
thanks

Yes, spark writes the output in multiple file when you try to save. Since the computation is distributed the output files are written in multiples part files like (part-r-00000-f9ec958d-ceb2-4aee-bcb1-fa42a95b714f). The number of files created are equal to the number of partition.
If your data is small and can fits in the memory then you can save your output file in a single file. But if your data is large saving on a single file is not the suggested way.
Actually the test.json is a directory and not a json file. It contains multiple part files inside it. This does not create any problem for you you can easily read this later.
If you still want your output in a single file then you need to repartition to 1, which brings your all data to single node and saves. This may cause issue if you have large data.
df.repartition(1).write.json("path/test.json")
Or
df.collect().write.json("path/test.json")

Related

Best Approach to read large number of JSON Files (1 JSON per file) in Databricks

Hope everyone is doing well.
I am trying to read a large number of JSON files i.e. around 150,000 files in a folder using Azure Databricks. Each file contains a single JSON i.e. 1 record per file. Currently the read process is taking over an hour to just read all the files despite having a huge cluster. The files are read using pattern as shown below.
val schema_variable = <schema>
val file_path = "src_folder/year/month/day/hour/*/*.json"
// e.g. src_folder/2022/09/01/10/*/*.json
val df = spark.read
.schema(schema_variable)
.json(file_path)
.withColumn("file_name", input_file_name())
Is there any approach or option we can try to make the reads faster.
We have already considered copying the file contents into a single file and then reading it, but we are losing lineage of file content i.e. which record came from which file.
I have also gone through various links in SO, but most of them seem to be around a single/multiple files of huge size say 10GB to 50GB.
Environment - Azure Databricks 10.4 Runtime.
Thank you for all the help.

ADF Merge-Copying JSON files in Copy Data Activity creates error for Mapping Data Flow

I am trying to do some optimization in ADF. Setup is a third-party tool copies one JSON file per object to a BLOB storage container. These feed to a Mapping Data Flow. The individual files written by the third party tool work great. If I copy these files to a different BLOB folder using an Azure Copy Data activity, the MDF can no longer parse the files and gives an error: "JSON parsing error, unsupported encoding or multiline." I started this with a Merge Files, but outcome is same regardless of copy behavior I choose.
2ND EDIT: After another day's work, I have found that the Copy Activity Merge File from JSON to JSON definitely adds an EOL character to each single JSON object as it gets imported to the Merge file. I have also found that the MDF fails definitely with those EOL characters in the Merge file. If I remove all EOL characters from the Merge file, the same MDF will work. For me, this is a bug. The copy activity is adding a character that breaks the MDF. There seems to be a second issue in some of my data that doesn't fail as an individual file but does when concatenated that breaks the MDF when I try to pull all the files together, but I have tested the basic behavior on 1-5000 files and been able to repeat the fail/success tests.
I took the original file, and the copied file, ran them through all of sorts of test, what I eventually found when I dump into Notepad++:
Copied file:
{"CustomerMasterData":{"Customer":[{"ID":"123456","name":"Customer Name",}]}}\r\n
Original file:
{"CustomerMasterData":{"Customer":[{"ID":"123456","name":"Customer Name",}]}}\n
If I change the copied file from ending with \r\n to \n, the MDF can read the file again. What is going on here? And how do I change the file write behavior or the MDF settings so that I can concatenate or copy files without the CRLF?
EDIT: NEW INFORMATION -- It seems on further review like maybe the minification/whitespace removal is the culprit. If I download the file created by the ADF copy and format it using a JSON formatter, it works. Maybe the CRLF -> LF masked something else. I'm not sure what to do at this point, but its super frustrating.
Other possibly relevant information:
Both the source and sink JSON datasets are set to use UTF-8 (not default(UTF-8), although I tried that). Would a different encoding fix this?
I have tried remapping schemas, creating new data sets, creating new Mapping Data Flows, still get the same error.
EDITED for clarity based on comments:
In the case of a single JSON element in a file, I can get this to work -- data preview returns same success or failure as pipeline when run
In the case of multiple documents merged by ADF I get the below instead. It seems on further review like maybe the minification/whitespace removal is the culprit. If I download the file created by the ADF copy and format it using a JSON formatter, it works. Maybe the CRLF -> LF masked something else. I'm not sure what to do at this point, but its super frustrating.
Repro: Create any valid JSON as a single file, put it in blob storage, use it as a source in a mapping data flow, to do any sink operation. Create a second file with same schema, get them both to run in same flow using wildcard paths. Use a Copy Activity with Merge Files as the Sink Copy Activity and Array of Objects as the File pattern. Try to make your MDF use this new file. If it fails, download the file created by ADF, run it through a formatter (I have used both VS Code -> "Format Document" from standard VS Code JSON extension, and VS 2019 "Unminify" command) and reupload... It should work now.
don't know if you already solved the problem: I came across the exact same problem 3 days ago and after several tries I found a solution:
in the copy data activity under sink settings, use "set of objects" (instead of "array of objects") under File Pattern, so that the merged big JSON has the value of the original small JSON files written per line
in the MDF after setting up the wildcard paths with the *.json pattern, under JSON Settings select: Document per line as the Document form.
After that you should be good to go, as least it solved my problem. The automatic written CRLF in "array of objects" setting in the copy data activity should be a default setting and MSFT should provide the option to omit it in the settings in the future.
According to my test:
1.copy data activity can't change unix(LF) to windows(CRLF).
2.MDF can also parse unix(LF) file and windows(CRLF) file.
Maybe there is something else wrong.
By the way,I see there is a comma after "name":"Customer Name" in your Original file,I delete it before my test.

Read a flat file in Pentaho Spoon and then export it's metadata into a CSV

I am wondering if it is possible to extract the metadata of a flat file in a CSV using Pentaho Spoon. What I mean by that is for example get a CSV file input step, choose the file you want to read and then somehow get access to the metadata of that file and export it into a CSV.
I found on the documentation a step called Metadata Structured that was introduced in 3.1.0 but I can't find it in the latest version of Spoon, maybe it got removed by now.
Update: I found the "Metadata structure of stream" that almost does what I need to be done. Right now my transformation looks like this: csv file input -> metadata structure of stream -> text file ouput. The problem is that it doesnt extract all the metadata. It doesn't extract Format, Decimal and Group. It also gets me an Origin column that I don't really need and I have to get rid of it.
Update2: I keep trying to get to those columns that are missing but the problem is that the Metadata structure of stream step only outputs these columns "Position,Fieldname,Comments,Type,Length,Precision,Origin" so I cannot really access the format column for example that is an input for the step :( I can't really find a work-around for this

Spark read.json with file names

I need to read a bunch of JSON files from an HDFS directory. After I'm done processing, Spark needs to place the files in a different directory. In the meantime, there may be more files added, so I need a list of files that were read (and processed) by Spark, as I do not want to remove the ones that were not yet processed.
The function read.json converts the files immediately into DataFrames, which is cool but it does not give me the file names like wholeTextFiles. Is there a way to read JSON data while also getting the file names? Is there a conversion from RDD (with JSON data) to DataFrame?
From version1.6 on you can use input_file_name() to get the name of the file in which a row is located. Thus, getting the names of all the files can be done via a distinct on it.

tiffcp.exe merging a results file with a results file in a loop

I am building a web app that takes several tiff image files and merges them together into one single tiff image file using GNUWin32 tiffcp.exe from command line.
The way I was doing it was to loop through the file list and build a string of file names to merge into one single variable.
strfileList = "c:folder\folder\folder\aased98-def-wsdeff-434fsdsd-dvv.tif c:folder\folder\folder\aased98-def-wsdeff-434fsdsd-axs.tif c:folder\folder\folder\aased98-def-wsdeff-434fsdsd-dxzs.tif"
Then I would just write to the command line:
tiffcp.exe strFileList results.tif
The file names are guids and so the paths are fairly long and I do not have any control to shorten them. So if I have a bunch of these documents (over 20 files or so), the length of the string variable exceeds the limits for windows command line and the merge fails.
Since this process is just merging files, my next thought was instead of writing the file names to a string, just do the merge one file at a time. So the first time the loop runs the following type of code:
tiffcp.exe file1.tif results.tif
The result is a perfect 476k tif file. But the next iteration of the loop needs to merge the second file plus the contents of the first "results" tif file. So I do this:
tiffcp.exe results.tif file2.tiff results.tif
The results each time are a blank 1K tiff file?
All the examples I can find of tiffcp.exe say file1.tif file2.tif results.tif, none use the results file to write back to itself?
Any suggestions on how to do this?
Try the -a switch to tiffcp.exe
I'm doing something similar in Python and inside my file processing loop I'm issuing the command:
tiffcpp.exe -a temp.tif output.tif
works fine.
For an ASP.NET project you may want to try LibTiff.Net (free, open source, BSD license). That port of libtiff library contains tiffcp utility with source code. You may try to use it in your code.
Disclaimer: I am one of the maintainers of the library.
I believe your problem is caused by the use of results.tif as both input as output. If you increment the file name (i.e. results1.tif to results2.tif etc.) I believe it should work.
This is a rather inefficient approach (tiff1 is copied 9 times if you have 10 files). Since you refer to libtiff, you may take a look at the source of libtiff cp and check if it is worthwhile to embed it.