Using Split function to create multiple files from a *.jsonl.gz file in UNIX - function

I have a huge zipped file with [.gz extension, size 5gb] while i use the split function in UNIX. It is creating file in binary format where as original file was having json files in each line [jsonl].
While i try to read these files programmatically i am getting unexpected symbol since split function has not generated the files in proper format. Can someone assist please

Related

Apache Camel - extract TGZ file which contain multiple CSV's

I'm new to Apache Camel and learning its basics. I'm using Yaml DSL
I have a TGZ file which includes 2 small CSV files.
I am trying to decompress the file using gzipDeflater, but when I print the body after the extraction, it includes some data about the CSV (filename, my username, some numbers) - that is preventing me from parsing the CSV only by its known columns.
since the extracted file includes lines that were not included in the original CSV, whenever one of those lines is processed, i get an exception.
Is there a way for me to "ignore" those lines, or perhaps another functionality of Apache Camel that will let me access only the content of those CSV's?
Thanks!
You probably have a gzipped tar file, which is a slightly different thing than just a deflate compressed file.
Try this (convert to YAML if you'd like):
from("file:filedir")
.unmarshal().gzip()
.split(new TarSplitter())
// process/unmarshal CSV

avoid splitting json output by pyspark (v. 2.1)

using spark v2.1 and python, I load json files with
sqlContext.read.json("path/data.json")
I have problem with output json. Using the below command
df.write.json("path/test.json")
data is saved in a folder called test.json (not a file) which includes two empty files: one empty and the other with a strange name:
part-r-00000-f9ec958d-ceb2-4aee-bcb1-fa42a95b714f
Is there anyway to have a clean single json output file?
thanks
Yes, spark writes the output in multiple file when you try to save. Since the computation is distributed the output files are written in multiples part files like (part-r-00000-f9ec958d-ceb2-4aee-bcb1-fa42a95b714f). The number of files created are equal to the number of partition.
If your data is small and can fits in the memory then you can save your output file in a single file. But if your data is large saving on a single file is not the suggested way.
Actually the test.json is a directory and not a json file. It contains multiple part files inside it. This does not create any problem for you you can easily read this later.
If you still want your output in a single file then you need to repartition to 1, which brings your all data to single node and saves. This may cause issue if you have large data.
df.repartition(1).write.json("path/test.json")
Or
df.collect().write.json("path/test.json")

I get a mysterious "Neo.ClientError.Statement.InvalidSyntax" error when loading a CSV in Neo4j

For a course on Excel I was trying to load a CSV in Neo4j (first time using this application) when I was blocked at the first step of replicating an example shown in said course: loading.
The command which was used in the example was this;
LOAD CSV WITH HEADERS FROM "file:/path/to/file/file.csv"
as row
CREATE (m:movie {name:row.movie})
But it gave syntax errors. I found out I could correct it by using double \ and add "file:";
LOAD CSV WITH HEADERS FROM "file://C:\\path\\to\\file\\file.csv"
as row
CREATE (m:movie {name:row.movie})
Neo4j accepts this syntax, processes for a few moments, and returns YET ANOTHER error;
Neo.TransientError.Statement.ExternalResourceFailure
I tried the same commands (original and my own) in the online Neo4j console but no luck. I can reach the file using that path without problem; it really is there. The CSV file consist out of just 5 strings of regular letters, that's all. No fancy formatting or characters.
What's going on?
Not that mysterious, Neo4j's IMPORT CSV function looks for the specified CSV file in the import directory within your server configuration for that database, as specified at the top of its server configuration file. (IE: dbms.directories.import=import in your neo4j.conf file.)
You should create the import directory in...
"C:\Users\[User Name]\Documents\Neo4j\default.graphdb\"
If you place your CSV file in there, you can specify any sub-directory or just the "file.csv" you want to import with the IMPORT CSV function as below.
LOAD CSV WITH HEADERS FROM "file:///file.csv"
AS row
RETURN row
LIMIT 5
Try using:
"file:///C:/path/to/file/file.csv"
Since your file is on your local computer, the third / following the file scheme is not preceded by a host name or address -- but it still needs to be there. Also, file URI path separators should be forward slashes (even on Windows machines).
See the File URI scheme Wikipedia page if you need more information.

SFTP of a CSV file using JSch to Mainframes turns the file in to a single line one MF end

I am doing a CSV file SFTP using JSch to a mainframe. The file has multiple rows. However, after the transfer of the file, on the Mainframe it contains all the rows in a single line. Sample code snippet below:
File f1 = new File(FILETOTRANSFER1);
channelSftp.put(new FileInputStream(f1), f1.getName());
The JSch library always uses "binary" mode transfer only. It never converts the file in any way.
So either:
The file gets (wrongly) converted by the SFTP server on the mainfraime.
Or (more likely) the file actually is not converted to the format the mainframe requires. Either you need to do the conversion yourself before the upload or convert the file on the server after the upload.

Creating a CSV file with the Report Generation Toolkit in Labview

I want to create .csv files with the Report Generation Toolkit in Labview.
They must actually be .csv files which can be opened with Notepad or something similar.
Creating a .csv is not that hard, it's just a matter of adding the extension to the file name that's going to be created.
If I create a .csv file this way it opens nicely in excel just the way it should, but if I open it in Notepad it shows all kind of characters and it doesn't even come close to the data I wrote to the file.
I create the files with the Labview code below:
Link to image (can't post image yet because I've got to few points)
I know .csv files can be created with the Write to Spreadsheet VI but I would like to use the Report Generation Toolkit because it's pretty easy to add columns and rows to the file and that is something I really need.
you can use the Robust CSV package on the lavag.org forum to read and write 2D arrays to CSV files.
http://lavag.org/files/file/239-robust-csv/
Calling a file "csv" does not make it a CSV file. I never used the toolkit to generate an Excel file, but I'm assuming it creates an XLS or XLSX file, regardless of what extension you give it, which is why you're seeing gibberish (probably XLS, since it's been around for a while and I believe XLSX is XML, not binary).
I'm not sure what your problem is with the write spreadsheet VI. It has an append input, so I assume you can use that to at least add rows directly to a file, although I can't say I ever tried it. I would prefer handling all the data in memory explicitly, where you can easily use the array functions to add rows or columns to the array and then overwrite the entire file.