Can you suggest how we can do merging of different types of files ?
Merging of different types of files cannot be accomplished. Each file type has their own way of compression and storing data.
RAR file on the other hand is not usually used in Hadoop. If there are other formats like, parquet, orc, json - these can be merged by converting the files to the same type.
For example if the requirement is to merge parquet and json files, the parquet files can be converted into json using tools like parquet-tools.jar and can be merged by creating tables by loading these files into a table with appropriate schema.
Hope this helps!
Related
I'm new to Apache Camel and learning its basics. I'm using Yaml DSL
I have a TGZ file which includes 2 small CSV files.
I am trying to decompress the file using gzipDeflater, but when I print the body after the extraction, it includes some data about the CSV (filename, my username, some numbers) - that is preventing me from parsing the CSV only by its known columns.
since the extracted file includes lines that were not included in the original CSV, whenever one of those lines is processed, i get an exception.
Is there a way for me to "ignore" those lines, or perhaps another functionality of Apache Camel that will let me access only the content of those CSV's?
Thanks!
You probably have a gzipped tar file, which is a slightly different thing than just a deflate compressed file.
Try this (convert to YAML if you'd like):
from("file:filedir")
.unmarshal().gzip()
.split(new TarSplitter())
// process/unmarshal CSV
Generically data follows one XML (ACORD) at a time. There is a process which convert XML into multiple CSV files (around 200+). Is it possible to merge all 200 files into one Parquet file? Or Parquet only supports one CSV structure per file?
I am using NIFI jsontoavro->avrotoorc->puthdfs. But facing following issues.
1)Single ORC file is being saved on HDFS. I am not using any compression.
2) when i try to access these files they are giving errors like buffer memory.
Thanks for help in advance.
You should be merging together many Avro records before ConvertAvroToORC.
You could do this by using MergeContent with the mode set to Avro right before ConvertAvroToORC.
You could also do this by merging your JSON together using MergeContent, and then sending the merged JSON to ConvertJsonToAvro.
Using PutHDFS to append to ORC files that are already in HDFS will not work. The HDFS processor does not know anything about the format of the data and is just writing additional raw bytes on to the file and will likely create an invalid ORC file.
CATProduct format files with multiple CATPart format files are assembly relationships, and upload files in CATProduct format will fail to convert. Is there any way to solve this problem? Such as CATProduct format files and multiple CATPart format files or specify the relationship between the subordinate relationship.
The way it usually works is that you either upload all the files in a zip and when translating you specify that it is a composite file and provide the rootFilename as well: http://adndevblog.typepad.com/cloud_and_mobile/2016/07/translate-referenced-files-by-derivative-api.html, or if you are uploading to A360/Fusion Team/BIM 360 Docs, then you could also upload each file separately and then define the relationships between them: http://adndevblog.typepad.com/cloud_and_mobile/2016/12/setting-up-references-between-files.html
I need to read a bunch of JSON files from an HDFS directory. After I'm done processing, Spark needs to place the files in a different directory. In the meantime, there may be more files added, so I need a list of files that were read (and processed) by Spark, as I do not want to remove the ones that were not yet processed.
The function read.json converts the files immediately into DataFrames, which is cool but it does not give me the file names like wholeTextFiles. Is there a way to read JSON data while also getting the file names? Is there a conversion from RDD (with JSON data) to DataFrame?
From version1.6 on you can use input_file_name() to get the name of the file in which a row is located. Thus, getting the names of all the files can be done via a distinct on it.