NiFi merge CSV files using MergeRecord - csv

i have a stream of JSON records that i convert it into CSV record successfully with this instruction. but now i want to merge this CSV records into one CSV file. below is that flow:
at step 5 i face with around 9K csv record, how do i merge it into one csv file using MergeRecord processor?
my csv header:
field1,field2,field3,field4,field5,field6,field7,field8,field9,field10,field11
some of this fields may be null and vary in records.

after this use UpdateAttribute configure it so that it can save the file with a filename and after that use putFile to store it to a specific location

I had a similar problem and solved it by using RouteonAttribute processor. Hope this helps someone.
Below is how I configure the processor using ${merge.count:equals(1)}

Related

Extracting from CSV file knowing row and column number on command line

I have a CSV file and I want to extract the element in the first row and 3rd column. How might I go about doing this?
I would load the CSV in a matrix and then take the relevant row/column; of course, you could ignore the non-relevant element while loading the CSV. How to do the aforementioned has already been answered e.g.
How can I read and parse CSV files in C++?

Apache Nifi : How to create parquet file from CSV file with schema saved in "avro.schema" attribute

I am trying to create a parquet file from a CSV file using Apache Nifi.
I am able to convert the CSV to parquet file, but the problem is, the schema of the parquet file contains struct type(Which I need to overcome) and convert it into string type.
I am using Apache Nifi 1.14.0 on Windows Server 2016.
This is what I've tried to convert CSV to parquet till now...
I have used the below 3 controllers
CSVReader
CSVRecordSetWriter
ParquetRecordSetWriter
And, These are the processors/Flow
GetFile
ConvertRecord(CSVReader to CSVRecordSetWriter and this will automatically generate "avro.schema" attribute and in next step I am updating this attribute)
UpdateAttribute(Updating "avro.schema" attribute, where ever I've got 2 data types inferred, I am replacing it to '["null","string"]')
ConvertRecord(CSVReader to ParquetRecordSetWriter)
UpdatedAttribute(For appending '.parquet' in the filename)
PutFile
I also want to know, how to view a .parquet file in Windows OS. Currently, I am reading the parquet file via PySpark and checking the schema. :|
This is how parquet file schema looks like after conversion. I want string instead of Struct as output.
Please Note: There are lots of CSVs with many columns/fields. I don't want to create schema manually.
OR
Any other ways to achieve this would be very helpfull.
Thanks!
After playing around with some more options of "ParquetRecordSetWriter", I was able to create a parquet file with the schema that I've captured in "avro.schema" attribute.

Spark read multiple CSV files, one partition for each file

suppose I have multiple CSV files in the same directory, these files all share the same schema.
/tmp/data/myfile1.csv, /tmp/data/myfile2.csv, /tmp/data.myfile3.csv, /tmp/datamyfile4.csv
I would like to read these files into a Spark DataFrame or RDD, and I would like each file to be a parition of the DataFrame. How can I do this?
You have two options I can think of:
1) Use the Input File name
Instead of trying to control the partitioning directly, add the name of the input file to your DataFrame and use that for any grouping/aggregation operations you need to do. This is probably your best option as it is more aligned with the parallel processing intent of spark where you tell it what to do and let it figure out the how. You do this with code like this:
SQL:
SELECT input_file_name() as fname FROM dataframe
Or Python:
from pyspark.sql.functions import input_file_name
newDf = df.withColumn("filename", input_file_name())
2) Gzip your CSV files
Gzip is not a splittable compression format. This means when loading gzipped files, each file will be it's own partition.

Cassandra RPC Timeout on import from CSV

I am trying to import a CSV into a column family in Cassandra using the following syntax:
copy data (id, time, vol, speed, occupancy, status, flags) from 'C:\Users\Foo\Documents\reallybig.csv' with header = true;
The CSV file is about 700 MB, and for some reason when I run this command in cqlsh I get the following error:
"Request did not complete within rpc_timeout."
What is going wrong? There are no errors in the CSV, and it seems to me that Cassandra should be suck in this CSV without a problem.
Cassandra installation folder has a .yaml file to set rpc timeout value which is "rpc_timeout_in_ms ", you could modify the value and restart cassandra.
But another way is cut your big csv to multiply files and input the files one by one.
This actually ended up being my own misinterpretation of COPY-FROM as the CSV was about 17 million rows. Which in this case the best option was to use the bulk loader example and run sstableloader. However, the answer above would certainly work if I wanted to break the CSV into 17 different CSV's which is an option.

How to rename my hadoop result into a file with ".csv" extension

Actually my intention is to rename the output of a hadoop job to .csv files, because i need to visualize this csv data in rapidminer.
In How can i output hadoop result in csv format it is said, that for this purpose I need to follow these three steps:
1. Submit the MapReduce Job
2. Which will extract the output from HDFS using shell commands
3. Merge them together, rename as ".csv" and place in a directory where the visualization tool can access the final file
If so, how can I achieve this?
UPDATE
myjob.sh:
bin/hadoop jar /var/root/ALA/ala_jar/clsperformance.jar ala.clsperf.ClsPerf /user/root/ala_xmlrpt/Amrita\ Vidyalayam\,\ Karwar_Class\ 1\ B_ENG.xml /user/root/ala_xmlrpt-outputshell4
bin/hadoop fs -get /user/root/ala_xmlrpt-outputshell4/part-r-00000 /Users/jobsubmit
cat /Users/jobsubmit/part-r-00000 /Users/jobsubmit/output.csv
showing:
The CSV file was empty and couldn’t be imported.
when I tried to open output.csv.
solution
cat /Users/jobsubmit/part-r-00000> /Users/jobsubmit/output.csv
Firstly you need to retrieve MapReduce result from HDFS
hadoop dfs -copyToLocal path_to_result/part-r-* local_path
Then cat them into a single file
cat local_path/part-r-* > result.csv
Then it depends your MapReduce result format, if it's already a csv format, then it is done. If not, probably you have to use other tool like sed or awk to transform it into csv format.