Parsing CSV file in Hadoop

Parsing CSV file in Hadoop - csv

How can Mapreduce parse a CSV file with 80 columns and for each row in excel format it results two to three lines in CSV format? Text input format doesn't work in this case. Does key value input format work in this case?

You can write your own InoutFormat & RecordReader which will read multiple lines and send as a single record to your Mapper.

Related

How can I escape characters when creating a csv file in Data fusion?

I am creating a pipeline in google data fusion that should read records from a source database and write them to a target csv file in Cloud Storage.
The problem is that in the resulting file the separator character is a comma ",", and some fields are of type string and contains phrases with commas, so when I try to load the resulting file in wrangler as a csv, I get an error, because the number of fields in the csv does not match the number of fields in the schema (because of fields containing comma strings).
How can I escape these special characters in the pipeline?
Thanks and regards

Try writing the data as TSV instead of CSV (set the format of the sink plugin to tsv). Then load the data as tsv in Wrangler.

Storing Pyspark Data in csv creating problem

I am trying to store my pyspark output into csv, but when I try to save it in csv, the output does not look the same. I have the output in this form:
When I try to convert this to csv, the Concat tasks column does not show up properly, due to the size of the data. Given my requirement, it's necessary for me to store the data in csv format. Is there a way out for this. (P.S- I also see columns showing nonsensical values, even though the pyspark output shows correct value)

How to update a values of specific fields on csv using nifi?

I have a CSV file where it contains id, name, and salary as fields. The data in my CSV file is like below.
id,name,salary
1,Jhon,2345
2,Alex,3456
I want to update the current CSV with new id (id*4)
id,name,salary
4,Jhon,2345
8,Alex,3456
The format of the file in the destination should aslo be CSV. Can anyone tell me the flow? (What processors do I need). I'm very new to nifi. A big thanks in advance.

Use UpdateRecord processor with the below settings,
Record Reader CSVReader
Record Writer CSVRecordSetWriter
Replacement Value Strategy Literal Value
/id ${field.value:multiply(4)}
then it gives the desired result. Just csv in and csv out.

Solr how to write an array of data in a CSV file

What's the syntax to write an array for solr in a csv file?, i need to update a multivalued field but when i upload the file, the data get all in the array but like just one element like this:
multiField:["data1,data2,data3"]
instead of this
multiField:["data1", "data2" , "data3"]
how i can write this in the csv file by default?

You can use the split parameter to split a single field into multiple values:
&f.multiField.split=,
.. should do what you want.

Importing CSV file in Talend - how to set options to match Excel

I have a CSV file that I can open in Excel 2012 and it comes in perfectly. When I try to setup the metadata for this CSV file in Talend the fields (columns) are not splitting the same was as Excel splits them. I suspect I am not properly setting the metadata.
The specific issue is that I have a column with string data in it which may contain commas within the string. For example suppose I have a CSV file with three columns: ID, Name and Age which looks like this:
ID,Name,Age
1,Ralph,34
2,Sue,14
3,"Smith, John", 42
When Excel reads this CSV file it looks at the second element of the third row ("Smith, John") as a single token and places it into a cell by itself.
In Talend it trys to break this same token into two since there is a comma within the token. Apparently Excel ignores all delimeters within a quoted string while Talend by default does not.
My question is how to I get Talend to behave the same as Excel?

if you use tfileinputdelimited component to read this csv file, you can use delimeter as "," and under csv options properties of this component you should enable Text Enclosure """ option or even if you use metadata there would be an option to define string/text enclosure - here you should mention """ to resolve your problem

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Parsing CSV file in Hadoop - csv

How can Mapreduce parse a CSV file with 80 columns and for each row in excel format it results two to three lines in CSV format? Text input format doesn't work in this case. Does key value input format work in this case?

You can write your own InoutFormat & RecordReader which will read multiple lines and send as a single record to your Mapper.

Related

How can I escape characters when creating a csv file in Data fusion?

Storing Pyspark Data in csv creating problem

How to update a values of specific fields on csv using nifi?

Solr how to write an array of data in a CSV file

Importing CSV file in Talend - how to set options to match Excel

Categories

Resources