How can hadoop mapreduce get data input from CSV file? - csv

I want to implement hadoop mapreduce, and I use the csv file for it's input. So, I want to ask, is there any method that hadoop provide for use to get the value of csv file, or we just do it with Java Split String function?
Thanks all.....

By default Hadoop uses a Text Input reader that feeds the mapper line by line from the input file. The key in the mapper is the number of lines read. Be careful with CSV files though, as single columns/fields can contain a line break. You might want to look for a CSV input reader like this one:
https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java
But, you have to split your line in your code.

Related

wso2 convert json/xml to csv and write to a csv file

i'm trying to create tab-delimited csv data from json/xml data. While I can do this using payload factory mediator in an iterate loop; the data gets appended to the same line in the file every iteration, creating a long line of data. I want to be append to the next line, but i've been unable to find a way. Any suggestions? Thanks.
(I do not want a solution which uses a csv connector or module)
Edit: I solved it, you just need to use an xslt and "
" line break character.

Extracting from CSV file knowing row and column number on command line

I have a CSV file and I want to extract the element in the first row and 3rd column. How might I go about doing this?
I would load the CSV in a matrix and then take the relevant row/column; of course, you could ignore the non-relevant element while loading the CSV. How to do the aforementioned has already been answered e.g.
How can I read and parse CSV files in C++?

Pyspark: how to read CSV file with additional lines

so I am having the following CSV file. It has some additional strings between the valid rows. Excel seems to do a good job when reading those(and just ignores the additional ones).
However, the story with spark is a bit different.
I have set it as spark.read.csv(path, header=True, multiLine=True, sep='|')
Is there some simple way to handle it?

Mahout CSV to SEQ for text vectorization

I have a large CSV file where each line consists (id, description) in a Text format. I wanted to convert each line to a vector using "seq2sparse" and then later run "rowsimilarity" to generate a textual similarity result.
Problem is i need to convert the CSV file to SEQ somehow to work with "seq2sparse", and existing method "seqdirectory" takes a directory of text files rather than a CSV file. Anyway to accomplish this?

Changing The Delimiter to CTRL+A in Python CSV Module

I'm trying to write a csv file with the delimiter ctrl+a. I'm going to have to eventually write the file to hadoop and I'm unable to use a standard delimiter.
Currently I'm trying this:
writer = csv.writer(f, delimiter = "\u0001")
for item in aList:
writer.writerow(item)
f.close()
However, the outputted excel file doesn't appear to be written correctly...
Some rows are condensed into one block, while others will have one field in the first and then the rest condensed into the second block, etc.
Is the error where I'm setting up the writer object, or am I just not familiar with separating files this way?
You can try using the nonprinting "group separator" character, which can be represented in python code as '\035'
see http://www.asciitable.com/index/asciifull.gif for some other nonprinting characters if you need more.
It may be helpful to include more context about why you want to use nonstandard delimiter. And whether Excel parsing of the file is necessary, or just a quick check to see if the file might be parsed properly by the target system, Hadoop.