how to read CSV file in Map/Reduce? - csv

I have a large CSV file that is in the size of 6GB, comma-delimited. Below is the mapper function
#Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] tokens = value.toString().split(",");
String crimeType = tokens[5].trim(); // column #5 is the crime type in the CSV file, serving key
// int year = Integer.parseInt(tokens[17].trim()); // the year when the crime happened
int year = 2010;
CrimeTypeKey crimeTypeYearKey = new CrimeTypeKey(crimeType, year);
context.write(crimeTypeYearKey, ONE);
}
As you can see, I use ".split" to break down each row (or column?). I am wondering how you can use OpenCSV in this case? please give me an example, thanks a lot

Related

jackson reading in non-existent and null values to “” and marshalling out “” to non-existent values?

I read through this post Jackson: deserializing null Strings as empty Strings which has this cool trick
mapper.configOverride(String.class)
.setSetterInfo(JsonSetter.Value.forValueNulls(Nulls.AS_EMPTY));
THEN on the flipside, I read through this post Jackson serialization: ignore empty values (or null) which has this cool trick
objectMapper.setSerializationInclusion(JsonInclude.Include.NON_NULL);
This is VERY VERY close except I really don't want incoming data to be null in any case. I have the following code printing 4 situations with the above settings BUT want to fix the null piece so any json we unmarshal into java results in
public class MapperTest {
private static final Logger log = LoggerFactory.getLogger(MapperTest.class);
private ObjectMapper mapper = new ObjectMapper();
public MapperTest() {
mapper.setSerializationInclusion(JsonInclude.Include.NON_NULL);
mapper.configOverride(String.class)
.setSetterInfo(JsonSetter.Value.forValueNulls(Nulls.AS_EMPTY));
}
public static void main(String[] args) throws JsonProcessingException {
new MapperTest().start();
}
private void start() throws JsonProcessingException {
//write out java color=null resulting in NO field...
String val = mapper.writeValueAsString(new Something());
log.info("val="+val);
Something something = mapper.readValue(val, Something.class);
log.info("value='"+something.getColor()+"'");
//write out java color="" resulting in NO field...
Something s = new Something();
s.setColor("");
String val2 = mapper.writeValueAsString(new Something());
log.info("val="+val2);
String temp = "{\"color\":null,\"something\":0}";
Something something2 = mapper.readValue(temp, Something.class);
log.info("value2='"+something2.getColor()+"'");
}
}
The output is then
INFO: val={"something":0}
INFO: value='null'
INFO: val={"something":0}
INFO: value2=''
NOTE: The value = 'null' is NOT what I desire and want that to also be empty string. Notice that if customers give a color:'null', it does result in empty string. Non-existence should result in the same thing for us "".
This is a HUGE win in less mistakes in this area 'for us' I mean.
thanks,
Dean

SpringBatch Read from unstructured csv

I would like to read from an unstructured CSV file. It means it will have different columns types every time. Please help.
Yes, Finally Myself found the solution and i would like to share with you. You can write a LineMapper and you can map unstructured header (dynamic- columns) with each line by the following code. Please Note i have read header while job scheduling and pass it as JobParameter.
#Bean
#StepScope
public FlatFileItemReader<Customer> csvReader(#Value("#{jobParameters[filepath]}") String filepath,
#Value("#{jobParameters[header]}") String header,
#Value("#{jobParameters[campaignId]}") String campaignId,
#Value("#{jobParameters[_id]}") String _id) {
FlatFileItemReader<Customer> flatFileItemReader = new FlatFileItemReader<>();
flatFileItemReader.setResource(new FileSystemResource(filepath));
flatFileItemReader.setName("customer-csv-file-reader");
flatFileItemReader.setLinesToSkip(1);
flatFileItemReader.setLineMapper(lineMapper(header,campaignId,_id));
return flatFileItemReader;
}
#Bean
#StepScope
public LineMapper<Customer> lineMapper(#Value("#{jobParameters[header]}") String header,
#Value("#{jobParameters[campaignId]}") String campaignId,
#Value("#{jobParameters[_id]}") String _id) {
return new LineMapper<Customer>() {
public String[] headers = header.split(",");
#Override
public Customer mapLine(String line, int linenumber) throws Exception {
Customer item = new Customer();
String[] p = line.split(",");
Map<String, String> properties = IntStream.range(0, headers.length).boxed()
.collect(Collectors.toMap(i -> headers[i], i -> p[i]));
item.setCampaignId(new ObjectId(campaignId));
item.setInviteId(new ObjectId(_id));
item.setProperties(properties);
return item;
}
};
}

Apache Camel CSV with Header

I have written a simple test app that reads records from a DB and puts the result in a csv file. So far it works fine but the column names i.e. headers are not put in the csv file. According to the doc it should be put there. I have also tried it without/with streaming and split but the situation is the same.
In the camel unit-tests in line 182 the headers are put there explicitly: https://github.com/apache/camel/blob/master/components/camel-csv/src/test/java/org/apache/camel/dataformat/csv/CsvDataFormatTest.java
How could this very simple problem be solved without the need to iterate over the headers? I also experimented with different settings but all the same. The e.g delimiters have been considered I set but the headers not. Thanks for the responses also in advance.
I used Camel 2.16.1 like this:
final CsvDataFormat csvDataFormat = new CsvDataFormat();
csvDataFormat.setHeaderDisabled(false);
[...]
from("direct:TEST").routeId("TEST")
.setBody(constant("SELECT * FROM MYTABLE"))
.to("jdbc:myDataSource?readSize=100") // max 100 records
// .split(simple("${body}")) // split the list
// .streaming() // not to keep all messages in memory
.marshal(csvDataFormat)
.to("file:extract?fileName=TEST.csv");
[...]
EDIT 1
I have also tried to add the headers from the exchange.in. They are there available with the name "CamelJdbcColumnNames" in a HashSet. I added it to the csvDataFormat like this:
final CsvDataFormat csvDataFormat = new CsvDataFormat();
csvDataFormat.setHeaderDisabled(false);
[...]
from("direct:TEST").routeId("TEST")
.setBody(constant("SELECT * FROM MYTABLE"))
.to("jdbc:myDataSource?readSize=100") // max 100 records
.process(new Processor() {
public void process(Exchange exchange) throws Exception {
headerNames = (HashSet)exchange.getIn().getHeader("CamelJdbcColumnNames");
System.out.println("#### Process headernames = " + new ArrayList<String>(headerNames).toString());
csvDataFormat.setHeader(new ArrayList<String>(headerNames));
}
})
.marshal(csvDataFormat)//.tracing()
.to("file:extract?fileName=TEST.csv");
The println() prints the column names but the cvs file generated does not.
EDIT2
I added the header names to the body as proposed in comment 1 like this:
.process(new Processor() {
public void process(Exchange exchange) throws Exception {
Set<String> headerNames = (HashSet)exchange.getIn().getHeader("CamelJdbcColumnNames");
Map<String, String> nameMap = new LinkedHashMap<String, String>();
for (String name: headerNames){
nameMap.put(name, name);
}
List<Map> listWithHeaders = new ArrayList<Map>();
listWithHeaders.add(nameMap);
List<Map> records = exchange.getIn().getBody(List.class);
listWithHeaders.addAll(records);
exchange.getIn().setBody(listWithHeaders, List.class);
System.out.println("#### Process headernames = " + new ArrayList<String>(headerNames).toString());
csvDataFormat.setHeader(new ArrayList<String>(headerNames));
}
})
The proposal solved the problem and thank you for that but it means that CsvDataFormat is not really usable. The exchange body after the JDBC query contains an ArrayList from HashMaps containing one record of the table. The key of the HashMap is the name of the column and the value is the value. So setting the config value for the header output in CsvDataFormat should be more than enough to get the headers generated. Do you know a simpler solution or did I miss something in the configuration?
You take the data from a database with JDBC so you need to add the headers yourself first to the message body so its the first row. The resultset from the jdbc is just the data, not including headers.
I have done it by overriding the BindyCsvDataFormat and BindyCsvFactory
public class BindySplittedCsvDataFormat extends BindyCsvDataFormat {
private boolean marshallingfirslLot = false;
public BindySplittedCsvDataFormat() {
super();
}
public BindySplittedCsvDataFormat(Class<?> type) {
super(type);
}
#Override
public void marshal(Exchange exchange, Object body, OutputStream outputStream) throws Exception {
marshallingfirslLot = new Integer(0).equals(exchange.getProperty("CamelSplitIndex"));
super.marshal(exchange, body, outputStream);
}
#Override
protected BindyAbstractFactory createModelFactory(FormatFactory formatFactory) throws Exception {
BindySplittedCsvFactory bindyCsvFactory = new BindySplittedCsvFactory(getClassType(), this);
bindyCsvFactory.setFormatFactory(formatFactory);
return bindyCsvFactory;
}
protected boolean isMarshallingFirslLot() {
return marshallingfirslLot;
}
}
public class BindySplittedCsvFactory extends BindyCsvFactory {
private BindySplittedCsvDataFormat bindySplittedCsvDataFormat;
public BindySplittedCsvFactory(Class<?> type, BindySplittedCsvDataFormat bindySplittedCsvDataFormat) throws Exception {
super(type);
this.bindySplittedCsvDataFormat = bindySplittedCsvDataFormat;
}
#Override
public boolean getGenerateHeaderColumnNames() {
return super.getGenerateHeaderColumnNames() && bindySplittedCsvDataFormat.isMarshallingFirslLot();
}
}
My solution with spring xml (but I'd like to have an option in for extracting also the header on top:
Using spring xml
<multicast stopOnException="true">
<pipeline>
<log message="saving table ${headers.tablename} header to ${headers.CamelFileName}..."/>
<setBody>
<groovy>request.headers.get('CamelJdbcColumnNames').join(";") + "\n"</groovy>
</setBody>
<to uri="file:output"/>
</pipeline>
<pipeline>
<log message="saving table ${headers.tablename} rows to ${headers.CamelFileName}..."/>
<marshal>
<csv delimiter=";" headerDisabled="false" useMaps="true"/>
</marshal>
<to uri="file:output?fileExist=Append"/>
</pipeline>
</multicast>
http://www.redaelli.org/matteo-blog/2019/05/24/exporting-database-tables-to-csv-files-with-apache-camel/

how to reverse the extracted entry after modification

I am working with csv file having very large dataset. while reading file i had extracted 4th place(BALANCE) ';' separated numeric value from each rows through while loop iteration. and make a list of Double after some mathematical calculation(here incremented).
now I want to store this list of Double in reverse order(from end to beginning).as its original position(here 4th place).example
public static void main(String[] args) throws IOException {
String filename = "abc.csv";
List<Double> list = new ArrayList<Double>();
File file = new File(filename);
Scanner inputStream = new Scanner(file);
inputStream.next();
while (inputStream.hasNext()) {
String data = inputStream.next();
String[] values = data.split(";");
double BALANCE = Double.parseDouble(values[1]);
BALANCE = BALANCE + 1;
ListIterator li = list.listIterator(list.size());
while (li.hasPrevious()) {
values[1] = String.valueOf(li.previous()); }
inputStream.close();
}
} }
You can use Collections.reverse. Example Collections.reverse(list);

compare the data in CSV files using mapreduce program

I am trying to compare columns of two csv files using mapreduce program
The input csv data files(input for map program)that contains some automated generated data which has around 100 columns and thousands of rows looks in the below format...
Note: CSV file columns are separated by ";"
Input File1 Data
Column1;Column2;Column3;Column4;-----------
Sigma48_12mar09.9010.9010.3;K.TAFQEALDAAGDKLVVVDFSATWC[160.14]GPC[160.14]K.M;P08263.3;1.062
Sigma48_12mar09.9063.9063.3;K.KDPEGLFLQDNIVAEFSVDETGQMSATAK.G;P08263.3;1.062
Input File2 Data
Column1;Column2;Column3;Column4;-----------
Sigma48_12mar09.9188.9188.2;R.YKLSLEFPSGYPYNAPTVK.F;P08263.3;1.062
Sigma48_12mar09.9314.9314.2;R.YKLSLEFPSGYPYNAPTVK.FP08263.3;1.062
Sigma48_12mar09.9010.9010.3;K.TAFQEALDAAGDKLVVVDFSATWC[160.14]GPC[160.14]K.M;P08263.3;1.062
My requirement :
Read all the rows in Input File1 Data.csv take column1 and read all rows in Input File2 Data.csv, then compare the column1 in first file with column1 in the second file.
When match found compare all other columns in the above two files for that particular row and write the matched data in to the HDFS and should return the % matched among those two input files.
Mycode is as follows..
/* First Mapper */
public void map(LongWritable key,Text value,Context context)
throws IOException, InterruptedException{
String line = value.toString();
String[] words = line.split(";");
String name = words[1];
String other = words[2];
context.write(new Text(name), new Text(line));
}
}
/* Second Mapper */
public static class InputMapper2 extends Mapper<LongWritable,Text,Text,Text>{
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException
{
String line = value.toString();
String[] words = line.split(";");
String name = words[1];
String other = words[2];
System.out.println(key);
context.write(new Text(name), new Text(line));
}
}
/* Reducer for both of the mappers */
/*incomplete and have to compare the two csv files here */
public static class CounterReducer extends Reducer
{
String line=null;
public void reduce(Text key, Iterable<Text> values, Context context )
throws IOException, InterruptedException
{
Iterator<Text> val = values.iterator();
for(Text value:values)
{
line = value.toString();
}
context.write(key, new Text(line));
}
}