compare the data in CSV files using mapreduce program - csv

I am trying to compare columns of two csv files using mapreduce program
The input csv data files(input for map program)that contains some automated generated data which has around 100 columns and thousands of rows looks in the below format...
Note: CSV file columns are separated by ";"
Input File1 Data
Column1;Column2;Column3;Column4;-----------
Sigma48_12mar09.9010.9010.3;K.TAFQEALDAAGDKLVVVDFSATWC[160.14]GPC[160.14]K.M;P08263.3;1.062
Sigma48_12mar09.9063.9063.3;K.KDPEGLFLQDNIVAEFSVDETGQMSATAK.G;P08263.3;1.062
Input File2 Data
Column1;Column2;Column3;Column4;-----------
Sigma48_12mar09.9188.9188.2;R.YKLSLEFPSGYPYNAPTVK.F;P08263.3;1.062
Sigma48_12mar09.9314.9314.2;R.YKLSLEFPSGYPYNAPTVK.FP08263.3;1.062
Sigma48_12mar09.9010.9010.3;K.TAFQEALDAAGDKLVVVDFSATWC[160.14]GPC[160.14]K.M;P08263.3;1.062
My requirement :
Read all the rows in Input File1 Data.csv take column1 and read all rows in Input File2 Data.csv, then compare the column1 in first file with column1 in the second file.
When match found compare all other columns in the above two files for that particular row and write the matched data in to the HDFS and should return the % matched among those two input files.
Mycode is as follows..
/* First Mapper */
public void map(LongWritable key,Text value,Context context)
throws IOException, InterruptedException{
String line = value.toString();
String[] words = line.split(";");
String name = words[1];
String other = words[2];
context.write(new Text(name), new Text(line));
}
}
/* Second Mapper */
public static class InputMapper2 extends Mapper<LongWritable,Text,Text,Text>{
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException
{
String line = value.toString();
String[] words = line.split(";");
String name = words[1];
String other = words[2];
System.out.println(key);
context.write(new Text(name), new Text(line));
}
}
/* Reducer for both of the mappers */
/*incomplete and have to compare the two csv files here */
public static class CounterReducer extends Reducer
{
String line=null;
public void reduce(Text key, Iterable<Text> values, Context context )
throws IOException, InterruptedException
{
Iterator<Text> val = values.iterator();
for(Text value:values)
{
line = value.toString();
}
context.write(key, new Text(line));
}
}

Related

Can OpenCSV ignore trailing commas on records?

A CSV with trailing commas like this:
name, phone
joe, 123-456-7890,
bob, 333-555-6666,
processed like this:
CSVReaderHeaderAware r = new CSVReaderHeaderAware(reader);
Map<String, String> values = r.readMap();
will throw this exception:
java.io.IOException: Error on record number 2: The number of data elements is not the same as the number of header elements
For now I'm stripping commas from input files using sed:
find . -type f -exec sed -i 's/,*\r*$//' {} \;
Is there some easy way to tell OpenCSV to ignore trailing commas?
OpenCSV maintainers commented here. As of OpenCSV v5.1 there is no simple way to accomplish this and pre-processing the file using sed, etc is best for now.
According to link provided in #Andrew's answer it's a malformed CSV input.
But as own maintainer suggests ( here ):
If you know you will always have single-line records, you could
derive a class from CSVReader, override getNextLine() to call
super.getNextLine(), then cut off the trailing comma, and of course,
pass your new reader into opencsv to use in parsing.
In other words, create your own CustomCSVReader and remove the last comma.
Here's an example:
import com.opencsv.CSVReader;
public class CustomCSVReader extends CSVReader {
public CustomCSVReader(Reader reader) {
super(reader);
}
#Override
protected String getNextLine() throws IOException {
String line = super.getNextLine();
if (line == null) {
return null;
}
boolean endsWithComma = line.endsWith(",");
if (endsWithComma) {
return line.substring(0, line.length() - 1);
}
return line;
}
}
The Model Converter using CustomCSVReader
public class CustomCSVParser{
public List<User> convert(String data) {
return new CsvToBeanBuilder<Transaction>(new CustomCSVReader(new StringReader(data)))
.withType(User.class)
.build()
.parse();
}
The Model class
import com.opencsv.bean.CsvBindByName;
public class User {
#CsvBindByName(column = "name")
private String userName;
#CsvBindByName(column = "phone")
private String phoneNumber;
// Constructor, Getters and Setters ommited
}
Test Class
class CustomCSVParserTest {
private CustomCSVParser instance;
#BeforeEach
void setUp() {
instance = new CustomCSVParser();
}
#Test
void csvInput_withCommaInLastLine_mustBeParsed() {
String data = "name, phone
joe, 123-456-7890,
bob, 333-555-6666,";
List<User> result = instance.convert(data);
List<User> expectedResult = Arrays.asList(
new User("joe", "123-456-7890"),
new User("bob", "333-555-6666"));
Assertions.assertArrayEquals(expectedResult.toArray(), result.toArray());
}
}
That's it.

SpringBatch Read from unstructured csv

I would like to read from an unstructured CSV file. It means it will have different columns types every time. Please help.
Yes, Finally Myself found the solution and i would like to share with you. You can write a LineMapper and you can map unstructured header (dynamic- columns) with each line by the following code. Please Note i have read header while job scheduling and pass it as JobParameter.
#Bean
#StepScope
public FlatFileItemReader<Customer> csvReader(#Value("#{jobParameters[filepath]}") String filepath,
#Value("#{jobParameters[header]}") String header,
#Value("#{jobParameters[campaignId]}") String campaignId,
#Value("#{jobParameters[_id]}") String _id) {
FlatFileItemReader<Customer> flatFileItemReader = new FlatFileItemReader<>();
flatFileItemReader.setResource(new FileSystemResource(filepath));
flatFileItemReader.setName("customer-csv-file-reader");
flatFileItemReader.setLinesToSkip(1);
flatFileItemReader.setLineMapper(lineMapper(header,campaignId,_id));
return flatFileItemReader;
}
#Bean
#StepScope
public LineMapper<Customer> lineMapper(#Value("#{jobParameters[header]}") String header,
#Value("#{jobParameters[campaignId]}") String campaignId,
#Value("#{jobParameters[_id]}") String _id) {
return new LineMapper<Customer>() {
public String[] headers = header.split(",");
#Override
public Customer mapLine(String line, int linenumber) throws Exception {
Customer item = new Customer();
String[] p = line.split(",");
Map<String, String> properties = IntStream.range(0, headers.length).boxed()
.collect(Collectors.toMap(i -> headers[i], i -> p[i]));
item.setCampaignId(new ObjectId(campaignId));
item.setInviteId(new ObjectId(_id));
item.setProperties(properties);
return item;
}
};
}

Apache Camel CSV with Header

I have written a simple test app that reads records from a DB and puts the result in a csv file. So far it works fine but the column names i.e. headers are not put in the csv file. According to the doc it should be put there. I have also tried it without/with streaming and split but the situation is the same.
In the camel unit-tests in line 182 the headers are put there explicitly: https://github.com/apache/camel/blob/master/components/camel-csv/src/test/java/org/apache/camel/dataformat/csv/CsvDataFormatTest.java
How could this very simple problem be solved without the need to iterate over the headers? I also experimented with different settings but all the same. The e.g delimiters have been considered I set but the headers not. Thanks for the responses also in advance.
I used Camel 2.16.1 like this:
final CsvDataFormat csvDataFormat = new CsvDataFormat();
csvDataFormat.setHeaderDisabled(false);
[...]
from("direct:TEST").routeId("TEST")
.setBody(constant("SELECT * FROM MYTABLE"))
.to("jdbc:myDataSource?readSize=100") // max 100 records
// .split(simple("${body}")) // split the list
// .streaming() // not to keep all messages in memory
.marshal(csvDataFormat)
.to("file:extract?fileName=TEST.csv");
[...]
EDIT 1
I have also tried to add the headers from the exchange.in. They are there available with the name "CamelJdbcColumnNames" in a HashSet. I added it to the csvDataFormat like this:
final CsvDataFormat csvDataFormat = new CsvDataFormat();
csvDataFormat.setHeaderDisabled(false);
[...]
from("direct:TEST").routeId("TEST")
.setBody(constant("SELECT * FROM MYTABLE"))
.to("jdbc:myDataSource?readSize=100") // max 100 records
.process(new Processor() {
public void process(Exchange exchange) throws Exception {
headerNames = (HashSet)exchange.getIn().getHeader("CamelJdbcColumnNames");
System.out.println("#### Process headernames = " + new ArrayList<String>(headerNames).toString());
csvDataFormat.setHeader(new ArrayList<String>(headerNames));
}
})
.marshal(csvDataFormat)//.tracing()
.to("file:extract?fileName=TEST.csv");
The println() prints the column names but the cvs file generated does not.
EDIT2
I added the header names to the body as proposed in comment 1 like this:
.process(new Processor() {
public void process(Exchange exchange) throws Exception {
Set<String> headerNames = (HashSet)exchange.getIn().getHeader("CamelJdbcColumnNames");
Map<String, String> nameMap = new LinkedHashMap<String, String>();
for (String name: headerNames){
nameMap.put(name, name);
}
List<Map> listWithHeaders = new ArrayList<Map>();
listWithHeaders.add(nameMap);
List<Map> records = exchange.getIn().getBody(List.class);
listWithHeaders.addAll(records);
exchange.getIn().setBody(listWithHeaders, List.class);
System.out.println("#### Process headernames = " + new ArrayList<String>(headerNames).toString());
csvDataFormat.setHeader(new ArrayList<String>(headerNames));
}
})
The proposal solved the problem and thank you for that but it means that CsvDataFormat is not really usable. The exchange body after the JDBC query contains an ArrayList from HashMaps containing one record of the table. The key of the HashMap is the name of the column and the value is the value. So setting the config value for the header output in CsvDataFormat should be more than enough to get the headers generated. Do you know a simpler solution or did I miss something in the configuration?
You take the data from a database with JDBC so you need to add the headers yourself first to the message body so its the first row. The resultset from the jdbc is just the data, not including headers.
I have done it by overriding the BindyCsvDataFormat and BindyCsvFactory
public class BindySplittedCsvDataFormat extends BindyCsvDataFormat {
private boolean marshallingfirslLot = false;
public BindySplittedCsvDataFormat() {
super();
}
public BindySplittedCsvDataFormat(Class<?> type) {
super(type);
}
#Override
public void marshal(Exchange exchange, Object body, OutputStream outputStream) throws Exception {
marshallingfirslLot = new Integer(0).equals(exchange.getProperty("CamelSplitIndex"));
super.marshal(exchange, body, outputStream);
}
#Override
protected BindyAbstractFactory createModelFactory(FormatFactory formatFactory) throws Exception {
BindySplittedCsvFactory bindyCsvFactory = new BindySplittedCsvFactory(getClassType(), this);
bindyCsvFactory.setFormatFactory(formatFactory);
return bindyCsvFactory;
}
protected boolean isMarshallingFirslLot() {
return marshallingfirslLot;
}
}
public class BindySplittedCsvFactory extends BindyCsvFactory {
private BindySplittedCsvDataFormat bindySplittedCsvDataFormat;
public BindySplittedCsvFactory(Class<?> type, BindySplittedCsvDataFormat bindySplittedCsvDataFormat) throws Exception {
super(type);
this.bindySplittedCsvDataFormat = bindySplittedCsvDataFormat;
}
#Override
public boolean getGenerateHeaderColumnNames() {
return super.getGenerateHeaderColumnNames() && bindySplittedCsvDataFormat.isMarshallingFirslLot();
}
}
My solution with spring xml (but I'd like to have an option in for extracting also the header on top:
Using spring xml
<multicast stopOnException="true">
<pipeline>
<log message="saving table ${headers.tablename} header to ${headers.CamelFileName}..."/>
<setBody>
<groovy>request.headers.get('CamelJdbcColumnNames').join(";") + "\n"</groovy>
</setBody>
<to uri="file:output"/>
</pipeline>
<pipeline>
<log message="saving table ${headers.tablename} rows to ${headers.CamelFileName}..."/>
<marshal>
<csv delimiter=";" headerDisabled="false" useMaps="true"/>
</marshal>
<to uri="file:output?fileExist=Append"/>
</pipeline>
</multicast>
http://www.redaelli.org/matteo-blog/2019/05/24/exporting-database-tables-to-csv-files-with-apache-camel/

how to reverse the extracted entry after modification

I am working with csv file having very large dataset. while reading file i had extracted 4th place(BALANCE) ';' separated numeric value from each rows through while loop iteration. and make a list of Double after some mathematical calculation(here incremented).
now I want to store this list of Double in reverse order(from end to beginning).as its original position(here 4th place).example
public static void main(String[] args) throws IOException {
String filename = "abc.csv";
List<Double> list = new ArrayList<Double>();
File file = new File(filename);
Scanner inputStream = new Scanner(file);
inputStream.next();
while (inputStream.hasNext()) {
String data = inputStream.next();
String[] values = data.split(";");
double BALANCE = Double.parseDouble(values[1]);
BALANCE = BALANCE + 1;
ListIterator li = list.listIterator(list.size());
while (li.hasPrevious()) {
values[1] = String.valueOf(li.previous()); }
inputStream.close();
}
} }
You can use Collections.reverse. Example Collections.reverse(list);

how to read CSV file in Map/Reduce?

I have a large CSV file that is in the size of 6GB, comma-delimited. Below is the mapper function
#Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] tokens = value.toString().split(",");
String crimeType = tokens[5].trim(); // column #5 is the crime type in the CSV file, serving key
// int year = Integer.parseInt(tokens[17].trim()); // the year when the crime happened
int year = 2010;
CrimeTypeKey crimeTypeYearKey = new CrimeTypeKey(crimeType, year);
context.write(crimeTypeYearKey, ONE);
}
As you can see, I use ".split" to break down each row (or column?). I am wondering how you can use OpenCSV in this case? please give me an example, thanks a lot