Flink Data Stream CSV Writer not writing data to CSV file - csv

I am new to apache flink and trying to learn data streams. I am reading student data which has 3 columns(Name,Subject and Marks) from a csv file. I have applied filter on marks and only selecting those records where marks >40.
I am trying to write this data to csv file but program runs successfully and csv file remains empty. No data gets written to csv file.
I tried with different syntax for writing csv file but none of them worked for me. I am running this locally through eclipse. Write to text file works fine.
DataStream<String> text = env.readFile(format, params.get("input"),
FileProcessingMode.PROCESS_CONTINUOUSLY,100);
DataStream<String> filtered = text.filter(new FilterFunction<String>(){
public boolean filter(String value) {
String[] tokens = value.split(",");
return Integer.parseInt(tokens[2]) >= 40;
}
});
filtered.writeAsText("testFilter",WriteMode.OVERWRITE);
DataStream<Tuple2<String, Integer>> tokenized = filtered
.map(new MapFunction<String, Tuple2<String, Integer>>(){
public Tuple2<String, Integer> map(String value) throws Exception {
return new Tuple2("Test", Integer.valueOf(1));
}
});
tokenized.print();
tokenized.writeAsCsv("file:///home/Test/Desktop/output.csv",
WriteMode.OVERWRITE, "/n", ",");
try {
env.execute();
} catch (Exception e1) {
e1.printStackTrace();
}
}
}
Below is my input CSV format:
Name1,Subj1,30
Name1,Subj2,40
Name1,Subj3,40
Name1,Subj4,40
Tokenized.print() prints all correct records.

I did a little experimenting, and found that this job works just fine:
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.core.fs.FileSystem;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
public class WriteCSV {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
env.fromElements(new Tuple2<>("abc", 1), new Tuple2<>("def", 2))
.writeAsCsv("file:///tmp/test.csv", FileSystem.WriteMode.OVERWRITE, "\n", ",");
env.execute();
}
}
If I don't set the parallelism to 1, then the results are different. In that case, test.csv is a directory containing four files, each written by one of the four parallel subtasks.
I'm not sure what's wrong in your case, but maybe you can work backwards from this example (assuming it works for you).

You should remove tokenized.print(); before tokenized.writeAsCsv();.
It will consume the data the print();.

Related

Arduino Accelerometer Output Processing to CSV File

I am trying to convert ADXL335 accelerometer data from arduino to a csv file. The arduino code works perfectly when looking at it using the serial monitor. The processing code returns an output in the console but does not write anything into the CSV file. I'm not sure why it won't print. When I uncomment the print in the second if statement, the values are stored in the csv file, but it only works for 15 value inputs and then repeats these same values until the code is stopped. When we embed the if statements, we again get nothing in the csv file. I think there is something with the first if statement but I am not sure how to continue troubleshooting. I'm wondering how I can make it so I get a continuous output of the accelerometer readings. Thanks in advance.
Here is my arduino code:
void setup() {
pinMode(14,INPUT);//define mode of pin
pinMode(15,INPUT);
pinMode(16,INPUT);
Serial.begin(9600);//serial communication begin
delay(10);
}
void loop()
{
//Serial.print(",");
Serial.print("X=");
Serial.print(analogRead(14));
Serial.print(",");
Serial.print("Y=");
Serial.print(analogRead(15));
Serial.print(",");
Serial.print("Z=");
Serial.print(analogRead(16));
Serial.println(",");
delay(100);
}
Here is my processing code:
void setup() {
output = createWriter( "data.csv" );
mySerial = new Serial( this, Serial.list()[1], 9600 );
}
void draw() {
if (mySerial.available() > 0 ) {
value = mySerial.readString();
System.out.print(value);
output.println(value);
}
if ( value != null ) {
//output.println(value);
output.println();
}
}
void keyPressed() {
output.flush(); // Writes the remaining data to the file
output.close(); // Finishes the file
exit(); // Stops the program
}
First off you might want to change your arduino code to output valid CSV lines.
I'd suggest loosing the CSV header for now, or appending it from Processing.
Try this for now:
void loop()
{
Serial.print(analogRead(14));
Serial.print(",");
Serial.print(analogRead(15));
Serial.print(",");
Serial.print(analogRead(16));
Serial.println(",");
delay(100);
}
This should output something like ####,####,####, which is a valid CSV line.
On the processing side I would also advise first buffering until the new line(\n) character, which you can easily do with bufferUntil() and serialEvent()
import processing.serial.*;
// serial port
Serial mySerial;
// single line containing acc. values as a CSV row string
String values;
// the output
PrintWriter output;
void setup() {
output = createWriter( "data.csv" );
try{
mySerial = new Serial( this, Serial.list()[1], 9600 );
mySerial.bufferUntil('\n');
}catch(Exception e){
println("Error opening serial port: double check USB cable, if Serial Monitor is open, etc.");
e.printStackTrace();
}
}
void draw() {
background(0);
if(values != null){
text("most recent values:\n" + values,10,15);
}
}
// gets called when new data is in, and since we're buffering until \n it's one csv line at a time
void serialEvent(Serial p) {
values = p.readString();
if(values != null && values.length() > 0){
println(values);
// if values.trim() isn't call, the \n should still be there so print() will suffice in terms of adding a new line
output.print(values);
}else{
println("received invalid serial data");
}
}
void keyPressed() {
output.flush(); // Writes the remaining data to the file
output.close(); // Finishes the file
exit(); // Stops the program
}
Also notice error checking on the serial connection and serial data read
(it's a good habit checking for things that could go wrong).
Optionally you can add a header to your CSV file in setup():
output = createWriter( "data.csv" );
output.println("X,Y,Z,");
In terms of writing a CSV file there are many ways to do that and Processing has a Table class which allows you to read/parse and write CSV data. At the moment your PrintWriter approach is pretty straight forward: use that.

USACO Code Submission Problem - Output File Missing

I'm practicing some USACO past released problems but whenever I submit my code for grading I receive the error:
Your output file (FILENAME.out):
[File missing!]
I tested every problem using this simple code, but still receive the same error:
import java.util.*;
import java.io.*;
public class Test
{
public static void main (String [] args) throws IOException
{
PrintWriter out = new PrintWriter(new BufferedWriter(new FileWriter(FILENAME)));
out.println("Hello world.");
out.close();
System.exit(0);
}
}
Why would this code not create an output file?
The USACO grading system has the output file already made in the same directory as your java solution, so all you need to do is just write to it.
In your line
PrintWriter out = new PrintWriter(new BufferedWriter(new FileWriter(FILENAME)));
you should change this to
PrintWriter out = new PrintWriter(new BufferedWriter(new FileWriter(FILENAME.out)));
since this is the name of the file. This does not create an actual file, but just writes to the existing one on the USACO grading system.

Univocity bean processor showing inconsistent behaviour in distributed system

I am using univocity bean processor for file parsing. I was able to successfully use it on my local box. But on deploying the same code on an environment with multiple hosts, the parser is showing inconsistent behavior. Say for invalid files, it is not failing processing and also for valid files it fails processing some times.
Would like to know if bean processor implementation suitable for a multi-threaded distributed environment.
Sample code:
private void validateFile(#Nonnull final File inputFile) throws NonRetriableException {
try {
final BeanProcessor<TargetingInputBean> rowProcessor = new BeanProcessor<TargetingInputBean>(
TargetingInputBean.class) {
#Override
public void beanProcessed(#Nonnull final TargetingInputBean targetingInputBean,
#Nonnull final ParsingContext context) {
final String customerId = targetingInputBean.getCustomerId();
final String segmentId = targetingInputBean.getSegmentId();
log.debug("Validating customerId {} segmentId {} for {} file", customerId, segmentId, inputFile
.getAbsolutePath());
if (StringUtils.isBlank(customerId) || StringUtils.isBlank(segmentId)) {
throw new DataProcessingException("customerId or segmentId is blank");
}
try {
someValidation(customerId);
} catch (IllegalArgumentException ex) {
throw new DataProcessingException(
String.format("customerId %s is not in required format. Exception"
+ " message %s", customerId, ex.getMessage()),
ex);
}
}
};
rowProcessor.setStrictHeaderValidationEnabled(true);
final CsvParser parser = new CsvParser(getCSVParserSettings(rowProcessor));
parser.parse(inputFile);
} catch (TextParsingException ex) {
throw new NonRetriableException(
String.format("Exception=%s occurred while getting & parsing targeting file "
+ "contents, error=%s", ex.getClass(), ex.getMessage()),
ex);
}
}
private CsvParserSettings getCSVParserSettings(#Nonnull final BeanProcessor<TargetingInputBean> rowProcessor) {
final CsvParserSettings parserSettings = new CsvParserSettings();
parserSettings.setProcessor(rowProcessor);
parserSettings.setHeaderExtractionEnabled(true);
parserSettings.getFormat().setDelimiter(AIRCubeTargetingFileConstants.FILE_SEPARATOR);
return parserSettings;
}
TargetingInputBean:
public class TargetingInputBean {
#Parsed(field = "CustomerId")
private String customerId;
#Parsed(field = "SegmentId")
private String segmentId;
}
Are you using the latest version?
I just realized you are probably affected by a bug introduced in version 2.5.0 that was fixed in version 2.5.6 if I'm not mistaken. This plagued me for a while as it was an internal concurrency issue that was hard to track down. Basically when you pass a File without an explicit encoding it will try to find a UTF BOM marker in the input (effectively consuming the first character) to determine the encoding automatically. This happened only for InputStreams and Files.
Anyway, this has been fixed so simply updating to the latest version should get rid of the problem for you (please let me know if you are not using version 2.5.something)
If you want to remain with the current version you have there, the error will be gone if you call
parser.parse(inputFile, Charset.defaultCharset());
This will prevent the parser from trying to discover whether there's a BOM marker in your file, therefore avoiding that pesky bug.
Hope this helps

JSON to CSV conversion on HDFS

I am trying to convert a JSON file into CSV.
I have a JAVA code which is able to do it perfectly on UNIX file system and on local file system.
I have written below main class to perform this conversion on HDFS.
public class ClassMain {
public static void main(String[] args) throws IOException {
String uri = args[1];
String uri1 = args[2];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
FSDataInputStream in = null;
FSDataOutputStream out = fs.create(new Path(uri1));
try{
in = fs.open(new Path(uri));
JsonToCSV toCSV = new JsonToCSV(uri);
toCSV.json2Sheet().write2csv(uri1);
IOUtils.copyBytes(in, out, 4096, false);
}
finally{
IOUtils.closeStream(in);
IOUtils.closeStream(out);
}
}
}
json2sheet and write2csv are methods which perform the conversion and write operation.
I am running this jar using below command:
hadoop jar json-csv-hdfs.jar com.nishant.ClassMain /nishant/large.json /nishant/output
The problem is, it does not write anything at /nishant/output. It creates a 0 sized /nishant/output file.
Maybe the usage of copyBytes is not a good idea here.
How to achieve this on HDFS if it is working OK on unix FS and local FS.
Here I am trying to convert JSON file to CSV and not trying to map JSON objects to their values
FileSystem needs only one configuration key to successfully connect to HDFS.
conf.set(key, "hdfs://host:port"); // where key="fs.default.name"|"fs.defaultFS"

How to read a CSV file from Hdfs?

I have my Data in a CSV file. I want to read the CSV file which is in HDFS.
Can anyone help me with the code??
I'm new to hadoop. Thanks in Advance.
The classes required for this are FileSystem, FSDataInputStream and Path. Client should be something like this :
public static void main(String[] args) throws IOException {
// TODO Auto-generated method stub
Configuration conf = new Configuration();
conf.addResource(new Path("/hadoop/projects/hadoop-1.0.4/conf/core-site.xml"));
conf.addResource(new Path("/hadoop/projects/hadoop-1.0.4/conf/hdfs-site.xml"));
FileSystem fs = FileSystem.get(conf);
FSDataInputStream inputStream = fs.open(new Path("/path/to/input/file"));
System.out.println(inputStream.readChar());
}
FSDataInputStream has several read methods. Choose the one which suits your needs.
If it is MR, it's even easier :
public static class YourMapper extends
Mapper<LongWritable, Text, Your_Wish, Your_Wish> {
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
//Framework does the reading for you...
String line = value.toString(); //line contains one line of your csv file.
//do your processing here
....................
....................
context.write(Your_Wish, Your_Wish);
}
}
}
If you want to use mapreduce you can use TextInputFormat to read line by line and parse each line in mapper's map function.
Other option is to develop (or find developed) CSV input format for reading data from file.
There is one old tutorial here http://hadoop.apache.org/docs/r0.18.3/mapred_tutorial.html but logic is same in new versions
If you are using single process for reading data from file it is same as reading file from any other file system. There is nice example here https://sites.google.com/site/hadoopandhive/home/hadoop-how-to-read-a-file-from-hdfs
HTH