Parsing CSV in groovy with exception tolerance - csv

I've been trying to parse a csv file in groovy, currently using the library org.apache.commons.csv 2.4. The requirement I have is that there are invalid data value in csv cells, such as invalid characters, and instead of throwing an exception on first invalid row/cell, I want to collect these exceptions and keep iterating in the csv file until the end, then I will have a full list of invalid data this csv file has.
With that purpose, I've tried multiple ways of using this apache lib, but unfortunately as long as it uses the CSVParser.getNextRecord() for iteration, the iterator will just abort.
put in code, something like this:
def records = new CSVParser(reader, CSVFormat.EXCEL.withHeader().withIgnoreSurroundingSpaces())
// at this line, the iterator() inside CSVParser is always using getNextRecord() for its next() implementation, and it may throw exception on invalid char
records.each {record->
// if the exception is thrown from .each, that makes below try/catch in vain
try{
}catch(e){ //want collect Errors here }
}
So, is there anything else that I should dig in this library? Or could anybody point me to another more viable solution? Many thanks to all!
Update:
Sample CSV
"Company code for WBS element","WBS Element","PS: Short description (1st text line)","Responsible Cost Center for WBS Element","OBJNR","WBS Status"
"1001","RE-01768-011","Opex - To present a paper on Career con","0000016400","PR00031497","X"
"1001","RE-01768-011","Opex - To present a paper on "Career con","0000016400","PR00031497","X"
The second data row has invalid char " that makes parser throw exception

The problem you have is that one of the characters in one cell is the quote character used by the parser according with the format selected: CSVFormat.EXCEL.
The quote character is
the character used to encapsulate values containing special characters
so in your example the quote is misused and the parser complains about it.
You can workaround that using a different CSVFormat. For example, one without quote character:
#Grapes(
#Grab(group='org.apache.commons', module='commons-csv', version='1.2')
)
import java.nio.charset.*
import org.apache.commons.csv.*
def text = '''"Company code for WBS element","WBS Element","PS: Short description (1st text line)","Responsible Cost Center for WBS Element","OBJNR","WBS Status"
"1001","RE-01768-011","Opex - To present a paper on Career con","0000016400","PR00031497","X"
"1002","RE-01768-011","Opex - To present a paper on "Career con","0000016400","PR00031497","X"
"1003","RE-01768-011","Opex - To present a paper on Career con","0000016400","PR00031497","X"'''
def parsed = CSVParser.parse(text, CSVFormat.EXCEL.withHeader().withIgnoreSurroundingSpaces().withQuote(null))
parsed.getRecords().each {
println it.toMap().values()
}
And the above yields:
[]
["0000016400", "1001", "RE-01768-011", "Opex - To present a paper on Career con", "X", "PR00031497"]
["0000016400", "1002", "RE-01768-011", "Opex - To present a paper on "Career con", "X", "PR00031497"]
["0000016400", "1003", "RE-01768-011", "Opex - To present a paper on Career con", "X", "PR00031497"]
Of course, with the above workaround, you have the quotes (") included in each field.
You can replace them all if you want:
parsed.getRecords().each {
println it.toMap().values().collect({ it.replace('"', '') })
}

The problem is that if the csv file has invalid data, meaning data that breaks the rules of the csv format, then the parser cannot... parse. That's why it cannot reliably parse any more than the first error encountered.

Related

Escape string in a JSON object

We use Freemarker to transform one JSON to another. The input JSON is something like this:
{"k1": "a", "k2":"line1. \n line2"}
Post using the Freemarker template, the JSON is converted to:
{ \n\n "p1": "a", \n\n "p2": "line1. \n line2"}
Here is the logic we use to do the transformation
final Map<String, Object> input = JsonConverter.convertFromJson(input, Map.class);
final Template template = freeMarkerConfiguration.getTemplate("Template1.ftl");
final Writer out = new StringWriter();
template.process(input, out);
out.flush();
final String newlineFilteredResult = new JSONObject(out.toString).toString();
The conversion to JSON object fails due to a newline character inside a string for key k2 and gives the following exception:
Caused by: org.json.JSONException: Unterminated string at ...
I tried using the following but nothing works:
1. JSONObject.quote
2. JSONValue.escape
3. out.toString().replaceAll("[\n\r]+", "\\n");
I get the following exception due to the newline characters at the beginning as well:
Caused by: org.json.JSONException: Missing value at 1 [character 2 line 1]
Could someone please point me in the correct direction.
Edit
After further clarification from OP he had "${key}": "${value}" in his freemarker template and ${value} could contain line brakes. The solution in this case is to use ${value?json_string}.
Starting from FreeMarker 2.3.32 you can write "${key}": ${value?c} instead of "${key}": "${value}", because if the left-side of ?c is a string, now instead of failing, it quotes and escapes the string. Thus you don't even have to know if the left-side is a number/boolean, which must not be quoted (and ?c won't quote them), or a string, which must be quoted, as it's automatic.
Also, if the left-value is known to be missing/null sometimes, them ?cn will handle that case by printing a null literal.
Also, check out the c_format setting for best results, but by default string formatting is JSON compatible, so using ?c will be an improvement even without setting that.

JSON.parse and JSON.stringify are not idempotent and that is bad

This question is multipart-
(1a) JSON is fundamental to JavaScript, so why is there no JSON type? A JSON type would be a string that is formatted as JSON. It would be marked as parsed/stringified until the data was altered. As soon as the data was altered it would not be marked as JSON and would need to be re-parsed/re-stringified.
(1b) In some software systems, isn't it possible to (accidentally) attempt to send a plain JS object over the network instead of a serialized JS object? Why not make an attempt to avoid that?
(1c) Why can't we call JSON.parse on a straight up JavaScript object without stringifying it first?
var json = { //JS object in properJSON format
"baz":{
"1":1,
"2":true,
"3":{}
}
};
var json0 = JSON.parse(json); //will throw a parse error...bad...it should not throw an error if json var is actually proper JSON.
So we have no choice but to do this:
var json0= JSON.parse(JSON.stringify(json));
However, there are some inconsistencies, for example:
JSON.parse(true); //works
JSON.parse(null); //works
JSON.parse({}); //throws error
(2) If we keep calling JSON.parse on the same object, eventually it will throw an error. For example:
var json = { //same object as above
"baz":{
"1":1,
"2":true,
"3":{}
}
};
var json1 = JSON.parse(JSON.stringify(json));
var json2 = JSON.parse(json1); //throws an error...why
(3) Why does JSON.stringify infinitely add more and more slashes to the input? It is not only hard to read the result for debugging, but it actually puts you in dangerous state because one JSON.parse call won't give you back a plain JS object, you have to call JSON.parse several times to get back the plain JS object. This is bad and means it is quite dangerous to call JSON.stringify more than once on a given JS object.
var json = {
"baz":{
"1":1,
"2":true,
"3":{}
}
};
var json2 = JSON.stringify(json);
console.log(json2);
var json3 = JSON.stringify(json2);
console.log(json3);
var json4 = JSON.stringify(json3);
console.log(json4);
var json5 = JSON.stringify(json4);
console.log(json5);
(4) What is the name for a function that we should be able to call over and over without changing the result (IMO how JSON.parse and JSON.stringify should behave)? The best term for this seems to be "idempotent" as you can see in the comments.
(5) Considering JSON is a serialization format that can be used for networked objects, it seems totally insane that you can't call JSON.parse or JSON.stringify twice or even once in some cases without incurring some problems. Why is this the case?
If you are someone who is inventing the next serialization format for Java, JavaScript or whatever language, please consider this problem.
IMO there should be two states for a given object. A serialized state and a deserialized state. In software languages with stronger type systems, this isn't usually a problem. But with JSON in JavaScript, if call JSON.parse twice on the same object, we run into fatal exceptions. Likewise, if we call JSON.stringify twice on the same object, we can get into an unrecoverable state. Like I said there should be two states and two states only, plain JS object and serialized JS object.
1) JSON.parse expects a string, you are feeding it a Javascript object.
2) Similar issue to the first one. You feed a string to a function that needs an object.
3) Stringfy actually expects a string, but you are feeding it a String object. Therefore, it applies the same measures to escape the quotes and slashes as it would for the first string. So that the language can understand the quotes, other special characters inside the string.
4) You can write your own function for this.
5) Because you are trying to do a conversion that is illegal. This is related to the first and second question. As long as the correct object types are fed, you can call it as many times as you want. The only problem is the extra slashes but it is in fact the standard.
We'll start with this nightmare of your creation: string input and integer output.
IJSON.parse(IJSON.stringify("5")); //=> 5
The built-in JSON functions would not fail us this way: string input and string output.
JSON.parse(JSON.stringify("5")); //=> "5"
JSON must preserve your original data types
Think of JSON.stringify as a function that wraps your data up in a box, and JSON.parse as the function that takes it out of a box.
Consider the following:
var a = JSON.stringify;
var b = JSON.parse;
var data = "whatever";
b(a(data)) === data; // true
b(b(a(a(data)))) === data; // true
b(b(b(a(a(a(data)))))) === data; // true
That is, if we put the data in 3 boxes, we have to take it out of 3 boxes. Right?
If I put my data in 2 boxes and take it out of 1, I'm not holding my data yet, I'm holding a box that contains my data. Right?
b(a(a(data))) === data; // false
Seems sane to me...
JSON.parse unboxes your data. If it is not boxed, it cannot unbox it. JSON.parse expects a string input and you're giving it a JavaScript object literal
The first valid call to JSON.parse would return an object. Calling JSON.parse again on this object output would result in the same failure as #1
repeated calls to JSON.stringify will "box" our data multiple times. So of course you have to use repeated calls to JSON.parse then to get your data out of each "box"
Idempotence
No, this is perfectly sane. You can't triple-stamp a double-stamp.
You'd never make a mistake like this, would you?
var json = IJSON.stringify("hi");
IJSON.parse(json);
//=> "hi"
OK, that's idempotent, but what about
var json = IJSON.stringify("5");
IJSON.parse(json);
//=> 5
UH OH! We gave it a string each time, but the second example returns an integer. The input data type has been lost!
Would the JSON functions have failed us here?
var json = JSON.stringify("hi");
JSON.parse(json);
//=> "hi"
All good. And what about the "5" ?
var json = JSON.stringify("5");
JSON.parse(json));
//=> "5"
Yay, the types have been preseved! JSON works, IJSON does not.
Maybe a more real-life example:
OK, so you have a busy app with a lot of developers working on it. It makes
reckless assumptions about the types of your underlying data. Let's say it's a chat app that makes several transformations on messages as they move from point to point.
Along the way you'll have:
IJSON.stringify
data moves across a network
IJSON.parse
Another IJSON.parse because who cares? It's idempotent, right?
String.prototype.toUpperCase — because this is a formatting choice
Let's see the messages
bob: 'hi'
// 1) '"hi"', 2) <network>, 3) "hi", 4) "hi", 5) "HI"
Bob's message looks fine. Let's see Alice's.
alice: '5'
// 1) '5'
// 2) <network>
// 3) 5
// 4) 5
// 5) Uncaught TypeError: message.toUpperCase is not a function
Oh no! The server just crashed. You'll notice it's not even the repeated calling of IJSON.parse that failed here. It would've failed even if you called it once.
Seems like you were doomed from the start... Damned reckless devs and their careless data handling!
It would fail if Alice used any input that happened to also be valid JSON
alice: '{"lol":"pwnd"}'
// 1) '{"lol":"pwnd"}'
// 2) <network>
// 3) {lol:"pwnd"}
// 4) {lol:"pwnd"}
// 5) Uncaught TypeError: message.toUpperCase is not a function
OK, unfair example maybe, right? You're thinking, "I'm not that reckless, I
wouldn't call IJSON.stringify or IJSON.parse on user input like that!"
It doesn't matter. You've fundamentally broken JSON because the original
types can no longer be extracted.
If I box up a string using IJSON, and then unbox it, who knows what I will get back? Certainly not you, and certainly not the developer using your reckless function.
"Will I get a string type back?"
"Will I get an integer?"
"Maybe I'll get an object?"
"Maybe I will get cake. I hope it's cake"
It's impossible to tell!
You're in a whole new world of pain because you've been careless with your data types from the start. Your types are important so start handling them with care.
JSON.stringify expects an object type and JSON.parse expects a string type.
Now do you see the light?
I'll try to give you one reason why JSON.parse cannot be called multiple time on the same data without us having a problem.
you might not know it but a JSON document does not have to be an object.
this is a valid JSON document:
"some text"
lets store the representation of this document inside a javascript variable:
var JSONDocumentAsString = '"some text"';
and work on it:
var JSONdocument = JSON.parse(JSONDocumentAsString);
JSONdocument === 'some text';
this will cause an error because this string is not the representation of a JSON document
JSON.parse(JSONdocument);
// SyntaxError: JSON.parse: unexpected character at line 1 column 1 of the JSON data
in this case how could have JSON.parse guessed that JSONdocument (being a string) was a JSON document and that it should have returned it untouched ?

Getting line number of json file at which the json validation failed

I am using json-schema-validator for validating my json.
I want to show the line number in the json data file where the validation failure occurs. I want to show the failure messages in the user friendly manner.
I get the pointer to the json node where the validation failure might have occurred as follows:
JsonNode jsondatanode = JsonLoader.fromFile(new File("jsondata.json"));
JsonNode jsonschemanode = JsonLoader.fromFile(new File("jsonschema.json"));
final JsonSchemaFactory factory = JsonSchemaFactory.byDefault();
final JsonSchema datastoreschema = factory.getJsonSchema(jsonschemanode);
ProcessingReport report;
report = datastoreschema.validate(jsondatanode);
However the pointer is inconvenient to locate the json object/attribute when the json file contains many nodes of type specified by the pointer.
I got following validation failure message:
--- BEGIN MESSAGES ---
error: instance value (12) not found in enum (possible values:["true","false","y","n","yes","no",0,1])
level: "error"
schema: {"loadingURI":"#","pointer":"/properties/configuration/items/properties/skipHeader"}
instance: {"pointer":"/configuration/0/skipHeader"}
domain: "validation"
keyword: "enum"
value: 12
enum: ["true","false","y","n","yes","no",0,1]
--- END MESSAGES ---
I want to show the custom message for validation failures with the line number in json data file which caused schema validation failure. I know I can access the individual details of validation report as shown in below code.
I want to show the custom message as follows:
List<ProcessingMessage> messages = Lists.newArrayList((AbstractProcessingReport)report);
JsonNode reportJson = messages.get(0).asJson();
if(reportJson.get("keyword").toString().equals("enum"))
{
System.out.println("Value "+report.Json.get("value").toString() +"is invalid in " + filepath + " at line " + linenumber);
}
else if{
//...
}
//...
What I dont understand is how can I get that linenumber variable in above code.
Edit
Now I realize that
instance: {"pointer":"/configuration/0/skipHeader"}
shows which occurrence of skipHeader is into problem and in this case its 0th instance of skipHeader inside configuration. However I still think its better to get the line number which ran into problem.
(library author here)
While it can be done (I have somewhere an implementation of JsonParser which does just that) the problem is that the line/column information will most of the time be irrelevant.
In order to save bandwidth, most of the time, JSON sent over the wire will always be on a single line, therefore the problem will remain that you would get, say, "line 1, column 202" without getting any the smarter.
I'll probably do this anyway for the next major version but for 2.2.x it is too late...

jackson jsonparser restart parsing in broken JSON

I am using Jackson to process JSON that comes in chunks in Hadoop. That means, they are big files that are cut up in blocks (in my problem it's 128M but it doesn't really matter).
For efficiency reasons, I need it to be streaming (not possible to build the whole tree in memory).
I am using a mixture of JsonParser and ObjectMapper to read from my input.
At the moment, I am using a custom InputFormat that is not splittable, so I can read my whole JSON.
The structure of the (valid) JSON is something like:
[ { "Rep":
{
"date":"2013-07-26 00:00:00",
"TBook":
[
{
"TBookC":"ABCD",
"Records":
[
{"TSSName":"AAA",
...
},
{"TSSName":"AAB",
...
},
{"TSSName":"ZZZ",
...
}
] } ] } } ]
The records I want to read in my RecordReader are the elements inside the "Records" element. The "..." means that there is more info there, which conforms my record.
If I have an only split, there is no problem at all.
I use a JsonParser for fine grain (headers and move to "Records" token) and then I use ObjectMapper and JsonParser to read records as Objects. For details:
configure(JsonParser.Feature.AUTO_CLOSE_SOURCE, false);
MappingJsonFactory factory = new MappingJsonFactory();
mapper = new ObjectMapper(factory);
mapper.configure(Feature.FAIL_ON_UNKNOWN_PROPERTIES,false);
mapper.configure(SerializationConfig.Feature.FAIL_ON_EMPTY_BEANS,false);
parser = factory.createJsonParser(iStream);
mapper.readValue(parser, JsonNode.class);
Now, let's imagine I have a file with two inputsplits (i.e. there are a lot of elements in "Records").
The valid JSON starts on the first split, and I read and keep the headers (which I need for each record, in this case the "date" field).
The split would cut anywhere in the Records array. So let's assume I get a second split like this:
...
},
{"TSSName":"ZZZ",
...
},
{"TSSName":"ZZZ2",
...
}
] } ] } } ]
I can check before I start parsing, to move the InputStream (FSDataInputStream) to the beginning ("{" ) of the record with the next "TSSNAME" in it (and this is done OK). It's fine to discard the trailing "garbage" at the beginning. So we got this:
{"TSSName":"ZZZ",
...
},
{"TSSName":"ZZZ2",
...
},
...
] } ] } } ]
Then I handle it to the JsonParser/ObjectMapper pair seen above.
The first object "ZZZ" is read OK.
But for the next "ZZZ2", it breaks: the JSONParser complaints about malformed JSON. It is encountering a "," not being in an array. So it fails. And then I cannot keep on reading my records.
How could this problem be solved, so I can still be reading my records from the second (and nth) splits? How could I make the parser ignore these errors on the commas, or either let the parser know in advance it's reading contents of an array?
It seems it's OK just catching the exception: the parser goes on and it's able to keep on reading objects via the ObjectMapper.
I don't really like it - I would like an option where the parser could not throw Exceptions on nonstandard or even bad JSON. So I don't know if this fully answers the question, but I hope it helps.

JSON parsing: Unexpected Token error

Im trying to parse a string to JSON and I'm getting an unexpected token error.
I am checking validity using http://json.parser.online.fr/ which comes up with no parse errors, but still says the eval fails due to an unexpected token. If you paste the JSON from below in to that website you can see that it finds an error, but doesn't specify what token is causing it.
Heres what I'm trying to parse.
{
"Polish": {
"Rent": [
{
"english": "a",
"audioUrl": "b",
"alternate": "c"
},
{
"english": "d",
"audioUrl": "e",
"alternate": "f"
}
]
}
}
Am I missing something obvious?
EDIT
There is an unprintable character inbetween the : and [ after the "Rent" key.
I am doing some replace() calls on the string prior to the parse attempt which are likely creating the problem.
prior to the parse that particular line is
"Rent":"[
I want to remove the doublequote between the : and [ sybmols.
So I am using:
var reg = new RegExp('":"', 'g');
var newStr = originalStr.replace(reg, '":');
I don't know why the above is causing the unprintable character though.
EDIT2
I did a quick check removing the Above replace() call pasted it into the validator, manually removed the doublequotes I was using replace() on, and the unreadable characters are still there. So the error is present in the original string. So more code :|
The string is being returned from an ajax call to a php script residing on a server. The PHP script is reading a directory on the server and populating nested associative array to produce the string which is sent back to the JS side, which edits and parses it (shown above).
Within the directories are JSON files, which I'm inserting the contents of into this nested array structure to complete the JSON hierarchy.
The unreadable characters were
ef bb bf
Which I googled and found to be the Byte Order Mark of the string representing the file contents.
So heres the PHP Code which reads the directories and JSON files creating a nested array structure to be JSON_encode()d and sent back to the JS
if ($langHandle = opendir($langDir)) {
while (false !== ($langEntry = readdir($langHandle))) {
$currentLangDir = $langDir . "/" . $langEntry;
if (is_dir($currentLangDir) && $langEntry != '.' && $langEntry != '..') {
$currentLang = array();
if ($currentLangHandle = opendir($currentLangDir)) {
while (false !== ($catEntry = readdir($currentLangHandle))) {
$currentCatFile = $currentLangDir . "/" . $catEntry;
if(is_file($currentCatFile) && $catEntry != '.' && $catEntry != '..') {
$currentCat = file_get_contents($currentCatFile);
$currentLang[removeFileExtension($catEntry)] = $currentCat;
}
}
}
$langArray[$langEntry] = $currentLang;
}
}
What can I do to fix these unwanted characters, a quick search on removing the BOM chars suggests it is a bad thing to do.
You probably have a non printable character that is not showing up in what you pasted in your question. I copied and pasted your text into the online parser at the link you provided and it parses cleanly.
Try copying and pasting your original text into this online hex dump website, and compare to what you get when you copy and paste from your SO question above... if they differ then you'll have a clue as to where the bogus character is.
Here's a screenshot of the output I got, which parses cleanly.
Bro, I was having a similar problem, check your file encoding (UTF-8) and (UTF-8 WITHOUT BOM) can make a difference.