Parsing a large CSV file, dealing with commas and quotes

Parsing a large CSV file, dealing with commas and quotes - actionscript-3

I need to load in a large CSV file (>1MB) and parse it.
Generally this is quite easy to do by splitting first on linebreaks and then commas.
The problem is though that some entries contain Strings that include their own commas. When this spreadsheet is converted to CSV, the lines containing commas are wrapped in quotes.
I've written a parser that first escapes all the commas in these strings, then splits it on linebreaks and then commas, and then unescapes the values again.
This is quite a slow process for such a long string, as I need to iterate through the whole string.
Does anyone know a faster or more optimised method of dealing with this?

Have you had a look at csvlib yet? It is a parser library for ActionScript 3. It claims to be designed to properly handle quoted strings.
Hopefully, you are already enclosing your strings in quotes, especially the ones containing the commas. CSV parsers cannot distinguish a comma that is part of a string from a comma that separates two strings, unless the strings have quotes around them.
Good
"This string, has a comma", "This string doesn't"
Bad
This string, has a comma, this string doesn't

Processing the file in a single pass will reduce the time. This can be achieved by using a simple state machine to handle the complexity of commas embedded in the values.
Regards

Add a reference to the Microsoft.VisualBasic (yes, it says
VisualBasic but it works in C# just as well - remember that at the
end it is all just IL)
Use the Microsoft.VisualBasic.FileIO.TextFieldParser class to parse the
CSV file
Here is the sample code:
Dim parser As TextFieldParser = New TextFieldParser("C:\mar0112.csv")
parser.TextFieldType = FieldType.Delimited
parser.SetDelimiters(",")
While Not parser.EndOfData
'Processing row
Dim fields() As String = parser.ReadFields
For Each field As String In fields
'TODO: Process field
Next
End While
parser.Close()

Related

Regex for replacing unnecessary quotation marks within a JSON object containing an array

I am currently trying to format a JSON object using LabVIEW and have ran into the issue where it adds additional quotation marks invalidating my JSON formatting. I have not found a way around this so I thought just formatting the string manually would be enough.
Here is the JSON object that I have:
{
"contentType":"application/json",
"content":{
"msgType":2,
"objects":"["cat","dog","bird"]",
"count":3
}
}
Here is the JSON object I want with the quotation marks removed.
{
"contentType":"application/json",
"content":{
"msgType":2,
"objects":["cat","dog","bird"],
"count":3
}
}
I am still not an expert with regex and using a regex tester I was only able to grab the "objects" and "count" fields but I would still feel I would have to utilize substrings to remove the quotation marks.
Example I am using (would use a "count" to find the start of the next field and work backwards from there)
"([objects]*)"
Additionally, all the other Regex I have been looking at removes all instances of quotation marks whereas I only need a specific area trimmed. Thus, I feel that a specific regex replace would be a much more elegant solution.
If there is a better way to go about this I am happy to hear any suggestions!

Your question suggests that the built-in LabVIEW JSON tools are insufficient for your use case.
The built-in library converts LabVIEW clusters to JSON in a one-shot approach. Bundle all your data into a cluster and then convert it to JSON.
When it comes to parsing JSON, you use the path input terminal and the default type terminals to control what data is parsed from a JSON string.
If you need to handle JSON in a manner similar to say JavaScript, I would recommend something like the JSONText Toolkit which is free to use (and distribute) under the BSD licence. This allows more complex and iterative building of JSON strings from LabVIEW types and has text-path style element access along with many more features.
The Output controls from both my examples are identical - although JSONText provides a handy Pretty Print vi.

After using a regex from one of the comments, I ended up with this regex which allowed me to match the array itself.
(\[(?:"[^"]*"|[^"])+\])
I was able to split the the JSON string into before match, match and after match and removed the quotation marks from the end of 'before match' and start of 'after match' and concatenated the strings again to form a new output.

Deal with long numbers in scientific notation in json string - Freemarker

I have a json string which contains a long number but in scientific notation (like 1.559101974041E12 instead of 1559101974041). Due to this, I am not able to parse it using ?eval as this value must be in double quotes in order to get parsed.
I thought of one solution like putting double quotes around them using regex and get them evaluated. After that, use some free marker method to convert value into long. But this solution is very risky and can alter other values as well.

I'm not sure how your template looks, but if you have variable s that contains the string "1.559101974041E12" (the quotation marks aren't part of the string value itself), then you can parse it like s?number. s?eval doesn't work because scientific notation is not part of the FreeMarker syntax (but ?number can parse more formats).
If you will re-print the number in the template, note that depending on locale and configuration settings, it might will look like 1,559,101,974,041. You can prevent that with ?c (for example like ${s?number?c}), in which case it will always look like 1559101974041.

What's the easiest way to convert a 'pretty printed' JSON string to a compact format representation?

This may be an odd question as it's specific to the JSON strings themselves, not the objects they represent. Given a 'pretty printed' JSON string (representing any JSON-encodable model), how would one reformat it to the 'compact' format?
My first thought was to not consider it JSON, but rather just a string, then use RegEx to remove duplicate spaces, remove newlines, etc., but that's not context aware so it risks affecting the keys and values portions of the JSON if you don't properly test that you're inside quotes.
My next thought was to try and construct an object from the JSON, but without a type to convert to, I'm not sure how to do that without manually parsing the values as 'ANY', then testing if they're an array, and recurse into it if they are, repeating the process. Then once I have the final object, serialize the result in compact form. However, that seems like a lot of overkill.
Is there an easier way to accomplish this? We're using Swift 4 if it helps.

UPDATE:
as pointed by #Mark A. Donohoe, this removes ALL whitespaces. so even though it looks coooooool, it's a dumb answer. don't fall for it.
i needed the same thing and i ended up creating a String extension:
extension String {
func toCompactJSON() -> String {
self.filter { !$0.isWhitespace && !$0.isNewline }
}
}
in my case though it was for testing purposes and it turned out to be useless as the order in which the Javascript object/arrays appear is not the same as while generated through the JSONEncoder.

Raw string field value in JSON file

In my JSON file, one of the fields has to carry the content of another file (a string).
The string has CRLFs, single/double quotes, tabs.
Is there a way to consider my whole string as a raw string so I don't have to escape special characters?
Is there an equivalent in JSON to the string raw delimiter in C++?
In C++, I would just put the whole file content inside : R"( ... )"

Put simply, no there isn't. Depending on what parser you use, it may have a feature that allows this and/or there may be a variant of JSON that allows this (examples of variants include JSONP and JSON-C, though I'm not aware of one specifically that allows for the features you are looking for), but the JSON standard ubiquitous on the web does not support multiline strings or unescaped special characters.

A workaround for the lack of raw string support in JSON is to Base64 encode your string before adding it to your JSON.

Ocaml CSV to Float List

I'm looking for the easiest way to turn a CSV file (of floats) into a float list. I'm not well acquainted with reading files in general in Ocaml, so I'm not sure what this sort of function entails.
Any help or direction is appreciated :)
EDIT: I'd prefer not to use a third party CSV library unless I absolutely have to.

https://forge.ocamlcore.org/projects/csv/

If you don't want to include a third-party library, and your CSV files are simply formatted with no quotes or embedded commas, you can parse them easily with standard library functions. Use read_line in a loop or in a recursive function to read each line in turn. To split each line, call Str.split_delim (link your program with str.cma or str.cmxa). Call float_of_string to parse each column into a float.
let comma = Str.regexp ","
let parse_line line = List.map float_of_string (Str.split_delim comma line)
Note that this will break if your fields contain quotes. It would be easy to strip quotes at the beginning and at the end of each element of the list returned by split_delim. However, if there are embedded commas, you need a proper CSV parser. You may have embedded commas if your data was produced by a localized program in a French locale — French uses commas as the decimal separator (e.g. English 3.14159, French 3,14159). Writing floating point data with commas instead of dots isn't a good idea, but it's something you might encounter (some spreadsheet CSV exports, for example). If your data comes out of a Fortran program, you should be fine.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Parsing a large CSV file, dealing with commas and quotes - actionscript-3

Processing the file in a single pass will reduce the time. This can be achieved by using a simple state machine to handle the complexity of commas embedded in the values. Regards

Related

Regex for replacing unnecessary quotation marks within a JSON object containing an array

Deal with long numbers in scientific notation in json string - Freemarker

What's the easiest way to convert a 'pretty printed' JSON string to a compact format representation?

Raw string field value in JSON file

Ocaml CSV to Float List

Categories

Resources