Ocaml CSV to Float List - csv

I'm looking for the easiest way to turn a CSV file (of floats) into a float list. I'm not well acquainted with reading files in general in Ocaml, so I'm not sure what this sort of function entails.
Any help or direction is appreciated :)
EDIT: I'd prefer not to use a third party CSV library unless I absolutely have to.

https://forge.ocamlcore.org/projects/csv/

If you don't want to include a third-party library, and your CSV files are simply formatted with no quotes or embedded commas, you can parse them easily with standard library functions. Use read_line in a loop or in a recursive function to read each line in turn. To split each line, call Str.split_delim (link your program with str.cma or str.cmxa). Call float_of_string to parse each column into a float.
let comma = Str.regexp ","
let parse_line line = List.map float_of_string (Str.split_delim comma line)
Note that this will break if your fields contain quotes. It would be easy to strip quotes at the beginning and at the end of each element of the list returned by split_delim. However, if there are embedded commas, you need a proper CSV parser. You may have embedded commas if your data was produced by a localized program in a French locale — French uses commas as the decimal separator (e.g. English 3.14159, French 3,14159). Writing floating point data with commas instead of dots isn't a good idea, but it's something you might encounter (some spreadsheet CSV exports, for example). If your data comes out of a Fortran program, you should be fine.

Related

Regex for replacing unnecessary quotation marks within a JSON object containing an array

I am currently trying to format a JSON object using LabVIEW and have ran into the issue where it adds additional quotation marks invalidating my JSON formatting. I have not found a way around this so I thought just formatting the string manually would be enough.
Here is the JSON object that I have:
{
"contentType":"application/json",
"content":{
"msgType":2,
"objects":"["cat","dog","bird"]",
"count":3
}
}
Here is the JSON object I want with the quotation marks removed.
{
"contentType":"application/json",
"content":{
"msgType":2,
"objects":["cat","dog","bird"],
"count":3
}
}
I am still not an expert with regex and using a regex tester I was only able to grab the "objects" and "count" fields but I would still feel I would have to utilize substrings to remove the quotation marks.
Example I am using (would use a "count" to find the start of the next field and work backwards from there)
"([objects]*)"
Additionally, all the other Regex I have been looking at removes all instances of quotation marks whereas I only need a specific area trimmed. Thus, I feel that a specific regex replace would be a much more elegant solution.
If there is a better way to go about this I am happy to hear any suggestions!
Your question suggests that the built-in LabVIEW JSON tools are insufficient for your use case.
The built-in library converts LabVIEW clusters to JSON in a one-shot approach. Bundle all your data into a cluster and then convert it to JSON.
When it comes to parsing JSON, you use the path input terminal and the default type terminals to control what data is parsed from a JSON string.
If you need to handle JSON in a manner similar to say JavaScript, I would recommend something like the JSONText Toolkit which is free to use (and distribute) under the BSD licence. This allows more complex and iterative building of JSON strings from LabVIEW types and has text-path style element access along with many more features.
The Output controls from both my examples are identical - although JSONText provides a handy Pretty Print vi.
After using a regex from one of the comments, I ended up with this regex which allowed me to match the array itself.
(\[(?:"[^"]*"|[^"])+\])
I was able to split the the JSON string into before match, match and after match and removed the quotation marks from the end of 'before match' and start of 'after match' and concatenated the strings again to form a new output.

Spark - load numbers from a CSV file with non-US number format

I have a CSV file which I want to convert to Parquet for futher processing. Using
sqlContext.read()
.format("com.databricks.spark.csv")
.schema(schema)
.option("delimiter",";")
.(other options...)
.load(...)
.write()
.parquet(...)
works fine when my schema contains only Strings. However, some of the fields are numbers that I'd like to be able to store as numbers.
The problem is that the file arrives not as an actual "csv" but semicolon delimited file, and the numbers are formatted with German notation, i.e. comma is used as decimal delimiter.
For example, what in US would be 123.01 in this file would be stored as 123,01
Is there a way to force reading the numbers in different Locale or some other workaround that would allow me to convert this file without first converting the CSV file to a different format? I looked in Spark code and one nasty thing that seems to be causing issue is in CSVInferSchema.scala line 268 (spark 2.1.0) - the parser enforces US formatting rather than e.g. rely on the Locale set for the JVM, or allowing configuring this somehow.
I thought of using UDT but got nowhere with that - I can't work out how to get it to let me handle the parsing myself (couldn't really find a good example of using UDT...)
Any suggestions on a way of achieving this directly, i.e. on parsing step, or will I be forced to do intermediate conversion and only then convert it into parquet?
For anybody else who might be looking for answer - the workaround I went with (in Java) for now is:
JavaRDD<Row> convertedRDD = sqlContext.read()
.format("com.databricks.spark.csv")
.schema(stringOnlySchema)
.option("delimiter",";")
.(other options...)
.load(...)
.javaRDD()
.map ( this::conversionFunction );
sqlContext.createDataFrame(convertedRDD, schemaWithNumbers).write().parquet(...);
The conversion function takes a Row and needs to return a new Row with fields converted to numerical values as appropriate (or, in fact, this could perform any conversion). Rows in Java can be created by RowFactory.create(newFields).
I'd be happy to hear any other suggestions how to approach this but for now this works. :)

Importing CSV values into a webpage and including commas using split()

EDIT: I know similar questions like this have been asked on SO but nothing compares to how simple I need this to be :)
I have a classic-asp web page that reads a CSV file and spits it onto the page using HTML. The content in the CSV file however contains some paragraphs with properly formed sentences.
In short, I need to display the grammar of these paragraphs which includes commas.
This is a snippet of what my parsing looks like:
sRows = oInStream.readLine
arrRows = Split(sRows,",")
If arrRows(0) = aspFileName And arrRows(1) = "minCamSys" Then
minCamSys1= arrRows(2)
minCamSys2= arrRows(3)
minCamSys3= arrRows(4)
How can I alter my Split() so that I can display commas without breaking the CSV format.
I would prefer to use double quotes around the data that contains a comma (as is usually the CSV standard when importing to Excel). For example:
Peter,Jeff,"Jim was from Ontario, Canada",Scott
I would like to avoid the use of a library as this is a simple in-house application.
Thank you!
Well folks the answer was right in front of my face. Kind of silly really but for this application, it will suffice.
I swapped out the , delimiter with a |. So the new code looks like this:
sRows = oInStream.readLine
arrRows = Split(sRows,"|")
This may not be a great solution but for this simple application it is all that is necessary.

Delimiter for multiple json strings

I'd like to save multiple json strings to a file and separate them by a delimiter, such that it will be easy to read this list in, split on the delimiter and work with each json doc separately.
Serializing using a json array is not an option due to external reasons.
I would like to use a delimiter that is illegal in JSON (e.g. delimiting using a comma would be a bad idea since there are commas within the json strings).
Are there any characters that are not considered legal in JSON serialized strings?
I know it's not exactly what you needed, but you can use this SO answer to write the json string to a CSV, then read it on the other side by using a good streaming CSV reader such as this one
NDJSON
Have a look at NDJSON (Newline delimited JSON).
http://ndjson.org/
It seems to me to be exactly how you should do things, though its not exactly what you asked for. (If you can't flatten your JSON objects into single lines then it's not for you though!) You asked for a delimiter that is not allowed in JSON. Newline is allowed in JSON, but it is not necessary for JSON to contain newlines.
The format is used for log files amongst other things. I discovered it when looking at the Lichess API documentation.
You can start listening in to a broadcast stream of NDJSON data part way through, wait for the next newline character and then start processing objects as and when they arrive.
If you go for NDJSON, you are at least following a standard and I think you'd be hard pressed to find an alternative standard to follow.
Example NDJSON
{"some":"thing"}
{"foo":17,"bar":false,"quux":true}
{"may":{"include":"nested","objects":["and","arrays"]}}
An old question, but hopefully this answer will be useful.
Most JSON readers crash on this character: , which is information separator two. They declare it "unexpected token", so I guess it has to be wrapped to pass or something.

Parsing a large CSV file, dealing with commas and quotes

I need to load in a large CSV file (>1MB) and parse it.
Generally this is quite easy to do by splitting first on linebreaks and then commas.
The problem is though that some entries contain Strings that include their own commas. When this spreadsheet is converted to CSV, the lines containing commas are wrapped in quotes.
I've written a parser that first escapes all the commas in these strings, then splits it on linebreaks and then commas, and then unescapes the values again.
This is quite a slow process for such a long string, as I need to iterate through the whole string.
Does anyone know a faster or more optimised method of dealing with this?
Have you had a look at csvlib yet? It is a parser library for ActionScript 3. It claims to be designed to properly handle quoted strings.
Hopefully, you are already enclosing your strings in quotes, especially the ones containing the commas. CSV parsers cannot distinguish a comma that is part of a string from a comma that separates two strings, unless the strings have quotes around them.
Good
"This string, has a comma", "This string doesn't"
Bad
This string, has a comma, this string doesn't
Processing the file in a single pass will reduce the time. This can be achieved by using a simple state machine to handle the complexity of commas embedded in the values.
Regards
Add a reference to the Microsoft.VisualBasic (yes, it says
VisualBasic but it works in C# just as well - remember that at the
end it is all just IL)
Use the Microsoft.VisualBasic.FileIO.TextFieldParser class to parse the
CSV file
Here is the sample code:
Dim parser As TextFieldParser = New TextFieldParser("C:\mar0112.csv")
parser.TextFieldType = FieldType.Delimited
parser.SetDelimiters(",")
While Not parser.EndOfData
'Processing row
Dim fields() As String = parser.ReadFields
For Each field As String In fields
'TODO: Process field
Next
End While
parser.Close()