How to get slice of string in OCaml? - json

So I'm writing a JSON parser in OCaml, and I need to get a slice of a string. More specifically, I need to get the first n characters of a string so I can pattern-match with them.
Here's an example string:
"null, \"field2\": 25}"
So, how could I use just a couple lines of OCaml code to get just the first 4 characters (the null)?
P.S. I've already thought about using something like input.[0..4] but I'm not entirely sure how that works, I'm reasonably new to OCaml and the ML family.

Using build-in sub function should do the work:
let example_string = "null, \"field2\": 25}"
(*val example_string : string = "null, \"field2\": 25}" *)
let first_4 = String.sub example_string 0 4
(*val first_4 : string = "null" *)
I suggest you to look at official documentation:
https://caml.inria.fr/pub/docs/manual-ocaml/libref/String.html
And if you are not doing this for self teaching I would strongly suggest using one of available libraries for the purpose, such as yojson (https://ocaml-community.github.io/yojson/yojson/Yojson/index.html) for example.

Related

Borrowed value does not live long enough while writing an HTML parser

I am very new to Rust, and trying to build a HTML parser.
I first tried to parse the string and put it in the Hashmap<&str, i32>.
and I figured out that I have to take care of letter cases.
so I added tag.to_lowercase() which creates a String type. From there it got my brain to panic.
Below is my code snippet.
fn html_parser<'a>(html:&'a str, mut tags:HashMap<&'a str, i32>) -> HashMap<&'a str, i32>{
let re = Regex::new("<[:alpha:]+?[\\d]*[:space:]*>+").unwrap();
let mut count;
for caps in re.captures_iter(html) {
if !caps.at(0).is_none(){
let tag = &*(caps.at(0).unwrap().trim_matches('<').trim_matches('>').to_lowercase());
count = 1;
if tags.contains_key(tag){
count = *tags.get_mut(tag).unwrap() + 1;
}
tags.insert(tag,count);
}
}
tags
}
which throws this error,
src\main.rs:58:27: 58:97 error: borrowed value does not live long enough
src\main.rs:58 let tag:&'a str = &*(caps.at(0).unwrap().trim_matches('<').trim_matches('>').to_lowercase());
^~~~~~~~~~~~~~~~~~~
src\main.rs:49:90: 80:2 note: reference must be valid for the lifetime 'a as defined on the block at 49:89...
src\main.rs:49 fn html_parser<'a>(html:&'a str, mut tags:HashMap<&'a str, i32>)-> HashMap<&'a str, i32>{
src\main.rs:58:99: 68:6 note: ...but borrowed value is only valid for the block suffix following statement 0 at 58:98
src\main.rs:58 let tag:&'a str = &*(caps.at(0).unwrap().trim_matches('<').trim_matches('>').to_lowercase());
src\main.rs:63
...
error: aborting due to previous error
I read about lifetimes in Rust but still can not understand this situation.
If anyone has a good HTML tag regex, please recommend so that I can use it.
To understand your problem it is useful to look at the function signature:
fn html_parser<'a>(html: &'a str, mut tags: HashMap<&'a str, i32>) -> HashMap<&'a str, i32>
From this signature we can see, roughly, that both accepted and returned hash maps may only be keyed by subslices of html. However, in your code you are attempting to insert a string slice completely unrelated (in lifetime sense) to html:
let tag = &*(caps.at(0).unwrap().trim_matches('<').trim_matches('>').to_lowercase());
The first problem here (your particular error is about exactly this problem) is that you're attempting to take a slice out of a temporary String returned by to_lowercase(). This temporary string is only alive during this statement, so when the statement ends, the string is deallocated, and its references would become dangling if this was not prohibited by the compiler. So, the correct way to write this assignment is as follows:
let tag = caps.at(0).unwrap().trim_matches('<').trim_matches('>').to_lowercase();
let tag = &*tag;
(or you can just use top tag and convert it to a slice when it is used)
However, your code is not going to work even after this change. to_lowercase() method allocates a new String which is unrelated to html in terms of lifetime. Therefore, any slice you take out of it will have a lifetime necessarily shorter than 'a. Hence it is not possible to insert such slice as a key to the map, because the data they point to may be not valid after this function returns (and in this particular case, it will be invalid).
It is hard to tell what is the best way to fix this problem because it may depend on the overall architecture of your program, but the simplest way would be to create a new HashMap<String, i32> inside the function:
fn html_parser(html:&str, tags: HashMap<&str, i32>) -> HashMap<String, i32>{
let mut result: HashMap<String, i32> = tags.iter().map(|(k, v)| (k.to_owned(), *v)).collect();
let re = Regex::new("<[:alpha:]+?[\\d]*[:space:]*>+").unwrap();
for caps in re.captures_iter(html) {
if let Some(cap) = caps.at(0) {
let tag = cap
.trim_matches('<')
.trim_matches('>')
.to_lowercase();
let count = result.get(&tag).unwrap_or(0) + 1;
result.insert(tag, count);
}
}
result
}
I've also changed the code for it to be more idiomatic (if let instead of if something.is_none(), unwrap_or() instead of mutable local variables, etc.). This is a more or less direct translation of your original code.
As for parsing HTML with regexes, I just cannot resist providing a link to this answer. Seriously consider using a proper HTML parser instead of relying on regexes.

How to convert to String?

Below commands :
theta = zeros(2, 1);
printf(theta)
Give error : error: printf: format TEMPLATE must be a string
Is there function to convert the theta to a String or to print the theta value ?
Reading the octave doc : http://www.network-theory.co.uk/docs/octave3/octave_140.html this does seem possible ?
If you are trying to sprint to the stdout stream then you can use printf without converting to a string as it will do this for you but it works like any string formatting function in any language where the first argument is a string followed by variables you want to format and insert into that string. for your simple case:
printf('%f', theta)
If you are just trying to print to the console however, I would suggest rather using sprintf or display. Matlab doesn't have a printf command and I would always advocate keeping your Octave code directly portable to Matlab when possible.
matstr function
For my case : printf(mat2str(theta , 2))
src : https://www.gnu.org/software/octave/doc/interpreter/Converting-Numerical-Data-to-Strings.html
Use num2str()
eg
str_theta = num2str(theta)
Octave documentaion on converting numbers to strings

f# split html by tags

I would like to parse an HTML document and print each of the paragraphs to a log file as an individual entry. So far I have:
let parseTextFile (path) =
let fileText = File.ReadAllText(path)
fileText.Split('<p>') |> Seq.iter (fun m -> logEmail(m))
But unfortunately for me string.Split does not do what I want here, it seems to exist to split a string by a single character delimiter. How can I split the file up using something more than a single character, it may be nice to have something more than just <p> as well because with just that I will have a </p> at the end of the paragraph. With a regex or some sort of complex matcher I could more specifically pick out everything between <p> tags.
Try using specific libraries for parsing html, for example HtmlAgilityPack.
As wmeyer said, you need to use a different overload of the .Split() method on strings. In fact, the code you posted won't even compile because '<p>' is not a string literal -- you need to use "<p>" instead (single quotes are for character literals).
Here's how to use the correct overload of .Split():
open System.IO
let parseTextFile path =
let fileText = File.ReadAllText path
fileText.Split ([| "<p>"; |], System.StringSplitOptions.RemoveEmptyEntries)
|> Seq.iter logEmail
For a quick test in F# Interactive:
> "First paragraph<p>Second paragraph.<p><p>Third paragraph.<p>"
.Split ([| "<p>"; |], System.StringSplitOptions.RemoveEmptyEntries);;
val it : string [] =
[|"First paragraph"; "Second paragraph."; "Third paragraph."|]
Finally, as #ntr said -- you're much, much better off using a library like the HTML Agility Pack for parsing HTML. Their parsers are very robust and will save you a lot of trouble.

String Parsing using tcl

I am trying to parse a string to return the text between two sets. For example, my string is: "faultstring>Item not valid: The specified Standard SIP1 Profile was not found faultstring>"
I want to write a function that will return the string: Item not valid: The specified Standard SIP1 Profile was not found
I am new to tcl and your help is very much appreciated.
Please let me know.
Thanks.
Assuming there is no faultstring> inside the interesting string, and there might be some uninteresting garbage before and after specified fragment:
set testString "faultstring>Item not valid: The specified Standard SIP1 Profile was not found faultstring>"
if {[regexp {faultstring>(.*)faultstring>} $testString _ extracted]} {
puts "Got it: $extracted"
}
The answer may vary for other assumptions.

How to JSON serialize math vector type in F#?

I'm trying to serialize "vector" (Microsoft.FSharp.Math) type. And I get that error:
Exception Details: System.Runtime.Serialization.SerializationException: Type 'Microsoft.FSharp.Math.Instances+FloatNumerics#115' with data contract name 'Instances.FloatNumerics_x0040_115:http://schemas.datacontract.org/2004/07/Microsoft.FSharp.Math' is not expected. Add any types not known statically to the list of known types - for example, by using the KnownTypeAttribute attribute or by adding them to the list of known types passed to DataContractSerializer.
I have tried to put KnownType attribute and some other stuff, but nothing helps!
Could someone know the answer?
This is the code I use:
// [< KnownType( typeof<vector> ) >]
type MyType = vector
let public writeTest =
let aaa = vector [1.1;2.2]
let serializer = new DataContractJsonSerializer( typeof<MyType> )
let writer = new StreamWriter( #"c:\test.txt" )
serializer.WriteObject(writer.BaseStream, aaa)
writer.Close()
I think that F# vector type doesn't provide all the necessary support for JSON serialization (and is quite a complex type internally). Perhaps the best thing to do in your case would be to convert it to an array and serialize the array (which will also definitely generate shorter and more efficient JSON).
The conversion is quite straightforward:
let aaa = vector [1.1;2.2]
let arr = Array.ofSeq aaa // Convert vector to array
// BTW: The 'use' keyword behaves like 'using' in C#
use writer = new StreamWriter( #"c:\test.txt" )
let serializer = new DataContractJsonSerializer()
serializer.WriteObject(writer.BaseStream, aaa)
To convert the array back to vector, you can use Vector.ofSeq (which is a counterpart to Array.ofSeq used in the example above).
You can also use:
let v = vector [1.;3.];
let vArr = v.InternalValues
to get the internal array of the vector. In this way, you don't need to create a temporary array.
Types, RowVector, Matrix also has this method to get the internal arrays.