Iterating over a string in Vimscript or Parse a JSON file - json

So I'm creating a vim script that needs to load and parse a JSON file into a local object graph. I searched and I couldn't find any native way to process a JSON file, and I don't want to add any dependencies to the script. So I wrote my own function to parse the JSON string (gotten from the file), but it's really slow. At the moment, I iterate through each character in the file like so:
let len = strlen(jsonString) - 1
let i = 0
while i < len
let c = strpart(jsonString, i, 1)
let i += 1
" A lot of code to process file....
" Note: I've tried short cutting the process by searching for enclosing double-quotes when I come across the initial double quotes (also taking into account escaping '\' character. It doesn't help
endwhile
I've also tried this method:
for c in split(jsonString, '\zs')
" Do a lot of parsing ....
endfor
For reference, a file with ~29,000 characters takes about 4 seconds to process, which is unacceptable.
Is there a better way to iterate over a string in vim script?
Or better yet, have I missed a native function to parse JSON?
Update:
I asked for no dependencies because I:
Didn't want to deal with them
Genuinely wanted some ideas for best way to do this without someone else's work.
Sometimes I just like to do things manually even though the problem has already been solved.
I'm not against plugins or dependencies at all, it's just that I'm curious. Thus the question.
I ended up creating my own function to parse the JSON file. I was creating a script that could parse the package.json file associated with node.js modules. Because of this, I could rely on a fairly consistent format and quit the processing whenever I'd retrieved the information I needed. This usually cut out large chunks of the file since most developers put the largest chunk of the file, their "readme" section, at the end. Because the package.json file is strictly defined, I left the process somewhat fragile. It assumed a root dictionary { } and actively looks for certain entries. You can find the script here: https://github.com/ahayman/vim-nodejs-complete/blob/master/after/ftplugin/javascript.vim#L33.
Of course, this doesn't answer my own question. It's only the solution to my unique problem. I'll wait a few days for new answers and pick the best one before the bounty ends (already set an alarm on my phone).

The simplest solution with the least dependencies is just using the json_decode vim function.
let dict = json_decode(jsonString)

Even though Vim's origin dates back a lot it happens that its internal string() eval() representation is that close to JSON that its likely to work unless you need special characters.
You can lookup the implementation here which even supports true/false/null if you want:
https://github.com/MarcWeber/vim-addon-json-encoding
Better use that library (vim-addon-manager allows to install dependencies easily).
Now it depends on your data whether this is good enough.
Now Benjamin Klein posted your question to vim_use which is why I'm replying.
Best and fast replies happen if you subscribe to the Vim mailinglist.
Goto vim.sf.net and follow the community link.
You cannot expect the Vim community to scrape stackoverflow.
I've added the keyword "json" and "parsing" to that little code that it can be found easier.
If this solution does not work for you you can try the many :h if_* bindings or write an external script which extracts the information you're looking for, or turns JSON into Vim's dictionary representation which can be read by eval() escaping special characters you care about correctly.

If you seek for completely correct solution omitting dependencies is one of the worst thing you can do. The eval() variant mentioned by #MarcWeber is one of the fastest, but it has its disadvantages:
Using solution for securing eval I mentioned in comment makes it no longer the fastest. In fact after you use this it makes eval() slower by more then an order of magnitude (0.02s vs 0.53s in my test).
It does not respect surrogate pairs.
It cannot be used to verify that you have correct JSON: it accepts some strings (e.g. "\<C-o>") that are not JSON strings and it allows trailing commas.
It fails to give normal error messages. It fails badly if you use vam#VerifyIsJSON I mentioned in p.1.
It fails to load floating point values like 1e10 (vim requires numbers to look like 1.0e10, but numbers like 1e10 are allowed: note “and/or” in the first paragraph).
. All of the above (except for the first) statements also apply to vim-addon-json-encoding mentioned by #MarcWeber because it uses eval. There are some other possibilities:
Fastest and the most correct is using python: pyeval('json.loads(vim.eval("varname"))'). Not faster then eval, but fastest among other possibilities. (0.04 in my test: approximately two times slower then eval())
Note that I use pyeval() here. If you want solution for vim version that lacks this functionality it will no longer be one of the fastest.
Use my json.vim plugin. It has an advantages of slightly better error reporting compared to failed vam#VerifyIsJSON, slightly worse compared to eval() and it correctly loads floating-point numbers. It can be used for verification of strings (it does not accept "\<C-a>"), but it loads lists with trailing comma just fine. It does not support surrogate pairs. It is also very slow: in the test I used (it uses 279702 character long strings) it takes 11.59s to load. Json.vim tries to use python if possible though.
For the best error reporting you can take yaml.vim and purge YAML support out of it leaving only JSON (I once have done the same thing for pyyaml, though in python: see markedjson library used in powerline: it is pyyaml minus YAML stuff plus classes with marks). But this variant is even slower then json.vim and should only be used if the main thing you need is error reporting: 207 seconds for loading the same 279702 character long string.
Note that the only variant mentioned that satisfies both requirements “no dependencies” and “no python” is eval(). If you are not fine with its disadvantages you have to throw away one or both of these requirements. Or copy-paste code. Though if you take speed into account only two candidates are left: eval() and python: if you want to parse json fast you really must use C and only these solutions spend most time in functions written in C.
Most other interpreters (ruby/perl/TCL) do not have pyeval() equivalent so they will be slower even if their JSON implementation is written in C. Some other (lua/racket (mzscheme)) have pyeval() equivalent, but e.g. luaeval('{}') is zero meaning that you will have to add additional step explicitly and recursively converting objects into vim dictionaries and lists (e.g. luaeval('vim.dict({})')) which will impact performance. Cannot say anything about mzeval(), but I have never heard about anybody actually using racket (mzscheme) with vim.

Related

Why is my %h is List = 1,2; a valid assignment?

While finalizing my upcoming Raku Advent Calendar post on sigils, I decided to double-check my understanding of the type constraints that sigils create. The docs describe sigil type constraints with the table
below:
Based on this table (and my general understanding of how sigils and containers work), I strongly expected this code
my %percent-sigil is List = 1,2;
my #at-sigil is Map = :k<v>;
to throw an error.
Specifically, I expected that is List would attempt to bind the %-sigiled variable to a List, and that this would throw an X::TypeCheck::Binding error – the same error that my %h := 1,2 throws.
But it didn't error. The first line created a List that seemed perfectly ordinary in every way, other than the sigil on its variable. And the second created a seemingly normal Map. Neither of them secretly had Scalar intermediaries, at least as far as I could tell with VAR and similar introspection.
I took a very quick look at the World.nqp source code, and it seems at least plausible that discarding the % type constraint with is List is intended behavior.
So, is this behavior correct/intended? If so, why? And how does that fit in with the type constraints and other guarantees that sigils typically provide?
(I have to admit, seeing an %-sigiled variable that doesn't support Associative indexing kind of shocked me…)
I think this is a grey area, somewhere between DIHWIDT (Docter, It Hurts When I Do This) and an oversight in implementation.
Thing is, you can create your own class and use that in the is trait. Basically, that overrides the type with which the object will be created from the default Hash (for %) and Array (for # sigils). As long as you provide the interface methods, it (currently) works. For example:
class Foo {
method AT-KEY($) { 42 }
}
my %h is Foo;
say %h<a>; # 42
However, if you want to pass such an object as an argument to a sub with a % sigil in the signature, it will fail because the class did not consume the Associatve role:
sub bar(%) { 666 }
say bar(%h);
===SORRY!=== Error while compiling -e
Calling bar(A) will never work with declared signature (%)
I'm not sure why the test for Associative (for the % sigil) and Positional (for #) is not enforced at compile time with the is trait. I would assume it was an oversight, maybe something to be fixed in 6.e.
Quoting the Parameters and arguments section of the S06 specification/speculation document about the related issue of binding arguments to routine parameters:
Array and hash parameters are simply bound "as is". (Conjectural: future versions ... may do static analysis and forbid assignments to array and hash parameters that can be caught by it. This will, however, only happen with the appropriate use declaration to opt in to that language version.)
Sure enough the Rakudo compiler implemented some rudimentary static analysis (in its AOT compilation optimization pass) that normally (but see footnote 3 in this SO answer) insists on binding # routine parameters to values that do the Positional role and % ones to Associatives.
I think this was the case from the first official Raku supporting release of Rakudo, in 2016, but regardless, I'm pretty sure the "appropriate use declaration" is any language version declaration, including none. If your/our druthers are static typing for the win for # and % sigils, and I think they are, then that's presumably very appropriate!
Another source is the IRC logs. A quick search quickly got me nothing.
Hmm. Let's check the blame for the above verbiage so I can find when it was last updated and maybe spot contemporaneous IRC discussion. Oooh.
That is an extraordinary read.
"oversight" isn't the right word.
I don't have time tonight to search the IRC logs to see what led up to that commit, but I daresay it's interesting. The previous text was talking about a PL design I really liked the sound of in terms of immutability, such that code could become increasingly immutable by simply swapping out one kind of scalar container for another. Very nice! But reality is important, and Jonathan switched the verbiage to the implementation reality. The switch toward static typing certainty is welcome, but has it seriously harmed the performance and immutability options? I don't know. Time for me to go to sleep and head off for seasonal family visits. Happy holidays...

Having Multiple Commands for Calling a Specific Programming Language: To Provide a Delimiter-less Option or Not?

After re-reading the off/on topic lists, I'm still not certain if this question is best posted to this site, so apologies in advance, if it is not.
Overview:
I am working on a project that mixes several programming languages and we are trying to determine important considerations for the command used to call one in particular.
For definiteness, I will list the specific languages; however, I think the principles ought to be general, so familiarity with these specific languages is not really essential.
Specific Context
Specifically, we are using: Maxima, KaTeX, Markdown and HTML). While building the prototype, we have used the following (I believe, standard) conventions:
KaTeX delimited by $ $ or $$ $$;
HTML delimited by < > </ > pairs;
Markdown works anywhere in the body, except within KaTeX or Maxima environments;
The only non-standard convention we used during this design phase was to call on Maxima using \comp{<Maxima commands>}. This command works within all the other environments (which is desired).
Now that we are ready to start using the platform, it has become apparent that this temporary command for calling Maxima is cumbersome for our users. The vast majority of use cases involve simply calling a single variable or function, e.g.
As such, we have $\eval{function-name()}(\eval{variable-name})$
as opposed to actually using Maxima for computation, e.g.
Here, it is clear that $\eval{a} + \eval{b} = \eval{a+b}$
(where \eval{a+b} would return the actual sum, as calculated by Maxima).
As such, our users would prefer a delimiter-less command option for invoking a single variable or function, e.g. \#<variable-name-in-Maxima> and \#<function-name>(<argument>) (where # is some reserved character not used in the other languages), while also having a delimited alternative for the (much less frequent) cases where they actually want to use Maxima for computation; perhaps something like \#{a+b}.
However, we have a general sense that this is not a best practice, even though we can't foresee any specific issue.
"Research" / Comparisons:
Indeed, there is precedence for delimit-less expressions for single arguments like x^2 (on any calculator) or Knuth's a \over b in TeX (which persists in LaTeX with \frac12 being parsed as \frac{1}{2}.
IIRC Knuth's point was that this delimit-less notation was more semantic (and so, in his view, preferable), and because delimiters can be added, ambiguity can be avoided, whenever the need arises: e.g. x^{22}, {a+b}\over{c+d} and \frac{12}{3}.
The Question, Proper:
Can anyone point to or explain actual shortcomings / risks associated with a dual solution like:
\#<var>, \#<function>(<arg>) and,
\#[<extended expression>],
(where # is a reserved (& escapable) character), for calling one language amongst others, as opposed to only using a delimited command?
Any alternative suggestions for how to achieve the ease-of-use and more semantic code enabled by the above solution, while keeping the code unambiguous would be very much welcome and appreciated.

Univocity parser - false delimiter autodetection when too little information given

I set the parser to detect the delimiters automatically
CsvParserSettings settings = new CsvParserSettings();
settings.detectFormatAutomatically();
I have only 1 single record : 47W2E2qxPs, http://usda.gov/mattis.html
What I got :
code: 47W2E2qxPshttp url: //usda.gov/mattis.html
I expected the delimiter to be , and not :
so my expected result would be 47W2E2qxPs and http://usda.gov/mattis.html .
Could I fix it in an elegant way?
Author of the library here. The detection process is a heuristic that uses statistics collected from multiple rows of part of your input. Therefore it depends a lot on the size of the input.
Its purpose is to handle situations where you can't easily determine what is the CSV format - such as when users upload random files to you. Don't use the detection process if you already know what is the correct delimiter.
In your case, one row of data is absolutely not enough to reliably detect the delimiter, especially if there are multiple symbols present. There is little you can do about it except for testing what was the detected delimiter before continuing:
parser.beginParsing(new File("/path/to/your.csv"));
CsvFormat format = parser.getDetectedFormat();
//check if the format is sane.
The next version (2.6.0) will include more options to assist the heuristic such as providing a set of allowed characters to be used as delimiters - which will probably help in your case.

TCL man page: it is better to place comment section way ahead

I know I am really picky here, but like to throw it out in case I am off in my interpreting the TCL man page, actually, I wish I was wrong here, as you see the below story.
So for every new TCL developer, we recommend reading the famous "11 rules" (now it is 12 rules).
Yesterday I was asked this question: why does the following script fail?
# puts "hello
world!"
Of course it fails, I said, the first line is taken as comment, that leaves world!" as a command.
But, the newbie said, the manpage indicates that the script is parsed in certain order:
As #2 Evaluation states, the command is parsed to words first.
As #4 Double quotes states, newline is taken as is in parsing double quotes. This makes hello and world! into one word, with a newline in between.
Comments at #10 does states everything up till the next newline is ignored, but after the above processing, the newline should be the 2nd newline, the one after world!.
I see he had a point.
It makes more sense to move the comment section way ahead in the man page, maybe at the second section. With this order change, it indicates that comment recognition is preceding the word-tokenizing process.
How do you think?
Again, I have no intention to ask for change of the manpage, just want to make sure if I miss anything in interpreting the bible.
[UPDATE]
To the people suggesting to close this question as not-a-technical question, it is the same as if my colleague came here asking why that script fails even though his understanding of TCL man page indicates it is a good script.
Again, I am not asking to change the man page.
Let me re-phrase my question - when you are asked this same question, what flaw do you see in his reasoning?
[UPDATE2]
Thanks Donal. I think this is what I learnt, TCL parser goes one char by another, there is no look-ahead.
This is another example:
puts [#haha]
Such script fails at tclsh for the same reason, TCL parser does not break down the script first and only parses the string embedded inside the matching brackets, instead it recognizes "#" as the start of comment and ignores everything after it.
The rules in the Tcl(n) manual page describe pretty precisely the parser that Tcl uses. Requests to change it substantively are usually denied as they tend to have far-reaching consequences and interact with each other trickily. Verifying that a reordering of the rules is not substantive is a non-trivial task, as they correspond to quite a bit of code (our parser and a chunk of our bytecode compiler).
Adding non-normative sections (e.g., EXAMPLES) is easier.
Update based on your updated question
The problem with the reasoning is that the rules are a whole, not really a layered set of parts. They do interact with each other. (The one that usually trips people up is the interaction between the brace rule and the comment rule when inside a braced string such as a procedure body.) Comments really are true comments, and extend up to the end of the line (allowing for backslash-newline sequences) but not beyond, but they only start at places where commands start, not at other places with a # character, and that's the genuinely tricky bit.
Unfortunately, the way that the Tcl parser works is a bit different to the way that programmers think, but most of the time it's pretty good at pretending to work in a “reasonable fashion”. The tricky edge cases don't actually come up too often other than when dealing with the brace-comment interaction mentioned above. The other cases which I hit tend to be either with a switch (resolvable by just putting the comment in the arm) or with long literal lists of things where I want to comment some sections of the list; in that latter case, I actually post-process the string before using it as a list.
set exampleList {
a b c
d e f
# Not really a comment but I want to use it like one!
g h i
j k l
}
# Convert “comment” lines to empty lines
regsub -all -line "^\\s*#.*$" $exampleList "" exampleList
The general advantage of Tcl's rules is that it is actually pretty easy to embed other languages within Tcl, precisely because Tcl only treats # (and other character) as special in well-defined contexts. As long as you can have the embedded language be one that uses balanced braces — and that's almost all of them in practice — then embedding it is utterly trivial. The other cases have to use backslashes and/or double quotes and are pretty ugly, but are also a minuscule fraction of all the embedding cases.
Your colleague's problem is that he's looking at the whole script in one go, whereas the Tcl parser handles one character at a time and doesn't do meaningful amounts of lookahead. It's just some dumb code.

Performance comparison: one argument or a list of arguments?

I am defining a new TCL command whose implementation is C++. The command is to query a data stream and the syntax is something like this:
mycmd <arg1> <arg2> ...
The idea is this command takes a list of arguments and returns a list which has the corresponding data for each argument.
My colleague commented that it is best just to use a single argument and when multi values are needed, just call the command multiple times.
There are some other discussions, but one thing we cannot agree with each other is, the performance.
I think my version, list of argument should be quicker because when we want multi arguments, it is one time cost going through TCL interpreter.
His comment is new to me -
function implementation is cached
accessing TCL function is quicker than accessing TCL data
Is this reasoning sound?
If you use Tcl_EvalObjv to invoke the command, you won't go through the Tcl interpreter. The cost will be one hash-table lookup (or less, if you reuse the Tcl_Obj* containing the command name) and then you'll be in the implementation of the command. Otherwise, constructing a list Tcl_Obj* (e.g., with Tcl_NewListObj) and then calling Tcl_EvalObj is nearly as cheap, as that's a special case because the list construction code is guaranteed to produce lists that are also substitution-free commands.
Building a normal string and passing that through Tcl_Eval (or Tcl_EvalObj) is significantly slower, as that has to be parsed. (OTOH, passing the same Tcl_Obj* through Tcl_EvalObj multiple times in a row will be faster as it will be compiled internally to bytecode.)
Accessing into values (i.e., into Tcl_Obj* references) is pretty fast, provided the internal representation of those values matches the type that the access function requires. If there's a mismatch, an internal type conversion function may be called and they're often relatively expensive. To understand internal representations, here's a few for you to think about:
string — array of unicode characters
integer — a C long (except when you spill over into arbitrary precision work)
list — array of Tcl_Obj* references
dict — hash table that maps Tcl_Obj* to Tcl_Obj*
script — bytecoded version
command — pointer to the implementation function
OK, those aren't the exact types (there's often other bookkeeping data too) but they're what you should think of as the model.
As to “which is fastest”, the only sane way to answer the question is to try it and see which is fastest for real: the answer will depend on too many factors for anyone without the actual code to predict it. If you're calling from Tcl, the time command is perfect for this sort of performance analysis work (it is what it is designed for). If you're calling from C or C++, use that language's performance measurement idioms (which I don't know, but would search Stack Overflow for).
Myself? I advise writing the API to be as clear and clean as possible. Describe the actual operations, and don't distort everything to try to squeeze an extra 0.01% of performance out.