Sgml returns some warnings - html

I use the sgml library of prolog to extract information about a web page. I use this instruction to extract all:
load_structure('file.html', List, [dialect(sgml), shorttag(false), max_errors(-1)])
the system loads the page but i have some warnings, for instance:
WARNING:SGML2PL(sgml): inserted omitted end-tag for "img"
WARNING:SGML2PL(sgml): inserted omitted end-tag for "br"
WARNING:SGML2PL(sgml): entity "amp" does not exist
How can i do to eliminate this warnings?

I use this syntax
get_html_file(FileOrStream, P) :-
dtd(html, DTD),
load_structure(FileOrStream, [P],
[ dtd(DTD),
dialect(sgml),
shorttag(false),
syntax_errors(quiet),
max_errors(-1)
]).
the option syntax_errors(quiet) should do.
I recall I had some hard time parsing old pages with errors.
Error handling can be complicated, some tool like tags soup, being more tolerant, could help in getting the work sone...

Related

Regex getting the tags from an <a href= ...> </a> and the likes

I've tried the answers I've found in SOF, but none supported here : https://regexr.com
I essentially have an .OPML file with a large number of podcasts and descriptions.
in the following format:
<outline text="Software Engineering Daily" type="rss" xmlUrl="http://softwareengineeringdaily.com/feed/podcast/" htmlUrl="http://softwareengineeringdaily.com" />
What regex I can use to so I can just get the title and the link:
Software Engineering Daily
http://softwareengineeringdaily.com/feed/podcast/
Brief
There are many ways to go about this. The best way is likely using an XML parser. I would definitely read this post that discusses use of regex, especially with XML.
As you can see there are many answers to your question. It also depends on which language you are using since regex engines differ. Some accept backreferences, whilst others do not. I'll post multiple methods below that work in different circumstances/for different regex flavours. You can probably piece together from the multiple regex methods below which parts work best for you.
Code
Method 1
This method works in almost any regex flavour (at least the normal ones).
This method only checks against the attribute value opening and closing marks of " and doesn't include the possibility for whitespace before or after the = symbol. This is the simplest solution to get the values you want.
See regex in use here
\b(text|xmlUrl)="[^"]*"
Similarly, the following methods add more value to the above expression
\b(text|xmlUrl)\s*=\s*"[^"]*" Allows whitespace around =
\b(text|xmlUrl)=(?:"[^"]*"|'[^']*') Allows for ' to be used as attribute value delimiter
As another alternative (following the comments below my answer), if you wanted to grab every attribute except specific ones, you can use the following. Note that I use \w, which should cover most attributes, but you can just replace this with whatever valid characters you want. \S can be used to specify any non-whitespace characters or a set such as [\w-] may be used to specify any word or hyphen character. The negation of the specific attributes occurs with (?!text|xmlUrl), which says don't match those characters. Also, note that the word boundary \b at the beginning ensures that we're matching the full attribute name of text and not the possibility of other attributes with the same termination such as subtext.
\b((?!text|xmlUrl)\w+)="[^"]*"
Method 2
This method only works with regex flavours that allow backreferences. Apparently JGsoft applications, Delphi, Perl, Python, Ruby, PHP, R, Boost, and Tcl support single-digit backreferences. Double-digit backreferences are supported by JGsoft applications, Delphi, Python, and Boost. Information according this article about numbered backreferences from Regular-Expressions.info
See regex in use here
This method uses a backreference to ensure the same closing mark is used at the start and end of the attribute's value and also includes the possibility of whitespace surrounding the = symbol. This doesn't allow the possibility for attributes with no delimiter specified (using xmlUrl=http://softwareengineeringdaily.com/feed/podcast/ may also be valid).
See regex in use here
\b(text|xmlUrl)\s*=\s*(["'])(.*?)\2
Method 3
This method is the same as Method 2 but also allows attributes with no delimiters (note that delimiters are now considered to be space characters, thus, it will only match until the next space).
See regex in use here
\b(text|xmlUrl)\s*=\s*(?:(["'])(.*?)\2|(\S*))
Method 4
While Method 3 works, some people might complain that the attribute values might either of 2 groups. This can be fixed by either of the following methods.
Method 4.A
Branch reset groups are only possible in a few languages, notably JGsoft V2, PCRE 7.2+, PHP, Delphi, R (with PCRE enabled), Boost 1.42+ according to Regular-Expressions.info
This also shows the method you would use if backreferences aren't possible and you wanted to match multiple delimiters ("([^"])"|'([^']*))
See regex in use here
\b(text|xmlUrl)\s*=\s*(?|"([^"]*)"|'([^']*)'|(\S*))
Method 4.B
Duplicate subpatterns are not often supported. See this Regular-Expresions.info article for more information
This method uses the J regex flag, which allows duplicate subpattern names ((?<v>) is in there twice)
See regex in use here
\b(text|xmlUrl)\s*=\s*(?:(["'])(?<v>.*?)\2|(?<v>\S*))
Results
Input
<outline text="Software Engineering Daily" type="rss" xmlUrl="http://softwareengineeringdaily.com/feed/podcast/" htmlUrl="http://softwareengineeringdaily.com" />
Output
Each line below represents a different group. New matches are separated by two lines.
text
Software Engineering Daily
xmlUrl
http://softwareengineeringdaily.com/feed/podcast/
Explanation
I'll explain different parts of the regexes used in the Code section that way you understand the usage of each of these parts. This is more of a reference to the methods above.
"[^"]*" This is the fastest method possible (to the best of my knowledge) to grabbing anything between two " symbols. Note that it does not check for escaped backslashes, it will match any non-" character between two ". Whilst "(.*?)" can also be used, it's slightly slower
(["'])(.*?)\2 is basically shorthand for "(.*?)"|'(.*?)'. You can use any of the following methods to get the same result:
(?:"(.*?)"|'(.*?)')
(?:"([^"])"|'([^']*)') <-- slightly faster than line above
(?|) This is a branch reset group. When you place groups inside it like (?|(x)|(y)) it returns the same group index for both matches. This means that if x is captured, it'll get group index of 1, and if y is captured, it'll also get a group index of 1.
For simple HTML strings you might get along with
Url=(['"])(.+?)\1
Here, take group $2, see a demo on regex101.com.
Obligatory: consider using a parser instead (see here).

Matching nested constructs in TextMate / Sublime Text / Atom language grammars

While writing a grammar for Github for syntax highlighting programs written in the Racket language, I have stumbled upon a problem.
In Racket #| starts a multiline comment and |# ends it.
The problem is that multiline comments can be nested:
#| a comment #| still a comment |# even
more comment |#
Here is my non-working attempt:
repository:
multilinecomment:
begin: \#\|
end: \|\#
name: comment
contentName: comment
patterns:
- include: "#multilinecomment"
name: comment
- match: ([^\|]|\|(?=[^#]))*
name: comment
The intent of the match patterns are:
"#multilinecomment"
A multiline comment can contain another multiline comment.
([^\|]|\|(?=[^#]))*
The meaning of the subexpressions:
[^\|] any characters not an `|`
\|(?=[^#]) an `|` followed by a non-`#`
The entire expression thus matches a string not containg |#
Update:
Got an answer from Allan Odgaard on the TextMate mailing list:
http://textmate.1073791.n5.nabble.com/TextMate-grammars-and-nested-multiline-comments-td28743.html
So I've tested a bunch of languages in Sublime that have multiline comments (C/C++, Java, HTML, PHP, JavaScript), and none of the language syntaxes support multiline comments embedded in multiline comments - the syntax highlighting for the comment scope ends with the first "comment close" marker, not with symmetric markers. Now, this isn't to say that it's impossible, because the BracketHighlighter plugin works great for matching symmetric tags, brackets, and other markers. However, it's written in Python, and uses custom logic for its matching algorithms, something that may not be available in the Oniguruma engine that powers Sublime's syntax highlighter, and apparently Github's as well.
Basically, from your description of the problem, you need a code parser to ensure that nested comments are legal, something you can't do with just a syntax highlighting definition. If you're writing this just for Sublime, a custom plugin could take care of that, but I don't know enough about Github's Linguist syntax highlighting system to say if you're allowed to do that. I'm not a regex master yet, but it seems to me that it would be rather difficult to achieve this purely by regex, as you'd need to somehow keep track of an arbitrary number of internal symmetric "open" and "close" markers before finding (and identifying!) the final one.
Sorry I couldn't provide a definitive answer other than I'm not sure this is possible, but that's the best I can come up with without knowing more about Sublime's and Github's internals, something that (at least in Sublime's case) won't happen unless it's open-sourced. Good luck!
Old post, and I don't have the reputation for a comment, but it is emphatically NOT possible to detect arbitrarily nested comments using purely regular expressions. Intuitively, this is because all regular expressions can be transformed into a finite state machine, and keeping track of nesting depth requires a (theoretically) infinite amount of state (the number of states needs to be equal to at least the different possible nesting depths, which here is infinite).
In practice this number grows very slowly, so if you don't want to go to too much trouble you could probably write something that allows nesting up to a reasonable depth. Otherwise you'll probably need a separate phase that parses through and finds the comments to tell the syntax highlighter to ignore them.
You had the correct idea but it looks like your second pattern also matches for the "begin nested comment" sequence #| which will never give a chance for your recursive #multilinecomment pattern to kick in.
All you have to do is replace your second pattern with something similar to
(#(?=[^|])|\|(?=[^#])|[^|#])+
Take the last match out. You do not need it. Its redundant to what textmate will do naturally, which is to match all additional text in to the comment scope until the end marker comes along, or the entire pattern recurses upon itself.

Iterating over a string in Vimscript or Parse a JSON file

So I'm creating a vim script that needs to load and parse a JSON file into a local object graph. I searched and I couldn't find any native way to process a JSON file, and I don't want to add any dependencies to the script. So I wrote my own function to parse the JSON string (gotten from the file), but it's really slow. At the moment, I iterate through each character in the file like so:
let len = strlen(jsonString) - 1
let i = 0
while i < len
let c = strpart(jsonString, i, 1)
let i += 1
" A lot of code to process file....
" Note: I've tried short cutting the process by searching for enclosing double-quotes when I come across the initial double quotes (also taking into account escaping '\' character. It doesn't help
endwhile
I've also tried this method:
for c in split(jsonString, '\zs')
" Do a lot of parsing ....
endfor
For reference, a file with ~29,000 characters takes about 4 seconds to process, which is unacceptable.
Is there a better way to iterate over a string in vim script?
Or better yet, have I missed a native function to parse JSON?
Update:
I asked for no dependencies because I:
Didn't want to deal with them
Genuinely wanted some ideas for best way to do this without someone else's work.
Sometimes I just like to do things manually even though the problem has already been solved.
I'm not against plugins or dependencies at all, it's just that I'm curious. Thus the question.
I ended up creating my own function to parse the JSON file. I was creating a script that could parse the package.json file associated with node.js modules. Because of this, I could rely on a fairly consistent format and quit the processing whenever I'd retrieved the information I needed. This usually cut out large chunks of the file since most developers put the largest chunk of the file, their "readme" section, at the end. Because the package.json file is strictly defined, I left the process somewhat fragile. It assumed a root dictionary { } and actively looks for certain entries. You can find the script here: https://github.com/ahayman/vim-nodejs-complete/blob/master/after/ftplugin/javascript.vim#L33.
Of course, this doesn't answer my own question. It's only the solution to my unique problem. I'll wait a few days for new answers and pick the best one before the bounty ends (already set an alarm on my phone).
The simplest solution with the least dependencies is just using the json_decode vim function.
let dict = json_decode(jsonString)
Even though Vim's origin dates back a lot it happens that its internal string() eval() representation is that close to JSON that its likely to work unless you need special characters.
You can lookup the implementation here which even supports true/false/null if you want:
https://github.com/MarcWeber/vim-addon-json-encoding
Better use that library (vim-addon-manager allows to install dependencies easily).
Now it depends on your data whether this is good enough.
Now Benjamin Klein posted your question to vim_use which is why I'm replying.
Best and fast replies happen if you subscribe to the Vim mailinglist.
Goto vim.sf.net and follow the community link.
You cannot expect the Vim community to scrape stackoverflow.
I've added the keyword "json" and "parsing" to that little code that it can be found easier.
If this solution does not work for you you can try the many :h if_* bindings or write an external script which extracts the information you're looking for, or turns JSON into Vim's dictionary representation which can be read by eval() escaping special characters you care about correctly.
If you seek for completely correct solution omitting dependencies is one of the worst thing you can do. The eval() variant mentioned by #MarcWeber is one of the fastest, but it has its disadvantages:
Using solution for securing eval I mentioned in comment makes it no longer the fastest. In fact after you use this it makes eval() slower by more then an order of magnitude (0.02s vs 0.53s in my test).
It does not respect surrogate pairs.
It cannot be used to verify that you have correct JSON: it accepts some strings (e.g. "\<C-o>") that are not JSON strings and it allows trailing commas.
It fails to give normal error messages. It fails badly if you use vam#VerifyIsJSON I mentioned in p.1.
It fails to load floating point values like 1e10 (vim requires numbers to look like 1.0e10, but numbers like 1e10 are allowed: note “and/or” in the first paragraph).
. All of the above (except for the first) statements also apply to vim-addon-json-encoding mentioned by #MarcWeber because it uses eval. There are some other possibilities:
Fastest and the most correct is using python: pyeval('json.loads(vim.eval("varname"))'). Not faster then eval, but fastest among other possibilities. (0.04 in my test: approximately two times slower then eval())
Note that I use pyeval() here. If you want solution for vim version that lacks this functionality it will no longer be one of the fastest.
Use my json.vim plugin. It has an advantages of slightly better error reporting compared to failed vam#VerifyIsJSON, slightly worse compared to eval() and it correctly loads floating-point numbers. It can be used for verification of strings (it does not accept "\<C-a>"), but it loads lists with trailing comma just fine. It does not support surrogate pairs. It is also very slow: in the test I used (it uses 279702 character long strings) it takes 11.59s to load. Json.vim tries to use python if possible though.
For the best error reporting you can take yaml.vim and purge YAML support out of it leaving only JSON (I once have done the same thing for pyyaml, though in python: see markedjson library used in powerline: it is pyyaml minus YAML stuff plus classes with marks). But this variant is even slower then json.vim and should only be used if the main thing you need is error reporting: 207 seconds for loading the same 279702 character long string.
Note that the only variant mentioned that satisfies both requirements “no dependencies” and “no python” is eval(). If you are not fine with its disadvantages you have to throw away one or both of these requirements. Or copy-paste code. Though if you take speed into account only two candidates are left: eval() and python: if you want to parse json fast you really must use C and only these solutions spend most time in functions written in C.
Most other interpreters (ruby/perl/TCL) do not have pyeval() equivalent so they will be slower even if their JSON implementation is written in C. Some other (lua/racket (mzscheme)) have pyeval() equivalent, but e.g. luaeval('{}') is zero meaning that you will have to add additional step explicitly and recursively converting objects into vim dictionaries and lists (e.g. luaeval('vim.dict({})')) which will impact performance. Cannot say anything about mzeval(), but I have never heard about anybody actually using racket (mzscheme) with vim.

Types of Errors during Compilation and at Runtime

I have this question in a homework assignment for my Computer Languages class. I'm trying to figure out what each one means, but I'm getting stuck.
Errors in a computer program can be
classified according to when they are
detected and, if they are detected at
compile time, what part of the
compiler detects them. Using your
favorite programming language, give an
example of:
(a) A lexical error, detected by the
scanner.
(b) A syntax error, detected by the
parser.
(c) A static semantic error, detected
(at compile-time) by semantic
analysis.
(d) A dynamic semantic error, detected
(at run-time) by code generated by the
compiler.
For (a), I think this is would be correct: int char foo;
For (b), int foo (no semicolon)
For (c) and (d), I'm not sure what is being asked.
Thanks for the help.
I think it's important to understand what a scanner is, what a parser is and how they are involved in the compilation process.
(I'll try my best at a high-level explanation)
The scanner takes a sequence of characters (a source file) and converts it to a sequence of tokens. e.g., sees the text if 234 ) and converts to the tokens, IF INTEGER RPAREN (there's more to it but should be enough for the example).
Another way you can think of how the scanner works is that it takes the text and makes sure you use the correct keywords and not makes them up. It has to be able to convert the entire source file to the associated language's recognized tokens and this varies from language to language. In other words, "Does every piece of text correspond to a construct a language understands". Or better put with an example, "Do all these words found in a book, belong to the English language?"
The parser takes a sequence of tokens (usually from the scanner) and (among other things) sees if it is well formed. e.g., a C variable declaration is in the form Type Identifier SEMICOLON.
The parser checks "Does this sequence of tokens in this order make sense to me?" And similarly the analogy, "Does this sequence of English words (with punctuation) form complete sentences?"
C asks for errors that can be found when compiling the program. D asks for errors that you see when running the program after it compiled successfully. You should be able to distinguish these two by now hopefully.
I hope this helps you get a better understanding and make answering these easier.
I'll give it a shot. Here's what I think:
a. int foo+; (foo+ is an invalid identifier because + is not a valid char in identifiers)
b. foo int; (Syntax error is any error where the syntax is invalid - either due to misplacement of words, bad spelling, missing semicolons etc.)
c. Static semantic error are logical errors. for e.g passing float as index of an array - arr[1.5] should be a SSE.
d. I think exceptions like NullReferenceException might be an example of DME. Not completely sure but in covariant returns that raise an exception at compile time (in some languages) might also come in this category. Also, passing the wrong type of object in another object (like passing a Cat in a Person object at runtime might qualify for DME.) Simplest example would be trying to access an index that is out of bounds of the array.
Hope this helps.

Where to translate message strings - in the view or in the model?

We have a multilingual (PHP) application and use gettext for i18n. There are a few classes in the backend/model that return messages or message formats for printf().
We use xgettext to extract the strings that we want to translate.
We apply the gettext function T_() in the frontend/view - this seems to be where it belongs. So far we kept the backend clean from T_() calls, this way we can also unit-test messages.
So in the frontend we have something like
echo T_($mymodel->getMessage());
or
printf(T_($mymodel->getMessageFormat()), $mymodel->getValue());
This makes it impossible to apply xgettext to extract the strings, unless we put some dummy T_("my message %s to translate") call in the MyModel class.
So this leads to the more general question:
Do you apply translation in the backend classes, resp. where do you apply translation and how do you keep track of the strings which you have to translate?
(I am aware of Question: poedit workaround for dynamic gettext.)
My backend classes usually output english strings with parameters left out. Example
["Good job %s you have %i points", "Paul", 10]
Then the key for the translation is the English string (since I don't really like message codes).
Translation is for me totally a View issue except for clearly defined business reasons, like having to store displayed messages as shown. The latter could e.g. happen if you want to store a sent invoice as delivered to the client.