Issues with parsing HTML with ragel - html

In my project I need to extract links from HTML document.
For this purpose I've prepared ragel HTML grammar, primarily based on this work:
https://github.com/brianpane/jitify-core/blob/master/src/core/jitify_html_lexer.rl
(mentioned here: http://ragel-users.complang.narkive.com/qhjr33zj/ragel-grammars-for-html-css-and-javascript )
Almost all works well (thanks for the great tool!), except one issue I can't overcome to date:
If I specify this text as an input:
bbbb <a href="first_link.aspx"> cccc<a href="/second_link.aspx">
my parser can correctly extract first link, but not the second one.
The difference between them is that there is a space between 'bbbb' and '<a', but no spaces between 'cccc' and '<a'.
In general, if any text, except spaces, exists before '<a' tag it makes parses consider it as content, and parser do not recognize tag opening.
Please find in this repo: https://github.com/amdei/ragel_html_sample intentionally simplified sample with grammar, aiming to work as C program ( ngx_url_html_portion.rl ).
There is also input file input-nbsp.html , which expected to contain input for the application.
In order to play with it, make .c-file from grammar:
ragel ngx_url_html_portion.rl
then compile resulting .c-file and run programm.
Input file should be in the same directory.
Will be sincerely grateful for any clue.

The issue with the defined FSM is that it includes into 'content' all characters until the space. You should exclude HTML tag opening '<' from the rule. Here is the diff for illustration:
$ git diff
diff --git a/ngx_url_html_portion.rl b/ngx_url_html_portion.rl
index ccef0ca..1f8dcf0 100644
--- a/ngx_url_html_portion.rl
+++ b/ngx_url_html_portion.rl
## -145,7 +145,7 ## void copy2hrefbuf(par_t* par, u_char* p){
);
content = (
- any - (space )
+ any - (space ) - '<'
)+;
html_space = (

Related

How to use regex (regular expressions) in Notepad++ to remove all HTML and JSON code that does not contain a specific string?

Using regular expressions (in Notepad++), I want to find all JSON sections that contain the string foo. Note that the JSON just happens to be embedded within a limited set of HTML source code which is loaded into Notepad++.
I've written the following regex to accomplish this task:
({[^}]*foo[^}]*})
This works as expected in all the input that is possible.
I want to improve my workflow, so instead of just finding all such JSON sections, I want to write a regex to remove all the HTML & JSON that does not match this expression. The result will be only JSON sections that contain foo.
I tried using the Notepad++ regex Replace functionality with this find expression:
(?:({[^}]*?foo[^}]*?})|.)+
and this replace expression:
$1\n\n$2\n\n$3\n\n$4\n\n$5\n\n$6\n\n$7\n\n$8\n\n$9\n\n
This successfully works for the last occurrence of foo within the JSON, but does not find the rest of the occurrences.
How can I improve my code to find all the occurrences?
Here is a simplified minimal example of input and desired output. I hope I haven't simplified it too much for it to be useful:
Simplified input:
<!DOCTYPE html>
<html>
<div dat="{example foo1}"> </div>
<div dat="{example bar}"> </div>
<div dat="{example foo2}"> </div>
</html>
Desired output:
{example foo1}
{example foo2}
You can use
{[^}]*foo[^}]*}|((?s:.))
Replace with (?1:$0\n). Details:
{[^}]*foo[^}]*} - {, zero or more chars other than }, foo, zero or more chars other than } and then a }
| - or
((?s:.)) - Capturing group 1: any one char ((?s:...) is an inline modifier group where . matches all chars including line break chars, same as if you enabled . matches newline option).
The (?1:$0\n) replacement pattern replaces with an empty string if Group 1 was matched, else the replacement is the match text + a newline.
See the demo and search and replace dialog settings:
Updates
The comment section was full tried to suggest a code here,
Let me know if this is a bit close to your intended result,
Find: ({.+?[\n]*foo[ \d]*})|.*?
Replace all: $1
Also added Toto's example

Dita-OT: markdown to HTML escape string / character (specifically brackets)

Something's been puzzling me for the better half of a workday now: What's actually going on during markdown to HTML conversion in Dita when I try to keep brackets intact.
Specifically, this is my original markdown:
1. Value[:, :]
Which should be written as-is in HTML. However, looking at the HTML element produced by Dita:
<li class="li">
<p class="p">
Value
<span class="xref"></span>
</p>
</li>
Expected output:
<li class="li">
<p class="p">
Value[:, :]
</p>
</li>
Which means the brackets are interpreted as an external references (?)
I produce my markdown to HTML conversion in dita CLI, version 3.1.2 (windows 10), with the following command:
dita --input=root.ditamap --output=./output --format=html5
The root.ditamap simply contains a single topic that is my markdown file.
I tried at the following first:
1) Using \ to escape the string, results in:
1. Value\[:, :\]
2) using html entity in-place of brackets ([ and ]) results in: 1. Value:, :
3) using UTF code in-place of brackets ([ and ]) results in:
1. Value:, :
Then I tried to add more brackets there and it worked!
4) Markdown that worked: 1. Value[[]:, :[]] produced expected output 1. Value[:, :]
My question(s):
1) Which of the three pieces is responsible for this behaviour: Markdown, Dita or HTML? (with this behaviour I mean the interpretation of brackets in a way that made them disappear during the original conversion).
2) Is there a "better"/"universal" way to escape strings in markdown -> html by dita? (By better way I mean something that will leave the original markdown's string meaning the same, and by universal I mean something that can be applied to all strings not only brackets)
At the very least I hope my findings will be useful to someone, even though I realize my use-case is very specific. :)

Regular expression remove some links

i need a regular expression to strip html tags for some links
example
link
fasafiso
should be converted to
link
fasafiso
Depending on your programming language, you could come up with sth. like:
~<a href="sample\.com" [^>]*>(.*?)</a>~
# delimiter ~
# look for <a, everything that is not > and >
# capture everything lazily in a group
# look for a closing tag
# delimiter ~
In your example, group 1 would hold fasafiso and could be replaced/insert via the group $1.
See a demo for this approach on regex101.com.
Hint:
This is just a quick-and-dirty solution (e.g. for text editors). If this is getting more complicated, consider using a parser instead.
I'll assume you want to replace all links whose target is sample.com by their content :
match <a[^>]*href="sample.com"[^>]*>([^<]*)</a>
replace by \1
For example with sed :
sed 's/<a[^>]*href="sample.com"[^>]*>([^<]*)</a>/\1/'
Please also keep in mind that if your requirements are complex enough you should instead be using an HTML parser.

Create an NCX file with Notepad++ and Regular expression

I have a HTML Table of Contents page containing list of book chapters with hyperlinks:
Multimedia Implementation<br/>
Table of Contents<br/>
About the Author<br/>
About the Technical Reviewers<br/>
Acknowledgments<br/>
Part I: Introduction and Overview<br/>
Chapter 1. Technical Overview<br/>
...
I want create NCX file for a Kindle book which must contain details as follows:
<navPoint id="n1" playOrder="1">
<navLabel>
<text>Multimedia Implementation</text>
</navLabel>
<content src="final/main.html"/>
</navPoint>
<navPoint id="n2" playOrder="2">
<navLabel>
<text>Table of Contents</text>
</navLabel>
<content src="final/toc.html"/>
</navPoint>
<navPoint id="n3" playOrder="3">
<navLabel>
<text>About the Author</text>
</navLabel>
<content src="final/pref01.html"/>
</navPoint>
...
I'm using Notepad++: is it possible automate this process with regular expression?
You cannot do everything using regex.. you can split the problem into two parts..
generate strings like <navPoint id="n1" playOrder="1"> using program logic (increment variable)
remaining you can do with regex
Use the following regex to match:
<a\shref="([^"]*)">([^<]*)<\/a><br\/>
And replace with:
(generated string)<navLabel>\n<text>\2</text>\n<content src="\1"/>\n</navPoint>
See DEMO
Yes, it is possibly to replace the links with <navpoint> tags. The only thing I found no solution for is the incremental numbering of the <navpoint> attributes id and playOrder...
The following regex will do most of the work:
/^<a[^>]*href="([^"]+)"[^>]*([^<]+).*$/gm
substitute with:
<navpoint id="n" playOrder="">\n<navLabel><text>$2</text></navLabel>\n<content src="$1" />\n</navpoint>\n
Regex details
/^<a .. only parse lines that start with an `<a` tag
.*href=" .. find the first occurance of `href="`
([^"]+) .. capture the text and stop when a " is found
"[^>]*> .. find the end of the <a> tag
([^<]+) .. capture the text and stop when a < is found (i.e. the </a> tag)
.*$/ .. continue to end of the line
gm .. search the whole string and parse each line individually
More detailled (but also more confusing) explanation is here:
https://regex101.com/r/gA0yJ2/1
This link also demonstrates how the regex is working. You can test changes there if you like

Emmet - Wrap with Abbreviation - Token that represents the wrapped text i.e. {original text}

I'm attempting to convert a list of URLs into HTML links as lazily as possible:
www.annaandsally.com.au
www.babylush.com.au
www.babysgotstyle.com.au
... etc
Using wrap in abbreviation, I'd like to do something like: a[href="http://${1}/"]*
The expanded abbreviation would result in:
www.annaandsally.com.au
www.babylush.com.au
www.babysgotstyle.com.au
... etc
The missing piece of the puzzle is an abbreviation token that represents the text being wrapped.
Any idea if this can be done?
If they are already on their own lines (which in the question, they look like they are), a simple Find and Replace with RegEx turned on will work. The Params are as follows:
Find What:
(.+)
Replace With:
$1
Before
After
Sergey from Emmet was kind enough to point me in the right direction. The $# token contains the original content:
a[href="http://$#/"]*>{$#}
By specifying $# as the href attribute, the original content is no longer 'wrapped' and must be be reinserted via {$#}.
http://docs.emmet.io/actions/wrap-with-abbreviation/#controlling-output-position