Mako template filter ordering - mako

When outputting some UGC ($user.text) in a mako template, I'd like to sanitise the content using the mako filter 'h' and then add some <br> tags in place of newlines so there is a bit of formatting.
However, it appears that mako ignores the order that I apply the 'h' filter and now my <br> tags are being escaped and not rendered.
This is my br-adding filter:
<%
def nl2br(str):
return str.replace("\n", "<br/>")
%>
This is my test string:
hello,
My name is
James
The following mako tags with filters:
${user.text | n,h,nl2br}
${user.text | n,nl2br,h}
... generate the same html with <br> tags escaped:
hello,
<br/>
<br/>My name is
<br/>
<br/>James
The only way I've been able to find to allow the <br> tags to come through without escaping is to remove the 'h' filter altogether as follows:
${user.text | n,nl2br}
But this defeats the object of sanitising the user.text field.
How can I get the 'h' filter to fire and then add <br> tags?
Am I missing something with buffers?

The behaviour you're seeing is due to the fact that Markup.replace assumes that the replacement string is unsafe:
>>> from markupsafe import Markup, escape
>>> e = escape(">x\ny")
>>> e
Markup(u'>x\ny')
>>> e.replace("\n", "<br />")
Markup(u'>x<br />y')
The solution is to tell markupsafe that the <br /> is trusted:
>>> e.replace("\n", Markup("<br />"))
Markup(u'>x<br />y')
So your nl2br filter should be:
from markupsafe import Markup
def nl2br(s):
return s.replace("\n", Markup("<br />"))
And then ${user.text|h,nl2br} should work as expected.

I wasn't able to work out what's going on with the ordering above, so if you have the answer please post it here. Instead I used a workaround...
Since the h tag was firing too late in the filter order, I created my own version of it and forced it to return cleaned HTML by calling __str__.
def early_html_escape(string):
"""Run markupsafe escaping and force the result to string."""
import markupsafe
return markupsafe.escape(string).__str__()
This then allowed me to pass the user's text first through HTML escaping and then the nl2br filter, without the <br> tags being converted into HTML entities.
${user.text | early_html_escape,nl2br }
Hope that helps someone.

Related

Using beautifulsoup to separate strings separated by `<br>`

I want to get some data from a website that uses <br>. In the html parsed using beautifulsoup4 sometimes I have the following pattern:
"<p class=some_class>text_1. text_2 (text_3<span class=GramE>)</span>
<br>
text_4,<span style='mso-fareast-font-family:"Arial Unicode MS"'>
</span>text_5.</p>"
But if the website was written in a nicer way, it would have looked like:
"<p class=some_class>text_1. text_2(text_3<span class=GramE>)</span
</p> <p class=some_class>
text_4,<span style='mso-fareast-font-family:"Arial Unicode MS"'>
</span>text_5.</p>
To extract the strings I want, I would have extracted all text within each <p>.
However, now the strings I want to separate are separated by <br>.
My question is the following: how can I use <br> to disentangle the parts of the string I am interested in? I mean, I want something like [text_1.+text_2+text_3, text_4+text_5.].
I'm explicitly asking about the use of <br> since is the only element I have found that separates the strings I'm interested. Moreover, in some others parts of the website, I have <br/> separating the strings I'm interested in, instead of <br>.
I cannot solve this by using the replace() function since my object is a Tag froom bs4. Also, using find("br") from bs4 gives me "<br/>" and not the text I want. In this way, the answers in this question are not exactly what I want. I think that one way would be to transform the tag from bs4 that I have to html, then change the "<br/>" using replace() function, and finally transform it back to a bs4 element. However, I do not know how to make this change, and I also want to know if there is an easier and shorter way to do this.
This is a solution I found but it's long and inefficient since it does not use any feature of bs4. Though, it works.
html_doc = """
"<p class=some_class>text_1. text_2 (text_3<span class=GramE>)</span>
<br>
text_4,<span style='mso-fareast-font-family:"Arial Unicode MS"'>
</span>text_5.</p>"
"""
def replace_br(soup_object):
html1=str(soup_object)
html1=html1.replace("<br>", "</p> <p>")
soup_html1 = BeautifulSoup(html1, 'html.parser')
return soup_html1.find_all("p")
replace_br(html_doc)
[<p class="some_class">text_1. text_2 (text_3<span class="GramE">)</span>
</p>, <p>
text_4,<span style='mso-fareast-font-family:"Arial Unicode MS"'>
</span>text_5.</p>]

Ruby: Including raw HTML in a Nokogiri HTML builder

I'm writing code to convert a fixed XML schema to HTML. I'm trying to use Nokogiri, and it works for most tags, e.g.:
# doc is the Nokogiri html builder, text_inline is a TextInlineContent node
def consume_inline_content?(doc, text_inline)
text = text_inline.text
case text_inline.name
when 'text'
doc.text text
when 'emphasized'
doc.em {
doc.text text
}
# ... and so on ...
end
end
The problem is, this schema also includes a rawHTML text node. Here is some of my input:
<rawHTML><![CDATA[<h2>]]></rawHTML>
Stuff
<rawHTML><![CDATA[</h2>]]></rawHTML>
which should ideally be rendered as <h2>Stuff</h2>. But when I try the "obvious" thing:
...
when 'rawHTML'
doc << text
...
Nokogiri produces <h2></h2>Stuff. It seems to be "fixing" the unbalanced open tag before I have a chance to insert its contents or closing tag.
I recognize that I'm asking about a feature that could produce malformed html, and maybe the builder doesn't want to allow that. Is there a right way to handle this situation?

Extracting plain text from html files in R [duplicate]

I'm trying to read web page source into R and process it as strings. I'm trying to take the paragraphs out and remove the html tags from the paragraph text. I'm running into the following problem:
I tried implementing a function to remove the html tags:
cleanFun=function(fullStr)
{
#find location of tags and citations
tagLoc=cbind(str_locate_all(fullStr,"<")[[1]][,2],str_locate_all(fullStr,">")[[1]][,1]);
#create storage for tag strings
tagStrings=list()
#extract and store tag strings
for(i in 1:dim(tagLoc)[1])
{
tagStrings[i]=substr(fullStr,tagLoc[i,1],tagLoc[i,2]);
}
#remove tag strings from paragraph
newStr=fullStr
for(i in 1:length(tagStrings))
{
newStr=str_replace_all(newStr,tagStrings[[i]][1],"")
}
return(newStr)
};
This works for some tags but not all tags, an example where this fails is following string:
test="junk junk<a href=\"/wiki/abstraction_(mathematics)\" title=\"abstraction (mathematics)\"> junk junk"
The goal would be to obtain:
cleanFun(test)="junk junk junk junk"
However, this doesn't seem to work. I thought it might be something to do with string length or escape characters, but I couldn't find a solution involving those.
This can be achieved simply through regular expressions and the grep family:
cleanFun <- function(htmlString) {
return(gsub("<.*?>", "", htmlString))
}
This will also work with multiple html tags in the same string!
This finds any instances of the pattern <.*?> in the htmlString and replaces it with the empty string "". The ? in .*? makes it non greedy, so if you have multiple tags (e.g., <a> junk </a>) it will match <a> and </a> instead of the whole string.
You can also do this with two functions in the rvest package:
library(rvest)
strip_html <- function(s) {
html_text(read_html(s))
}
Example output:
> strip_html("junk junk<a href=\"/wiki/abstraction_(mathematics)\" title=\"abstraction (mathematics)\"> junk junk")
[1] "junk junk junk junk"
Note that you should not use regexes to parse HTML.
Another approach, using tm.plugin.webmining, which uses XML internally.
> library(tm.plugin.webmining)
> extractHTMLStrip("junk junk<a href=\"/wiki/abstraction_(mathematics)\" title=\"abstraction (mathematics)\"> junk junk")
[1] "junk junk junk junk"
An approach using the qdap package:
library(qdap)
bracketX(test, "angle")
## > bracketX(test, "angle")
## [1] "junk junk junk junk"
It is best not to parse html using regular expressions. RegEx match open tags except XHTML self-contained tags
Use a package like XML. Source the html code in parse it using for example htmlParse and use xpaths to find the quantities relevant to you.
UPDATE:
To answer the OP's question
require(XML)
xData <- htmlParse('yourfile.html')
xpathSApply(xData, 'appropriate xpath', xmlValue)
It may be easier with sub or gsub ?
> test <- "junk junk<a href=\"/wiki/abstraction_(mathematics)\" title=\"abstraction (mathematics)\"> junk junk"
> gsub(pattern = "<.*>", replacement = "", x = test)
[1] "junk junk junk junk"
First, your subject line is misleading; there are no backslashes in the string you posted. You've fallen victim to one of the classic blunders: not as bad as getting involved in a land war in Asia, but notable all the same. You're mistaking R's use of \ to denote escaped characters for literal backslashes. In this case, \" means the double quote mark, not the two literal characters \ and ". You can use cat to see what the string would actually look like if escaped characters were treated literally.
Second, you're using regular expressions to parse HTML. (They don't appear in your code, but they are used under the hood in str_locate_all and str_replace_all.) This is another of the classic blunders; see here for more exposition.
Third, you should have mentioned in your post that you're using the stringr package, but this is only a minor blunder by comparison.

Yahoo Pipes and Regex with an html formatting issue

I am struggling to see how to use the regex to add a non-printable carriage return character into an html string.
Its a WordPress thing in that to auto-embed a video I need to put the URL on its own line in the html.
First I use a regex:
In item.vid_src replace ($) with \\r$1
s is checked.
After which I am using a loop with a string builder in it - I am prefixing vid_src to the start of description thus:
item.vid_src
<br><br>
item.description
assign results to item.description
Before I include the Regex module in the pipe I get this:
http://www.youtube.com/watch?v=THA_5cqAfCQ<br><br><p><h1 class="MsoNormal">Cheetahs on
the edge</h1>
But I need this:
http://www.youtube.com/watch?v=THA_5cqAfCQ
<br><br><p><h1 class="MsoNormal">Cheetahs on the edge</h1>
Adding the regex module I get this:
http://www.youtube.com/watch?v=THA_5cqAfCQ\r<br><br><p><h1 class="MsoNormal">
Cheetahs on the edge</h1>
Clearly its inserting exactly what I have asked for, but It is not what I was expecting, I need to get the html formatted with the newline. Does anybody have an insight as to how to tackle the problem?

Regex select all text between tags

What is the best way to select all the text between 2 tags - ex: the text between all the '<pre>' tags on the page.
You can use "<pre>(.*?)</pre>", (replacing pre with whatever text you want) and extract the first group (for more specific instructions specify a language) but this assumes the simplistic notion that you have very simple and valid HTML.
As other commenters have suggested, if you're doing something complex, use a HTML parser.
Tag can be completed in another line. This is why \n needs to be added.
<PRE>(.|\n)*?<\/PRE>
To exclude the delimiting tags:
(?<=<pre>)(.*?)(?=</pre>)
(?<=<pre>) looks for text after <pre>
(?=</pre>) looks for text before </pre>
Results will text inside pre tag
This is what I would use.
(?<=(<pre>))(\w|\d|\n|[().,\-:;##$%^&*\[\]"'+–/\/®°⁰!?{}|`~]| )+?(?=(</pre>))
Basically what it does is:
(?<=(<pre>)) Selection have to be prepend with <pre> tag
(\w|\d|\n|[().,\-:;##$%^&*\[\]"'+–/\/®°⁰!?{}|~]| ) This is just a regular expression I want to apply. In this case, it selects letter or digit or newline character or some special characters listed in the example in the square brackets. The pipe character | simply means "OR".
+? Plus character states to select one or more of the above - order does not matter. Question mark changes the default behavior from 'greedy' to 'ungreedy'.
(?=(</pre>)) Selection have to be appended by the </pre> tag
Depending on your use case you might need to add some modifiers like (i or m)
i - case-insensitive
m - multi-line search
Here I performed this search in Sublime Text so I did not have to use modifiers in my regex.
Javascript does not support lookbehind
The above example should work fine with languages such as PHP, Perl, Java ...
Javascript however does not support lookbehind so we have to forget about using `(?))` and look for some kind of workaround. Perhaps simple strip the first four chars from our result for each selection like in here
https://stackoverflow.com/questions/11592033/regex-match-text-between-tags
Also look at the JAVASCRIPT REGEX DOCUMENTATION for non-capturing parentheses
use the below pattern to get content between element. Replace [tag] with the actual element you wish to extract the content from.
<[tag]>(.+?)</[tag]>
Sometime tags will have attributes, like anchor tag having href, then use the below pattern.
<[tag][^>]*>(.+?)</[tag]>
This answer supposes support for look around! This allowed me to identify all the text between pairs of opening and closing tags. That is all the text between the '>' and the '<'. It works because look around doesn't consume the characters it matches.
(?<=>)([\w\s]+)(?=<\/)
I tested it in https://regex101.com/ using this HTML fragment.
<table>
<tr><td>Cell 1</td><td>Cell 2</td><td>Cell 3</td></tr>
<tr><td>Cell 4</td><td>Cell 5</td><td>Cell 6</td></tr>
</table>
It's a game of three parts: the look behind, the content, and the look ahead.
(?<=>) # look behind (but don't consume/capture) for a '>'
([\w\s]+) # capture/consume any combination of alpha/numeric/whitespace
(?=<\/) # look ahead (but don't consume/capture) for a '</'
I hope that serves as a started for 10. Luck.
This seems to be the simplest regular expression of all that I found
(?:<TAG>)([\s\S]*)(?:<\/TAG>)
Exclude opening tag (?:<TAG>) from the matches
Include any whitespace or non-whitespace characters ([\s\S]*) in the matches
Exclude closing tag (?:<\/TAG>) from the matches
You shouldn't be trying to parse html with regexes see this question and how it turned out.
In the simplest terms, html is not a regular language so you can't fully parse is with regular expressions.
Having said that you can parse subsets of html when there are no similar tags nested. So as long as anything between and is not that tag itself, this will work:
preg_match("/<([\w]+)[^>]*>(.*?)<\/\1>/", $subject, $matches);
$matches = array ( [0] => full matched string [1] => tag name [2] => tag content )
A better idea is to use a parser, like the native DOMDocument, to load your html, then select your tag and get the inner html which might look something like this:
$obj = new DOMDocument();
$obj -> load($html);
$obj -> getElementByTagName('el');
$value = $obj -> nodeValue();
And since this is a proper parser it will be able to handle nesting tags etc.
Try this....
(?<=\<any_tag\>)(\s*.*\s*)(?=\<\/any_tag\>)
var str = "Lorem ipsum <pre>text 1</pre> Lorem ipsum <pre>text 2</pre>";
str.replace(/<pre>(.*?)<\/pre>/g, function(match, g1) { console.log(g1); });
Since accepted answer is without javascript code, so adding that:
preg_match_all(/<pre>([^>]*?)<\/pre>/,$content,$matches) this regex will select everyting between tag. no matter is it in new line(work with multiline.
In Python, setting the DOTALL flag will capture everything, including newlines.
If the DOTALL flag has been specified, this matches any character including a newline. docs.python.org
#example.py using Python 3.7.4
import re
str="""Everything is awesome! <pre>Hello,
World!
</pre>
"""
# Normally (.*) will not capture newlines, but here re.DOTATLL is set
pattern = re.compile(r"<pre>(.*)</pre>",re.DOTALL)
matches = pattern.search(str)
print(matches.group(1))
python example.py
Hello,
World!
Capturing text between all opening and closing tags in a document
To capture text between all opening and closing tags in a document, finditer is useful. In the example below, three opening and closing <pre> tags are present in the string.
#example2.py using Python 3.7.4
import re
# str contains three <pre>...</pre> tags
str = """In two different ex-
periments, the authors had subjects chat and solve the <pre>Desert Survival Problem</pre> with a
humorous or non-humorous computer. In both experiments the computer made pre-
programmed comments, but in study 1 subjects were led to believe they were interact-
ing with another person. In the <pre>humor conditions</pre> subjects received a number of funny
comments, for instance: “The mirror is probably too small to be used as a signaling
device to alert rescue teams to your location. Rank it lower. (On the other hand, it
offers <pre>endless opportunity for self-reflection</pre>)”."""
# Normally (.*) will not capture newlines, but here re.DOTATLL is set
# The question mark in (.*?) indicates non greedy matching.
pattern = re.compile(r"<pre>(.*?)</pre>",re.DOTALL)
matches = pattern.finditer(str)
for i,match in enumerate(matches):
print(f"tag {i}: ",match.group(1))
python example2.py
tag 0: Desert Survival Problem
tag 1: humor conditions
tag 2: endless opportunity for self-reflection
(?<=>)[^<]+
for Notepad++
>([^<]+)
for AutoIt (option Return array of global matches).
or
(?=>([^<]+))
https://regex101.com/r/VtmEmY/1
To select all text between pre tag I prefer
preg_match('#<pre>([\w\W\s]*)</pre>#',$str,$matches);
$matches[0] will have results including <pre> tag
$matches[1] will have all the content inside <pre>.
DomDocument cannot work in situations where the requirement is to get text with tag details within the searched tag as it strips all tags, nodeValue & textContent will only return text without tags & attributes.
test.match(/<pre>(.*?)<\/pre>/g)?.map((a) => a.replace(/<pre>|<\/pre>/g, ""))
this should be a preferred solution.especially if you have multiple pre tags in the context
You can use Pattern pattern = Pattern.compile( "[^<'tagname'/>]" );
const content = '<p class="title responsive">ABC</p>';
const blog = {content};
const re = /<([^> ]+)([^>]*)>([^<]+)(<\/\1>)/;
const matches = content.match(re);
console.log(matches[3]);
matches[3] is the content text and this is adapted to any tag name with classes. (not support nested structures)
How about:
<PRE>(\X*?)<\/PRE>
More complex than PyKing's answer but matches any type of tag (except self-closing) and considers cases where the tag has HTML-like string attributes.
/<TAG_NAME(?:STRING|NOT_CLOSING_TAG_NOT_QUOTE)+>INNER_HTML<\/\1 *>/g
Raw: /<([^\s</>]+)(?:("(?:[^"\\]|\\.)*")|[^>"])+>(.*?)<\/\1 *>/g
Regex Railroad diagram:
group #1 = tag name
group #2 = string attr
group #3 = inner html
JavaScript code testing it:
let TAG_NAME = '([^\s</>]+)';
let NOT_CLOSING_TAG_NOT_QUOTE = '[^>"]';
let STRING = '("(?:[^"\\\\]|\\\\.)*")';
let NON_SELF_CLOSING_HTML_TAG =
// \1 is a back reference to TAG_NAME
`<${TAG_NAME}(?:${STRING}|${NOT_CLOSING_TAG_NOT_QUOTE})+>(.*?)</\\1 *>`;
let tagRegex = new RegExp(NON_SELF_CLOSING_HTML_TAG, 'g');
let myStr = `Aenean <abc href="/life<><>\\"<?/abc></abc>"><a>life</a></abc> sed consectetur.
Work Inner HTML quis risus eget about inner html leo.
interacted with any of the <<<ve text="<></ve>>">abc</ve>`;
let matches = myStr.match(tagRegex);
// Removing 'g' flag to match each tag part in the for loop
tagRegex = new RegExp(NON_SELF_CLOSING_HTML_TAG);
for (let i = 0; i < matches.length; i++) {
let tagParts = matches[i].match(tagRegex);
console.log(`Tag #${i} = [${tagParts[0]}]`);
console.log(`Tag #${i} name: [${tagParts[1]}]`);
console.log(`Tag #${i} string attr: [${tagParts[2]}]`);
console.log(`Tag #${i} inner html: [${tagParts[3]}]`);
console.log('');
}
Output:
Tag #0 = [<abc href="/life<><>\"<?/abc></abc>"><a>life</a></abc>]
Tag #0 name: [abc]
Tag #0 string attr: ["/life<><>\"<?/abc></abc>"]
Tag #0 inner html: [<a>life</a>]
Tag #1 = [Work Inner HTML]
Tag #1 name: [a]
Tag #1 string attr: ["/work"]
Tag #1 inner html: [Work Inner HTML]
Tag #2 = [about inner html]
Tag #2 name: [a]
Tag #2 string attr: ["/about"]
Tag #2 inner html: [about inner html]
Tag #3 = [<ve text="<></ve>>">abc</ve>]
Tag #3 name: [ve]
Tag #3 string attr: ["<></ve>>"]
Tag #3 inner html: [abc]
This doesn't work if:
The tag has any descendant tag of the same type
The tag start in one line and ends in another. (In my case I
remove line breaks from HTML)
If you change (.*?)<\/\1 *> to ([\s\S]*?)<\/\1 *> it should match the tag's inner html even if everything is not in the same line. For some reason it didn't work for me on Chrome and Node but worked here with the JavaScript's Regex Engine:
https://www.regextester.com
Regex: <([^\s</>]+)(?:("(?:[^"\\]|\\.)*")|[^>"])+>([\s\S]*?)<\/\1 *>
Test String:
Aenean lacinia <abc href="/life<><><?/a></a>">
<a>life</a></abc> sed consectetur.
Work quis risus eget urna mollis ornare about leo.
interacted with any of the <<<ve text="<></ve>>">abc</ve>
For multiple lines:
<htmltag>(.+)((\s)+(.+))+</htmltag>
I use this solution:
preg_match_all( '/<((?!<)(.|\n))*?\>/si', $content, $new);
var_dump($new);
In Javascript (among others), this is simple. It covers attributes and multiple lines:
/<pre[^>]*>([\s\S]*?)<\/pre>/
<pre>([\r\n\s]*(?!<\w+.*[\/]*>).*[\r\n\s]*|\s*[\r\n\s]*)<code\s+(?:class="(\w+|\w+\s*.+)")>(((?!<\/code>)[\s\S])*)<\/code>[\r\n\s]*((?!<\w+.*[\/]*>).*|\s*)[\r\n\s]*<\/pre>