Jinja2 Replace Text - jinja2

I am looking to use jinja to clean up a text block (low code product so I don't have many options)
Is there a way to use a wildcard to replace characters between two other characters? For example I would like to replace <p *> with <p> and <span *> with <span>.
There are too many different styles to try to do this individually.
I tried this, but it was not happy with my suggestion:
{{a.additional_comments | regex_replace('^<p.*>(.*)$', '<p>\\1') | regex_replace('^<span.*>(.*)$', '<span>\\1')}}

Related

How to match text and skip HTML tags using a regular expression?

I have a bunch of records in a QuickBase table that contain a rich text field. In other words, they each contain some paragraphs of text intermingled with HTML tags like <p>, <strong>, etc.
I need to migrate the records to a new table where the corresponding field is a plain text field. For this, I would like to strip out all HTML tags and leave only the text in the field values.
For example, from the below input, I would expect to extract just a small example link to a webpage:
<p>just a small <a href="#">
example</a> link</p><p>to a webpage</p>
As I am trying to get this done quickly and without coding or using an external tool, I am constrained to using Quickbase Pipelines' Text channel tool. The way it works is that I define a regex pattern and it outputs only the bits that match the pattern.
So far I've been able to come up with this regular expression (Python-flavored as QB's backend is written in Python) that correctly does the exact opposite of what I need. I.e. it matches only the HTML tags:
/(<[^>]*>)/
In a sense, I need the negative image of this expression but have not be able to build it myself.
Your help in "negating" the above expression is most appreciated.
Assuming there are no < or > elsewhere or entity-encoded, an idea using a lookbehind.
(?:(?<=>)|^)[^<]+
See this demo at regex101
(?:(?<=>)|^) is an alternation between either ^ start of the string or looking behind for any >. From there [^<]+ matches one or more characters that are not < (negated character class).

Remove HTML tags in specific tags in MySQL

I'd like to make a SQL script to remove for exemple all <strong> and </strong> tags which are inside a title <hX></hX> tag.
I want to replace all occurences like <h4><strong>Some text</strong></h4> with <h4>Some text</h4>,
but only if in a H tag and without losing content of course.
I tried many things like the REGEXP_REPLACE and REGEXP_SUBSTR but I'm stuck with something like REGEXP_REPLACE(myfield, "<h\\d>.*<strong>.*<\/strong>.*<\/h\\d>", "") which replaces all match.
I use php to strip info out: preg_replace('#[^A-Za-z0-9]#i', '', $_POST['username']); // filter everything but letters and numbers. It can be modified for specific phrases and characters. I know it isn't SQL but it is something. Also in Javascript, you can use an innerHTML command that pulls the text only out from within tags >Text<

Find and replace all occurrences of a string inside an HTML tag in one pass

I need a regular expression to search for and replace multiple occurrences of a text string within a delimited section of text.
Let's say there is HTML code with one or more spans that have a certain class. Each span may have none, one or multiple occurrences of the string {abc} inside, e.g.
<p>lorem ipsum dolor <span class="xyz">sid amet{abc}et pluribus {abc} unum{abc} diex
et mon droit</span> you'll never walk alone</p>
Thus I need a regex pair to replace all occurrences of {abc} within <span id="xyz"> with {def} in a single pass.
This is for use in a text editor such as Notepad++ and the like and needs to be be a PCRE/UNIX-style regular expression.
What I have is,
find: (<span class="xyz">)([^<]*)\{abc\}([^<]*<)
replace: \1\2{def}\3
This does work for one occurrence within a span, but in case of more occurrences, I have to run replacement multiple times, in cycle, while I need that to be one-pass.
I wonder how can I achieve that. I suppose this is a pretty common case, somehow I could not find similar things concerning the need to be one-pass, no cycles, no code, and I'd like to get an idea how this could be done in principle.
This seems to work in Notepad++
Find what : (?:<span class="xyz">|\G)[^<]*?\K\{abc\}(?=[^<]*<\/span>)
Replace with : {def}
Search mode : Regular expression
Note that because of the [^<]* there is an assumption that there are no other tags within the span tag.

Compare two HTML documents ignoring multiple and trailing whitespaces

Is there a tool that compares an HTML document like:
<p b="1" a="0 "> a b
c </p>
(as a C string: "<p> a b\nc </p>") equal to:
<p a="0 " b="1">a b c</p>
Note how:
text multiple whitespaces were converted to a single whitespace
newlines were converted to whitespaces
text trailing and heading whitespaces were stripped
attributes were put on a standard order
attribute values were unchanged, including trailing whitespaces
Why I want that
I am working on the Markdown Test Suite that aims to measure markdown engine compliance and portability.
We have markdown input, expected HTML output, and want to determine if the generated HTML output is equal to the expected one.
The problem is that Markdown is underspecified, so we cannot compare directly the two HTML strings.
The actual test code is here, just modify run-tests.py#dom_normalize if you want to try out your solution.
Things I tried
beautifulsoup. Orders the attributes, but does not deal well with whitespaces?
A function formatter regex modification might work, but I don't see a way to differentiate between the inside of nodes and attributes.
A Python only solution like this would be ideal.
looking for a Javascript function similar to isEqualNode() (does not work because ignores nodeVaue) + some headless JS engine. Couldn't find one.
If there is nothing better, I'll just have to write my own output formatter front-end to some HTML parser.
I ended up cooking up a custom HTML renderer that normalizes things based on Python's stdlib HTMLParser.
You can see it at: https://github.com/karlcow/markdown-testsuite/blob/749ed0b812ffcb8b6cc56f93ff94c6fdfb6bd4a2/run-tests.py#L20
Usage and docstrig tests at: https://github.com/karlcow/markdown-testsuite/blob/749ed0b812ffcb8b6cc56f93ff94c6fdfb6bd4a2/run-tests.py#L74

Regular expression to delete HTML strings

I am trying to delete part of a string that does not match my pattern. For example, in
<SYNC Start=364><P Class=KRCC>
<Font Color=lightpink>abcd
I would like to delete
<P Class=KRCC><Font Color=lightpink>
How do I do that?
Your question does not indicate that you need (or should use) regular expressions. If you want to remove a fixed string, do traditional search and replace.
Just match `your pattern' and write that to a file or update the table of a database. That way, you are deleting the rest.
If the HTML you are parsing is valid and always follows a known standard format, you can use non-greedy patterns to remove most of what you don't want.
These samples will have to be modified based on the tool/framework you're using to handle regular expressions. I am not escaping special characters for brevity.
To match any paragraph tags:
<p.*?>(.*?)</p>
You would replace these matches with $1 (or whatever your syntax requires to access groups).
It's important to use non-greedy (?) patterns to avoid accidentally matching two unrelated start/end tags. For example:
<p.*>(.*)</p>
Would behave very differently. In the case of the following example HTML, it would not correctly match two paragraphs:
<p>Lorem ipsum.</p><p>Lorem ipsum.</p>
Instead, it would match "<p>Lorem ipsum.</p><p>" as the first portion, which would result in losing content.
If you need to match paragraphs with specific classes, you could use something like this:
<p.*?class="delete".*?>(.*?)</p>
Where things get sticky is when you start working with non-standardized HTML. For example, this is all valid HTML, but the pattern to clean it up would be ugly:
<p>no class</p>
<p class=delete>no quotes</p>
<p class="delete">double quotes</p>
<p class='delete'>single quotes</p>
<p>space in closing tag</p >
<p>no closing tag