Selectively escape strings in html using python/django - html

I'm working with email content that has been formatted in html. What I have found is that email addresses are often formatted similarly to html tags. Is there a way to selectively escape strings in html code, but render others as-is?
For example, email addresses are often formatted as "Guy, Some <someguy#gmail.com>" How can I escape this, which python sees as an html tag, but leave <br></br><p></p>, etc. intact and render them?
edited
I'm dealing with raw emails that have been preserved as text. In the following example, I want all of the html tags to render normally. However, it also tries to render email addresses that are stored in the following format someguy#gmail.com and then it gives me an error. So my challenge has been to find all of the email addresses, but leave the html tags alone.
<p>From: Guy, Some <someguy#gmail.com></p><br>
<br>
<p>Sent: Friday, January 21, 2022 2:16 PM</p><br>
<br>
<p>To: Another Guy <anotherguy#gmail.com></p>
<br>
<p>Subject: Really Important Subject</p>
<br>
<p> <br>Good morning,
<br>This is sample text<br> </p>
<br>
<p>Thanks for all your help!!!
<br>
<p> </p>

You can use html < and > to make <> inside html document if you're passing this email tags from django then you've to use safe so it will rendered as pure html code like this
"Guy, Some {{email|safe}}"
EDIT
before rendering your html you can extract all emails with <email> for example
import re
data = '''
<p>From: Guy, Some <someguy#gmail.com></p><br>
<br>
<p>Sent: Friday, January 21, 2022 2:16 PM</p><br>
<br>
<p>To: Another Guy <anotherguy#gmail.com></p>
<br>
<p>Subject: Really Important Subject</p>
<br>
<p> <br>Good morning,
<br>This is sample text<br> </p>
<br>
<p>Thanks for all your help!!!
<br>
<p> </p>
'''
emails_to_parse = re.findall('[A-z]+#[A-z]+.[A-z]+', data) # this will return ['someguy#gmail.com', 'anotherguy#gmail.com']
emails_to_remove = re.findall('<[A-z]+#[A-z]+.[A-z]+>', data) # this will return ['<someguy#gmail.com>', '<anotherguy#gmail.com>']
for i in emails_to_parse:
for j in emails_to_remove:
data = data.replace(j,i)
print(data)
above code gives this output
<p>From: Guy, Some someguy#gmail.com</p><br>
<br>
<p>Sent: Friday, January 21, 2022 2:16 PM</p><br>
<br>
<p>To: Another Guy someguy#gmail.com</p>
<br>
<p>Subject: Really Important Subject</p>
<br>
<p> <br>Good morning,
<br>This is sample text<br> </p>
<br>
<p>Thanks for all your help!!!
<br>
<p> </p>
I'll suggest to look at this post

Related

TinyMCE - formats - includes br in inline OR limit block to selection

I'm using TinyMCE and I would like a sort of inline-block behaviour for a custom format.
In other words, I want the selected portion of text to be wrapped in the selected format, regardless of the presence of line breaks.
So here is an exemple of HTML, with [[ ]]delimiting the user selection
<div>
this is [[a text
<br>
with new]] lines
<br>
how amazing !
</div>
If I declare a format like that { inline: "span" } it would result in
<div>
this is <span>a text</span>
<br>
<span>with new</span> lines
<br>
how amazing !
</div>
If i declare the format like that { block: "span" } I'll end up with
<span>
this is a text
<br>
with new lines
<br>
how amazing !
</span>
But, what I want is
<div>
this is <span>a text
<br>
with new</span> lines
<br>
how amazing !
</div>
I tried various of format parameters without success, I tried the global format_empty_lines parameters (that, on the paper, seemed to be the solution) but that didn't work either.

Xpath select between elements under condition (containing text)

I have a page like this (a speech or a dialogue page organised like this, so speaker name in bold and then paragraphs of his speech):
<body>
<p>
<b>
speaker abc:
</b>
some wanted text here
</p>
<p>
some other text wanted, maybe containing speaker abc
</p>
<p>
some other text wanted, maybe containing speaker cde
</p>
<p>
some other text wanted
</p>
<p>
<b>
speaker cde (can be random):
</b>
</p>
<p>
some other text UNwanted, maybe containing speaker abc
</p>
<p>
some other text UNwanted, maybe containing speaker cde
</p>
<p>
some other text UNwanted
</p>
<p>
<b>
speaker abc:
</b>
</p>
<p>
some other text wanted
</p>
<p>
<b>
speaker fgh:
</b>
</p>
<p>
some other text UNwanted
</p>
</body>
I would like to select (using xpath) all text elements marked as wanted text in example (all phrases spoken by one particular speaker, say abc).
I am not very fluent with xpath and html, I suspect there should be some usage of axis but struggle to figure out how.
This is very difficult to do using XPath 1.0 alone.
In XSLT 2.0+, use positional grouping:
<xsl:for-each-group select="p" group-starting-with="p[b]">...</
and then select the groups you are interested in.
If you have to do it using XPath 1.0, consider pre-processing the input using XSLT to split the text into speeches, using xsl:for-each-group as suggested.
The following XPath will do this:
"//*[preceding-sibling::p[contains(.,'speaker abc')] and following-sibling::p[contains(.,'speaker cde')]]"
We are limiting the wanted p nodes by preceding-sibling p node containing the wanted text speaker name in front and by following-sibling p node containing the next, unwanted speaker name on the end.
the output is
some other text wanted, maybe containing abc
some other text wanted, maybe containing cde
some other text wanted

Change replace and exclude multiline html code

Looking for a bit of help.
After completing hundreds of pages, I came across an HTML error.
Need simply to move the /h1> closing tag to reside before the p> tag
Is it possible with notepad++ or other, to find these 6 lines and replace the /h1> tag from line 6 to the end of the second line?
Appreciate the help, if it's possible, as it will save a lot of work replacing it individually on all my pages.
Code is; tab delimited.
Note: The title and paragraph are never the same.
Find:
<h1 id="h-title">
Estatus rogue
<p>ext 098 float
<br>
</p>
</h1>
Replace with:
<h1 id="h-title">
Estatus rogue</h1>
<p>ext 098 float
<br>
</p>
In advance all my appreciation for your help
Try downloading/using Visual Studio Code Community and doing a find and replace in multiple files
Here's a link to someone else who has done this.

Loop based on tag in HTML document

I'm trying to extract certain details from articles which are combined in an html file. The html file will consist of 1000s of articles so trying to automate the extraction using BeautifulSoup. For the first article I can get it to extract but do not seem to get it to automatically move on to the next article. This is how the html looks like:
<DOCFULL> -->
<br/>
<div class="c0">
<p class="c1">
<span class="c2">
2 of 4 DOCUMENTS
</span>
</p>
</div>
<br/>
<div class="c0">
<br/>
<p class="c1">
<span class="c2">
The New York Times
<br/>
</span>
...
</DOCFULL>
...
<DOCFULL> -->
<br/>
<div class="c0">
<p class="c1">
<span class="c2">
1 of 4 DOCUMENTS
So, somehow I need the following commands, once fulfilled, to again apply to the next article, that will start again will -->. But I just cannot get it to work the way needed. For example to extract 'The New York Times' in the partial html above I use, and it should automatically also been done to the 2nd/3rd/4th etc article.
journal = soup.find_all('span', class_='c2')[1].getText()
If anyone can point me in the direction I should start thinking it would be really appreciated!
EDIT:
Just to put more into perspective what I am trying to achieve. I can get the latter parts to work, but do not get it to check each article after the former.
For Each Article:
* Determine Newspaper
* If newspaper = x
.
.
.
* Else
Continue

XPATH, grabbing content of next divs bases on textual value

I need to grab contact details of people that appears pretty much in the same bit of code after the words:
"For further information, please contact:"
So I want the href (http://www.wongpartnership.com/index.php/wongpartnership/partner/chan-hock-keng)
and the name (Hock Keng)
Tried variations of //strong[contains(., 'For further')]/following-sibling::p
but not working
Code:
<p><span style="font-size:14px;"><span data-mce-style="font-size: small;"><strong>For further information, please contact:</strong></span></span></p>
<p> </p>
<p><span style="font-size:14px;"><span data-mce-style="font-size: small;"><strong><a data-mce-href="http://www.wongpartnership.com/index.php/wongpartnership/partner/chan-hock-keng" href="http://www.wongpartnership.com/index.php/wongpartnership/partner/chan-hock-keng" target="_blank">Hock Keng</a></strong><strong><a data-mce-href="http://www.wongpartnership.com/index.php/wongpartnership/partner/chan-hock-keng" href="http://www.wongpartnership.com/index.php/wongpartnership/partner/chan-hock-keng" target="_blank"> </a></strong><strong><a data-mce-href="http://www.wongpartnership.com/index.php/wongpartnership/partner/chan-hock-keng" href="http://www.wongpartnership.com/index.php/wongpartnership/partner/chan-hock-keng" target="_blank">Chan</a></strong><strong>, Partner, WongPartnership</strong></span></span></p>