Using beautifulsoup to separate strings separated by `<br>` - html

I want to get some data from a website that uses <br>. In the html parsed using beautifulsoup4 sometimes I have the following pattern:
"<p class=some_class>text_1. text_2 (text_3<span class=GramE>)</span>
<br>
text_4,<span style='mso-fareast-font-family:"Arial Unicode MS"'>
</span>text_5.</p>"
But if the website was written in a nicer way, it would have looked like:
"<p class=some_class>text_1. text_2(text_3<span class=GramE>)</span
</p> <p class=some_class>
text_4,<span style='mso-fareast-font-family:"Arial Unicode MS"'>
</span>text_5.</p>
To extract the strings I want, I would have extracted all text within each <p>.
However, now the strings I want to separate are separated by <br>.
My question is the following: how can I use <br> to disentangle the parts of the string I am interested in? I mean, I want something like [text_1.+text_2+text_3, text_4+text_5.].
I'm explicitly asking about the use of <br> since is the only element I have found that separates the strings I'm interested. Moreover, in some others parts of the website, I have <br/> separating the strings I'm interested in, instead of <br>.
I cannot solve this by using the replace() function since my object is a Tag froom bs4. Also, using find("br") from bs4 gives me "<br/>" and not the text I want. In this way, the answers in this question are not exactly what I want. I think that one way would be to transform the tag from bs4 that I have to html, then change the "<br/>" using replace() function, and finally transform it back to a bs4 element. However, I do not know how to make this change, and I also want to know if there is an easier and shorter way to do this.

This is a solution I found but it's long and inefficient since it does not use any feature of bs4. Though, it works.
html_doc = """
"<p class=some_class>text_1. text_2 (text_3<span class=GramE>)</span>
<br>
text_4,<span style='mso-fareast-font-family:"Arial Unicode MS"'>
</span>text_5.</p>"
"""
def replace_br(soup_object):
html1=str(soup_object)
html1=html1.replace("<br>", "</p> <p>")
soup_html1 = BeautifulSoup(html1, 'html.parser')
return soup_html1.find_all("p")
replace_br(html_doc)
[<p class="some_class">text_1. text_2 (text_3<span class="GramE">)</span>
</p>, <p>
text_4,<span style='mso-fareast-font-family:"Arial Unicode MS"'>
</span>text_5.</p>]

Related

How can I extract the url from a link following an italicized element with xpath?

I'm trying to extract a link from a lot of pages with xpath and I'm not sure what I'm doing wrong here. The pages are poorly formatted with italicizing which is what I think is throwing it off.
This is an example of the way the html is formatted:
<p>
<i>This content is constant</i>
<a href="example.com/exampe123">
<i>This text changes</i>
</a>
<i> </i>
</p>
In this example, the word "text" doesn't change but the rest of the words do.
I tried using the following xpath but it didn't work:
//p/a[contains(text(), 'text')]/#href
You might use one of below XPath expressions:
//p/a[i[contains(text(), 'text')]]/#href
//p/a[contains(., 'text')]/#href
If the <i> elements are causing issues, or if they are mal-formed, how about just textually removing ALL <i> and </i> strings before creating your XPath object?
var cleanString = dirtyString.Replace("<i>","").Replace("</i>","");
And then create your XPath object from that "clean" string. Chances are, you don't need to know where <i> segments are in your app.

Compare two HTML documents ignoring multiple and trailing whitespaces

Is there a tool that compares an HTML document like:
<p b="1" a="0 "> a b
c </p>
(as a C string: "<p> a b\nc </p>") equal to:
<p a="0 " b="1">a b c</p>
Note how:
text multiple whitespaces were converted to a single whitespace
newlines were converted to whitespaces
text trailing and heading whitespaces were stripped
attributes were put on a standard order
attribute values were unchanged, including trailing whitespaces
Why I want that
I am working on the Markdown Test Suite that aims to measure markdown engine compliance and portability.
We have markdown input, expected HTML output, and want to determine if the generated HTML output is equal to the expected one.
The problem is that Markdown is underspecified, so we cannot compare directly the two HTML strings.
The actual test code is here, just modify run-tests.py#dom_normalize if you want to try out your solution.
Things I tried
beautifulsoup. Orders the attributes, but does not deal well with whitespaces?
A function formatter regex modification might work, but I don't see a way to differentiate between the inside of nodes and attributes.
A Python only solution like this would be ideal.
looking for a Javascript function similar to isEqualNode() (does not work because ignores nodeVaue) + some headless JS engine. Couldn't find one.
If there is nothing better, I'll just have to write my own output formatter front-end to some HTML parser.
I ended up cooking up a custom HTML renderer that normalizes things based on Python's stdlib HTMLParser.
You can see it at: https://github.com/karlcow/markdown-testsuite/blob/749ed0b812ffcb8b6cc56f93ff94c6fdfb6bd4a2/run-tests.py#L20
Usage and docstrig tests at: https://github.com/karlcow/markdown-testsuite/blob/749ed0b812ffcb8b6cc56f93ff94c6fdfb6bd4a2/run-tests.py#L74

regex: selecting everything but img tag

I'm trying to select some text using regular expressions leaving all img tags intact.
I've found the following code that selects all img tags:
/<img[^>]+>/g
but actually having a text like:
This is an untagged text.
<p>this is my paragraph text</p>
<img src="http://example.com/image.png" alt=""/>
this is a link
using the code above will select the img tag only
/<img[^>]+>/g #--> using this code will result in:
<img src="http://example.com/image.png" alt=""/>
but I would like to use some regex that select everything but the image like:
/magical regex/g # --> results in:
This is an untagged text.
<p>this is my paragraph text</p>
this is a link
I've also found this code:
/<(?!img)[^>]+>/g
which selects all tags except the img one. but in some cases I will have untagged text or text between tags so this won't work for my case. :(
is there any way to do it?
Sorry but I'm really new to regular expressions so I'm really struggling for few days trying to make it work but I can't.
Thanks in advance
UPDATE:
Ok so for the ones thinking I would like to parse it, sorry I don't want it, I just want to select text.
Another thing, I'm not using any language in specific, I'm using Yahoo Pipes which only provide regex and some string tools to accomplish the job. but it doesn't evolves any programming code.
for better understanding here is the way regex module works in yahoo pipes:
http://pipes.yahoo.com/pipes/docs?doc=operators#Regex
UPDATE 2
Fortuntately I'm being able to strip the text near the img tag but on a step-by-step basis as #Blixt recommended, like:
<(?!img)[^>]+> , replace with "" #-> strips out every tag that is not img
(?s)^[^<]*(.*), replace with $1 #-> removes all the text before the img tag
(?s)^([^>]+>).*, replace with $1 #-> removed all the text after the img tag
the problem with this is that it will only catch the first img tag and then I would have to do it manually and catch the others hard-coding it, so I still not sure if this is the best solution.
The regexp you have to find the image tags can be used with a replace to get what you want.
Assuming you are using PHP:
$htmlWithoutIMG = preg_replace('/<img[^>]+>/g', '', $html);
If you are using Javascript:
var htmlWithoutIMG = html.replace(/<img[^>]+>/g, '');
This takes your text, finds the <img> tags and replaces them with nothing, ie. it deletes them from the text, leaving what you want. Can not recall if the <,> need escaping.
Regular expression matches have a single start and length. This means the result you want is impossible in a single match (since you want the result to end at one point, then continue later).
The closest you can get is to use a regular expression that matches everything from start of string up to start of <img> tag, everything between <img> tags and everything from end of <img> tag to end of string. Then you could get all matches from that regular expression (in your example, there would be two matches).
The above answer is assuming you can't modify the result. If you can modify the result, simply replace the <img> tags with the empty string to get your result.

Regex: Extracting readable (non-code) text and URLs from HTML documents

I am creating an application that will take a URL as input, retrieve the page's html content off the web and extract everything that isn't contained in a tag. In other words, the textual content of the page, as seen by the visitor to that page. That includes 'masking' out everything encapsuled in <script></script>, <style></style> and <!-- -->, since these portions contain text that is not enveloped within a tag (but is best left alone).
I have constructed this regex:
(?:<(?P<tag>script|style)[\s\S]*?</(?P=tag)>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>)
It correctly selects all the content that i want to ignore, and only leaves the page's text contents. However, that means that what I want to extract won't show up in the match collection (I am using VB.Net in Visual Studio 2010).
Is there a way to "invert" the matching of a whole document like this, so that I'd get matches on all the text strings that are left out by the matching in the above regex?
So far, what I did was to add another alternative at the end, that selects "any sequence that doesn't contain < or >", which then means the leftover text. I named that last bit in a capture group, and when I iterate over the matches, I check for the presence of text in the "text" group. This works, but I was wondering if it was possible to do it all through regex and just end up with matches on the plain text.
This is supposed to work generically, without knowing any specific tags in the html. It's supposed to extract all text. Additionally, I need to preserve the original html so the page retains all its links and scripts - i only need to be able to extract the text so that I can perform searches and replacements within it, without fear of "renaming" any tags, attributes or script variables etc (so I can't just do a "replace with nothing" on all the matches I get, because even though I am then left with what I need, it's a hassle to reinsert that back into the correct places of the fully functional document).
I want to know if this is at all possible using regex (and I know about HTML Agility Pack and XPath, but don't feel like).
Any suggestions?
Update:
Here is the (regex-based) solution I ended up with: http://www.martinwardener.com/regex/, implemented in a demo web application that will show both the active regex strings along with a test engine which lets you run the parsing on any online html page, giving you parse times and extracted results (for link, url and text portions individually - as well as views where all the regex matches are highlighted in place in the complete HTML document).
what I did was to add another alternative at the end, that selects "any sequence that doesn't contain < or >", which then means the leftover text. I named that last bit in a capture group, and when I iterate over the matches, I check for the presence of text in the "text" group.
That's what one would normally do. Or even simpler, replace every match of the markup pattern with and empty string and what you've got left is the stuff you're looking for.
It kind of works, but there seems to be a string here and there that gets picked up that shouldn't be.
Well yeah, that's because your expression—and regex in general—is inadequate to parse even valid HTML, let alone the horrors that are out there on the real web. First tip to look at, if you really want to chase this futile approach: attribute values (as well as text content in general) may contain an unescaped > character.
I would like to once again suggest the benefits of HTML Agility Pack.
ETA: since you seem to want it, here's some examples of markup that looks like it'll trip up your expression.
<a href=link></a> - unquoted
<a href= link></a> - unquoted, space at front matched but then required at back
- very common URL char missing in group
- more URL chars missing in group
<a href=lïnk></a> - IRI
<a href
="link"> - newline (or tab)
<div style="background-image: url(link);"> - unquoted
<div style="background-image: url( 'link' );"> - spaced
<div style="background-image: url('link');"> - html escape
<div style="background-image: ur\l('link');"> - css escape
<div style="background-image: url('link\')link');"> - css escape
<div style="background-image: url(\
'link')"> - CSS folding
<div style="background-image: url
('link')"> - newline (or tab)
and that's just completely valid markup that won't match the right link, not any of the possible invalid markup, markup that shouldn't but does match a link, or any of the many problems with your other technique of splitting markup from text. This is the tip of the iceberg.
Regex is not reliable for retrieving textual contents of HTML documents. Regex cannot handle nested tags. Supposing a document doesn't contain any nested tag, regex still requires every tags are properly closed.
If you are using PHP, for simplicity, I strongly recommend you to use DOM (Document Object Model) to parse/extract HTML documents. DOM library usually exists in every programming language.
If you're looking to extract parts of a string not matched by a regex, you could simply replace the parts that are matched with an empty string for the same effect.
Note that the only reason this might work is because the tags you're interested in removing, <script> and <style> tags, cannot be nested.
However, it's not uncommon for one <script> tag to contain code to programmatically append another <script> tag, in which case your regex will fail. It will also fail in the case where any tag isn't properly closed.
You cannot parse HTML with regular expressions.
Parsing HTML with regular expressions leads to sadness.
I know you're just doing it for fun, but there are so many packages out there than actually do the parsing the right way, AND do it reliably, AND have been tested.
Don't go reinventing the wheel, and doing it a way that is all but guaranteed to frustrate you down the road.
OK, so here's how I'm doing it:
Using my original regex (with the added search pattern for the plain text, which happens to be any text that's left over after the tag searches are done):
(?:(?:<(?P<tag>script|style)[\s\S]*?</(?P=tag)>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>))|(?P<text>[^<>]*)
Then in VB.Net:
Dim regexText As New Regex("(?:(?:<(?<tag>script|style)[\s\S]*?</\k<tag>>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>))|(?<text>[^<>]*)", RegexOptions.IgnoreCase)
Dim source As String = File.ReadAllText("html.txt")
Dim evaluator As New MatchEvaluator(AddressOf MatchEvalFunction)
Dim newHtml As String = regexText.Replace(source, evaluator)
The actual replacing of text happens here:
Private Function MatchEvalFunction(ByVal match As Match) As String
Dim plainText As String = match.Groups("text").Value
If plainText IsNot Nothing AndAlso plainText <> "" Then
MatchEvalFunction = match.Value.Replace(plainText, plainText.Replace("Original word", "Replacement word"))
Else
MatchEvalFunction = match.Value
End If
End Function
Voila. newHtml now contains an exact copy of the original, except every occurrence of "Original word" in the page (as it's presented in a browser) is switched with "Replacement word", and all html and script code is preserved untouched. Of course, one could / would put in a more elaborate replacement routine, but this shows the basic principle. This is 12 lines of code, including function declaration and loading of html code etc. I'd be very interested in seeing a parallel solution, done in DOM etc for comparison (yes, I know this approach can be thrown off balance by certain occurrences of some nested tags quirks - in SCRIPT rewriting - but the damage from that will still be very limited, if any (see some of the comments above), and in general this will do the job pretty darn well).
For Your Information,
Instead of Regex, With JQuery , Its possible to extract text alone from a html markup. For that you can use the following pattern.
$("<div/>").html("#elementId").text()
You can refer this JSFIDDLE

display mysql newline in HTML

Certain fields in our mysql db appear to contain newline characters so that if I SELECT on them something like the following will be returned for a single SQL call:
Life to be sure is nothing much to lose
But young men think it is and we were young
If I want to preserve the line breaks when displaying this field on a webpage, is the standard solution to write a script to replace '\n\r' with a br HTML tag or is there a better way?
Thanks!
Assuming PHP here...
nl2br() adds in <br /> for every \n. Don't forget to escape the content first, to prevent XSS attacks. See below:
<?php echo nl2br(htmlspecialchars($content)); ?>
HTML is a markup language. Regardless of how many linebreaks you put in the source code, you won't see anything from it back in the presentation (of course assuming you aren't using <pre> or white-space:pre). HTML uses the <br> element to represent a linebreak. So you basically indeed need to convert the real and invisible linebreaks denoted by the characters xA (newline, linefeed, LF, \n) and/or xD (carriage return, CR, \r) by a HTML <br> element.
In most programming languages you can just do this by a string replace of "\n" by "<br>".
You can wrap it in <pre> .. </pre>.