I need to pull all paragraphs from any website - html

I need to take any random website and pull all chunks of text from the website.
I am calling this "paragraph disambiguation" (see "sentence disambiguation" in Wikipedia).
I don't care if these chunks themselves contain other HTML like or as I can get rid of these after I extract the paragraphs text.
I also need to distinguish between the paragraphs as in, this is paragraph 1 and this is paragraph 2 and so on.
I am aware that most paragraphs would typically be contained in a tag. But this is not always the case. Text can also be contained in the following:
<div>
<span>
<td>
<li>
Is there any other HTML elements that might contain a block of text?
Is there any other methodology of extracting text blocks from a random webpage, like looking for "white words" and then finding their boundaries?
Thanks in advance
Jeff

Nearly all HTML elements may include texts:
p
dt
dd
td
th
And many more I can't recall at the moment. Take a look at the Complete list of HTML tags and see which is suitable to contain text, and which is not.

Use Python's Beautiful Soup and call .get_text() on the body element. This will give you all the text in the page.
From Documentation on get_text():
>>> markup = '\nI linked to <i>example.com</i>\n'
>>> soup = BeautifulSoup(markup)
>>> soup.get_text()
u'\nI linked to example.com\n'

Related

use regex to select words between html tags

thanks for visiting my questions here. I'm trying to match sentences between tags. for example:
<h1> Most flavors, except the ones discussed below, have only one
metacharacter that matches both before a word and after a word. <p>
This is because any position between characters can never be both at
the start and at the end of a word. Using only one operator makes
things easier for you.<p>Word boundaries, as described above, are
supported by most regular expression flavors.
I'm trying to get 10 words from each tag.
output:
Most flavors, except the ones discussed below, have only one
This is because any position between characters can never be
Word boundaries, as described above, are supported by most regular
I find it's so tricky. Thanks for your help here!!!
As has already been linked in the comment, one of the most well-known answers of all time on this site is about how you using regular expressions to parse HTML is probably not a good idea. For a more detailed and balanced overview of when it is and isn't a good idea to do so, check out this question as well.
But briefly, the answer depends on what you're trying to do. It's likely that you'll be better off finding an HTML/XML-parsing library for whatever language you're using, and extracting the text with that.
I'm a bit confused as to what your task actually is, as your code as shown isn't valid HTML, since <h1> at least requires a closing tag. But if you do need to use regex to do this, you will want to look at word boundaries and interval operators for limiting to 10, and perhaps lookbehind (or just capture groups) to match the tag without returning it.
But again: if you're trying to parse actual HTML, you'd be better of using an HTML parser to get the tag content, and then getting the first 10 words using string operators. An example in Javascript, which is a bit of a cheat because you get the HTML parsing for free, but it makes for an easy example:
for(const tag of document.querySelectorAll('body *')) {
console.log(`${tag.tagName}: ${tag.innerText.split(' ').slice(0,5).join(' ')}`)
}
<h1>This is an h1 tag with a bunch of text in it that is really long</h1>
<p>Here's a p tag with some more text that's really long
<p>Here's a p tag with some more text that's really long
<p>Here's a p tag with some more text that's really long
<p>Here's a p tag with some more text that's really long

Find Replace text FOO with Style "Heading 1" with <h1>Foo</h1>

I am trying to find an easy way to convert my Word documents to HTML without the awful save-as that is built in. These are structured documents (designed for our screen-reader (JAWS) users), and so they use Heading 1, 2, 3, 4 & the Table of Contents.
We plan to convert these to DAISY audiobooks (https://en.wikipedia.org/wiki/DAISY_Digital_Talking_Book ) , so we need pretty clean, but structured, HTML to convert.
I tried the find-replace, using Styles, but it would just replace anything in the text part of the search. I could convert it from any one style to another, but adding text in the box messed it up.
(I think I see that CSS for DAISY means that instead of just <h2> it will have to be <level2 class=='section' <h2> and closing tags), but that's step 2 after I handle this part.)
I just want to be able to find any text using Style 2 and add text to the start of that line saying "yep, here's some style 2" so that I can do the HTML/CSS stuff.
Thanks!
You can do that with a simple Find/Replace. For example, specify the Heading 1 Style for the Find parameter and use:
Replace = <h1>^&</h1>
For a macro you could incorporate that into, see: Convert a Word Range to a String with HTML tags in VBA

Why does Qt::mightBeRichText() not detact HTML table tags as rich text?

I'm using a HTML table within a QML Text component. My problem is that textFormat: Text.AutoText does not automatically recognize my HTML table as a rich text (QML Text documentary).
Searching for a solution I found HTML formatting in QML Text which is quite close to my problem.
The solution given: just setting textFormat: Text.RichText I knew before. But I can not use it as setting the textFormat: Text.RichText also changes how the contentWidth of the QML Text component behaves.
Text {
id: myPlainText
width: 500
wrapMode: Text.Wrap
text: "Hallo stackoverflow.com"
textFormat: Text.AutoText
}
Text {
id: myRichText
width: 500
wrapMode: Text.Wrap
text: "Hallo stackoverflow.com"
textFormat: Text.RichText
}
Accessing myPlainText.contentWidth will give me the actual used with of the text even if it is shorter than 500.
Accessing myRichText.contentWidth does always give me 500.
For me the information of the actual used with, which is contentWidth when no RichText is involved, is important for layout reasons, as this is what my component is mostly used for. Hitting the with limit (eg. 500) for HTML tables would be ok, even so I would prefer knowing the actual table with.
From the Documentation
If the text format is Text.AutoText the Text item will automatically determine whether the text should be treated as styled text. This determination is made using Qt::mightBeRichText() which uses a fast and therefore simple heuristic. It mainly checks whether there is something that looks like a tag before the first line break. Although the result may be correct for common cases, there is no guarantee.
As you can see, it distiguishes between plain and styled text.
The third category: RichText is not supported by AutoText.
This means for AutoText you need to resort to the reduced set of tags, seen in the documentation:
<b></b> - bold
<strong></strong> - bold
<i></i> - italic
<br> - new line
<p> - paragraph
<u> - underlined text
<font color="color_name" size="1-7"></font>
<h1> to <h6> - headers
<a href=""> - anchor
<img src="" align="top,middle,bottom" width="" height=""> - inline images
<ol type="">, <ul type=""> and <li> - ordered and unordered lists
<pre></pre> - preformatted
> < &
If you need the width of your text, try to use
myRichText.implicitWidth
This will give you the width of the text, if it is not wrapped.
Propbably, due to the advanced posibilities, it always works with a maximum contentWidth. Therefore it is not possible to use e.g. elide together with RichText. The unexpected behavior of contentWidth however seems like a bug to me - in either the source or more likely in the documentation.

Match a chunk of text inside <p></p> where certain tags repeat more than 2 times

I'm struggling to find a solution to this.
I would like to match any chunk of text inside <p></p> tags that contains more than 2 <a></a> tags
Here's an example
<p style=""> (Reporting by Jason Lange; Additional reporting by Alistair Bell, Eric Walsh and Peter Cooney; Editing by Ros Russell and Eric Beech)</p>
I'm trying to work out a regex code that would match this whole chunk of text inside <p> </p> tags, but the only determinant is the amount of <a></a> TAGS, I mean I've no idea what's the text is like.
Here's the regex code I have tried:
<p.*?>(\s+|\n+|)((.*?|)<a.*?>(.*?|)</a>(.*?|)){2,}(\s+|\n+|)</p>
It doesn't work.
Any ideas?
It's probably better to address this entire problem by parsing the html into a DOM and not by using a regexp.
If you must you can try something along the lines of this (there are some edge cases that will not work with this solution):
<p[^>]*>(.*?<a[^>]*>.*?<\/a[^>]*>.*?){2,}<\/p[^>]*>
This will match an opening <p>, then text containing <a> and <\a> at least twice and then the closing </p>
try this:
/^<p.*(?=(\<\/a>).*(\<\/a>)).*<\/p>$/mg
I'm counting the (I'm assuming that the html is properly formed)
https://regex101.com/r/oK4pM4/1

Regex to match text longer than x characters between html tags?

I have the task of migrating THE worst HTML product descriptions you will ever encounter. It consists of a mixture of tables and paragraphs. The majority are not even 100% valid HTML and there are plenty of Microsoft tags courtesy of MS Word. It is littered with in line style tags and the most of it relies on the most bonky set of css rules you will ever see.
Essentially I have come the the realisation that the only thing of use is the paragraphs of text. I can not just grab the <p> tags as sometimes the paragraphs do not use them and sometimes titles or single words have their own <p> tag.
So my question is can I match text that is longer then x characters between html tags?
Ideally it would also ignore <br/> and <br>
Here is a link to an example of the html I am dealing with
Note it is just the description I am processing, not the whole page.
Group 1 of this regex will match n+ chars between tags (n = 100 in this example):
<[^>]+>([^<]{100,})<[^>]+>
Notes:
I have deliberately not matched for a matching closing tag (<([^>]+)>([^<]{100,})<\1>) because of OP's sloppy HTML - a tag is a tag
I have avoided using a lookbehind ((?<=<[^>]+>)) because the match is of arbitrary length, which can cause backtracking problems (some languages, like java, do not even support it).
Scanning through the site a little, it looks like many of the descriptions fall short of 100 characters. You might try a multi-pass approach, where in the first iteration, you capture all content from the first table following 'div id="tab1"'. From that starting point, it may be easier to identify and eliminate the parts you don't want, rather than extracting the parts you do want.