I have a docx document which is structured into sections and subsections e.g.
Section A
texttexttext
texttexttext
1.1 texttexttext
texttexttext
(a) texttexttext
I want to use python-docx to extract the text. It is easy to get the text in the paragraphs but I do not know how to get the text of the section headings (e.g. "1." and "(a)" etc.). Is there an easy way to do this?
How easy it will be will depend on how rigorous the document author has been in constructing the document.
It the best case, the author has used styles for all section headings, and then you can just parse through the paragraphs picking out those with "Heading 1" style, for example.
for paragraph in document.paragraphs:
if paragraph.style.name == 'Heading 1':
print(paragraph.text)
If the author instead applied character formatting like bold and font size to designate headings, your job will be tougher as these are much less likely to uniquely identify headings.
I suggest you to use sections like the following example:
document = Document()
sections = document.sections
sections
<docx.parts.document.Sections object at 0x1deadbeef>
len(sections)
3
section = sections[0]
section
<docx.section.Section object at 0x1deadbeef>
for section in sections:
print(section.start_type)
NEW_PAGE (2)
EVEN_PAGE (3)
ODD_PAGE (4)
Related
I am trying to find an easy way to convert my Word documents to HTML without the awful save-as that is built in. These are structured documents (designed for our screen-reader (JAWS) users), and so they use Heading 1, 2, 3, 4 & the Table of Contents.
We plan to convert these to DAISY audiobooks (https://en.wikipedia.org/wiki/DAISY_Digital_Talking_Book ) , so we need pretty clean, but structured, HTML to convert.
I tried the find-replace, using Styles, but it would just replace anything in the text part of the search. I could convert it from any one style to another, but adding text in the box messed it up.
(I think I see that CSS for DAISY means that instead of just <h2> it will have to be <level2 class=='section' <h2> and closing tags), but that's step 2 after I handle this part.)
I just want to be able to find any text using Style 2 and add text to the start of that line saying "yep, here's some style 2" so that I can do the HTML/CSS stuff.
Thanks!
You can do that with a simple Find/Replace. For example, specify the Heading 1 Style for the Find parameter and use:
Replace = <h1>^&</h1>
For a macro you could incorporate that into, see: Convert a Word Range to a String with HTML tags in VBA
I can't find a guideline on how to indent HTML tags on multiple line, and the solution I am currently using doesn't really satisfy me.
Imagine we have an extremely long div declaration, such as:
<div data-something data-something-else data-is-html="true" class="html-class another-html-class yet-another-html-class a-class-that-makes-this-declaration-extremely-long" id="just-a-regular-id">
In order to avoid scrolling horizontally when I find these huge lines, I usually indent them in the following way:
<div
data-something
data-something-else
data-is-html="true"
class="html-class another-html-class yet-another-html-class a-class-that-makes-
this-declaration-extremely-long"
id="just-a-regular-id"
>
<p>Some element inside the DIV</p>
</div>
Which I think works pretty well in terms of readability, but I have some concern.
Is this way considered a good practice?
Would you find more readable to leave the closing > in the opening HTML Tag inline with the last element, or on a new line as I did in the example above?
Do you have any other preferred way to deal with extremely long HTML declaration?
Feel free to share if you know some good resource about style guidelines for HTML that cover this special case, because I didn't find anything specific online.
The Google HTML/CSS Style Guide suggests wrapping long lines when it significantly improves readability, and offers three techniques, each of which include the closing > with the last line of attributes:
Break long lines into multiple lines of acceptable length:
<div class="my-class" id="my-id" data-a="my value for data attribute a"
data-b="my value for data attribute b" data-c="my value for data attribute c">
The content of my div.
</div>
Break long lines by placing each attribute on its own indented line:
<div
class="my-class"
id="my-id"
data-a="my value for data attribute a"
data-b="my value for data attribute b"
data-c="my value for data attribute c">
The content of my div.
</div>
Similar to #2 except the first attribute is on the initial line, and subsequent attributes are indented to match the first attribute:
<element-with-long-name class="my-class"
id="my-id"
data-a="my value for data attribute a"
data-b="my value for data attribute b"
data-c="my value for data attribute c">
</element-with-long-name>
In my opinion, #3 would not improve readability when the element contains content.
Personally I find it to be a good practice and I find it more readable using multiple lines. I use the same convention for websites that I make.
In my college we are taught the same convention and it clearly states:
Avoid Long Code Lines
When using an HTML editor, it is inconvenient to scroll right and left to read the HTML code.
Try to avoid code lines longer than 80 characters.
The rest of the convention can be found here:
https://www.w3schools.com/html/html5_syntax.asp
Would you find more readable to leave the closing > in the opening HTML Tag inline with the last element, or on a new line as I did in the example above?
I think leaving the closing > is more readable but when use vscode it can't fold properly.
Unexpected,
Expected,
The following, is my preferred method:
<div
class="my-class"
id="my-id"
data-a="my value for data attribute a"
data-b="my value for data attribute b"
data-c="my value for data attribute c"
>The content of my div.
</div>
The crucial detail here is the lack of space between the closing > and the actual content.
All the following examples can be checked online:
€<span itemprop="price">13.50</span>
results in €13.50
€<span
itemprop="price"
Arbitrary carriage returns and spaces here
BUT before closing the tag
>13.50
</span>
also results in €13.50
However
€<span
itemprop="price"> <!-- Carriage return + spaces in this line -->
13.50
</span>
or
€ <!-- Carriage return + spaces in this line -->
<span
itemprop="price"> <!-- Carriage return + spaces in this line -->
13.50
</span>
Both result in € 13.50 (Mind the gap!)
In the style guide for the maintenance of a bulky documentation of an existing system using HTML which I has to maintain for a client, I found, that text given in a code-tag should be enclosed with spaces like:
..., the element<code> STATE </code>matches datatype ...
In most cases the whole text is enclosed in <p> tags:
<p>..., the element<code> STATE </code>matches datatype ...</p>
Does anyone has an idea why I should write <code> STATE </code> with no place before and afterwards?
One explanation could be that rendering the HTML leads to "better" (i. e. same / bigger width, ...) constant spaces between normal text and the code (the space in code-tag seems to be "bigger"). Is that approach meaningful? Or are there arguments against this rule so I could convince the program director to kick-out this rule?
This sounds like a way of enforcing a style without, for whatever reason, using CSS.
There's no reason to do this other than to conform to somebody's preference (your boss or a client, presumably, in this case).
To back this up, the HTML specification itself uses examples of <code> elements wrapped within <p> elements which do not follow this format:
Example 104
The following example shows how the element can be used in a paragraph to mark up element names and computer code, including punctuation.
<p>The <code>code</code> element represents a fragment of computer code.</p>
— Example 104 within the HTML5.1 specification
I need to take any random website and pull all chunks of text from the website.
I am calling this "paragraph disambiguation" (see "sentence disambiguation" in Wikipedia).
I don't care if these chunks themselves contain other HTML like or as I can get rid of these after I extract the paragraphs text.
I also need to distinguish between the paragraphs as in, this is paragraph 1 and this is paragraph 2 and so on.
I am aware that most paragraphs would typically be contained in a tag. But this is not always the case. Text can also be contained in the following:
<div>
<span>
<td>
<li>
Is there any other HTML elements that might contain a block of text?
Is there any other methodology of extracting text blocks from a random webpage, like looking for "white words" and then finding their boundaries?
Thanks in advance
Jeff
Nearly all HTML elements may include texts:
p
dt
dd
td
th
And many more I can't recall at the moment. Take a look at the Complete list of HTML tags and see which is suitable to contain text, and which is not.
Use Python's Beautiful Soup and call .get_text() on the body element. This will give you all the text in the page.
From Documentation on get_text():
>>> markup = '\nI linked to <i>example.com</i>\n'
>>> soup = BeautifulSoup(markup)
>>> soup.get_text()
u'\nI linked to example.com\n'
I have the task of migrating THE worst HTML product descriptions you will ever encounter. It consists of a mixture of tables and paragraphs. The majority are not even 100% valid HTML and there are plenty of Microsoft tags courtesy of MS Word. It is littered with in line style tags and the most of it relies on the most bonky set of css rules you will ever see.
Essentially I have come the the realisation that the only thing of use is the paragraphs of text. I can not just grab the <p> tags as sometimes the paragraphs do not use them and sometimes titles or single words have their own <p> tag.
So my question is can I match text that is longer then x characters between html tags?
Ideally it would also ignore <br/> and <br>
Here is a link to an example of the html I am dealing with
Note it is just the description I am processing, not the whole page.
Group 1 of this regex will match n+ chars between tags (n = 100 in this example):
<[^>]+>([^<]{100,})<[^>]+>
Notes:
I have deliberately not matched for a matching closing tag (<([^>]+)>([^<]{100,})<\1>) because of OP's sloppy HTML - a tag is a tag
I have avoided using a lookbehind ((?<=<[^>]+>)) because the match is of arbitrary length, which can cause backtracking problems (some languages, like java, do not even support it).
Scanning through the site a little, it looks like many of the descriptions fall short of 100 characters. You might try a multi-pass approach, where in the first iteration, you capture all content from the first table following 'div id="tab1"'. From that starting point, it may be easier to identify and eliminate the parts you don't want, rather than extracting the parts you do want.