How can I get the ListNumber in a docx by using python-docx - python-docx

I am trying to get the style of each paragraph in a docx by using "paragraph.style.name".
However, I can only get:
"Heading 1"
"Heading 2"
"Normal"
"Normal"
.
.
but no "List 1", "List 2" as expected.
How to get the ListNumber of a listed paragraph?
Or any other style information can I get except "Normal"?

This is probably due to the list numbering being applied manually using the toolbar icon rather than by using a style. In this case the style would be unchanged, which would explain why it might appear as "Normal".
The list numbering details will be found in the paragraph properties and will involve several references to the numbering.xml part. Your best bet if you want to tackle it would be to start by browsing the XML; opc-diag is good for that.
But it's pretty complicated and I know of no examples you can follow.

Related

How to relate two elements for accessibility?

I have following layout in my web application:
To make it accessibility compliant, is there a need to relate "Al Allbrook" with "Requester" label? If so, how we can achieve that?
"Al Allbrook" is a link to user profile.
If they are not related, how come srceen reader will know "Al Allbrook" is a requester? Same in case of "Site".
In additon to what #andy said, you could also use a table. The first column could have a "requestor" table heading (<th scope="col">). The heading itself could be visually hidden if you don't want it to "clutter" the display but still be available to screen reader (SR) users. The second column would be something like "contact info" and the last column is "site". This allows a SR user to navigate across the row of the table and they'll hear the column heading before the data cell.
Of course, you can do a combination of these techniques. Have a table and have extra information on the links. I would recommend aria-labelledby instead of aria-describedby. While both attributes will cause the extra information to be read by a SR (*), only the aria-labelledby attribute will be displayed in the list of links.
(*) Some SRs announce the aria-describedby attribute directly but other SRs will just tell you that there is a description associated with the link and you have to hit a different shortcut key to hear the description.
The nice thing about both attributes is that the element can refer to itself as part of the label. Kind of a recursive labeling but the "Accessible Name and Description Computation" rules handle the recursion.
if computing a name, and the current node has an aria-labelledby attribute that contains at least one valid IDREF, and the current node is not already part of an aria-labelledby traversal, process its IDREFs in the order they occur
It's probably easier to see an example of this.
<span id="comma" style="display:none">,</span>
...
<span id="requestor">Requestor</span>
Al Allbrook
Several things to note.
First is that the link is referring to itself in the aria-labelledby attribute (the 'myself' id).
Second is that I'm using a trick with screen readers by adding a comma in the label, "Al Allbrook, Requestor" so that the SR has a slight pause when reading the label, "Al Allbrook <pause> Requestor", rather than hearing it as if the guy's name was "Al Allbrook Requestor". Note that the comma itself has display:none so it's not visible, but since the comma element's ID is listed in aria-labelledby, it'll still be used. (See rule 2A in the Accessible Name url above)
Lastly, my example used a <span> for "Requestor" but you might want it to be a heading (<h3> or <h4> or whatever level is appropriate) instead.
For example:
<span id="comma" style="display:none">,</span>
...
<h3 id="requestor">Requestor</h3>
Al Allbrook
And then all this code could be in a <td> if you're using a table.
There is different ways to navigate through a site with a screenreader, so it depends on the navigation mode the user is using at the moment.
In DOM order
In this case, if your "Requester" is before the link in DOM, it will be read before the person's name. Also, the text right before and after a link can be read by means of certain shortcuts.
By accessing a list of links
There is different lists screen reader users can request, f.e. list of all headers, or a list of all links on the page.
If it's important to you to have the "requester" read when navigating to the link directly, you can link the two elements by means of aria-describedby or aria-labelledby.
Alternatively, you could add the text again to the link itself, hidden visually. Like "Al Allbrook, Requester".

Find Replace text FOO with Style "Heading 1" with <h1>Foo</h1>

I am trying to find an easy way to convert my Word documents to HTML without the awful save-as that is built in. These are structured documents (designed for our screen-reader (JAWS) users), and so they use Heading 1, 2, 3, 4 & the Table of Contents.
We plan to convert these to DAISY audiobooks (https://en.wikipedia.org/wiki/DAISY_Digital_Talking_Book ) , so we need pretty clean, but structured, HTML to convert.
I tried the find-replace, using Styles, but it would just replace anything in the text part of the search. I could convert it from any one style to another, but adding text in the box messed it up.
(I think I see that CSS for DAISY means that instead of just <h2> it will have to be <level2 class=='section' <h2> and closing tags), but that's step 2 after I handle this part.)
I just want to be able to find any text using Style 2 and add text to the start of that line saying "yep, here's some style 2" so that I can do the HTML/CSS stuff.
Thanks!
You can do that with a simple Find/Replace. For example, specify the Heading 1 Style for the Find parameter and use:
Replace = <h1>^&</h1>
For a macro you could incorporate that into, see: Convert a Word Range to a String with HTML tags in VBA

Replacing stuff of HTML using regex

I am editing a couple of hundred HTML files and I have to replace all the stuff manually, so I was wondering whether it could be done using regex.I don't think it is possible, but it might be, so please help me out.
Okay, so for example, I have many <p> tags in a file, each with a different class. eg:
<p class="class1">stuff here</p>
<p class="class2">more stuff here</p>
I wanted to replace the "stuff here" and "more stuff here" with something, for example
<p class="class1">[content]</p>
<p class="class2">[content]</p> .
I wanted to know if that is possible.
I'm using notepad++.
P.S. I'm new to regex.
I think notepad++ is great for stuff like this. Open up Find/Replace, and check the regular expressions box in the dialog's Search Mode section.
In the "Find what" field, try this:
\<p\ class\=(.*)\>(.*)\<\/p\>
and in "Replace with":
\<p\ class\=\1\>[content]\<\/p\>
the \1 here will take whatever (found by (.*)) between the class= and the angle bracket > which ends the tag, and replace it with itself, which essentially results in ignoring the class name, rather than having to specify. the second (.*) catches the current content inside the paragraph tag, which is what you want to replace. So where I wrote [content] in the "Replace with" block, that's where you'd put your new content. This does limit you to content that you can paste into the notepad++ find/replace dialog, but I think it has a pretty huge limit.
If I'm remembering that text field's limitations incorrectly, another thing you could do is just adjust my "Replace with" text to just replace the old text with some newlines:
\<p\ class\=\1\>\n\n\<\/p\>
This will delete the old text and leave a clear line where it once was, making it easy to paste whatever you want into the normal editor pane.
The first way is probably better, if your new content will fit the Replace With field, because this regex works once per line. And you can click "Replace" a couple times, and if it's working, clicking "Replace all" will iterate through every <p> element in the file.
Note: this solution assumes that your <p> tags open and close within one line, as you typed them your question description. If they break lines, you're going to want to enable . matches newline in the Replace dialog, and... you need trickier (more precise) syntax than (.*) to catch your class name and content-to-be-replaced. Let me know if this is the the case, and I'll fiddle with it and see if I can help more. The (.*) needs to change to (.*?) or something; the search needs to get more greedy, because if . matches newline, then .* matches any and every possible character infinite times, i.e., the whole document.

Using XPATH to find a li based on text contains doesnt work on text after a <b> tag

So I've got some li on a page and I'm trying to identify it with some XPATH, only trouble is I need to make sure that all the text matches so I need to identify on all the text and there is a in there that is giving me hassle (I'm using a chrome addin to validate the XPATH and it keeps telling me its null when I try), any suggestions welcome!
Here is the html on the page: -
<li>
Some pre text, <b>bold</b> nothing here is identified.
</li>
Here is what I've tried that doesnt work: -
//ul/li[contains(text(),'') and contains(text(),'bold') and contains(text(),'nothing here is identified')]
I also tried this just to see if it works (bear in mind my XPATH needs to check all the text within that li), but it won't identify it at all using any text after the bold tags...
//ul/li[contains(text(),'nothing here is identified')]
What obvious XPATH trickery and I missing...?
Cheers
You can use the following:
//ul/li[contains(.,'') and contains(.,'bold') and contains(.,'nothing here is identified')]
Use of text() would give you three text nodes, as there are 3 nodes infact, which when used in contains() will be an irrecoverable error:
Some pre text,
bold
nothing here is identified.
But the use of . or current()(both mean the same here), would give you only one string(concatenation of all three nodes mentioned above).

Parsing on HTML some specific datas

I'm working on a small app that requires me to parse an html site on the web.
My problem is as follows :
The parsing routine is working fine for some infos BUT I'm searching for hours for a way to get some infos that refuse to appear.
Here is the partial code structure I'm willing to parse :
<body>
`<header>
<nav>
<div.....>
<aside......>
<main>
<div .....>
<a ......>
<a ......>
</div>
.
.
.
<div id="general">
<h2> ........</h2>
<p>
<span class="label">text</span>
"text 2 to be parsed"
<br>
<span class="label">other text</span>
"text 3 to be parsed"
<br>
just an exemple of structure, to be precise the url is http://www.ourairports.com/airports/EBBR/pilot-info.html
OK it seems that the html code is not appearing on the preview so in the source code of the page above, when you see [div id="general"], below you have a [p] followed by [span class="label"]some text[/span] and just below that you have text between brackets. This happens on several lines and I need to catch those infos .
I've tried with : //body/div/main/div[#id='general']/p as XpathQueryString but result is 1 node and empty
also with div[#id='general'] but result is no node found,
with div[#id='general']/p/span result is no node found,
with //div/p/span[#class='label'] results are the titles between the flags and >/span> but I'm looking to retrieve the text between quotes just behind and I cannot figure out how to succeed. I think I've tried all combinations (a lot others than explained above) but no chance. Is there a special path to get to this text ?
Thanks for your advices.
By the way, this is my very first post on stackoverflow.com and My first language is french, so I do apologize in advance for any rule not followed or my bad english.
Enjoy your day, evening, ... night on the keyboard.
Alain
Your first expression //body/div/main/div[#id='general']/p is expected to return a single node, the <p>. And it works exactly that way on the referred website as you observed. The expression reaches down to that node but not deeper where the text nests. However you must get the text too, just encapsulated in html, with fancy tags around it. A good XPath selector API used properly should return the html node that was matched, including the <p> tag itself.
If all you see in the end is just the text nodes try the following:
Think of the text among the <span>s as html nodes, text() nodes.
//div[#id='general']/p/text()
This will match the "text to be parsed".
A node() will match any html node (even text among tags) and a * any non-text() node.
For any number of steps, use the double slash:
//div[#id='general']/p//text()
Now you match every text node under the <p> tag, regardless of the nesting level. And since text nodes are by definition leaf nodes (cannot contain other nodes), this guarantees that you will not match members of the same path down the tree more than once.
Some comments on you expressions:
//body is superficial, there is only one body and html defines exactly where.
Nodes quantified by #id should not need be proceeded by selectors for their parents, start with //div[#id='something unique'] .
Learn more about XPath. An API that properly returns selected "nodes" and not just concatenated text can play an important role in the understanding of how the expressions work in practice.