Obtaining font attributes of Table of Contents, List of Figures, List of Tables - python-docx

I am attempting to acquire the font attributes (font, font size, font colour) of text in the Table of Contents, List of Figures and List of Tables in a Word document by using python-docx (version 0.8.11)
The text in such areas are found using the 'Hyperlink' style as shown in the code snippet. This text can also be printed.
However I am unable to discover its font attributes. I tried to find the font by using i.font, but I get the error: AttributeError: 'CT_R' object has no attribute 'font'. Not sure what this means.
I would like to know if it is possible to obtain the font attributes of text in the Table of Contents, List of Figures and List of Tables in a Word document. Any help would be appreciated. The code is shown below.
import docx # python-docx library
# open docx file
document = docx.Document("file/path")
elements = document._body._body
rs = elements.xpath('.//w:r')
table_of_contents = [r for r in rs if r.style == "Hyperlink"]
for i in table_of_contents:
print(i.text) # print all text
print(i.font)

Related

How to solve line discontinuity in html table (python fpdf2 generate PDF documents)

I try to generate a PDF document. I use PYPDF2. The PDF contains a table which is generated by :
class HTMLMixin:
HTML2FPDF_CLASS = HTML2FPDF
def write_html(self, text, *args, **kwargs):
"""Parse HTML and convert it to PDF"""
kwargs2 = vars(self)
# Method arguments must override class & instance attributes:
kwargs2.update(kwargs)
h2p = self.HTML2FPDF_CLASS(self, *args, **kwargs2)
text = html.unescape(text) # To deal with HTML entities
h2p.feed(text)
I did not write this class, it is part of the library. Certain columns however has more content so they contains line-breaks. Because of these line-breaks the result is like this:
How can I have full, continuous side-lines for each columns in the table? (Just like for column 'Teilnehmer')
This is a current limitation of the library, reported here on the project bug tracker: https://github.com/PyFPDF/fpdf2/issues/91
The documentation also lists the limitations of the HTMLMIxin: https://pyfpdf.github.io/fpdf2/HTML.html#supported-html-features
Notes:
tables should have at least a first <th> row with a width attribute.
currently multi-line text in table cells is not supported. Contributions are welcome to add support for this feature! 😊

How to extract a description part from website with proper spacing?

I have accessed the website with beautiful Soup and retrieved the description part(div class) but since it was in bulleted points. I receive an output like this without any spacings between points(Not Readable):
DESCRIPTION:
COVID-19 ProjectionsGovernment-mandated social distancingHospital resource useAll bedsICU bedsInvasive ventilatorsDeaths per dayTotal deaths
Actually I have both normal paragraph and bullet points so I cannot use li or ul to retrieve bullet points alone.
This is my program for this description part:
def DESCRIPTION(self):
print('\n'+"DESCRIPTION: ")
for j in Data_Set_Info.soup.select('.iH9v7b'):
k = j.get_text()
print ('\n'+k)
The HTML code for this webpage is:
<div class="iH9v7b"><p>COVID-19 Projections</p><ul><li>Government-mandated social distancing</li><li>Hospital resource use</li><ul><li>All beds</li><li>ICU beds</li><li>Invasive ventilators</li></ul><li>Deaths per day</li><li>Total deaths</li></ul><p></p></div>
The webpage is:https://datasetsearch.research.google.com/search?query=health&docid=B2%2BtssYi2L2wvQwVAAAAAA%3D%3D
In this website there are different dataset and each dataset have different description. I need to get all description in a proper spacing with single program. Thanks in Advance
If you just want to get all the text with spaces in between, you can specify the character used to join text from different elements as an argument to get_text, like so:
k = j.get_text(' ')
If you want to be able to preserve (potentially nested) lists in the output then you'll need to recursively search through j.contents. A one-size-fits-all solution is unlikely to work for that purpose and will probably need a bit of experimentation.
Documentation links:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#get-text
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#contents-and-children

Word html format: insert a custom TOC via field code

I am generating Word docs from html. Basically, I build a file with html and save it as a .doc. Then I open it in Word and apply a template. All good so far.
I would like to automatically generate a custom TOC via the HTML ie when I am building the document. I need to insert a field code to do that, in the same way I do to add page numbering via the HML. eg:
<span style="mso-field-code: PAGE " class="page-field"></span>
If I save my html doc as docx and apply a template, I can make a TOC based in the styles in the way one would normally create a TOC in Word. I customised the TOC so the Title style is the top level followed by H1, H2 then H3. If I then toggle the field code on the TOC, the field code looks like this:
{ TOC \t "Heading 1,2,Heading 2,3,Heading 3,4,Title,1" }
Now, I can add HTML like this to insert the TOC:
<div style="mso-field-code: TOC " class="toc-field">TOC goes HERE</div>
When I do that, if I right click the text "TOC goes HERE" I get the option to "Update field" and if I do that a TOC is generated using the default H1,H2,H3 tags.
But, what I can't work out is how to include the
\t "Heading 1,2,Heading 2,3,Heading 3,4,Title,1"
part so my custom style sequence is applied. I have tried all sorts of combinations and it seems that adding anything after TOC causes Word to not make a field code.
Does anyone have any suggestions?
Update:
Based on the essential help from #slightlysnarky below, I thought I would summarise the outcome here because the information I needed was in a Microsoft chm file that was taken down many years ago. If you read the following extract from that help manual and compare it to the solution below you will see how this all works.
Word marks and stores information for simple fields by means of the Span element with the mso-field-code style. The mso-field-code value represents the string value of the field code. Formatting in the original field code might be lost when saving as HTML if only the string value of the code is necessary for its calculation.
Word has a different way of storing field information to HTML for more complex fields, such as ones that have formatted text or long values. Word marks these fields with so the data is not displayed in the browser. Word uses the Span element with the mso-element: field-begin, mso-element: field-separator, and mso-element: field-end attributes to contain the three respective parts of the field code: the field start, the separator between field code and field results, and the field end. Whenever possible, Word will save the field to HTML in the method that uses the least file space.
So, basically, add tags as shown below to your HTML at the point you wish the TOC to appear.
:-)
Word recognises a "complex field format" in HTML, along the same lines as it does in the Office Open XML format. So you can use
<span style='mso-element:field-begin'></span>TOC \t "Heading 1,2,Heading 2,3,Heading 3,4,Title,1"
<span style='mso-element:field-separator'></span>This text will show but the user will need to update the field
<span style='mso-element:field-end'></span>
This construct is outlined in a Microsoft document called "Microsoft Office HTML and XML Reference". It's a Windows .exe that unpacks to a .chm Help file. You can get it here
The info. on encoding fields is in Getting Started with Microsoft Office 2000 HTML and XML->Microsoft Word->Fields
There may be a later version but that's the only one I could find.

Use R to extract sections of HTML document using <b> to indicate section header

I have a few thousand large documents saved locally, where they are all saved as HTML files. Each document is about 300 pages long, and has some sections that have titles in bold letters. My goal is to do a text search in these files, and when I find the given phrase, extract the whole section that contains this phrase. My idea was to parse the html text so that it becomes a list of paragraphs, find the location of the phrase, and then extract everything from the bold letters (title of this section) just prior to bold letters just after (title of the next section).
I tried in a number of different ways, but none of them does what I want. the following was promising:
myhtmlfile = "I:/myfolder/myfile.html"
myhtmltxt2 = htmlTreeParse(myhtmlfile, useInternal = TRUE)
But while I can display the object "myhtmltxt2" and it looks like html with tags (which is what I need so that I can look for "<b>" ), it is an external pointer. So then I am not able to the command below, because grep does not work on pointers.
test2<-grep("myphrase",myhtmltxt2,ignore.case = T)
Alternatively, I did this:
doc.text = unlist(xpathApply(myhtmltxt2, '//p', xmlValue))
test3<-grep("myphrase",doc.text,ignore.case = T)
But in this case, I lost html tags in doc.text, so I no longer have "<b>" which is what I was going to use to indicate section to extract. Is there a way of doing this?
I managed this by following:
singleString <- paste(readLines(myHTMLfile), collapse=" ")
data11 = strsplit(singleString,"<p><b>", fixed = TRUE)
test2<- unlist(data11)
myindex<-regexpr("Myphrase </b>", test2)

how to apply font properties on <span> while passing html to pdf using itextsharp

I am converting html to pdf using itextsharp and I want to set the font size for tags. How can I do this?
Currently I am using:
StyleSheet
styles = new StyleSheet();
styles.LoadTagStyle(HtmlTags.SPAN, HtmlTags.FONTSIZE, "9f");
string contents = File.ReadAllText(Server.MapPath("~/PDF TEMPLATES/DeliveryNote.html"));
List
parsedHtmlElements = HTMLWorker.ParseToList(new StringReader(contents), styles);
But it didn't work.
The constants listed in HtmlTags are actually a hodgepodge of HTML tags and HTML and CSS properties and values and it can be a little tricky sometimes figuring out what to use.
In your case try HtmlTags.SIZE instead of HtmlTags.FONTSIZE and you should get what you want.
EDIT
I've never really seen a good tutorial on what properties do what, I usually just go directly to the source code. For instance, in the ElementFactory class there's a method called GetFont() that shows how font information is parsed. Specifically on line 130 (of revision 229) you'll see where the HtmlTags.SIZE is used. However, the actual value for the size is parsed in ChainedProperties in a method called AdjustFontSize(). If you look at it you'll see that it first looks for a value that ends with pt such as 12pt. If it finds that then it drops the pt and parses the number literally. If it doesn't end with pt it jumps over to HtmlUtilities to a method called GetIndexedFontSize(). This method is expecting either values like +1 and -1 for relative sizes or just integers like 2 for indexed sizes. Per the HTML spec user agents are supposed to accept values 1 through 7 for the font size and map those to a progressively increasing font size list. What this means is that your value of 9f is actually not a valid value to pass to this, you should probably be passing 9pt instead.
Anyway, you kind of half to jump around in the source to figure out what's being parsed where.