How to read Docx Meta-filed in Python? - python-docx

I checked out Python-docx module, it appears you can only read document.core_properties.
What about other meta fields in header or footer?
Currently I had to dump all the text using doc2txt and then search.

Related

Convert markdown table of contents to HTML

I am trying to convert a markdown document to HTML, using pandoc. I cannot get the HTML output to create the table of contents correctly.
Issue:
I have added a table of contents to the markdown doc, where clicking on each header takes the reader to the relevant section. I am using the format below, where clicking on 'Header Title' will send the reader to the section 'header' in the document:
[Header Title](#header)
I tried to convert this to HTML using the pandoc command
pandoc -i input.md -f markdown -t html -o input.html
This creates a valid HTML file I can open in Firefox, and the items in the table of contents show up as links - but when I click them, nothing happens (I am expecting it to jump to the relevant section)
This happens when I use either markdown or markdown_github as the input format (-i in pandoc)
Question:
How can I get the table of contents to show the expected behavior in HTML?
Or is the concept of 'table of contents' a wrong approach to HTML, and I should change my markdown code?
Apologies if I am going about this the wrong way, I have no experience with HTML / web documents.
I found a couple of similar questions but they seemed to be specific to other programming languages / tools, so any help how I can achieve this with markdown / pandoc is much appreciated.
I am using pandoc 1.19.2.4 on Ubuntu.
Example markdown:
- [Chapter 1](#chapter-1)
- [1. Reading a text file](#1-reading-a-text-file)
## Chapter 1
This post focuses on standard text processing tasks such as reading files and processing text.
### 1. Reading a text file
Reading a file.
Looking at your markdown file, you have used #1-reading-a-text-file as the id for the 1st subheading.
While converting it to HTML, the following line is generated for the subheading:
<h3 id="reading-a-text-file">1. Reading a text file</h3>
The problem is the mismatch of "#1" which is present in the table of contents, but not in the heading.
My guess is that pandoc does not allow HTML id to start with a number.
Changing the table of contents to the following should work:
- [Chapter 1](#chapter-1)
- [1. Reading a text file](#reading-a-text-file)

Translate text in a html text using python

I have written an API that translates text from english to Hindi language. However, if the text that is passed is an html text, then my code fails.
How do I get this right?
I have tried using py-translate, but this package is not able to convert html texts properly.
I have also tried using googleclient package, it is able to convert html text but can handle only one request at a time.
My API should be handle multiple requests and also be able to deal with html text translation.
Any help is appreciated.

Cannot convert DOCX to HTML with Python

I've tried it by using mammoth:
import mammoth
result = mammoth.convert_to_html("MyDocument.docx")
print (result.value)
I don't get an HTML, but this strange code:
kbW7yqZoo4h9pYM6yBxX1QFx2pCoPYflXfieIPbtqpT913Vk7OzcZdEk3eO7TbWjvZNTGilsfmRrPwDvB[...]
I've also tried to use docx2html, but I can't install it. When I run pip install docx2html I get this error:
SyntaxError: Missing parentheses in call to 'print'
Mammoth .docx to HTML converter
Mammoth is designed to convert .docx documents, such as those created by Microsoft Word, and convert them to HTML. Mammoth aims to produce simple and clean HTML by using semantic information in the document, and ignoring other details. For instance, Mammoth converts any paragraph with the style Heading 1 to h1 elements, rather than attempting to exactly copy the styling (font, text size, colour, etc.) of the heading.
There's a large mismatch between the structure used by .docx and the structure of HTML, meaning that the conversion is unlikely to be perfect for more complicated documents. Mammoth works best if you only use styles to semantically mark up your document.
The following features are currently supported:
Headings.
Lists.
Customisable mapping from your own docx styles to HTML. For instance, you could convert WarningHeading to h1.warning by providing an appropriate style mapping.
Tables. The formatting of the table itself, such as borders, is currently ignored, but the formatting of the text is treated the same as in the rest of the document.
Footnotes and endnotes.
Images.
Bold, italics, underlines, strikethrough, superscript and subscript.
Links.
Line breaks.
Text boxes. The contents of the text box are treated as a separate paragraph that appears after the paragraph containing the text box.
Comments.
Installation
pip install mammoth
Basic conversion
To convert an existing .docx file to HTML, pass a file-like object to mammoth.convert_to_html. The file should be opened in binary mode. For instance:
import mammoth
with open("document.docx", "rb") as docx_file:
result = mammoth.convert_to_html(docx_file)
html = result.value # The generated HTML
messages = result.messages # Any messages, such as warnings during conversion
You can also extract the raw text of the document by using mammoth.extract_raw_text. This will ignore all formatting in the document. Each paragraph is followed by two newlines.
with open("document.docx", "rb") as docx_file:
result = mammoth.extract_raw_text(docx_file)
text = result.value # The raw text
messages = result.messages # Any messages
You can use pypandoc module for that purpose. See below code
import pypandoc
output = pypandoc.convert_file('file.docx', 'docx', outputfile="file_converted.html")
The issue you're having is probably that mammoth doesn't create legit HTML files, just HTML snippets. Meaning it's missing the and tags.
Some browsers can still render the content from the file since they're advanced enough to do so, but I ran into a similar problem when trying to use the raw output.
A nifty workaround for this is to add this to your code to convert it to proper HTML files:
import mammoth
with open("test.docx", "rb") as docx_file:
result = mammoth.convert_to_html(docx_file)
html = result.value # The generated HTML
messages = result.messages # Any messages,
full_html = (
'<!DOCTYPE html><html><head><meta charset="utf-8"/></head><body>'
+ html
+ "</body></html>"
)
with open("test.html", "w", encoding="utf-8") as f:
f.write(full_html)
Where test.html is whatever the title you gave to your document.
I'm not taking credit for this, I found it here as well, but can't find the source post.
As stated in the documentation:
To convert an existing .docx file to HTML, pass a file-like object to
mammoth.convert_to_html. The file should be opened in binary mode. For
instance:
import mammoth
with open("document.docx", "rb") as docx_file:
result = mammoth.convert_to_html(docx_file)
html = result.value # The generated HTML
messages = result.messages # Any messages, such as warnings during conversion

iTextSharp convering database HTML content to an open document

I have a process that converts "text" content to PDF. All "text" is stored in a database.
We have recently added HTML content to the UI and database. So, I need to modify the pdf generation process to, "on the fly," convert HTML to PDF. I would like to use the XMLParser, but all of the examples show opening a new document - the process I need is to convert HTML to a PDF paragraph to push into an open document. Any ideas would be appreciated.

Problems with Parsing

How do I parse an html/txt file using only strtok and/or strsep?
I'm trying that saves the text parts of a wikipedia article to a .txt file. The first part of my code allows me to download the article in html form. The thing I should do next is to parse that html file and save it as a txt file.