I've tried it by using mammoth:
import mammoth
result = mammoth.convert_to_html("MyDocument.docx")
print (result.value)
I don't get an HTML, but this strange code:
kbW7yqZoo4h9pYM6yBxX1QFx2pCoPYflXfieIPbtqpT913Vk7OzcZdEk3eO7TbWjvZNTGilsfmRrPwDvB[...]
I've also tried to use docx2html, but I can't install it. When I run pip install docx2html I get this error:
SyntaxError: Missing parentheses in call to 'print'
Mammoth .docx to HTML converter
Mammoth is designed to convert .docx documents, such as those created by Microsoft Word, and convert them to HTML. Mammoth aims to produce simple and clean HTML by using semantic information in the document, and ignoring other details. For instance, Mammoth converts any paragraph with the style Heading 1 to h1 elements, rather than attempting to exactly copy the styling (font, text size, colour, etc.) of the heading.
There's a large mismatch between the structure used by .docx and the structure of HTML, meaning that the conversion is unlikely to be perfect for more complicated documents. Mammoth works best if you only use styles to semantically mark up your document.
The following features are currently supported:
Headings.
Lists.
Customisable mapping from your own docx styles to HTML. For instance, you could convert WarningHeading to h1.warning by providing an appropriate style mapping.
Tables. The formatting of the table itself, such as borders, is currently ignored, but the formatting of the text is treated the same as in the rest of the document.
Footnotes and endnotes.
Images.
Bold, italics, underlines, strikethrough, superscript and subscript.
Links.
Line breaks.
Text boxes. The contents of the text box are treated as a separate paragraph that appears after the paragraph containing the text box.
Comments.
Installation
pip install mammoth
Basic conversion
To convert an existing .docx file to HTML, pass a file-like object to mammoth.convert_to_html. The file should be opened in binary mode. For instance:
import mammoth
with open("document.docx", "rb") as docx_file:
result = mammoth.convert_to_html(docx_file)
html = result.value # The generated HTML
messages = result.messages # Any messages, such as warnings during conversion
You can also extract the raw text of the document by using mammoth.extract_raw_text. This will ignore all formatting in the document. Each paragraph is followed by two newlines.
with open("document.docx", "rb") as docx_file:
result = mammoth.extract_raw_text(docx_file)
text = result.value # The raw text
messages = result.messages # Any messages
You can use pypandoc module for that purpose. See below code
import pypandoc
output = pypandoc.convert_file('file.docx', 'docx', outputfile="file_converted.html")
The issue you're having is probably that mammoth doesn't create legit HTML files, just HTML snippets. Meaning it's missing the and tags.
Some browsers can still render the content from the file since they're advanced enough to do so, but I ran into a similar problem when trying to use the raw output.
A nifty workaround for this is to add this to your code to convert it to proper HTML files:
import mammoth
with open("test.docx", "rb") as docx_file:
result = mammoth.convert_to_html(docx_file)
html = result.value # The generated HTML
messages = result.messages # Any messages,
full_html = (
'<!DOCTYPE html><html><head><meta charset="utf-8"/></head><body>'
+ html
+ "</body></html>"
)
with open("test.html", "w", encoding="utf-8") as f:
f.write(full_html)
Where test.html is whatever the title you gave to your document.
I'm not taking credit for this, I found it here as well, but can't find the source post.
As stated in the documentation:
To convert an existing .docx file to HTML, pass a file-like object to
mammoth.convert_to_html. The file should be opened in binary mode. For
instance:
import mammoth
with open("document.docx", "rb") as docx_file:
result = mammoth.convert_to_html(docx_file)
html = result.value # The generated HTML
messages = result.messages # Any messages, such as warnings during conversion
Related
I'm creating a PDF with a large collection of quotes that I've imported into python with docx2python, using html=True so that they have some tags. I've done some processing to them so they only really have the bold, italics, underline, or break tags. I've sorted them and am trying to write them onto a PDF using the fpdf library, specifically the pdf.write_html(quote) method. The trouble comes with several special characters I have, so I am hoping to encode the PDF to UTF-8. To write with .write_html(), I had to create a new class as shown in their readthedocs under the .write_html() method at the very bottom of the left hand side:
from fpdf import FPDF, HTMLMixin
class htmlFPDF(FPDF, HTMLMixin):
pass
pdf = htmlFPDF()
pdf.add_page()
#set the overall PDF to utf-8 to preserve special characters
pdf.set_doc_option('core_fonts_encoding', 'utf-8')
pdf.write_html(quote) #[![a section of quote giving trouble with quotations][2]][2]
The list of quotes that I have going into the pdf all appear with their special characters and the html tags (<u> or <i>) in the debugger, but after the .write_html() step they then show up in the pdf file with mojibake, even before being saved, as seen through debugger. An example being "dayâ€ÂTMs demands", when it should be "day's demands" (the apostrophe is curled clockwise in the quote, but this textbox doesn't support).
I've tried updating the font I use by
pdf.add_font('NotoSans', '', 'NotoSans-Regular.ttf', uni=True)
pdf.set_font('NotoSans', '', size=12)
added after the .add_page() method, but this doesn't change the current font (or fix mojibake) on the PDF unless I use the more common .write(text_height, quote) method, which renders the underline/italicize tags into the PDF as text. The .write() method does preserve the special characters. I'm not trying to change the font really, but make sure that what's written onto the PDF preserves the special characters instead of mojibake them.
I've also attempted some .encode/.decode action before going into the .write_html(), as well as attempted some methods from the ftfy library. And tried adding '' to the start of each quote to no effect.
If anyone has ideas for a way to iterate through each line on the PDF that'd be terrific, since then I could use ftfy to fix the mojibake. But ideally, it would be some other html tag at the start of each quote or a way to change the font/encoding of the .write_html() method, maybe in the class declaration?
Or if I'm at a dead-end and should just split each quote on '<', use if statements to detect underlines, italicize, etc., and use the .write() method after all.
Extract docx to html works really bad with docx2python. I do this few month ago. I recommend PyDocX. docx2python are good for docx file content extracting, not converting it into a html.
Like this question, extract text from xml tags in an XML file using apach tika parser
I want to extract all text from text based files, including tagged content, the tags themselves, and other text in XML/HTML elements.
I've tried XML (application/xml), and HTML (text/html) and seen that AutoDetectParser returns less than the full text content.
I've also tried YAML (text/plain), and JSON (text/plain) which do return the full text content.
I understand that I can't do XML or HTML using the AutoDetectParser. What I can't find documented is a list of what types of files would need special handling.
To get full text content (even if that means a complete 'raw' copy of the file):
1. What Mimetypes should be parsed using a TXTParser?
2. What Mimetypes should be parsed using other parsers?
Basically, I'm asking what Mimetypes does the AutoDetectParser return less than the full text content?
Thanks
EDIT
My use case is to be able to extract text and metadata from a wide variety of input file formats including txt, xml, html, doc(x), ppt(x), pdf, ...
Essentially, I want to be able to handle any file type Tika can handle.
I am using code like this
AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
context.set(Parser.class, parser);
try (InputStream stream = new FileInputStream(fileToExtract)){
parser.parse(stream, handler, metadata, context);
} catch ... {
}
I see the same results for XML files as the question referenced above.
What I am trying to find out is: where is it documented when the combination of AutoDetectParser and BodyContentHandler will return less than the full text of the input file.
When, or for what Mimetypes, do I need to switch the Parser and/or ContentHandler?
I don't see this information clearly documented, and I am hoping to avoid a trail and error approach.
I do the following:
from docx import Document
document = Document('text.docx')
document.paragraphs[42].text
And it gives me '' whatever number I enter, and for loop to find and replace a word does not work. But if I save the document with document.save('text2.docx'), the document is not empty.
The document is relatively big and contains many different formatting, images, tables, styles.
My task is to find and replace a word in docx document with some correction of the following word, so I will be glad, if you suggest another tool
I ran into this problem and was able to read the document using docx2txt: https://pypi.org/project/docx2txt/
I am trying to parse a 16Mb html file using lxml. My actual task is to get all the doc tags and for each doc tag if the value of docno tag matches my doc list I extract the content of doc tag.
self.doc_file_list is a list containing paths of such 16Mb files that I need to parse.
file is absolute path of the file.
This is the code I am using currently
for file in file(self.doc_file_list,'r'):
tree = etree.parse(file.strip(), parser)
doc = tree.findall('.//doc')
for elem in doc:
docno = elem.find('.//docno').text
if docno in self.doc_set:
print >> out, etree.tostring(elem)
I checked the content of tree using etree.tostring(tree) and it does not parse the complete file and only parses some kb of the actual file.
Note: I am not getting any error message but the parsed content of tree is incomplete so I am not able to get the whole list.
I was finally able to solve this problem. I checked the tree generated and it was not parsing the whole document. This is because the document was heavily broken. You can check this information on the link: lxml.de/parsing.html (removed http as stackoverflow did not let me add more than 2 links).
This issue of broken html document can be resolved using one of the following two approaches:
1. Instead of using html parser you can either use ElementSoup provided by lxml. It uses BeautifulSoup parser to handle broken html docs. Link: http://lxml.de/lxmlhtml.html
Note: This approach did not work out for me.
2. Another approach is to directly use BeautifulSoup directly and using the parsers provided by it. There are many parser options provided and you need to find out which one suits you the best. For me, html.parser worked.
Link: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#attributes
Thanks all for the help.
This is kind of a corner case. I'm running Haskell, Text.XmlHtml (version 0.2.3). I'm getting my source data from Pandoc (version 1.12). My source files are all in Markdown format.
The corner deals with when I have raw Html directly in my Markdown file. This is, of course, supported by the Markdown format, and sometimes is the only way for me to get the kind of Table layout that I want. Pandoc reads the file just file, but then when it gets to the Html section, what it emits is roughly like this:
[ RawInline (Format "html") "<a href=\"abcdefg\">"
, RawInline (Format "html") "<img src=\"image.png\" />"
, RawInline (Format "html") "</a>" ]
So... converting this into a hierarchical tree could get very complicated. The desired result, in XmlHtml would be something like this:
Element "a" [("href", "abcdefg")] [Element "img" [("src", "image.png")]]
But that is very difficult to get when I'm dealing with a structure that was hierarchical (everything else Pandoc emits is nicely hierarchical) and suddenly is not, but that "not hierarchical" part is only findable by basically building an Html parser. That works on multiple strings that surround other structures.
ideally, I would like to emit is a simple TextNode:
TextNode "<img src=\"image.png\" />"
I could do that either by emitting a bunch of TextNodes, one for each RawInline, or by glomming together the RawInline elements. The point is that I want to emit a TextNode that has raw Html in it and have that ultimately rendered without any extra Html escaping.
My renderer is ultimately a Heist snippet, but that probably means it runs by way of Blaze.
My final alternative, which might work, is to go from Pandoc through the Blaze Html renderer and then through the XmlHtml parser to get something that I can embed into a Heist snippet. I'd just like to avoid that because it feels dirty.
(I think I would actually run into the same problem if I wanted to put Java script into my Markdown documents... which is technically allowed by the language but probably very evil.)
Is there a way to do this, or am I too limited by my tools?
Update
I tried the route of rendering from Pandoc to Blaze to XmlHtml. Turns out that I get the same result, with the Html put into the final nodes in escaped from and thus appearing in the browser. Here was my function (which was much shorter and easier than the full implementation I'd done...)
pandocToHtml :: Pandoc.Pandoc -> [XmlHtml.Node]
pandocToHtml = Text.Blaze.Renderer.XmlHtml.renderHtmlNodes . Pandoc.writeHtml Pandoc.def
Pandoc.def includes all of the "allow_raw_*" extensions, including allow_raw_html.
Final thing I can think to do is to apply my own piecemeal html parser (and then maybe contribute it to Pandoc). Which, in the end, couldn't be horribly hard.
The only ways to do this are to either construct the nodes yourself like this:
Element "a" [("href", "abcdefg")] [Element "img" [("src", "image.png")]]
...or run your markup through the parser. This is by design. The contents of TextNode will always be escaped. XmlHmtl is not designed for pandoc style markdown. It is designed for XML and HTML. So you have to get your documents into that format first. It seems to me like you should be able to use pandoc to render the markdown to HTML and then run the XmlHtml parser on that.
XmlHtml does have a mechanism for interpreting certain portions of a document as raw text (as required by the HTML spec). You can see which tags are interpreted as raw text here. In 0.2.2 we updated XmlHtml to give the user more control over what gets treated as raw text. If you want a node to be treated as raw, just add the xmlhtmlRaw attribute to the tag. If you want a node that gets treated as raw by default to not be treated as raw, then add the xmlhtmlNotRaw attribute.
I'm not sure why renderHtmlNodes . writeHtml def didn't work for you. It seems like that should work. If it didn't work, I think Pandoc's writeHtml may be buggy. Since that didn't work, you might try parseHtml . writeHtmlString (pseudocode).