convert docx with (ordered) list to html - html

I'm trying to convert a large docx document with several layers' ordered list to an html. (see an example of the document here: http://docdro.id/X1oyfBv You should download it)
I tried the following things, including:
online converters such as html-cleaner and index.html (which only recognize one layer of the list)
save as html - which creates an horrendous file but still doesn't recognize the ol structure.
saved the file as zip and then opened the xml file, but I dont see an easy way to get the ol structure out of the w:... tags
saving it to google docs and running Omar Alzabir's script
http://omaralzabir.com/wp-content/uploads/2014/05/GoogleDocsEmail.jpg
btw. If I create a word file with an ordered list with multiple layers and i convert it, it does recognize it as ol's. But the existing file is not recognized as ol's even if I 'un-list' and list it again. So possibly there is something wrong with how the original document was created (?)
Any suggestions much appreciated:) Or indications as to why this problem occurs

Are you asking how to save a Word-doc in HTML format, with multi-level ordered-lists?
Word-HTML has bugs in its multi-level ordered lists. For the list-items, the indentation tends to be incorrect and inconsistent. There's an example here.
Word-HTML has similar bugs in its multi-level unordered lists. An example is here.
I recently wrote a Python program that fixes these bugs, in Word's HTML. The program is part of WordWebNav (WWN), which is free and open-source.
WWN is an app that converts a Microsoft-Word document to a usable web-page. It adds some missing features in the Word-HTML web-page (e.g., a navigation pane), and it fixes bugs in the Word-HTML.

You can use pandoc : https://github.com/jgm/pandoc
This is an open source universal command line tool to convert markup source based document files.
You can use it as something like that:
pandoc -o output.html input.docx

Related

How to change format of Warning admonition or add Caution in Sphinx HTML output

This seems like it should be straightforward but I've been prowling the documentation and web and haven't found the answer.
I want to output HTML doc from Sphinx. Ideally I'd like to have three levels of "note" type highlighted text boxes. ReST defines several "admonitions": (http://docutils.sourceforge.net/docs/ref/rst/directives.html#admonitions) but most of the Sphinx HTML themes include special formatting only for Note and Warning. (I am using one of the preinstalled themes, Classic.)
I have two questions:
1) How can I customize the color behind Warning in my documents?
2) How can I add a formatting style for Caution?
I see that these all end up with tags like <div class="admonition warning"> ... in the HTML output. But I can't find where the formatting for that class is defined. Is it in a stylesheet? Is it in a layout.html file or some other file?
Is there anything that explains how the various files in themes actually interact with each other? I haven't found a good primer. (I am no expert on css-based HTML either, so maybe that's part of the problem.)
Okay, I figured out more and have a working workaround. (I'm still not sure how I'm supposed to handle this.)
Looks like my HTML code is reading directly from a few cascading stylesheets stored along with the output in a directory called _static. There's classic.css, which inherits from basic.css.
I don't understand how these relate to the files named like basic.css_t that live in the Python Sphinx install.
To change things, should I (A) try altering the _t files? or (B) create an altered local copy of classic.css that lives in my source directory?
If I go with B, more questions.
Will it be overwritten by the values in the css_t template at build time? (I guess this is easy enough to test)
Is it good practice to use the same filename for a modified version of that stylesheet?
Here's a workaround that avoids those questions and seems to be doing what I want - from this: https://github.com/snide/sphinx_rtd_theme/issues/117
I created an override stylesheet that includes just the formatting I want to change.
I stored it in the _static of my source directory.
I defined it in my conf.py as follows:
html_context = {
'css_files': [
'_static/theme_overrides.css',
],
}
Now, that github discussion said that this wasn't a solution for all kinds of themes (including the RTD theme mentioned in the question) but I think I'm safe for now.
What more should I know?

convert pdf into small chunks of data(many chunks per page)?

I have a pdf file and I need to get get small pieces of data from it.
It is structured like this :
Page1:
Question 1
......................................
......................................
Question 2
......................................
......................................
Page End
I want to get Question 1 and Question 2 as separate html files, which contain text and image.
I've tried
pdftohtml -c pdffile.pdf output.html
And I got files with png images, but how to do I cut the Image into smaller chunks to fit the size of each Question (I want to separate each question into individual files)?
P.S. I have alot of pdf files, so a command-line tool would be nice.
I'll try to give you an approach on how I would go about it. You mention, that every page in your PDF document might have multiple questions and you basically want have one HTML file for every question.
It's great if pdftohtml works for you, but I also found another decent command line utility that you might want to try out.
Ok, so assuming you have an HTML file converted from the PDF you initially had, you might want to use csplit or awk to split your file into multiple files based on the delimiter 'Question' in your case. (Side note- csplit and awk are linux specific utilites, but I'm sure there are alternatives if you are on Windows or a MAC. I haven't specifically tried the following code)
From a relevant SO Post :
csplit input.txt'/^Question$/' '{*}'
awk '/Question/{filename=NR".txt"}; {print >filename}' input.txt
So, assuming this works, you will have a couple of broken html files. Broken because they'll be unsanitized due to dangling < or > or some other stray HTML elements after the splitting.
So you could start by saving the initial .html as .txt, removing the html, head and body elements specifically and going through the general structure of how the program converts the pdf into html. I'm sure you'll see a pattern around how the string 'Quetion' is wrapped in an element and is something you can take care of. That is why I mention .txt files in the code snippets.
You will basically have a bunch of text files with just the content html and not the usual starting tags for an html file because we removed that initially. Then it's only a matter of reading each file, just taking care of the element that surrounds the string 'Question' and adding the html, head and body elements around the content and saving them as .html files. You could do this in any programming language of your choice that supports file reading and writing (would be a fun exercise)
I hope this gets you started in the right direction.

MathType Word Document export using MathPage MathML

I need to convert the ms-word 2003 documents to HTML with MathML included if there are math equations. The quick solution I found at the moment is using the MathType addin to export the whole document into a HTML with MathML using its "Publish to MathPage" function.
However, it couldn't do the conversion properly. Most of the equations in the document is still in the image format, instead of MathML. The strange thing is that it converts the commas into the MathML, not the equations.
The original word document:
https://dl.dropbox.com/u/4625393/test12.doc
The key part of the converted html source:
https://gist.github.com/katat/5091021
Is this a bug of the MathType?
Kata, I'm not sure what versions of Word and MathType you are using, but I was able to successfully create the MathPage with MathML. I am using Word 2013 and MathType 6.9. This is the page I created: http://dl.dropbox.com/u/17008533/187.xht
Not sure what could have gone wrong with yours. It does seem that you chose an appropriate "target" for the MathPage; it looks like you chose XHTML+MathML.
If you can give me some more details about what steps you're taking from start to finish, I'll try to help more. Also let me know what versions of the software you're using.

How to view xsd:documentation that is in HTML markup?

I am generating WSDL/XSD for SOAP services from a UML model using IBM Rational Software Architect (RSA). RSA allows you to document the classes and attributes in the model using rich-formatting.
For example, I have the following documentation on a Trailer class:
A wheeled Vehicle that is designed for towing by another
Vehicle. Known subtypes include:
Caravan
BoxTrailer
BoatTrailer
When the UML model is transformed to WSDL/XSD (using the out-of-the-box UML to WSDL transform), the formatting is preserved as HTML markup inside the xsd:documentation element:
<xsd:complexType name="Trailer">
<xsd:annotation>
<xsd:documentation><p>
A&nbsp;wheeled <strong>Vehicle</strong> that is designed for&nbsp;towing by another <strong>Vehicle.</strong> Known
subtypes include:&nbsp;
</p>
<ul>
<li>
<strong>Caravan</strong>
</li>
<li>
<strong>BoxTrailer</strong>
</li>
<li>
<strong>BoatTrailer</strong>
</li>
</ul></xsd:documentation>
</xsd:annotation>
</xsd:complexType>
Unfortunately, this is really hard to read and I've been searching (with no luck) for a program that can view WSDL/XSD with documentation in HTML markup.
XmlSpy 2008 can't do it, RSA can't do it (which is a bit surprising, as it generated the XSD in the first place), neither can any web browser I've tried.
I did write a JET template that extracted the documentation from the model and outputted to HTML, and I could probably write some XSLT to do something similar from the XSD, but I was hoping there's a program out there (ideally free) that could view the documentation as HTML.
Essentially, I'd like to be able to tell the consumers of our web service that they can view the WSDL in X program if they want to read the documentation - does anybody know the best solution to this?
Edit:
Thanks for the suggestions, but I think I have a solution! I didn't realise that RSA can export a WSDL to HTML (right-click on WSDL, export, HTML). The generated HTML has a graphical view of each schema element, the documentation for each element, as well as the original source, and everything is hyperlinked together.
Most importantly, the documentation is richly-formatted again! One small caveat is that the ;nbsp's appear in the HTML output. This seems to be because the ampersand is escaped in the HTML:
&nbsp;
Instead it should be
I will update my model-to-model transform to ensure that the ;nbsp's are replaced with real spaces (I don't believe I'll need non-breaking spaces in the documentation), so the generated WSDL/XSD won't ever have them.
I highly doubt if the standard xml/xsd editors can interpret the html tags and generate appropriate documentation. Oxygen XML Editor does a decent job of understanding and converting the XML entities (liket < etc) but HTML tags and entities are left as is. Below is the screen shot in design view.
The type of <xs:documentation> is <xs:any> so you should actually be able to include your documentation without escaping the markup, provided that it is a well formed XHTML fragment instead of HTML. I guess some XML Schema tools would be capable to interpret the embedded XHTML and show it as formatted text.
Do note that if the markup is not escaped it absolutely must be a well formatted XML fragment or the documentation element will cause your schema to be malformed. This applies also to HTML entities! If the documentation contains an (unescaped) entity reference (other than the 5 pre-defined XML entities), then your schema either must contain an external DTD reference or have an embedded DTD that defines what is the replacement text of that entity. In your case the documentation contains an entity reference. Probably easiest will be to replace such entities with the corresponding Unicode character/text or with character references (use   for )
If you have a chance, try to include the documentation without escaping the markup and make sure that it will be well formed. Otherwise you probably need to process the documentation twice: 1) parse the schema and extract documentation 2) parse the documentation text again (possibly as HTML, not XML).
I've tried this with the latest build of QTAssistant and it shows like this in the Schema Help Panel only; I've put a feature request for the grid view, as well as the documentation generator to work the same. Is this what you're expecting?
The help panel shows the annotation of the schema object that is selected in the Graph/Diagram view. To display the help panel press F1.
This issue is fixed in RSA 8.0.4 - which now supports exporting to WSDL/XSD with plain text (as well as an option to sort the schema by type, then name alphabetically!).
To view the the documentation in a WSDL/XSD generated from a UML model in prior versions of RSA, the easiest solution is to export the WSDL/XSD as HTML using RSA. You can do this by right-clicking on the WSDL/XSD, selecting export, then selecting HTML.
The generated HTML has a graphical view of each schema element, the documentation for each element, as well as the original source, and everything is hyperlinked together.
Most importantly, the documentation (that's virtually unreadable in the WSDL/XSD) is richly-formatted again! One small caveat is that the ;nbsp's that RSA's documentation editor inserts also appear in the HTML output. This seems to be because the ampersand is not only escaped in the WSDL/XSD (which is good), but also in the HTML (bad!):
&nbsp;
Instead it should be
A simple workaround to this is to replace all &nbsp;'s in the WSDL/XSD with real spaces before generating the HTML.

How can I convert an OpenOffice Writer document (.odt) to multiple HTML files with navigation?

I have an OpenOffice Writer document (.odt) with a table of contents, sections, subsections, etc.
Is there a quick way to convert (export) this into multiple HTML files with a navigation sidebar, converting the sections into links?
You can:
Unzip the odt, parse the XML and make the HTML file yourself.
Use OpenOffice to export the document to HTML.
There are several ways to export HTML from OpenOffice or LibreOffice:
Use File > Export, then select file type XHMTL. However, this creates one big HTML file, not multiple files.
Use File > Save as, then select file type HTML document. This creates one big HTML file which is similar but not fully equal to the one above.
Use File > Send > Create HTML document. In the following dialog, you can select a style used in the document based on which the document is split into multiple HTML files. However, I did not get this to work properly. My document is always split on level 1, no matter what I selected here.
Use File > Wizards > Web page. You will get multiple settings to chose from. However, this does not work at all for me. It either fails completely or it does not produce the expected output.
The last two solutions were found on the OpenOffice Wiki at https://wiki.openoffice.org/wiki/Documentation/OOo3_User_Guides/Getting_Started/Saving_Writer_documents_as_web_pages
As a conclusion, I cannot provide a complete solution. I am still looking for a good way to solve this problem.