MathType Word Document export using MathPage MathML - html

I need to convert the ms-word 2003 documents to HTML with MathML included if there are math equations. The quick solution I found at the moment is using the MathType addin to export the whole document into a HTML with MathML using its "Publish to MathPage" function.
However, it couldn't do the conversion properly. Most of the equations in the document is still in the image format, instead of MathML. The strange thing is that it converts the commas into the MathML, not the equations.
The original word document:
https://dl.dropbox.com/u/4625393/test12.doc
The key part of the converted html source:
https://gist.github.com/katat/5091021
Is this a bug of the MathType?

Kata, I'm not sure what versions of Word and MathType you are using, but I was able to successfully create the MathPage with MathML. I am using Word 2013 and MathType 6.9. This is the page I created: http://dl.dropbox.com/u/17008533/187.xht
Not sure what could have gone wrong with yours. It does seem that you chose an appropriate "target" for the MathPage; it looks like you chose XHTML+MathML.
If you can give me some more details about what steps you're taking from start to finish, I'll try to help more. Also let me know what versions of the software you're using.

Related

convert docx with (ordered) list to html

I'm trying to convert a large docx document with several layers' ordered list to an html. (see an example of the document here: http://docdro.id/X1oyfBv You should download it)
I tried the following things, including:
online converters such as html-cleaner and index.html (which only recognize one layer of the list)
save as html - which creates an horrendous file but still doesn't recognize the ol structure.
saved the file as zip and then opened the xml file, but I dont see an easy way to get the ol structure out of the w:... tags
saving it to google docs and running Omar Alzabir's script
http://omaralzabir.com/wp-content/uploads/2014/05/GoogleDocsEmail.jpg
btw. If I create a word file with an ordered list with multiple layers and i convert it, it does recognize it as ol's. But the existing file is not recognized as ol's even if I 'un-list' and list it again. So possibly there is something wrong with how the original document was created (?)
Any suggestions much appreciated:) Or indications as to why this problem occurs
Are you asking how to save a Word-doc in HTML format, with multi-level ordered-lists?
Word-HTML has bugs in its multi-level ordered lists. For the list-items, the indentation tends to be incorrect and inconsistent. There's an example here.
Word-HTML has similar bugs in its multi-level unordered lists. An example is here.
I recently wrote a Python program that fixes these bugs, in Word's HTML. The program is part of WordWebNav (WWN), which is free and open-source.
WWN is an app that converts a Microsoft-Word document to a usable web-page. It adds some missing features in the Word-HTML web-page (e.g., a navigation pane), and it fixes bugs in the Word-HTML.
You can use pandoc : https://github.com/jgm/pandoc
This is an open source universal command line tool to convert markup source based document files.
You can use it as something like that:
pandoc -o output.html input.docx

Losing superscript tag when converting HTML to DOCX using libreoffice

I have the following HTML:
<html><body><p>n<sup>th</sup></p></body></html>
I am using the command:
$ libreoffice --convert-to docx:"MS Word 2007 XML" test.html
To convert that HTML into a DOCX file. However I notice that the resulting DOCX file does not actually contain the <sup> tag. It looks like it is using position and size to replicate the <w:vertAlign> tag:
<w:position w:val="8"/><w:sz w:val="19"/>
What I would need to know is how to make libreoffice put in the <w:vertAlign> tag instead of using position and size.
Additonal Info:
I had a similar problem with bold and italics (<strong><em>) but was able to get the conversion to work correctly if I converted the strong and em tags to b and i tags respectively.
If you are looking to edit the HTML, it would be much better to use a tool that is suited for editing HTML, such as Notepad++ or Sublime (as examples).
If you need to have the HTML as a LibreOffice document for a specific reason, you could open the HTML file in Notepad and save as a text file with .txt as the extension. That should allow you to open the document in LibreOffice.
You can try using a WYSIWYG(What You See Is What You Get) editor like TinyMCE(http://www.tinymce.com/). There are lots of them online and you can also find some desktop applications for that. but if you want to convert it in docx you can try this http://htmltodocx.codeplex.com/ it is written in php and uses PHPWord and is quite efficient.
Just create a Python script that replaces your unwanted tags with the <w:vertAlign> tag where ever needed.
The command works fine if you replace 'docx' with 'xml', like this:
libreoffice --convert-to xml:"MS Word 2003 XML" test.html

Fixing malformed html that html tidy doesn't fix

Okay, so I've been utilizing HTML tidy to convert regular HTML webpages into XHTML suitable for parsing. The problem is the test page I saved in firefox had its html apparently somewhat precleaned by firefox during saving, call this File F. Html tidy works fine on file F, but fails on the raw data written to a file via .NET (file N). Html tidy is complaining about form tags being intermixed with table tags. The Html isn't mine so I can't just fix the source.
How do I clean up file N enough so that it can be run through Html tidy? Is there a standard way of hooking into Firefox (completely programmically without having to use mouse or keyboard) or another tool that will apply extra fixes to the html?
I had been using HTML tidy for some time, but then found that I was getting better results from TagSoup.
It can be used as a JAXP parser, converting non-wellformed HTML on the fly. I usually let it parse the input for Saxon XQuery transformations.
But it can also be used as a stand-alone utility, as an executable jar.
I wound up using SendKeys in C# and importing functions from user32.dll to set Firefox as the active window after launching it to the website I wanted (file:///myfilepathhere/).
SendKeys seemed to require running a windowed program, so I also added another executable which performs actions in its form_load() method.
By using alt+f, down six times, enter, wait for a bit, type full path file name, enter (twice) and then killing firefox, I was able to automate firefox's ability to clean some html up.

How to view xsd:documentation that is in HTML markup?

I am generating WSDL/XSD for SOAP services from a UML model using IBM Rational Software Architect (RSA). RSA allows you to document the classes and attributes in the model using rich-formatting.
For example, I have the following documentation on a Trailer class:
A wheeled Vehicle that is designed for towing by another
Vehicle. Known subtypes include:
Caravan
BoxTrailer
BoatTrailer
When the UML model is transformed to WSDL/XSD (using the out-of-the-box UML to WSDL transform), the formatting is preserved as HTML markup inside the xsd:documentation element:
<xsd:complexType name="Trailer">
<xsd:annotation>
<xsd:documentation><p>
A&nbsp;wheeled <strong>Vehicle</strong> that is designed for&nbsp;towing by another <strong>Vehicle.</strong> Known
subtypes include:&nbsp;
</p>
<ul>
<li>
<strong>Caravan</strong>
</li>
<li>
<strong>BoxTrailer</strong>
</li>
<li>
<strong>BoatTrailer</strong>
</li>
</ul></xsd:documentation>
</xsd:annotation>
</xsd:complexType>
Unfortunately, this is really hard to read and I've been searching (with no luck) for a program that can view WSDL/XSD with documentation in HTML markup.
XmlSpy 2008 can't do it, RSA can't do it (which is a bit surprising, as it generated the XSD in the first place), neither can any web browser I've tried.
I did write a JET template that extracted the documentation from the model and outputted to HTML, and I could probably write some XSLT to do something similar from the XSD, but I was hoping there's a program out there (ideally free) that could view the documentation as HTML.
Essentially, I'd like to be able to tell the consumers of our web service that they can view the WSDL in X program if they want to read the documentation - does anybody know the best solution to this?
Edit:
Thanks for the suggestions, but I think I have a solution! I didn't realise that RSA can export a WSDL to HTML (right-click on WSDL, export, HTML). The generated HTML has a graphical view of each schema element, the documentation for each element, as well as the original source, and everything is hyperlinked together.
Most importantly, the documentation is richly-formatted again! One small caveat is that the ;nbsp's appear in the HTML output. This seems to be because the ampersand is escaped in the HTML:
&nbsp;
Instead it should be
I will update my model-to-model transform to ensure that the ;nbsp's are replaced with real spaces (I don't believe I'll need non-breaking spaces in the documentation), so the generated WSDL/XSD won't ever have them.
I highly doubt if the standard xml/xsd editors can interpret the html tags and generate appropriate documentation. Oxygen XML Editor does a decent job of understanding and converting the XML entities (liket < etc) but HTML tags and entities are left as is. Below is the screen shot in design view.
The type of <xs:documentation> is <xs:any> so you should actually be able to include your documentation without escaping the markup, provided that it is a well formed XHTML fragment instead of HTML. I guess some XML Schema tools would be capable to interpret the embedded XHTML and show it as formatted text.
Do note that if the markup is not escaped it absolutely must be a well formatted XML fragment or the documentation element will cause your schema to be malformed. This applies also to HTML entities! If the documentation contains an (unescaped) entity reference (other than the 5 pre-defined XML entities), then your schema either must contain an external DTD reference or have an embedded DTD that defines what is the replacement text of that entity. In your case the documentation contains an entity reference. Probably easiest will be to replace such entities with the corresponding Unicode character/text or with character references (use   for )
If you have a chance, try to include the documentation without escaping the markup and make sure that it will be well formed. Otherwise you probably need to process the documentation twice: 1) parse the schema and extract documentation 2) parse the documentation text again (possibly as HTML, not XML).
I've tried this with the latest build of QTAssistant and it shows like this in the Schema Help Panel only; I've put a feature request for the grid view, as well as the documentation generator to work the same. Is this what you're expecting?
The help panel shows the annotation of the schema object that is selected in the Graph/Diagram view. To display the help panel press F1.
This issue is fixed in RSA 8.0.4 - which now supports exporting to WSDL/XSD with plain text (as well as an option to sort the schema by type, then name alphabetically!).
To view the the documentation in a WSDL/XSD generated from a UML model in prior versions of RSA, the easiest solution is to export the WSDL/XSD as HTML using RSA. You can do this by right-clicking on the WSDL/XSD, selecting export, then selecting HTML.
The generated HTML has a graphical view of each schema element, the documentation for each element, as well as the original source, and everything is hyperlinked together.
Most importantly, the documentation (that's virtually unreadable in the WSDL/XSD) is richly-formatted again! One small caveat is that the ;nbsp's that RSA's documentation editor inserts also appear in the HTML output. This seems to be because the ampersand is not only escaped in the WSDL/XSD (which is good), but also in the HTML (bad!):
&nbsp;
Instead it should be
A simple workaround to this is to replace all &nbsp;'s in the WSDL/XSD with real spaces before generating the HTML.

My Browser won't interpret "ΧΨ" when I load the website I'm building

I pretty much built this website in firebug, then when I copied the code into a text document and tried loading it, firefox wouldn't interpret the "ΧΨ" in the source. However, it does a fantastic job using them while I'm typing this.
Wassup wid dat?
You can't just type a character into an HTML tag, it must be a valid character and if not use the proper character code. See this list:
http://htmlhelp.com/reference/html40/entities/symbols.html
You can use Entity, Decimal, or Hex to represent your character like this:
<p>ΧΨ</p>
That's the HTML representation of "ΧΨ"
Cheers