Converting .doc/docx in to Html - html

I know that Word have option to save the document into web page, but what is produce as result doesn't suit my needs.
I want my document to be save as web page with separate folder, which include images and .css files. But the Word only include the images not the css file.
How can I achieve this with Word or some other tool (need to be free)?

Given that Java is your preferred language, you could try docx4j [disclaimer: I manage it].
See the CreateHtml sample.
Note: docx4j supports docx only (as opposed to legacy binary .doc; those can be converted using LibreOffice).

Related

confusion about files end with .thtml and what exactly are they

I was going over my company's code base, and I saw some file format I have never seen before. They are all ended with .thtml .
What exactly is the .thtml? I was told it is for template files, and every time I view it in vscode I need to choose a language at bottom right corner of the IDE (default was plain text). What is the use of template file in web development? Are they the substitutes of .html files?
HTML templates are HTML files enriched with variables, macros or other logic. They need to be preprocessed to ordinary HTML files before they can be viewed in a broswer. HTML templates are very useful when you want to create a lot of static HTML files sharing the same structure but with different contents.
There are several HTML templating engines out there, of which one happens to be named exactly thtml. (Of course this does not guarantee that it is the one your company uses.)

Automatically convert markdown to html on browser refresh

Paul Irish gave some amazing insight on web tooling this time during googleio 2013. So he was presenting some slides that had been parsed into html from a markdown source i.e a .md file.
However one thing that surprised me was when he edited the source markdown for the slides in the chrome dev tools sources panel and then hit refresh, the .md automatically compiled again into the html to be output on the browser. Now I understand that the changes he made to the markdown file in the chrome dev tools were made also on his local file saved on the computer, but how did the markdown file automatically get converted into the html file upon save and refreshing the browser?
I am a complete beginner with markdown and I would really like to have this functionality. Any help is deeply appreciated
The whole purpose of markdown is that it is both human readable and machine readable. It is designed to be converted to HTML.
Depending on the language you are using, there are markdown parsers that create HTML for you.
For example, for PHP.
So, as an example, to have your server show the contents of say, homepage.md, your index.php file could have something like this:
$filename = $_GET['file'];
$content = markdown( file_get_contents( "path_to_markdown/{$filename}.md" ) );
print $content;
And, to see it in your browser you would go to example.com/?file=homepage
I will do my best to answer this.
HTML Mark down is a shorthand syntax that can be interpreted by a web browser to format or render the page in html.
this is taken from Stack Overflow.
eg
The syntax is based on the way email programs
usually do quotations. You don't need to hard-wrap
the paragraphs in your blockquotes, but it looks much nicer if you do. Depends how lazy you feel.
So, like converting from a file in notepad ++ from text to html. The file will be formatted using the basic rules of that particular syntax.
It also must be remembered, that programs are not mind readers. If the mark down code is not valid, neither will the corresponding html code. Just as saving a text file that is "supposed" to be formatted in html. It won't save as a working html file if the syntax is incorrect.
Also, markdown is not a total replacement for real code. It cannot cover the breadth and depth of the true coding language. I could liken it to pseudocode, but that is more of a lateral example.
In answer to your latest comment, If a second file is created from a first file (and the format is altered) -( in this case from mark down to html) - If the first file is then edited, without overwriting the changes into the second file, it cannot expect to be altered.
This is a good link a fellow SO gave me:
https://stackoverflow.com/editing-help
Please feel free to edit, if I have made an error.
I haven't tried this extension for Chrome but it seems to automatically render markdown (.md) files in Chrome.
https://chrome.google.com/webstore/detail/markdown-preview/jmchmkecamhbiokiopfpnfgbidieafmd?hl=en
In Firefox, I use the following extension for the same functionality.
https://addons.mozilla.org/en-US/firefox/addon/markdown-viewer/
No need for a separate .html file, just save the text file with .md extension and open it in the browser.
Hope that helps.

Using IcePDF or PDFBox to generate HTML page from PDF

I want to use IcePDF or PDFBox to extract content from PDF. But I don't now the way to continue generating HTML web pages from the text and images extracted.
You can convert pdf to html with PDFBox. Try this link.
By adding -html as parameter when you extract text, you will get html of the pdf. But it will not contain any image, graphics and other details. It will be only the text extracted from the pdf in html format.
If you want to create the exact look and feel of the pdf, there is no single step method in PDFBox. In my knowledge no library provides this facility to create exact html of the pdf. But using PDFBox you can extract images, text and its details. Using these details you have to create a logic to produce the html. We have done a project to convert pdf to html for azzist.com. We have accomplished the conversion using PDFBox. In azzist we are converting the resume to html format. (Still some font issues are there).
Scribd, google, dropbox, zoho etc have accomplished this conversion in a better way. You can have a look at any of these sites to check how they have accomplished this. (You will not get the logic. You have to find it out).

PdfSharp, GDI+ and HTML printing

I currently have a "PrintingWebService" that I call from an AJAX page with all the information that is needed to construct a highly customized PDF printout using PDF Sharp and the PDFSharp's GDI+ mode, which takes DrawString and other commands that work basically just like GDI+ only they are drawn to the PDF.
I then save the PDF file to a location on the webserver and return the file name from the web service, and the AJAX page opens a new window with the pdf file.
So far, it works well, however, there is one part of my AJAX page that I want to printout and I haven't come up with a solution for yet. I've got a string of the HTML content of a TinyMCE editor that I want to dispay in the bottom part of the PDF page.
I'm looking for some sort of tool I could use for this purpose. Even something opensource that prints to GDI+ I could use by taking the source code and translating it to use PdfSharp's GDI+ (the class names are like XGraphics, with each class having X before the GDI+ name).
If I have to I will limit what HTML can be generated by TinyMCE and write my own renderer, but that will be a big challenge, so I'm looking for other solutions first.
I've stayed away from a printer-friendly page approach because I wanted to construct a page that was a near identical of an existing WinForms printout, using my existing code. With PdfSharp I was able to convert all the code except the text area stuff (which used the RichTextBox and RTF in the WinForms version).
Tony,
I personally have used WebSupergoo's ABCPdf library with much success. You can actually render HTML directly to the PDF and it does fairly well in regards to accuracy.
Another free software that will allow you the flexibility of writing HTML to PDF that I have used in the past with much success is iTextSharp.
Otherwise, I think you'll have to write something to render HTML to GDI.
Either way, you may want to consider using an HttpHandler that you map to using your web.config to generate the PDF file. This will allow for you to render the PDF to a bytestream and then dump it directly to the user (as opposed to having to save each PDF receipt to the web server). It will also allow for you to use the .pdf extension in the page that returns the receipt (PurchaseReceipt.pdf could be mapped to a HttpHandler)... making it more cross-browser friendly. Older versions of Adobe / Browsers will not display correctly if you start throwing a PDF byte stream from an ASPX page.
Hope this helps.

Best way to export html to Word without having MS Word installed?

Is there a way to export a simple HTML page to Word (.doc format, not .docx) without having Microsoft Word installed?
If you have only simple HTML pages as you said, it can be opened with Word.
Otherwise, there are some libraries which can do this, but I don't have experience with them.
My last idea is that if you are using ASP.NET, try to add application/msword to the header and you can save it as a Word document (it won't be a real Word doc, only an HTML renamed to doc to be able to open).
There's a tool called JODConverter which hooks into open office to expose it's file format converters, there's versions available as a webapp (sits in tomcat) which you post to and a command line tool. I've been firing html at it and converting to .doc and pdf succesfully it's in a fairly big project, haven't gone live yet but I think I'm going to be using it.
http://sourceforge.net/projects/jodconverter/
There is an open source project called HTMLtoWord that that allows users to insert fragments of well-formed HTML (XHTML) into a Word document as formatted text.
HTMLtoWord documentation
While it is possible to make a ".doc" Microsoft Word file, it would probably be easier and more portable to make a ".rtf" file.
If you are working in Java, you can convert HTML to real docx content with code I released in docx4j 2.8.0. I say "real", because the alternative is to create an HTML altChunk, which relies on Word to do the actual conversion (when the document is first opened).
See the various samples prefixed ConvertInXHTML. The import process expects well formed XML, so you might have to tidy it first.
Well, there are many third party tools for this. I don't know if it gets any simpler than that.
Examples:
http://htmltortf.com/
http://www.brothersoft.com/windows-html-to-word-2008-56150.html
http://www.eprintdriver.com/to_word/HTML_to_Word_Doc.html
Also found a vbscribt, but I'm guessing that requires that you have word installed.
I presume from the "C#" tag you wish to achieve this programmatically.
Try Aspose.Words for .NET.
If it's just HTML, all you need to do is change the extension to .doc and word will open it as if it's a word document. However, if there are images to include or javascript to run it can get a little more complicated.
i believe open office can both open .html files and create .doc files
You can open html files with Libreoffice Writer. Then you can export as PDF from File menu. Also browsers can export html as a PDF file.
use this link to export to word, but here image wont work:
http://www.jqueryscript.net/other/Export-Html-To-Word-Document-With-Images-Using-jQuery-Word-Export-Plugin.html