Strip Word document for formatting but not images? - html

I need to add a bunch of word documents to a wiki but want to clean up the resulting html so ideally I have text and image tags... Anyone up for a challenge? :o)
It's ok if the solution involves using a text editor and doing some "gymnastics" on it.

There are tools that perform much of this cleaning for you, like here or here and Dreamweaver includes such a tool as well.
I don't know what these tools do with images though... If you choose a more DIY route, this can help you I think.

I would copy the text out of Word and paste it into Notepad and then manually enter my images into the Wiki document.

Hi I've worked a little with Open XML.
You can just either cycle through the word document checking each paragraph and converting each element into literalcontrols. Or you can also use LINQ to filter specific nodesets. You could also just treat your word file as an XML nodeset and navigate with XPath, LINQ to XML, DOM.
Just try downloading the Open XML toolset with SDK and start looking inside of your documents.

Related

.doc / .docx problem to display in browser

I am trying to find a solution for a website that I made so that the word documents will not download automatically anymore, but that they display directly in a new page in the browser when I click them.
Can someone help me?? How can I do that in HTML?
Thanks a lot in advance
There are two approaches you can take for this:
Ensure that all visitors to your site have a browser extension which can display Word documents. (This isn't something you can control for a typical website but might be an option for a company Intranet)
Convert the Word documents to a format that the browser can display (i.e. HTML). You could do this manually or with code (which could be client-side or server-side). The Word document formats are notoriously complex so you would need to find a third party library that could do this for you.

VB.NET: WYSIWYG page maker tutorial

I have a course work for which I have to make a (as advanced as possible) WYSIWYG web page editor in VB.NET (2010). It should have a visual editor with drag-drop support for several elements such as anchors, images, tables etc., and it should generate HTML based on that structure.
I don't know where to begin though.. I have some experience with vb.net, I made a tabbed notepad vaguely following a tutorial, but I don't know how to make this drag-drop thingy in a richtextbox.
I've searched for a tutorial, but most of them are just too simple - a text editor with browser control rendering the HTML.. I found one really nice and advanced, but it's in german :-|
So, if anyone knows any resources / tutorials I could use to start things I'll appreciate it.
I won't start with a richtextbox. Do you want to realize it in WPF or Forms (I would recommend WPF)?
In WPF there is relative simple a Drag-And-Drop behavior for elements (see http://msdn.microsoft.com/en-en/library/ms742859.aspx).
I would start with some simple elements (e.g. TextBoxes) and drag-drop them from some sort of toolbox onto a grid with fixed columns and rows (and later use a canvas). And then generate the HTML-Code from that.
In general, most of the WYSIWYG browser based editors are written in Javascript using an editable DIV.
A good example is tinymce:
http://www.tinymce.com/
Download, including full source code, is available here:
http://www.tinymce.com/download/download.php
You can use CKEditor. Its one of the best WYSIWYG editor i have worked with. Its highly customizable and opensource.
Given below is the URL for the website:
http://ckeditor.com/

WYSIWYG HTML editor for grails gsp files

Does anyone know about a good HTML editor which can be configured in such a way that it is gsp aware?
What I mean is that at least tags such as <g:link> and <g:input> should be displayed as their html equivalent.
Yes I know: a perfect editor is hard to write and it is easier to edit the HTML sources (that's what I do), but there are people who prefer an HTML editor...
Update: yes, I am looking for a WYSIWYG HTML editor with which I can drag'n'drop some html elements to a page without changing the <g:...> tags which might already be contained in the page. In addition, this editor should have some gsp awareness, so that <g:...> tags are displayed in an appropriate way.
Update: still looking for something, so I started a bounty. What I need is something like this plugin: http://code.google.com/p/grails-form-builder-plugin/ but more evolved...
Bounty: not easy to select the right answer for the bounty. None of the answers is a solution to my problem, but I have decided that rschlachter points me in the right direction: a wysiwyg form editor is not the right solution for a developer...
I think there may be a flaw in the process here. You could build the page first in HTML and make any changes there before putting in any gsp elements. While the page is in HTML format people can continue to use WYSIWYG editors and then developers can add in the grails functionality.
It just seems like if you need/want to use a WYSIWYG editor, you shouldn't be modifying a gsp.
The iterations I prefer to use after I have gathered requirements are:
wireframe
mockup
html
gsp
If the gsps are already there (ie you inherited the project or something) you could go back a step and create an html only version of the page by pulling the gsp elements out and putting in images of them or replacing them with their html equivalents.
the IBM Maqetta Project seems to be going in the right direction:
http://maqetta.org/
Mercury editor might be worth looking at too.
http://jejacks0n.github.com/mercury/
There is one more editor that you might want to look at:
Aloha Editor - http://www.aloha-editor.org/
Orbeon can be an option
http://www.orbeon.com/orbeon/home/
Might be able to do this with TinyMCE by configuring the valid_elements or the extended_valid_elements (docs). For example, if you want to replace <g:link> and <g:input> you would do something like:
tinyMCE.init({
valid_elements : "a/g:link,input/g:input"
});
OR If you want to simply enable the additional elements, then you could do something like:
tinyMCE.init({
extended_valid_elements : "g:link,g:input"
});

Putting several hundred .doc pages into webpage

I have hundreds of .doc files with text that I need put on web pages.
I realize I could convert every .doc file to .txt, then use a server side include to embed the contents of each page into a webpage. This would save a lot of time because I could simply have one .php?txt=... page which will display a different .txt include depending on the link the user pressed to get there. This works perfectly content-wise.
However, all formatting is lost when it is converted to .txt (titles should be in bold)
When I convert these .doc files to .html using Microsoft Word, the ~20 line documents become bloated >300 line .htm files (probably because each paragraph is put into textboxes)
Dreamweaver's "Clean up Word HTML" helped a bit but the code was still extremely bloated.
How would you suggest going about this?
edit: I may have solved my own question, trying to embed Google docs into my page.
There is a program suite called wv (former mswordview). It has a program wvWare. This software can transform Word documents to HTML.
Furthermore you can use the output from Word and send it through tidy. This corrects markup and usually can handle the mistakes made by Word.
You can try converting the Word documents to a DocBook intermediate format, then you can easily transform the DocBook with existing tools to (X)HTML.
MS Word is bloatware. Its own markup is bloated, and therefore any attempt to automatically convert it to HTML will inherit these problems. You end up with garbage like: <strong><strong></strong></strong> for no good reason.
Dreamweaver can clean it up a lot, but nothing short of strip/remarkup is going to get you clean results.
That's why most people use PDFs for this type of issue.
My immediate reaction would be to convert the docs to PDFs. That will normally preserve formatting quite well, and users typically have their browsers set up to view PDFs one way or another (and the few who don't are undoubtedly accustomed to being unable to view a lot of documents on a lot of sites).
Alright thanks everyone for your suggestions, but I wanted to make this page accessible to everyone without pdf viewers as well.
Google docs allows you to bulk upload your text files (and converts them for you too)
You can then export them into an iframe to embed in any html document.

How do I convert PDF to HTML programmatically?

Are there any classes, COM objects, command line utilities, or anything else that I can make an API for that can convert a PDF to an HTML document? Obviously the conversion might be a little rough since PDFs can contain a lot more than HTML can describe. I found a utility called pdftohtml on Source Forge, but quite honestly it does a horrible job with the conversion. I don't care if the software is free or commercial, but is there anything out there at all that I can incorporate with my own software to do this sort of conversion at least decently? I know Google's developed their own method of doing this, since you can click "View as HTML" on a PDF attached to an email through Gmail, but I was hoping there was something out available to the public.
Remember, PDF to HTML. I'm NOT worried about HTML to PDF.
well one solution i can think of is to write little program that reads pdf text using library called iText and then generate html files.
well for java based PDF solutions...we dont have a clean way i guess-still.. all solutions are primitive and kind of workarounds... No easy solution for
1. Designing a template of a PDF
2. Then at runtime using java, populate data into this template...either using xml or other datasources...
such a simple requirement and NONE has a good "open-source and free" solution yet !
Eclipse BIRT comes close.. but does not handle Barcode elements ..OOB.
You were looking for pdf2htmlEX (C++), which converts PDF to HTML without losing text or format.
To convert further to semantic HTML, you can process pdf2htmlEX output using my project Transcript (Python). It is however not lossless anymore and works best on documents not deviating too much from conventional visual layout.