Word Document to HTML - html

I have looked over the answers to what is the best to convert Word to HTML for free. What if I am willing to pay? The big issue is that these documents have several tables that need to be kept exact. The background colors and cell alignment have to match the original.

You're willing to pay? Try http://word-to-html.com/ or the even more expensive http://www.solutionsoft.com/convert-word-to-html.htm
The main thing the other answers miss is that Word does a horrid job of producing HTML. And otherwise reasonable tools like OpenOffice do an even worse job. The results are so incredibly bad there usual approach is two steps:
Step 1: Export HTML from word
Step 2: Post process the result to make it usable
An example (free) cleaner is http://word2cleanhtml.com/.
If you have the choice use Microsofts "Web Page, Filtered" rather than the full HTML (you'll be much happier). Also consider a dark horse candidate: email the document to yourself via gmail, then "view as HTML".

Word has an export (or save as) to HTML. Will that work?
It's Save As -- Other Formats -- Web Page, Filtered

what version of word are you using?
Word has an option "Save as HTML".Isn't this enough?

You would just do file>Save As> change file type to HTML.

Related

How to create language-dictionary database from text file?

I have a large text file, which is an Italian-English dictionary. A typical line is:
Mazzapícchio, a long pole that fishers vse to bob vp and down for Eeles, and also to make fish to stirre. Also a kind of meate or custard in some parts of Italie made with milke and egges.
(Yes, it's a 17th-century dictionary.)
I'm looking for the best/easiest way to turn this into a searchable database.
The search would need to ignore the diacritics; with everything up to the first comma as the 'entry'. There are some cross-references, e.g.: Mefíte, as Mephíte.
My first thought is simply to turn it into HTML, with anchor tags for the word/phrase up to the first comma. That should be easy enough with a bit of Grep. I could also add links to the crossrefs in the same way (using BBEdit to confirm each change). It would then be easy to query just using a browser's search field.
However, ideally, I'd like something that returned only (all) the matching results. XML/HTML Tagging is the easy bit: the problem is the front-end to access/query it.
I'm on MacOS. (I'm also investigating Apple's Dictionary format...)
Any ideas on how to proceed would be welcome. Thanks.
This is a huge question. So many choices at so many areas.
A small start:
A searchable db. Look at https://solr.apache.org/
Php to handle interaction front-end with solr and to serve your html search form and results.

When "viewing source", some sites have neat markup, some sites don't. Why? (pic attached)

Notice how in the 'ugly' side, the doctype is all the way indented and some of the meta lines extend past the left indent.
How can I get my markup looking neat when viewing source in a browser? Is there a certain way to encode the code while using an editor? I use Notepad++ by the way.
Large blocks of unindented code like you see in the left hand side are probably being written out server side, and so although the tag that creates them is nicely indented in your HTML the erver script output will not honour that.
It's not about encoding, it's about writing neat source code, haha. If you are outputting from php or something you can use keep track of how far to indent each thing or you an use some sort of template output function that keeps track of how many tags are open for you and indents the correct amount each time. But, there is no point on having neat HTML, the only important thing is that it's valid. Developer Tools will make it neat for you when you're trying to debug, and actually removing all that whitespace used to make it neat can reduce your page size quite a bit.
The ugly ones probably look pretty in the underlying php or other source. Once generated into HTML it looks ugly, and very few programmers will try to make that pretty too - it's not worth it.
It's funny that what you list as "ugly" seems properly indented to me... at least from what I can tell from the screenshot.
In any case, it doesn't matter. Most of the time these days, sites are made with something dynamic, and a lot of the HTML formatting isn't explicitly output.
If you were to view the source on many of my sites, it is all rammed together on one line, as that is how I echo it out. I don't see the point in wasting bytes on line feeds. Especially these days with all of the browser tools available that reformat the source while debugging.
I use Eclipse to do my coding and I can use Source->Format to clean up my code and format it nicely.
For Notepad++, I believe you can use HTML tidy as per: Formatting code in Notepad++
TextFX -> HTML Tidy -> Tidy: Reindent XML
You really want your HTML code to look like this:
view-source:http://lightningsoul.com/
As it uses the minimum amount of data to present itself to the browser. Remember that indents and white-spaces consume data as well as any other character.

Translate HTML files to another language

I have a website with Dutch text which I want to translate to English. Is there a fast way of doing this with keeping the HTML tags(<strong>,<span>) in tact. I know I can just copy the parsed TEXT into a translator but this will remove the formatting.
I also know that at the end I have to go trough the text manually to fix some minor spelling and grammar.
Online translators are good to turn foreign text into something that can be understood, but they are useless for producing quality translations. Even if you fix obvious problems at the end, you will get an amateurish word-by-word translation. If you want your visitors to take you seriously, you should translate from scratch.
If you want to preserve the HTML formatting at the same time as translating, you will have to work directly with the HTML source and update the text yourself without touching the formatting.
You may be able to use an XML editor like XmlSpy that will let you edit text nodes directly without touching the tagging, but this requires that the HTML is actually XHTML. You may still need to translate some attributes (such as title and alt attributes).
Is a virtual traslate a good option for you? Because if you paste google translato script into your page source, it will translate your text on the site, and the formating will stay there too. http://translate.google.com/translate_tools

Why do I need Markdown?

Why do I need a Markdown with a front edit editor like WMD? What does the markdown do to the content that’s sent from the WMD editor?
How does Markdown store the content in the backend? Is it the same way like *bold* or in some other format? Why can’t I just do an html encode?
Sorry if I sounded very naïve.

			
				
It's probably helpful to take a step back and ask some of the larger questions. The issue Markdown is trying to solve is that of rich editing in the browser. Consider this: At some point, for any piece of software to enable rich text it has to describe the richness in a some manner, however that may be.
We could call that description of richness (by description of richness I mean like "this bit of text is bold" or "this bit of text is a hyperlink), we could call that description of richness "markup" -- it marks up the text with meta "richness".
Implementations of rich text can take on two approaches, either a.) hide the markup from the user or b.) let them have access to the markup.
For those who choose to hide it, the end result is very often WYSIWYG. The user is oblivious to what is happening behind the scenes. The editor takes care of the details. Think MS Word as an example. No one manipulates the Word markup format as a regular end user.
For implementations which choose to expose the markup, a markup language is then in order to allow users to interacat with it. Such markup languages would be things like HTML doing <tag> or BB code for example, doing things like [tag].
Markdown is one such of these languages.
As opposed to the former types I mentioned, Markdown has tried to design itself so that the markup renders common ASCII people already use. For example, it's common for people to asterisk their text to set it off, *important*, and this notation in Markdown is an indicator of italic.
In regards to storage, as Stephan pointed out, the system will most likely store the raw markdown, because the user will most likely need to have the possibility of editing, and the original markdown can be recalled for that purpose.
In most of the systems I've built, I store the markdown, and then normalize it to a 2nd field which caches the HTML rendering of the markdown. This way I don't have to do markdown->HTML rendering for every markdown field. It takes a little more space, but I'd rather the user have a faster response than use less DB storage space.
Care should also be taken when accepting Markdown from the browser, as it can easily contain <script> tags which need to be filtered out. Most markdown implementations will also recognize HTML intermingled with Markdown formatting, as so to be safe, you need to make sure your inputs and caches are sanitized properly.
The reason for using an alternate encoding system other than HTML is for security
Markdown and other such wiki style encoding systems do not usually support scripting languages
HTML supports scripting languages in many ways (
The two main security issues are:
Malware criminals use scripts in user generated content to attempt malware actions on the content readers computer by scripting to access known security holes
Free loaders using scripts to subvert the rest of the site by changing the content frame or styles i.e. ads, menu's, logos etc. This can also be criminal behaviour if not just annoying
By using an intermediate language such as Markdown you have total control on the rendered output
Filtering HTML is possible, but is also complex and risky
The other significant reason for an alternate encoding system is enforcement of style. Normal HTML has too many options. By limiting the available options, users can only use certain styles. The usually makes for cleaner looking and more readable content (compare SO to Ebay)
The main reason for using Markdown is the readability of a marked text. For instance, you can send it in a plain-text email and the reader will still understand the emphiasis, bullets, the text will be divided in paragraphs et cetera.
When you ask about storing data, it depends. If you enable Markdown in the WordPress blog engine, it stores data as the user has input it - in Markdown. In Stack Overflow, however, it seems like the data is stored as HTML. At least, the "Stack Overflow data dumps" contain HTML, not Markdown (I've seen people complaining) that they have to convert it back).
If you use the WMD editor, you can show the user how the outputs will look like after being converted to HTML. Even though Markdown syntax is really simple, it is not hard to make mistakes. Hence, it is best to show users the output.
Another reason for using Markdown instead of a WYSIWIG control - a WYSIWIG control allows the user to use HTML in data you are displaying on your web page. So, you have to be the one who decides when there is simply incorrect HTML and when it is an evil XSS/CSRF/whatever injection. In Markdown, you simply convert *something* to <b>something</b>, remove any unknow HTML elements and you're done.

Tools to reduce generated HTML size

I'm using google docs, and some templates we are using were created using MS-Office.
The resulting HTML is fat and ugly, and the 500KB per doc limitation on google makes some cleanup mandatory.
I was able to find redundant "style" attributes and move them to some CSS class, and rename the most redundant classes names to shorter ones, which makes me save about 50% of the original size.
Are you aware of some existing tools/scripts/lib which could do this painful job for me, or at least help me to write this magic tool ?
Thanks in advance !
EDIT: I gave a try to both tidy, demoronizer and "manual rewrite":
- Input : 140Kb
- Tidy'ed : 110Kb
- Demoronized : 135Kb
So my favorite answer will be "rewrite it!"
Thanks !
MS-Office makes crappy HTML, period. You're better of spending time rebuilding the HTML from the original text than trying to walk through that minefield.
I made a few macros that do some search/replace functions on Word to do basic things like wrap <p> tags around paragraphs and stuff like that, then re-markup the whole thing from scratch.
You could try tidy it will clean up many things.
Without commenting on its name, I could mention demoronizer, which the author describes as:
...a Perl program available for downloading from this site which corrects numerous errors and incompatibilities in HTML generated by, or edited with, Microsoft applications.
YMMV.
One of my favourite utilties now is actually Windows Live Writer - it does a neat job of stripping rubbish out of Word doc files. Some might disagree but I use it quite often!