Tools to reduce generated HTML size - html

I'm using google docs, and some templates we are using were created using MS-Office.
The resulting HTML is fat and ugly, and the 500KB per doc limitation on google makes some cleanup mandatory.
I was able to find redundant "style" attributes and move them to some CSS class, and rename the most redundant classes names to shorter ones, which makes me save about 50% of the original size.
Are you aware of some existing tools/scripts/lib which could do this painful job for me, or at least help me to write this magic tool ?
Thanks in advance !
EDIT: I gave a try to both tidy, demoronizer and "manual rewrite":
- Input : 140Kb
- Tidy'ed : 110Kb
- Demoronized : 135Kb
So my favorite answer will be "rewrite it!"
Thanks !

MS-Office makes crappy HTML, period. You're better of spending time rebuilding the HTML from the original text than trying to walk through that minefield.
I made a few macros that do some search/replace functions on Word to do basic things like wrap <p> tags around paragraphs and stuff like that, then re-markup the whole thing from scratch.

You could try tidy it will clean up many things.

Without commenting on its name, I could mention demoronizer, which the author describes as:
...a Perl program available for downloading from this site which corrects numerous errors and incompatibilities in HTML generated by, or edited with, Microsoft applications.
YMMV.

One of my favourite utilties now is actually Windows Live Writer - it does a neat job of stripping rubbish out of Word doc files. Some might disagree but I use it quite often!

Related

How do I remove excess whitespace in an HTML file? (And only excess whitespace)

I have a horrible, ugly HTML file that was spat out by a form generator and slightly modified to look nice. This HTML file needs to be translated, so I hooked up some scripts using po4a and csv2po, and that all works fairly well except for one thing: some of the base strings in our translation templates are surrounded by whitespace, and the translators get rather confused.
The other thing is I have this working with a Makefile (because that generated form is updated quite frequently and I'm a nerd). I'd like to keep it that way because it's nice for my workflow. So, I need a command line tool.
I'm really looking for the simplest solution in this case, so I ran the HTML file through HTML Tidy, and that removes the weird whitespace quite competently. However, it does a lot of stuff I don't need. It messes with the doctype (and it doesn't support an html5 doctype), and I've ended up with a really crazy command line just to get it to not mangle things. It is not very pleasant.
All I really want is a command line tool (not an online one) whose single goal in life is to look at my HTML file and format it nicely. Ideally not a "compressor" thing, but if that's the only option, suggestions would be nice :)
Stick it in an ide or text editor like notepad++ or net beans and hit the "format code" button which is available in nearly every ide?
I'm not sure if it is still being developed, but would HTML Tidy do the trick?

Writing a book for both print and HTML which can include code samples

I want to write a book on programming. I need to target both print and HTML.
In order not to get burned with the code examples, I need to be able to include parts of source code which have been marked up with start and end points to ensure the code is up to date and compiles. Extract the code from external files if you will.
I would like some simple format such as Txt2tags rather than latex since I then can use word's fine spelling capabilities.
Any experiences you want to share?
It is important to note that by starting with Txt2Tags you will be able to export your documents into LaTex. To my knowledge this is a one-way street, so by starting with Txt2tags you can still have the flexibility of LaTex, but by going with LaTex you don't get the benefits of Txt2tags.
Firstly, don't dismiss LaTeX too rapidly. Although it can be a bit of a pain to spellcheck, it's still quite doable with tools like aspell.
That being said, I would highly recommend using emacs' org-mode. It will provide you with a nice foldable overview of your book's structure, and is much more readable in plain text than LaTeX. Additionally, since it uses emacs' native syntax highlighting when you export (to HTML, LaTeX, PDF, etc) you'll be able to write the code inline (between #+begin_src tags) and get a much more precise WYSIWYG view of the code snippets you include.
Since emacs will work with aspell out-of-the-box, you'll still be able to check spelling as you work. Also, it uses LaTeX as an export format, which means you can obtain the same professional/technical look that LaTeX affords.
I see it has been reported as a missing feature on the text2tag homepage...

What language should I use for editing documents?

Document editors are nice but they have their limitations.
What is a good alternative to them?
I already know HTML and CSS and while they can do the job, they are ill-suited for printed documents.
I was thinking in learning LaTeX, because many scholars use it. But I wonder if someone would recommend another language such as postscript.
LaTeX is fine. You don't want to write postscript by hand.
I’m using LaTeX almost exclusively nowadays, at least for text documents (everything from CV over letters to manuals).
For quick one-off notes, I’m actually using Markdown (without a renderer. I just think that Markdown preserves document structure quite nicely even when used in text-only mode).
For presentations and spreadsheets, I use appropriate applications, though. In particular, I don’t think LaTeX is that well-suited to do the former (depending on your style of presentations, obviously. Mine have next to no text though …).
I finally got a chance to write an entire paper in LaTeX for my final semester of College and found it to be easier than I thought it would be. A couple of the nice things I found about it were
A fairly lightweight syntax for most things (tables being the only real offender, but no one can get text tables right).
An extremely wide array of syntax for doing anything from automatically marking up a chemical formula to writing inline lists.
Beautiful output automatically.
Extremely easy to write modular documents where I might store a chapter in a file and then simply \include{} it in another. One particularly nice use I found for this was to include code that I had written in the document simply by referencing the files.
Wonderful support for footnotes and bibliographic references.
Libraries for just about anything you can imagine.
The major drawbacks are, IMHO:
A lack of any real direction or life in the language. It feels dead, and not because it's done.
A frustrating build process, although there are tools to help with that, from a simple bash script to a full fledged make file.
If you're interested in learning LaTeX, I would recommend starting out by reading the Not So Short Introduction to LaTeX 2e PDF.
However, I decided against using LaTeX for most things that I write these days specifically because it feels dead and has a frustrating build process. I instead switched over to MultiMarkdown, as it is well supported and can be transformed into a large array of other formats, including LaTeX which can then be hand massaged if you really need to in order to get it the format expected by some publication. If you haven't played with MultiMarkdown or Markdown before, then I highly recommend checking them out. The syntax is extremely lightweight and natural, even compared to LaTeX. I find that except for some of the higher level typographical constructs, MultiMarkdown supports everything I need on a regular basis.
My 2 cents.
It depends on what you want to do. If you are planning to write a formal document, maybe for printing too, just go for LaTex.
Not difficolt as it may appear at the very beginning but professional and fulfilling.
If Web is your goal, go for HTML / CSS.
OpenOffice or Word would do the trick in most cases; do not underestimate them, if you are going to use them (example for job) take time to learn them.
To expand on zzzzBov's commmment, LaTeX is SUPPOSED to allow the writer to concentrate on the content and allow the compiler/documentclass to handle formatting (and that usually is true). If you use HTML/CSS to format you will probably be spending more time (rather than less) doing formatting. Imagine that the LaTeX documentclass is the CSS, only it is already written for you, and your LaTeX source is the content, only the tags are more functional (such as italics or equations) than for patching between the HTML and the CSS (<div ...>). I recommend the LaTeX wikibook as an easy way to start, and the short-math-guide, it if you need mathematics. Enjoy!

How extract meaningful text from HTML

I would like to parse a html page and extract the meaningful text from it. Anyone knows some good algorithms to do this?
I develop my applications on Rails, but I think ruby is a bit slow in this, so I think if exists some good library in c for this it would be appropriate.
Thanks!!
PD: Please do not recommend anything with java
UPDATE:
I found this link text
Sadly, is in python
Use Nokogiri, which is fast and written in C, for Ruby.
(Using regexp to parse recursive expressions like HTML is notoriously difficult and error prone and I would not go down that path. I only mention this in the answer as this issue seems to crop up again and again.)
With a real parser like for instance Nokogiri mentioned above, you also get the added benefit that the structure and logic of the HTML document is preserved, and sometimes you really need those clues.
Solutions integrating with Ruby
use Nokogiri as recommended by Amigable Clark kant
Use Hpricot
External Solutions
If your HTML is well-formed, you could use the Expat XML Parser for this.
For something more targeted toward HTML-only, the W3C actually released the code for the LibWWW, which contains a simple HTML parser (documentation).
Lynx is able to do this. This is open source if you want to take a look at it.
You should strip all angle-bracketed part from text and then collapse white-spaces.
In theory the < and > should not be there in other cases. Pages contain < and > everywhere instead of them.
Collapsing whitespaces: Convert all TAB, newline, etc to spaces, then replace every sequence of spaces to a single space.
UPDATE: And you should start after finding the <body> tag.

Word Document to HTML

I have looked over the answers to what is the best to convert Word to HTML for free. What if I am willing to pay? The big issue is that these documents have several tables that need to be kept exact. The background colors and cell alignment have to match the original.
You're willing to pay? Try http://word-to-html.com/ or the even more expensive http://www.solutionsoft.com/convert-word-to-html.htm
The main thing the other answers miss is that Word does a horrid job of producing HTML. And otherwise reasonable tools like OpenOffice do an even worse job. The results are so incredibly bad there usual approach is two steps:
Step 1: Export HTML from word
Step 2: Post process the result to make it usable
An example (free) cleaner is http://word2cleanhtml.com/.
If you have the choice use Microsofts "Web Page, Filtered" rather than the full HTML (you'll be much happier). Also consider a dark horse candidate: email the document to yourself via gmail, then "view as HTML".
Word has an export (or save as) to HTML. Will that work?
It's Save As -- Other Formats -- Web Page, Filtered
what version of word are you using?
Word has an option "Save as HTML".Isn't this enough?
You would just do file>Save As> change file type to HTML.