csharp code to remove all extraneous microsoft html formatting - html

is there any way to programatically remove all microsoft html formatting that gets put on and simply render it as regular html.
i want to remove all the extra tags as i am trying to load it into tinymce but tinymce doesn't seem to be able to render it.

I've used the regular expressions from these articles:
http://tim.mackey.ie/CleanWordHTMLUsingRegularExpressions.aspx
How do I filter all HTML tags except a certain whitelist?
In my case I wanted to restrict everyone down to a small whitelist of tags. Especially those who paste from Word. TinyMCE has a property "valid_elements" which does exactly this.

Related

Preserving custom made word styles in word html

I have to make a release notes document in word based on an xml export from Jira.
Some of the fields that we need in the release notes document are rich text enabled. In reality the output is html styled text.
To be able to mix the html text and the plain text from Jira, we opted to use the word html (save a word document as web page) as a template.
Then we use xslt to add the needed values from the xml export into the word html template.
This seems to be working quite ok, but we ran into a problem.
The release notes format in word has a lot of styles (some of them are custom build)
But in the document we get when opening the transformed file (thus technically HTML) we only get 2 styles: Normal and heading1
(I prepared 2 screenshots, but I can't post images yet)
My question now is:
Is there a way to add these styles to the word html part so these can still be used by our users to edit the document after it is generated.
I thought there had to be something I could add here to insert these styles. But I can't seem to figure out what.
there is a style part in the html, so that is what I primarily have been looking at.
any help would be appreciated. Even if you just point me to some documentation about word html, or to someone who might know this.
Kind regards
Peter
As an alternative approach, you could consider docx4j.XHTMLImport.
If your incoming XHTML contains #class values with the same names as styles in the target docx, then the converted content will use those styles.

How to remove specific HTML tags in Visual Studio

I need to copy/paste text from Microsoft Powerpoint to Visual studio 2010's aspx page. When I copy the text it copies several unwanted tags (like style tags, span, p tag etc.). How can I cleanup that copied text in Visual Studio? I have also installed Resharper, is it useful in removing unwanted tags? For example I want to remove all style tags from a document or want to remove all span tags. I want to cleanup/remove unwanted tags in a single command.
After you have already pasted the text, it will be pretty hard to automatically determine between unwanted and actual tags, perhaps a complicated Replace All with Regex would work. But there are ways to copy/paste pure text, look here: How to copy and paste code without rich text formatting?.
As of now, there is now easy way to remove specific HTML tags. I would suggest you to use find and replace feature of Notepad++ where you can easily write a regular expression to replace tags. Also, I would suggest you to use these links to clean up your HTML
WordToHTML CleanHTML. I hope this helps you to resolve a part of your concern.

Today's Google Doodle of Moog Synthesizer

I was inspecting today's Google Doodle of Moog Synth, dedicated to Robert Moog,
when I came across the following piece of html code:
<w id=moogk0></w>
<s id=moogk1></s>
<w id=moogk2></w>
<s id=moogk3></s>
(You can view the source & do a Ctrl+F for , you will get it in the first search result).
I googled about s & w tags but found nothing useful except that the s tag is sometimes used to strikeout text but is now deprecated.
Obviously google will not be using deprecated tags but I guess there's a lot more behind this than plain html code. Can anybody explain me the use of this tags? & how does the browser recognise them? & if the browser does not recognise them, whats their use?
The browser doesn't recognise them.
But HTML was designed to ignore what it doesn't recognise, so these are simply not understood by the browser and get ignored. As elements without content, they will not get rendered at all either, though they will be part of the DOM.
However, these can be styled directly as elements in CSS and picked up by Javascript (getElementsByTagName and getElementById etc...).
In short, these elements provide a target for CSS and Javascript without any other impact on display.
Unknown elements are treated as block elements (like div) and can be styled accordingly and be used in scripts.
The tags you are talking about are user created XML tags.
If you need to display dynamic data in your HTML document, it will take a lot of work to edit the HTML each time
the data changes.
With XML, data can be stored in separate XML files. This way you can concentrate on using HTML/CSS for display and layout, and be sure that
changes in the underlying data will
not require any changes to the HTML.
With a few lines of JavaScript code,
you can read an external XML file and
update the data content of your web
page.

Clean HTML table of formatting?

Anyone know of a way to clean a <table> of all formatting leaving just the basic tags and text?
I have tries Komposer which was useless and even added more formatting rubbish of its own. I them tried Aptana but that only seems to be a text editor, again no use at all.
Any ideas?
When you would like to clean HTML tables (e.g. when you copy them from Word or Excel to an HTML editor) you can use the online Table Cleaner at https://www.r2h.nl/tablecleaner
I strips all the formatiing and returns only clean HTML code so will you have a table without any styling.
How about using a text editor that supports find and replace using regular expressions (such as Notepad++) to remove the unwanted attributes using one regex, and the font tags using another regex?
To match the attributes you need to remove the following regex should do the job:
( style| class| height| width)=("[A-Za-z0-9:;_ -]*"|'[A-Za-z0-9:;_ -]*'|[A-Za-z0-9:;_-]*)
To match font tags, try
<font.*font>
(I've tested these regular expressions with http://gskinner.com/RegExr/).
Edit
It turns out that Notepad++ does not support the logical OR operator in regular expressions. An alternative would be to use another text editor that does, or to write a small app/script to perform the replacements.

Text style affecting the whole site

I've got an input so the user can type either html or plain text. When the user copy & paste text from MS Word, for example, it generates a weird html. Then, when you view that topic, you can see the whole page's style is affected. I don't really know if the generated html has unclosed tags or something, but it looks like it does and thus, the style of the page is affected.
Does anybody know how to "isolate" the html of that div(or whatever the container be) from the whole page's style?
Short of showing the content in an IFRAME, you can't really do that. What I usually do in this situation is apply tag stripping logic to the content as it comes in. You really don't want to allow arbitrary HTML from a security perspective, but even if you don't care what your users input, you should be stripping out invalid HTML tags (Word has a habit of creating tags with weird namespace-looking things like o:p) and running something like Tidy over the result to ensure every tag is properly closed. There are a number of Tidy libraries for .NET out there; here's one.
Here's a quick cut-and-paste of how I've done this in the past. Note that the class implements an interface from the project I used it in, but you get the general idea.
Copying text from word can include <style> tags. The only sure way to isolate these styles is to put the input control in an <iframe>
You can either sanitize the input or display it in an IFrame.
It it were me I'd strip all but basic formatting (e.g., bold, italics) and use Tidy. That's what I end up doing, I strip and convert all the CSS styles of word into <strong>, <em>, etc.