Store arbitrary characters in Semantic MediaWiki - mediawiki

I'm trying to store some text containing html tags into properties, which doesn't work. I created a form for a property with the data type 'text' and a template. Saving the form writes the text into the template, but it can't get displayed, as it contains illegal characters, as I guess.
What I'm trying to do:
I need a form to enter data, containing html tags and special
characters
I'd like to be able to use a query to find all those pages
and show that text using a template I provide to the ask query.
I also tried to use the free text option, but then I can't retrieve it using the ask query.
What would be the best, or at least a working solution to this?
Thanks a lot

storing text with html tags is a bit tricky in SemanticMediaWiki
The reason is the invention of the StripMarkers UNIQ/QINU by the MediaWiki developers.
When parsing the content of page with html tags in it the parsing is sort of "postponed". This technical detail unfortunately makes it hard for extension developers like the SMW developers to solve the issue of handling such content. Also it makes it hard for lay people to follow the discussion on how to solve the problem
Here are two examples of SMW Issues that are marked as "closed". This state of affairs means that by following the configuration hints in the issue your problem should be solved. If not please ask a question on the SMW issue list or even initiate the reopening of the issues.
https://github.com/SemanticMediaWiki/SemanticMediaWiki/pull/794
https://github.com/SemanticMediaWiki/SemanticMediaWiki/issues/3707

On my wiki we ran into this and resolved it by replacing special characters (we had issues with [ ] =, but the same problem happens with to < > tags too) with alternate unicode characters using the regex extension and a template before setting the property with {{#set:}}. If you want to display the formatted text on the wiki directly then call that parameter separately without replacing the unicode characters.
When you want to display the property, you can then run the reverse replacement with regex before displaying your now intact code (using the template result format to allow you to perform the operation on the output of the query).
To switch to special characters you can create this template
{{#regex:{{#regex:{{#regex:{{#regex:{{#regex:{{{1|}}}|/=/|꞊}}|/\[/|[}}|/\]/|]}}|/>/|≽}}|/</|≼}}
And to switch back you can use this as a template
{{#regex:{{#regex:{{#regex:{{#regex:{{#regex:{{{1|}}}|/꞊/|=}}|/[/|[}}|/]/|]}}|/≽/|>}}|/≼/|<}}

Related

Maintaining font style/formatting into a form that doesn't support html/markdown

I have looked into the previous postings to do with this area but haven't found any relevant answers as perhaps I am asking the wrong question.
On the popular design site Dribbble, there seem to be interesting formatting changes in profile names that break from the conventions of the site's styling.
Alot of people have been adding special characters (ΔδΓ etc.) that can be achieved by pasting into their profile form and saving changes, yet some users have somehow managed to enter formatted versions of their name, despite the profile form not supporting HTML or Markdown. You can see an example in the images below.
An example of copying the font to Google with maintained formatting
When opening in inspector, it also shows the formatted type
How could this be done in a simple text input form that doesn't support HTML/Markdown?
These are almost certainly Unicode characters, just like these characters that you reference in your question: ΔδΓ.
For example, Unicode's mathematical alphanumeric symbols section includes symbols that look like the ones in your screenshot. Since these are separate Unicode characters there is no need for additional formatting.
Users will need to have a font that supports those characters installed locally to view them.

Django templatetag for rendering a subset of html

I have some html (in this case created via TinyMCE) that I would like to add to a page. However, for security reason, I don't want to just print everything the user has entered.
Does anyone know of a templatetag (a filter, preferably) that will allow only a safe subset of html to be rendered?
I realize that markdown and others do this. However, they also add additional markup syntax which could be confusing for my users, since they are using a rich text editor that doesn't know about markdown.
There's removetags, but it's a blacklisting approach which fails to remove tags when they don't look exactly like the well-formed tags Django expects, and of course since it doesn't attempt to remove attributes it is totally vulnerable to the 1,000 other ways of script-injection that don't involve the <script> tag. It's a trap, offering the illusion of safety whilst actually providing no real security at all.
HTML-sanitisation approaches based on regex hacking are almost inevitably a total fail. Using a real HTML parser to get an object model for the submitted content, then filtering and re-serialising in a known-good format, is generally the most reliable approach.
If your rich text editor outputs XHTML it's easy, just use minidom or etree to parse the document then walk over it removing all but known-good elements and attributes and finally convert back to safe XML. If, on the other hand, it spits out HTML, or allows the user to input raw HTML, you may need to use something like BeautifulSoup on it. See this question for some discussion.
Filtering HTML is a large and complicated topic, which is why many people prefer the text-with-restrictive-markup languages.
Use HTML Purifier, html5lib, or another library that is built to do HTML sanitization.
You can use removetags to specify list of tags to be remove:
{{ data|removetags:"script" }}

Rails - Escaping HTML using the h() AND excluding specific tags

I was wondering, and was as of yet, unable to find any answers online, how to accomplish the following.
Let's say I have a string that contains the following:
my_string = "Hello, I am a string."
(in the preview window I see that this is actually formatting in BOLD and ITALIC instead of showing the "strong" and "i" tags)
Now, I would like to make this secure, using the html_escape() (or h()) method/function.
So I'd like to prevent users from inserting any javascript and/or stylesheets, however, I do still want to have the word "Hello" shown in bold, and the word "string" shown in italic.
As far as I can see, the h() method does not take any additional arguments, other than the piece of text itself.
Is there a way to escape only certain html tags, instead of all? Like either White or Black listing tags?
Example of what this might look like, of what I'm trying to say would be:
h(my_string, :except => [:strong, :i]) # => so basically, escape everything, but leave "strong" and "i" tags alone, do not escape these.
Is there any method or way I could accomplish this?
Thanks in advance!
Excluding specific tags is actually pretty hard problem. Especially the script tag can be inserted in very many different ways - detecting them all is very tricky.
If at all possible, don't implement this yourself.
Use the white list plugin or a modified version of it . It's superp!
You can have a look Sanitize as well(Seems better, never tried it though).
Have you considered using RedCloth or BlueCloth instead of actually allowing HTML? These methods provide quite a bit of formatting options and manage parsing for you.
Edit 1: I found this message when browsing around for how to remove HTML using RedCloth, might be of some use. Also, this page shows you how version 2.0.5 allows you to remove HTML. Can't seem to find any newer information, but a forum post found a vulnerability. Hopefully it has been fixed since that was from 2006, but I can't seem to find a RedCloth manual or documentation...
I would second Sanitize for removing HTML tags. It works really well. It removes everything by default and you can specify a whitelist for tags you want to allow.
Preventing XSS attacks is serious business, follow hrnt's and consider that there is probably an order of magnitude more exploits than that possible due to obscure browser quirks. Although html_escape will lock things down pretty tightly, I think it's a mistake to use anything homegrown for this type of thing. You simply need more eyeballs and peer review for any kind of robustness guarantee.
I'm the in the process of evaluating sanitize vs XssTerminate at the moment. I prefer the xss_terminate approach for it's robustness—scrubbing at the model level will be quite reliable in a regular Rails app where all user input goes through ActiveRecord, but Nokogiri and specifically Loofah seem to be a little more peformant, more actively maintained, and definitely more flexible and Ruby-ish.
Update I've just implemented a fork of ActsAsTextiled called ActsAsSanitiled that uses Santize (which has recently been updated to use nokogiri by the way) to guarantee safety and well-formedness of the RedCloth output, all without needing any helpers in your templates.

Ultimate Website Testing String

I've been grappling with the fraught area of escaping user (text) input for web pages. The ultimate goal is to have user input displayed and stored exactly as typed in, without breaking anything.
To that end I have been using the following test string :
'"_$%^&*()+=-£{}[]/n/<>\#~;|,.?#:!&``"'
It seems to work well (even Stack Overflow or Twitter is not immune, hence the back ticks). My question is, will this string capture most escaping problems, for example going from a web page via Ajax and to a database and back again?
In fact how do I display this string in Stack Overflow without the back ticks?
Is there a better one, e.g. say one that will highlight encoding problems too?
When I'm testing, I'm using something like this
a’b<’>",!"/%$?$&?%(()%/"!"/&?%$/"&$/"?%&?-f¯Ñ112üêù
This is generally sufficient to highlight encoding issues, at least from what I can see.
Including a mathematical symbol such as unicode x2202 might be useful too.
That seems like it should be all of them. The smartest thing to do would be to (depending on the language you're using) use a library that has been well tested, that can sanitize user input. Just ask around what other websites use.
See here: http://gendoh.com/2511063
The post itself is written in Korean, but you could see what makes difference between several given patterns. (V1 to V3 are for generic web apps while V4 and V5 is for javascripts.)

Why do I need Markdown?

Why do I need a Markdown with a front edit editor like WMD? What does the markdown do to the content that’s sent from the WMD editor?
How does Markdown store the content in the backend? Is it the same way like *bold* or in some other format? Why can’t I just do an html encode?
Sorry if I sounded very naïve.

			
				
It's probably helpful to take a step back and ask some of the larger questions. The issue Markdown is trying to solve is that of rich editing in the browser. Consider this: At some point, for any piece of software to enable rich text it has to describe the richness in a some manner, however that may be.
We could call that description of richness (by description of richness I mean like "this bit of text is bold" or "this bit of text is a hyperlink), we could call that description of richness "markup" -- it marks up the text with meta "richness".
Implementations of rich text can take on two approaches, either a.) hide the markup from the user or b.) let them have access to the markup.
For those who choose to hide it, the end result is very often WYSIWYG. The user is oblivious to what is happening behind the scenes. The editor takes care of the details. Think MS Word as an example. No one manipulates the Word markup format as a regular end user.
For implementations which choose to expose the markup, a markup language is then in order to allow users to interacat with it. Such markup languages would be things like HTML doing <tag> or BB code for example, doing things like [tag].
Markdown is one such of these languages.
As opposed to the former types I mentioned, Markdown has tried to design itself so that the markup renders common ASCII people already use. For example, it's common for people to asterisk their text to set it off, *important*, and this notation in Markdown is an indicator of italic.
In regards to storage, as Stephan pointed out, the system will most likely store the raw markdown, because the user will most likely need to have the possibility of editing, and the original markdown can be recalled for that purpose.
In most of the systems I've built, I store the markdown, and then normalize it to a 2nd field which caches the HTML rendering of the markdown. This way I don't have to do markdown->HTML rendering for every markdown field. It takes a little more space, but I'd rather the user have a faster response than use less DB storage space.
Care should also be taken when accepting Markdown from the browser, as it can easily contain <script> tags which need to be filtered out. Most markdown implementations will also recognize HTML intermingled with Markdown formatting, as so to be safe, you need to make sure your inputs and caches are sanitized properly.
The reason for using an alternate encoding system other than HTML is for security
Markdown and other such wiki style encoding systems do not usually support scripting languages
HTML supports scripting languages in many ways (
The two main security issues are:
Malware criminals use scripts in user generated content to attempt malware actions on the content readers computer by scripting to access known security holes
Free loaders using scripts to subvert the rest of the site by changing the content frame or styles i.e. ads, menu's, logos etc. This can also be criminal behaviour if not just annoying
By using an intermediate language such as Markdown you have total control on the rendered output
Filtering HTML is possible, but is also complex and risky
The other significant reason for an alternate encoding system is enforcement of style. Normal HTML has too many options. By limiting the available options, users can only use certain styles. The usually makes for cleaner looking and more readable content (compare SO to Ebay)
The main reason for using Markdown is the readability of a marked text. For instance, you can send it in a plain-text email and the reader will still understand the emphiasis, bullets, the text will be divided in paragraphs et cetera.
When you ask about storing data, it depends. If you enable Markdown in the WordPress blog engine, it stores data as the user has input it - in Markdown. In Stack Overflow, however, it seems like the data is stored as HTML. At least, the "Stack Overflow data dumps" contain HTML, not Markdown (I've seen people complaining) that they have to convert it back).
If you use the WMD editor, you can show the user how the outputs will look like after being converted to HTML. Even though Markdown syntax is really simple, it is not hard to make mistakes. Hence, it is best to show users the output.
Another reason for using Markdown instead of a WYSIWIG control - a WYSIWIG control allows the user to use HTML in data you are displaying on your web page. So, you have to be the one who decides when there is simply incorrect HTML and when it is an evil XSS/CSRF/whatever injection. In Markdown, you simply convert *something* to <b>something</b>, remove any unknow HTML elements and you're done.