Amend HTML Grammar based on attributes in TextMate - html

I've recently started experimenting with jQuery Templates, which rely on your ability to wrap HTML within SCRIPT tags.
<script id="movieTemplate" type="text/x-jquery-tmpl">
<li>
<b>${Name}</b> (${ReleaseYear})
</li>
</script>
The problem is, TextMate naturally assumes that anything within SCRIPT tags is JavaScript. I'm sure it's possible to make TextMate treat the content differently based on the type attribute, but I'm struggling with some of the grammar being used in the bundle. I'm pretty confident that the line below is key, but I'm not sure where to start.
begin = '(?:^\s+)?(<)((?i:script))\b(?![^>]*/>)';
Has anyone already dealt with a similar scenario? Would someone be able to point me in the right direction?
Rich

begin = '(?:^\s+)?(<)((?i:script))\b(?!([^>]*text/x-jquery-tmpl[^>]*|[^>]*/>))';
will stop treating script tags with "text/x-jquery-tmpl" in them as javascript

That's a regular expression. You could extend it to check for the type text/javascript like that:
begin = '(?:^\s+)?(<)((?i:script))\b(.*?type="text/javascript")(?![^>]*/>.*)';
I have only tested it with if, but it seems to work. When the type is text/javascript TextMate expands it to Javascript for every other type it uses PHP. (Just like outside of script tags.)
You can read more about how TextMate uses regular expressions here: Regex (TextMate Manual)

The matching groups are meaningful. You need to change it to this:
begin = '(?:^\s+)?(<)((?i:script))\b(?:.*?type="text/javascript")(?![^>]*/>)';
In order to keep the current matching group configuration.

Related

Does media wiki support links inside highlighted code?

Assume I have the following code section:
<syntaxhighlight lang = "php">
function my_func($str) {
$arr = split($str, ' ');
}
</syntaxhighlight>
This would be highlighted with the help of Geshi extension. However, I would also like to make split as a url link to the external site with documentation explaining what this function does. Is there like any way to do that in MediaWiki for the highlighted code?
Since Geshi works like the <pre> tag to display the code is displayed as typed instead of parsing it as wikicode, mediawiki can't parse anything inside it. Therefore its impossible to add a 'normal' link using wiki code.
Good news is that GeSHi already have exactly what you need!
First, you will need to set in localSettings.php:
$wgSyntaxHighlightKeywordLinks = true;
By doing that it will each function will be a link to http://www.php.net/<function name> (since your example is using php code).
If what you want is a link to somewhere else (your own site maybe), you will need to edit the 'URLS' array in $IP/SyntaxHighlight_GeSHi/geshi/geshi/php.php
(more information on GeSHi's documentation)
And if you will need links on functions for other languages other than php, just edit the according file instead. For example:
$IP/SyntaxHighlight_GeSHi/geshi/geshi/lolcode.php

Why do I need XSS library while I can use Html-encode?

I'm trying to understand why do I need to use XSS library when I can merely do HtlEncode when sending data from server to client ...?
For example , here in Stackoverflow.com - the editor - all the SO tem neads to do is save the user input and display it with html encode.
This way - there will never going to be a HTML tag - which is going to be executed.
I'm probably wrong here -but can you please contradict my statement , or exaplain?
For example :
I know that IMG tag for example , can has onmouseover , onload which a user can do malicious scripts , but the IMG won't event run in the browser as IMG since it's <img> and not <img>
So - where is the problem ?
HTML-encoding is itself one feature an “XSS library” might provide. This can be useful when the platform doesn't have a native HTML encoder (eg scriptlet-based JSP) or the native HTML encoder is inadequate (eg not escaping quotes for use in attributes, or ]]> if you're using XHTML, or #{} if you're worried about cross-origin-stylesheet-inclusion attacks).
There might also be other encoders for other situations, for example injecting into JavaScript strings in a <script> block or URL parameters in an href attribute, which are not provided directly by the platform/templating language.
Another useful feature an XSS library could provide might be HTML sanitisation, for when you want to allow the user to input data in HTML format, but restrict which tags and attributes they use to a safe whitelist.
Another less-useful feature an XSS library could provide might be automated scanning and filtering of input for HTML-special characters. Maybe this is the kind of feature you are objecting to? Certainly trying to handle HTML-injection (an output stage issue) at the input stage is a misguided approach that security tools should not be encouraging.
HTML encoding is only one aspect of making your output safe against XSS.
For example, if you output a string to JavaScript using this code:
<script>
var enteredName = '<%=EnteredNameVariableFromServer %>';
</script>
You will be wanting to hex entity encode the variable for proper insertion in JavaScript, not HTML encode. Suppose the value of EnteredNameVariableFromServer is O'leary, then the rendered code when properly encoded will become:
<script>
var enteredName = 'O\x27leary';
</script>
In this case this prevents the ' character from breaking out of the string and into the JavaScript code context, and also ensures proper treatment of the variable (HTML encoding it would result in the literal value of O'leary being used in JavaScript, affecting processing and display of the value).
Side note:
Also, that's not quite true of Stack Overflow. Certain characters still have special meanings like in the <!-- language: lang-none --> tag. See this post on syntax highlighting if you're interested.

Eclipse - how to extend HTML editor to add custom tags?

I write an application and inside of HTML code I have custom tags (of course these tags are parsed on server side and end user gets them as valid HTML code). Example of custom tag usage:
<html>
<body>
...
<Gallery type="grid" title="My Gallery" />
...
</body>
</html>
1.) How can I have eclipse recognize my custom tags inside of HTML code and add syntax highlighting to them?
2.) How can I add auto-suggestions to my custom tags? For example if I type "<Gallery " press "Ctrl+Space" - in the list of available attributes it shows me "type" and "title" and if I type "<Gallery type=" press "Ctrl+Space" I would see list of available values only for tag "Gallery" and its attribute "type".
Thanks in advance!
Not really what you want, but maybe it helps you:
You can try the Aptana Plug-in for Eclipse. It allows to write your own regular expression for HTML validation, so a custom tag would be ignored by the validator.
E.g.:
.gallery.
Eclipse allows you to add simple auto-suggestions via Templates. On
Eclipse 3.7.1 (Indigo) + PHP Dev Tools (PDT) 3.0.0: Window > Preferences > Web > HTML Files > Editor > Templates
Sadly, there is no easy way: you have to roll your own parser for this, and then add both your extra elements and the base grammar (HTML) to it.
If you have your parser, you could use it to do syntax highlighting (strictly speaking, for that simple lexing is enough); and a good parser can support content assist (auto-suggestions in your terminology).
Caveats:
Creating a parser for HTML is not an easy task. Maybe by aiming at a more often used subset is feasible.
If a parser exists, the editor parts are still hard to get well.
Some help on the other hand: you could use some text editor generators to ease your work:
Eclipse IMP http://www.eclipse.org/imp/ can in theory handle any type of parser, but currently it is most optimized for LPG. The documentation is scarce, but the developers are helpful in the forums.
Xtext http://www.eclipse.org/Xtext/ got quite a hype for creating text editors for DSLs. The generated editors are quite nice out of the box, but is not the best solution for large files. Has a really helpful developer community.
EMFText http://www.emftext.org/index.php/EMFText is a lesser known entity - I don't know it in details, but I guess, it is similar to Xtext.
I know its been a long time since this Q was asked,
but I hope this might help others like myself that reach this in search of a solution.
So, When using Eclipse (Mars.1 Release (4.5.1) - and possibly earlier - I did not check).
Go to Window - Prefrences
Then in the dialog that opens go to Web - HTML Files - Editor - Validation.
On the right side:
under Ignore specified element names in validation and enter the list of custom elements you use. (e.g. Gallery,tab,tabset,my-element-directives-*)
you might also like to go under Ignore specified attribute names in validation do the same for your custom attributes.(e.g. ng-*,my-attr-directives-*)
Two things to note:
After letting eclipse do a full validation you must also close the file and reopen it to have the warnings removed from the source code.
Using this method would ignore those attributes under any element. I don't think there is a simple way to tell it to ignore some-attribute only if its a child of some-element.
I find templates are an ok alternative but let's see if we can encourage a more robust solution; please take a moment and vote for this: https://bugs.eclipse.org/bugs/show_bug.cgi?id=422584
You need to add a new HTML template.To add a new template, complete the following steps:
1) From the Window menu, select Preferences.
2) In the Preferences page, select Web and XML > HTML Files > HTML Templates.
3) Click New.
4) Enter the new template name and a brief description of the template.
5) Using the Context drop-down list, specify the context in which the template is available.
6) In the Pattern field, enter the appropriate tags, attributes, or attribute values (the content of the template) to be inserted by content assist.
7) If you want to insert a variable, click the Variable button and select the variable to be inserted. For example, the word_selection variable indicates the word that is selected at the beginning of template insertion, and the cursor variable determines where the cursor will be after the template is inserted in the HTML document.
8) Click OK to save the new template.
You can edit, remove, import, or export a template by using the same Preferences page.
Reference : http://help.eclipse.org/kepler/index.jsp?topic=%2Forg.eclipse.wst.sse.doc.user%2Ftopics%2Ftsrcedt024.html

How do I match text in HTML that's not inside tags?

Given a string like this:
This is the foo link
... and a search string like "foo", I would like to highlight all occurrences of "foo" in the text of the HTML -- but not inside a tag. In other words, I want to get this:
This is the <b>foo</b> link
However, a simple search-and-replace won't work, because it will match part of the URL in the <a> tag's href.
So, to express the above in the form of a question: How do I restrict a regex so that it only matches text outside of HTML tags?
Note: I promise that the HTML in question will never be anything pathological like:
<img title="Haha! Here are some angle brackets to screw you up: ><" />
Edit: Yes, of course I'm aware that there are complex libraries in CPAN that can parse even the most heinous HTML, and thus alleviate the need for such a regex. On many occasions, that's what I would use. However, this is not one of those occasions, since keeping this script short and simple, without external dependencies, is important. I just want a one-line regex.
Edit 2: Again, I know that Template::Refine::Fragment can parse all my HTML for me. If I were writing an application I would certainly use a solution like that. But this isn't an application. It's barely more than a shell script. It's a piece of disposable code. Being a single, self-contained file that can be passed around is of great value in this case. "Hey, run this program" is a much simpler instruction than, "Hey, install a Perl module and then run this-- wait, what, you've never used CPAN before? Okay, run perl -MCPAN -e shell (preferably as root) and then it's going to ask you a bunch of questions, but you don't really need to answer them. No, don't be afraid, this isn't going to break anything. Look, you don't need to answer every question carefully -- just hit enter over and over. No, I promise, it's not going to break anything."
Now multiply the above across a great deal of users who are wondering why the simple script they've been using isn't so simple anymore, when all that's changed is to make the search term boldface.
So while Template::Refine::Fragment may be the answer to someone else's HTML parsing question, it's not the answer to this question. I just want a regular expression that works on the very limited subset of HTML that the script will actually be asked to parse.
If you can absolutely guarantee that there are no angle brackets in the HTML other than those used to open and close tags, this should work:
s%(>|\G)([^<]*?)($key)%$1$2<b>$3</b>%g
In general, you want to parse the HTML into a DOM, and then traverse the text nodes. I would use Template::Refine for this:
#!/usr/bin/env perl
use strict;
use warnings;
use feature ':5.10';
use Template::Refine::Fragment;
my $frag = Template::Refine::Fragment->new_from_string('<p>Hello, world. This is a test of foo finding. Here is another foo.');
say $frag->process(
simple_replace {
my $n = shift;
my $text = $n->textContent;
$text =~ s/foo/<foo>/g;
return XML::LibXML::Text->new($text);
} '//text()',
)->render;
This outputs:
<p>Hello, world. This is a test of <foo> finding. Here is another <foo>.</p>
Anyway, don't parse structured data with regular expressions. HTML is not "regular", it's "context-free".
Edit: finally, if you are generating the HTML inside your program, and you have to do transformations like this on strings, "UR DOIN IT WRONG". You should build a DOM, and only serialize it when everything has been transformed. (You can still use TR, however, via the new_from_dom constructor.)
The following regex will match all text between tags or outside of tags:
<.*?>(.*?)<.*?>|>(.*?)<
Then you can operate on that as desired.
Try this one
(?=>)?(\w[^>]+?)(?=<)
it matches all words between tags
To strip off the variable size contents from even nested tags you can use this regex that is in fact a mini-regular grammar for that. (note: PCRE machine)
(?<=>)((?:\w+)(?:\s*))(?1)*

Limiting HTML Input into Text Box

How do I limit the types of HTML that a user can input into a textbox? I'm running a small forum using some custom software that I'm beta testing, but I need to know how to limit the HTML input. Any suggestions?
i'd suggest a slightly alternative approach:
don't filter incoming user data (beyond prevention of sql injection). user data should be kept as pure as possible.
filter all outgoing data from the database, this is where things like tag stripping, etc.. should happen
keeping user data clean allows you more flexibility in how it's displayed. filtering all outgoing data is a good habit to get into (along the never trust data meme).
You didn't state what the forum was built with, but if it's PHP, check out:
http://htmlpurifier.org/
Library Features: Whitelist, Removal, Well-formed, Nesting, Attributes, XSS safe, Standards safe
Once the text is submitted, you could strip any/all tags that don't match your predefined set using a regex in PHP.
It would look something like the following:
find open tag (<)
if contents != allowed tag, remove tag (from <..>)
Parse the input provides and strip out all html tags that don't match exactly the list you are allowing. This can either be a complex regex, or you can do a stateful iteration through the char[] of the input string building the allowed input string and stripping unwanted attributes on tags like img.
Use a different code system (BBCode, Markdown)
Find some code online that already does this, to use as a basis for your implementation. For example Slashcode must perform this, so look for its implementation in the Perl and use the regexes (that I assume are there)
Regardless what you use, be sure to be informed of what kind of HTML content can be dangerous.
e.g. a < script > tag is pretty obvious, but a < style > tag is just as bad in IE, because it can invoke JScript commands.
In fact, any style="..." attribute can invoke script in IE.
< object > would be one more tag to be weary of.
PHP comes with a simple function strip_tag to strip HTML tags. It allows for certain tags to not be stripped.
Example #1 strip_tags() example
<?php
$text = '<p>Test paragraph.</p><!-- Comment --> Other text';
echo strip_tags($text);
echo "\n";
// Allow <p> and <a>
echo strip_tags($text, '<p><a>');
?>
The above example will output:
Test paragraph. Other text
<p>Test paragraph.</p> Other text
Personally for a forum, I would use BBCode or Markdown because the amount of support and features provided such as live preview.