highlight words in html using regex & javascript - almost there - html

I am writing a jquery plugin that will do a browser-style find-on-page search. I need to improve the search, but don't want to get into parsing the html quite yet.
At the moment my approach is to take an entire DOM element and all nested elements and simply run a regex find/replace for a given term. In the replace I will simply wrap a span around the matched term and use that span as my anchor to do highlighting, scrolling, etc. It is vital that no characters inside any html tags are matched.
This is as close as I have gotten:
(?<=^|>)([^><].*?)(?=<|$)
It does a very good job of capturing all characters that are not in an html tag, but I'm having trouble figuring out how to insert my search term.
Input: Any html element (this could be quite large, eg <body>)
Search Term: 1 or more characters
Replace Txt: <span class='highlight'>$1</span>
UPDATE
The following regex does what I want when I'm testing with http://gskinner.com/RegExr/...
Regex: (?<=^|>)(.*?)(SEARCH_STRING)(?=.*?<|$)
Replacement: $1<span class='highlight'>$2</span>
However I am having some trouble using it in my javascript. With the following code chrome is giving me the error "Invalid regular expression: /(?<=^|>)(.?)(Mary)(?=.?<|$)/: Invalid group".
var origText = $('#'+opt.targetElements).data('origText');
var regx = new RegExp("(?<=^|>)(.*?)(" + $this.val() + ")(?=.*?<|$)", 'gi');
$('#'+opt.targetElements).each(function() {
var text = origText.replace(regx, '$1<span class="' + opt.resultClass + '">$2</span>');
$(this).html(text);
});
It's breaking on the group (?<=^|>) - is this something clumsy or a difference in the Regex engines?
UPDATE
The reason this regex is breaking on that group is because Javascript does not support regex lookbehinds. For reference & possible solutions: http://blog.stevenlevithan.com/archives/mimic-lookbehind-javascript.

Just use jQuerys built-in text() method. It will return all the characters in a selected DOM element.
For the DOM approach (docs for the Node interface): Run over all child nodes of an element. If the child is an element node, run recursively. If it's a text node, search in the text (node.data) and if you want to highlight/change something, shorten the text of the node until the found position, and insert a highligth-span with the matched text and another text node for the rest of the text.
Example code (adjusted, origin is here):
(function iterate_node(node) {
if (node.nodeType === 3) { // Node.TEXT_NODE
var text = node.data,
pos = text.search(/any regular expression/g), //indexOf also applicable
length = 5; // or whatever you found
if (pos > -1) {
node.data = text.substr(0, pos); // split into a part before...
var rest = document.createTextNode(text.substr(pos+length)); // a part after
var highlight = document.createElement("span"); // and a part between
highlight.className = "highlight";
highlight.appendChild(document.createTextNode(text.substr(pos, length)));
node.parentNode.insertBefore(rest, node.nextSibling); // insert after
node.parentNode.insertBefore(highlight, node.nextSibling);
iterate_node(rest); // maybe there are more matches
}
} else if (node.nodeType === 1) { // Node.ELEMENT_NODE
for (var i = 0; i < node.childNodes.length; i++) {
iterate_node(node.childNodes[i]); // run recursive on DOM
}
}
})(content); // any dom node
There's also highlight.js, which might be exactly what you want.

Related

Is there a way of searching div element by class in GAS? [duplicate]

Is there a simple method to locate an XML node by its attribute in Google Apps Script? Here's an XML snippet:
<hd:components>
<hd:text name="ADM Custom admissions TE">
<hd:prompt>Admission</hd:prompt>
<hd:columnWidth widthType="minimum" minWidth="100"/>
</hd:text>
<hd:text name="ADM Insufficient heat end date TE">
<hd:prompt>To</hd:prompt>
</hd:text>
<hd:text name="ADM Insufficient heat start date TE">
<hd:prompt>From</hd:prompt>
</hd:text>
<hd:text name="ADM Third party payment period TE">
<hd:defMergeProps unansweredText="__________"/>
<hd:prompt>When (date or period)?</hd:prompt>
</hd:text>
For purposes of the XML file I'm trying to parse, the "name" attribute is a unique identifier, while what GAS thinks is the "name" for purposes of the XmlService.Element.getChild(name) method ("text" for each node shown in this snippet) is a non-unique classifier for the type of node. I'd like to be able to write a function to retrieve a specific node from this XML file with only the name attribute. XMLPath notation in other languages has this capability using the [# notation. Is there a way to do it in GAS, or do I need to write a function that walks through the XML until it finds a node with the right name attribute, or store it in some different type of data structure for fast searching if the XML file is sufficiently large?
Here's the snippet I started writing: it's fine if there's no built-in function, I just wondered if there was a better/faster way to do this. My function isn't so efficient, and I wondered if the XmlService had a more efficient internal data structure it's using to speed up searching. My approach is just to loop through all of the element's children until there's a match.
function getComponentFromXML(xml,name) {
for (var i = 0; i < xml.length; i++) {
var x = xml[i];
var xname = x.getAttribute('name').getValue();
if (xname == name) {
return getComponentAttributes(x);
}
}
}
There is no built-in search, so the only way is to read the list of elements looking for the one with the desired value of attribute 'name'. If elements is an array of elements to search through, you can do
var searchResults = elements.filter(function (e) {
return e.getAttribute('name') && e.getAttribute('name').getValue() == searchString;
});
(Both checks are needed to avoid an error when there is no 'name' attribute at all.)
How to obtain such an array elements may depend on XML document. If, as in your example, the elements to search are the immediate children of the root element, then
var doc = XmlService.parse(xmlString);
var elements = doc.getRootElement().getChildren();
would be a quick and easy way to do this.
In general, to get all elements without recursion, the getDescendants method can be used. It returns an array of Content object, which can be filtered down to Element objects:
var elements = doc.getDescendants().filter(function (c) {
return c.getType() == XmlService.ContentTypes.ELEMENT;
}).map(function (c) {
return c.asElement();
});

How to set a certain number of spaces or indents before a Paragraph in Google Docs using Google Apps Script

I have a 20 line script, and I want to make sure that each paragraph is indented exactly once.
function myFunction() {
/*
This function turns the document's format into standard MLA.
*/
var body = DocumentApp.getActiveDocument().getBody();
body.setFontSize(12); // Set the font size of the contents of the documents to 9
body.setForegroundColor('#000000');
body.setFontFamily("Times New Roman");
// Loops through paragraphs in body and sets each to double spaced
var paragraphs = body.getParagraphs();
for (var i = 3; i < paragraphs.length; i++) { // Starts at 3 to exclude first 4 developer-made paragraphs
var paragraph = paragraphs[i];
paragraph.setLineSpacing(2);
// Left align the first cell.
paragraph.setAlignment(DocumentApp.HorizontalAlignment.LEFT);
// One indent
paragraph.editAsText().insertText(0, "\t"); // Adds one tab every time
}
var bodyText = body.editAsText();
bodyText.insertText(0, 'February 3, 1976\nMrs. Smith\nYour Name Here\nSocial Studies\n');
bodyText.setBold(false);
}
The code I have tried doesn't work. But my expected results are that for every paragraph in the for loop in myFunction(), there are exactly 4 spaces before the first word in each paragraph.
Here is a sample: https://docs.google.com/document/d/1sMztzhOehzheRdqumC6PLnvk4qJgUCSE0irjTZ0FjTQ/edit?usp=sharing
If the user uses Autoformat, but already has the paragraphs indented...
Update
I have investigated use of the Paragraph.setIndentFirstLine() method. When I set it to four, it sets it to 1 space. Now I realize this is because points and spaces are not the same thing. What number do I need to multiply by to get four spaces in points?
Let us consider a few basic identing operations: manual and by script.
The following image shows how to indent current paragraph (cursor stays inside this one).
Please note, the units are centimetres. Also note, that the paragraph does not include leading spaces or tabs, we have no need of them.
Suppose we would like to get the indent values in the script and apply them to the next paragraph. Look at the code below:
function myFunction() {
var ps = DocumentApp.getActiveDocument().getBody().getParagraphs();
// We work with the 5-th and 6-th paragraphs indeed
var iFirst = ps[5].getIndentFirstLine();
var iStart = ps[5].getIndentStart();
var iEnd = ps[5].getIndentEnd();
Logger.log([iFirst, iStart, iEnd]);
ps[6].setIndentFirstLine(iFirst);
ps[6].setIndentStart(iStart);
ps[6].setIndentEnd(iEnd);
}
If you run and look at the log, you will see something like this: [92.69291338582678, 64.34645669291339, 14.173228346456694]. No surprise, we have typographic points instead of centimetres. (1cm=28.3465pt) So we can measure and modify any paragraph indent values precisely.
Addition
For some reasons you might want to control spaces number at the beginning of the paragraph. It is also possible by scripting, but it has no effect on the paragraph's "left" or "right" indents.
Sample code below is for similar task: count leading spaces number of the 5-th paragraph and make the same number of spaces at the beginning of the next one.
function mySpaces() {
var ps = DocumentApp.getActiveDocument().getBody().getParagraphs();
// We work with the 5-th and 6-th paragraphs indeed
var spacesCount = getLeadingSpacesCount(ps[5]);
Logger.log(spacesCount);
var diff = getLeadingSpacesCount(ps[6]) - spacesCount;
if (diff > 0) {
ps[6].editAsText().deleteText(0, diff - 1);
} else if (diff < 0) {
var s = Array(1 - diff).join(' ');
ps[6].editAsText().insertText(0, s);
}
}
function getLeadingSpacesCount(p) {
var found = p.findText("^ +");
return found ? found.getEndOffsetInclusive() + 1 : 0;
}
We have used methods deleteText() and insertText() of the class Text for proper corrections and findText() to locate the spaces if any. Note, the last method argument is a string, representing a regular expression. It matches "all leading spaces", if they exist. See more details about regular expression syntax.

HTML:Escaping characters ton avoid xss attack [duplicate]

I'm writing the JS for a chat application I'm working on in my free time, and I need to have HTML identifiers that change according to user submitted data. This is usually something conceptually shaky enough that I would not even attempt it, but I don't see myself having much of a choice this time. What I need to do then is to escape the HTML id to make sure it won't allow for XSS or breaking HTML.
Here's the code:
var user_id = escape(id)
var txt = '<div class="chut">'+
'<div class="log" id="chut_'+user_id+'"></div>'+
'<textarea id="chut_'+user_id+'_msg"></textarea>'+
'<label for="chut_'+user_id+'_to">To:</label>'+
'<input type="text" id="chut_'+user_id+'_to" value='+user_id+' readonly="readonly" />'+
'<input type="submit" id="chut_'+user_id+'_send" value="Message"/>'+
'</div>';
What would be the best way to escape id to avoid any kind of problem mentioned above? As you can see, right now I'm using the built-in escape() function, but I'm not sure of how good this is supposed to be compared to other alternatives. I'm mostly used to sanitizing input before it goes in a text node, not an id itself.
Never use escape(). It's nothing to do with HTML-encoding. It's more like URL-encoding, but it's not even properly that. It's a bizarre non-standard encoding available only in JavaScript.
If you want an HTML encoder, you'll have to write it yourself as JavaScript doesn't give you one. For example:
function encodeHTML(s) {
return s.replace(/&/g, '&').replace(/</g, '<').replace(/"/g, '"');
}
However whilst this is enough to put your user_id in places like the input value, it's not enough for id because IDs can only use a limited selection of characters. (And % isn't among them, so escape() or even encodeURIComponent() is no good.)
You could invent your own encoding scheme to put any characters in an ID, for example:
function encodeID(s) {
if (s==='') return '_';
return s.replace(/[^a-zA-Z0-9.-]/g, function(match) {
return '_'+match[0].charCodeAt(0).toString(16)+'_';
});
}
But you've still got a problem if the same user_id occurs twice. And to be honest, the whole thing with throwing around HTML strings is usually a bad idea. Use DOM methods instead, and retain JavaScript references to each element, so you don't have to keep calling getElementById, or worrying about how arbitrary strings are inserted into IDs.
eg.:
function addChut(user_id) {
var log= document.createElement('div');
log.className= 'log';
var textarea= document.createElement('textarea');
var input= document.createElement('input');
input.value= user_id;
input.readonly= True;
var button= document.createElement('input');
button.type= 'button';
button.value= 'Message';
var chut= document.createElement('div');
chut.className= 'chut';
chut.appendChild(log);
chut.appendChild(textarea);
chut.appendChild(input);
chut.appendChild(button);
document.getElementById('chuts').appendChild(chut);
button.onclick= function() {
alert('Send '+textarea.value+' to '+user_id);
};
return chut;
}
You could also use a convenience function or JS framework to cut down on the lengthiness of the create-set-appends calls there.
ETA:
I'm using jQuery at the moment as a framework
OK, then consider the jQuery 1.4 creation shortcuts, eg.:
var log= $('<div>', {className: 'log'});
var input= $('<input>', {readOnly: true, val: user_id});
...
The problem I have right now is that I use JSONP to add elements and events to a page, and so I can not know whether the elements already exist or not before showing a message.
You can keep a lookup of user_id to element nodes (or wrapper objects) in JavaScript, to save putting that information in the DOM itself, where the characters that can go in an id are restricted.
var chut_lookup= {};
...
function getChut(user_id) {
var key= '_map_'+user_id;
if (key in chut_lookup)
return chut_lookup[key];
return chut_lookup[key]= addChut(user_id);
}
(The _map_ prefix is because JavaScript objects don't quite work as a mapping of arbitrary strings. The empty string and, in IE, some Object member names, confuse it.)
You can use this:
function sanitize(string) {
const map = {
'&': '&',
'<': '<',
'>': '>',
'"': '"',
"'": ''',
"/": '/',
};
const reg = /[&<>"'/]/ig;
return string.replace(reg, (match)=>(map[match]));
}
Also see OWASP XSS Prevention Cheat Sheet.
You could use a simple regular expression to assert that the id only contains allowed characters, like so:
if(id.match(/^[0-9a-zA-Z]{1,16}$/)){
//The id is fine
}
else{
//The id is illegal
}
My example allows only alphanumerical characters, and strings of length 1 to 16, you should change it to match the type of ids that you use.
By the way, at line 6, the value property is missing a pair of quotes, an easy mistake to make when you quote on two levels.
I can't see your actual data flow, depending on context this check may not at all be needed, or it may not be enough. In order to make a proper security review we would need more information.
In general, about built in escape or sanitize functions, don't trust them blindly. You need to know exactly what they do, and you need to establish that that is actually what you need. If it is not what you need, the code your own, most of the time a simple whitelisting regex like the one I gave you works just fine.
Since the text that you are escaping will appear in an HTML attribute, you must be sure to escape not only HTML entities but also HTML attributes:
var ESC_MAP = {
'&': '&',
'<': '<',
'>': '>',
'"': '"',
"'": '''
};
function escapeHTML(s, forAttribute) {
return s.replace(forAttribute ? /[&<>'"]/g : /[&<>]/g, function(c) {
return ESC_MAP[c];
});
}
Then, your escaping code becomes var user_id = escapeHTML(id, true).
For more information, see Foolproof HTML escaping in Javascript.
You need to take extra precautions when using user supplied data in HTML attributes. Because attributes has many more attack vectors than output inside HTML tags.
The only way to avoid XSS attacks is to encode everything except alphanumeric characters. Escape all characters with ASCII values less than 256 with the &#xHH; format. Which unfortunately may cause problems in your scenario, if you are using CSS classes and javascript to fetch those elements.
OWASP has a good description of how to mitigate HTML attribute XSS:
http://www.owasp.org/index.php/XSS_(Cross_Site_Scripting)_Prevention_Cheat_Sheet#RULE_.233_-_JavaScript_Escape_Before_Inserting_Untrusted_Data_into_HTML_JavaScript_Data_Values
The following approach to prevent XSS looks like a good solution.
var sanitizeHTML = function (str) {
return str.replace(/[^\w. ]/gi, function (c) {
return '&#' + c.charCodeAt(0) + ';';
});
};
Here is a working example:
var sanitizeHTML = function (str) {
return str.replace(/[^\w. ]/gi, function (c) {
return '&#' + c.charCodeAt(0) + ';';
});
};
var app = document.querySelector('#app');
app.innerHTML = sanitizeHTML('<img src="x" onerror="alert(1)">');
<div id="app">
</div>
This solution was Provided Here.
Just to add to the comment of #SilentImp. if u need a typeScript version...
export function sanitize(input: string) {
const map: Record<string, string> = {
'&': '&',
'<': '<',
'>': '>',
'"': '"',
"'": ''',
'/': '/',
};
const reg = /[&<>"'/]/gi;
return input.replace(reg, (match) => map[match]);
}

Counting inner text letters of HTML element

Is there a way to count the letters of inner text of an HTML element, without counting the letters of inner element's texts?
I tried out the ".getText()" method of "WebElements" using the Selenium library, but this counts the inner Texts of inner web elements in (e.G. "<body><div>test</div></body>" results in 4 letters for the "div" and the "body" element, instead of 0 for the "body" element)
Do I have to use an additional HTML parsing library, and when yes which one would you recommend?
I'm using Java 7...
Based on this answer for a similar question, I cooked you a solution:
The piece of JavaScript takes an element, iterates over all its child nodes and if they're text nodes, it reads them and returns them concatenated:
var element = arguments[0];
var text = '';
for (var i = 0; i < element.childNodes.length; i++)
if (element.childNodes[i].nodeType === Node.TEXT_NODE) {
text += element.childNodes[i].textContent;
}
return text;
I saved this script into a script.js file and loaded it into a single String via FileUtils.readFileToString(). You can use Guava's Files.toString(), too. Or just embed it into your Java code.
final String script = FileUtils.readFileToString(new File("script.js"), "UTF-8");
JavascriptExecutor js = (JavascriptExecutor)driver;
...
WebElement element = driver.findElement(By.anything("myElement"));
String text = (String)js.executeScript(script, element);

JSFL: convert text from a textfield to a HTML-format string

I've got a deceptively simple question: how can I get the text from a text field AND include the formatting? Going through the usual docs I found out it is possible to get the text only. It is also possible to get the text formatting, but this only works if the entire text field uses only one kind of formatting. I need the precise formatting so that I convert it to a string with html-tags.
Personally I need this so I can pass it to a custom-made text field component that uses HTML for formatting. But it could also be used to simply export the contents of any text field to any other format. This could be of interest to others out there, too.
Looking for a solution elsewhere I found this:
http://labs.thesedays.com/blog/2010/03/18/jsfl-rich-text/
Which seems to do the reverse of what I need, convert HTML to Flash Text. My own attempts to reverse this have not been successful thus far. Maybe someone else sees an easy way to reverse this that I’m missing? There might also be other solutions. One might be to get the EXACT data of the text field, which should include formatting tags of some kind(XML, when looking into the contents of the stored FLA file). Then remove/convert those tags. But I have no idea how to do this, if at all possible. Another option is to cycle through every character using start- and endIndex, and storing each formatting kind in an array. Then I could apply the formatting to each character. But this will result in excess tags. Especially for hyperlinks! So can anybody help me with this?
A bit late to the party but the following function takes a JSFL static text element as input and returns a HTML string (using the Flash-friendly <font> tag) based on the styles found it its TextRuns array. It's doing a bit of basic regex to clear up some tags and double spaces etc. and convert /r and /n to <br/> tags. It's probably not perfect but hopefully you can see what's going on easy enough to change or fix it.
function tfToHTML(p_tf)
{
var textRuns = p_tf.textRuns;
var html = "";
for ( var i=0; i<textRuns.length; i++ )
{
var textRun = textRuns[i];
var chars = textRun.characters;
chars = chars.replace(/\n/g,"<br/>");
chars = chars.replace(/\r/g,"<br/>");
chars = chars.replace(/ /g," ");
chars = chars.replace(/. <br\/>/g,".<br/>");
var attrs = textRun.textAttrs;
var font = attrs.face;
var size = attrs.size;
var bold = attrs.bold;
var italic = attrs.italic;
var colour = attrs.fillColor;
if ( bold )
{
chars = "<b>"+chars+"</b>";
}
if ( italic )
{
chars = "<i>"+chars+"</i>";
}
chars = "<font size=\""+size+"\" face=\""+font+"\" color=\""+colour+"\">"+chars+"</font>";
html += chars;
}
return html;
}