Regex for index of string match spanning across several XML tags - html

I'm trying to insert a link in TLF. Normally you would simply simply use
var linkElement:LinkElement = textArea.textFlow.interactionManager.applyLink( ... );
The problem is that, if I create a link which spans across differently formatted text (bold, italic, etc), or heaven forbid across paragraphs and list items, it completely and utterly crashes and burns. Link formatting is completely lost, and list structures collapse.
Simply adding a LinkElement via addChild() doesn't work either, if we're going to keep both the formatting and the structure within the selected text.
Ripping out the textFlow for the selection with interactionManager.cutTextScrap(...), wrapping it in a LinkElement with interactionManager.applyLink( ... ), and then "pasting" back in... also creates a mess.
So I have to create my own link insertion routine.
What I've resolved to do is to:
1) convert the textflow tags to a string
2) find the start and end indexes of the selection within the textflow string
3) insert the following string at the start index:
</span><a href="[hrefVar]" target="[targetVar]"><span>
4) insert the following string at the end index:
</span></a><span>
5) reconvert the textflow string into a textflow object for the TextArea
And voila! Instant RTF link!
The only problem is... I have no idea how to write a regex parsing equation which can find the start and ending indexes for a string match inside XML markup where the result may be spread across several tags.
For instance, if the TextFlow is (abbreviated):
<TextFlow><p><span>Lorem Ip</span><span fontWeight="bold">sum do</span><span>
lor sit am</span><span fontStyle="italic">et, consectetur adipiscing elit.
</span></p></TextFlow>
Say, for instance, the user has selected "Ipsum dolor sit amet" to be converted into a link. I need to find the first and last indexes of "Ipsum dolor sit amet" within that RTF markup, and then insert the strings indicated in 3) & 4) above, so that the end result looks like this:
<TextFlow><p><span>Lorem </span><a href="http://www.google.ca" target="_blank">
<span>Ip</span><span fontWeight="bold">sum do</span><span>lor sit am</span>
<span fontStyle="italic">et</span></a><span>, consectetur adipiscing elit.
</span></p></TextFlow>
You might lose some style formatting, but I can fix that later parsing through the textflow formatting.
What I need is the regex to do step 2).
I know the regex to ignore tags and strip out the text between tags, and how to find a string match of the selected text in the stripped textflow text... but not how to find the match indexes within the original (unstripped) textflow string.
Anyone?

IMHO better way is to go through out the string instead of trying to go with regex.
Here is an idea for quick dirty way, this code need to be improved, but anyway it might give directions.
So main goal might be just "throw out" tags and match text, but counting gow many chars passed within the process.
//This code might need revision for not to get < and > symbols as fake tags starting and finishing points, also reseting searchwhen text not completly done.
var sourceStr:String = '<TextFlow><p><span>Lorem Ip</span><span fontWeight="bold">sum do</span><span>lor sit am</span><span fontStyle="italic">et, consectetur adipiscing elit.</span></p></TextFlow>';
var searchStr:String = "Lorem Ipsum d";
var indexes:Object = firstLast(sourceStr, searchStr);
trace(indexes.startIndex,indexes.finishIndex);
function firstLast(sourceStr:String, searchStr:String):Object
{
var indexCounter:int = -1;
var searchFlag:Boolean = true;
var searchPos:int = 0;
var searchChar:String;
var sourceChar:String;
var startIndex:int;
var finishIndex:int;
for (var i:int = 0; i < sourceStr.length; i++ )
{
indexCounter++;
sourceChar = sourceStr.substr(i, 1);
if (sourceChar == "<")
{
searchFlag = false;
}
else if (sourceChar == ">")
{
searchFlag = true;
}
if (!searchFlag)
{
continue;
}
searchChar = searchStr.substr(searchPos, 1);
if (sourceChar == searchChar)
{
if (searchPos == 0)
{
startIndex = indexCounter;
}
if (searchPos == searchStr.length - 1)
{
finishIndex = indexCounter;
}
searchPos++;
}
}
return { startIndex:startIndex, finishIndex:finishIndex };
}

Related

How to extract the href value from links in HTML data based on then element's text?

I have been tasked with the coding of a web crawler that goes through several URLs (around 400, but the list could grow), each with a completely different html structure and extract the links containing certain information. The only thing the program knows beforehand is what are the keywords it should search for, but the html structure and any semantic cues as to where to look for those keywords is unknown.
So far, I have used the request-promise module for Node.js to send a request to the URL where the search for keywords will take place:
const htmlResult = await request.get(url);
htmlResult stores the response as a string, and I can save it both as an .txt or .html if needed.
The problem I have is that I don't know how to instruct the program how to extract a URL based on words that aren't necessarily present in the url string. An example might help clarify:
<a href="site/with/no/keywords-just-a-random-string" title="Keywords might be here, but title attribute might be absent"><span class="img"><img data-cfsrc="/thumbpdf/618a8nb4.jpg" alt="" style="display:none;visibility:hidden;"><noscript><img src="/thumbpdf/8bfa84.jpg" alt=""></noscript></span>
<h2>KEYWORDS ARE IN THIS TAG, WHICH IN TURN IS INSIDE THE <a> TAG</h2>
<span class="date--type">2 Nov 2021 </span>
<span class="tag">
oher stuff with no keywords in it</span>
</a>
As you can see, this tag has a complex structure. The keywords I need to parse are inside an h2 tag which, in turn, is inside the a tag. But he a tag could also be like this:
KEYWORDS TO PARSE
Here the keywords are simply within the a tag.
My question, thus, is how do I parse htmlResult (either as a string or saved as a .txt/.html file), and, once I get a match, instruct the program to extract the url that is in the bounds of the a tag wherein I go the match of keywords?
As I am using Node.js I open to using any tool available.
Could someone offer some advice on how to tackle this challenge?
Thanks so much in advance.
This is very quick and dirty, and I'm sure it can be further streamlined, but it should get you at least closer to where you need to be.
This assumes a bunch of <div> elements, each containing one of your your <a> elements, all in one document (see link below). It uses xpath to locate the data:
function xpathEval(xpath, context) {
return document.evaluate(xpath, context, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null);
}
desiredHrefs = []
let targets = xpathEval("//div[#class='container']", document);
for (let i = 0; i < targets.snapshotLength; i++) {
let attribs = xpathEval('.//*/#*', targets.snapshotItem(i)),
texts = xpathEval('.//*/text()', targets.snapshotItem(i));
for (let k = 0; k < attribs.snapshotLength; k++) {
attribData = attribs.snapshotItem(k).textContent
if (attribData.includes("trainer") & attribData.includes("dog")) {
//either
//console.log(targets.snapshotItem(i).querySelector('a').getAttribute('href'))
//ot
let href = xpathEval('.//a/#href', targets.snapshotItem(i));
desiredHrefs.push(href.snapshotItem(0).textContent)
}
}
for (let j = 0; j < texts.snapshotLength; j++) {
data = texts.snapshotItem(j).nodeValue.trim().toLowerCase()
if (data.includes("trainer") & data.includes("dog")) {
//either
//console.log(targets.snapshotItem(i).querySelector('a').getAttribute('href'))
//or
let href = xpathEval('.//a/#href', targets.snapshotItem(i));
desiredHrefs.push(href.snapshotItem(0).textContent)
}
}
}
for (let href of [...new Set(desiredHrefs)])
console.log(href)
You can see it in action here.

How to better deal with regex capture group between HTML tags?

I'm trying to capture this content inside the html tag document in the string below. The result yields the desired match, but also a weird entry "t", the last letter before the close tag.
I'm pretty new to regex and I wonder what is going on? What should I read up about?
PS: If I remove the () brackets around the pattern, only 't' is captured. I'm not sure I can see what difference the bracket (i.e. defining a capture group) make in this case.
example = '''ABCDE<DOCUMENT>
Lorem ipsum
dolor sit amet</DOCUMENT>
EFGHIJK.'''
re.findall(r'(<DOCUMENT>(.|\s)*<\/DOCUMENT>)', example)
Outputs:
[('<DOCUMENT>\nLorem ipsum\ndolor sit amet</DOCUMENT>', 't')]
Try using the re.DOTALL flag instead of using \s to capture whitespaces:
re.findall(r'(<DOCUMENT>.*<\/DOCUMENT>)', example, flags = re.DOTALL)
Explaining the issue
re.findall documentation states that:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group
You have two capturing groups (defined by the parenthesis) in your regex:
over all the pattern, defined by the first and last parenthesis
over the .|\s pattern
That's why the return is a list of a tuple with two elements: \nLorem ipsum\ndolor sit amet and t.
When you use the * outside the capturing group, you are actually matching it multiple times. The last time the group matches, is the last t of "amet" in the input string, thus findall returns it as the value of the capturing group.
Here, we can use this expression,
<DOCUMENT>(.*?)<\/DOCUMENT>
Please see this demo for explanation.
with s flag, or any of these expressions:
<DOCUMENT>([\s\S]*?)<\/DOCUMENT>
<DOCUMENT>([\d\D]*?)<\/DOCUMENT>
<DOCUMENT>([\w\W]*?)<\/DOCUMENT>
with m flag, and our problem would be likely solved.
Please see this demo for explanation.
Test
import re
regex = r"<DOCUMENT>([\s\S]*?)<\/DOCUMENT>"
test_str = ("ABCDE<DOCUMENT>\n"
"Lorem ipsum\n\n\n\n"
"dolor sit amet</DOCUMENT>\n"
"EFGHIJK.")
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

Vertical html table without repeating th tags

I'm generating a table using xslt, but for this question I'll keep that side out of it, as it relates more to the actual generated structure of a html table.
What I do is make a vertical table as follows, which suits the layout needed for the data concerned that originated in a spreadsheet. Example is contrived for brevity, actual data fields contain lengthy strings and many more fields.
Title: something or rather bla bla
Description: very long desription
Field1: asdfasdfasdfsdfsd
Field2: asdfasfasdfasdfsdfjasdlfksdjaflk
Title: another title
Description: another description
Field1:
Field2: my previous field was blank but this one is not, anyways
etc.
The only way so far I found to generate such a html table is using repeating tags for every field and every record e.g.:
<tr><th>Title</th><td>something or rather bla bla</td></tr>
<tr><th>Description</th><td>very long desription</td></tr>
...
<tr><th>Title</th><td>another title</td></tr>
<tr><th>Description</th><td>another description</td></tr>
...
Of course this is semantically incorrect but produces correct visual layout. I need it to be semantically correct html, as that's the only sane way of later attaching a filtering javascript facility.
The following correct semantically produces an extremely wide table with a single set of field headers on the left:
<tr><th>Title</th><td>something or rather bla bla</td><td>another title</td></tr>
<tr><th>Description</th><td>very long desription</td><td>another description</td></tr>
...
So to summarise, need a html table (or other html structure) where it's one record under another (visually) with repeating field headers, but the field headers must not be repeated in actual code because that would wreck any record based filtering to be added later on.
Yo. Thanks for updating your question, and including some code. Typically you'd also post what you've tried to correct this issue - but I'm satisfied enough with this post.
Since you want the repeating headers in vertical layout (not something I've seen often, but I can understand the desire), you don't have to modify the HTML formatting, just use a bit more JavaScript to figure it out. I haven't gone through and checked to see if I'm doing things efficiently (I'm probably not, since there are so many loops), but in my testing the following can attach to a vertical table and filter using a couple variables to indicate how many rows there are in each entry.
Firstly, here's the HTML I'm testing this one with. Notice I have a div with the id of filters, and each of my filter inputs has a custom attribute named filter that matches the header of the rows they are supposed to filter:
<div id='filters'>
Title: <input filter='Title'><br>
Desc: <input filter='Description'>
</div>
<table>
<tr><th>Title</th><td>abcd</td></tr>
<tr><th>Description</th><td>efgh</td></tr>
<tr><th>Title</th><td>ijkl</td></tr>
<tr><th>Description</th><td>mnop</td></tr>
<tr><th>Title</th><td>ijkl</td></tr>
<tr><th>Description</th><td>mdep</td></tr>
<tr><th>Title</th><td>ijkl</td></tr>
<tr><th>Description</th><td>mnop</td></tr>
<tr><th>Title</th><td>ijkl</td></tr>
<tr><th>Description</th><td>mnop</td></tr>
</table>
Here are the variables I use at the start:
var filterTable = $('table');
var rowsPerEntry = 2;
var totalEntries = filterTable.find('tbody tr').size() / rowsPerEntry;
var currentEntryNumber = 1;
var currentRowInEntry = 0;
And this little loop will add a class for each entry (based on the rowsPerEntry as seen above) to group the rows together (this way all rows for an entry can be selected together with a class selector in jQuery):
filterTable.find('tbody tr').each(function(){
$(this).addClass('entry' + currentEntryNumber);
currentRowInEntry += 1;
if(currentRowInEntry == rowsPerEntry){
currentRowInEntry = 0;
currentEntryNumber += 1;
}
});
And the magic; on keyup for the filters run a loop through the total number of entries, then a nested loop through the filters to determine if that entry does not match either filter's input. If either field for the entry does not match the corresponding filter value, then we add the entry number to our hide array and move along. Once we've determined which entries should be hidden, we can show all of the entries, and hide the specific ones that should be hidden:
$('#filters input').keyup(function(){
var hide = [];
for(var i = 0; i < totalEntries; i++){
var entryNumber = i + 1;
if($.inArray(entryNumber, hide) == -1){
$('#filters input').each(function(){
var val = $(this).val().toLowerCase();
var fHeader = $(this).attr('filter');
var fRow = $('.entry' + entryNumber + ' th:contains(' + fHeader + ')').closest('tr');
if(fRow.find('td').text().toLowerCase().indexOf(val) == -1){
hide.push(entryNumber);
return false;
}
});
}
}
filterTable.find('tbody tr').show();
$.each(hide, function(k, v){
filterTable.find('.entry' + v).hide();
});
});
It's no masterpiece, but I hope it'll get you started down the right path.
Here's a fiddle too: https://jsfiddle.net/bzjyfejc/

highlight words in html using regex & javascript - almost there

I am writing a jquery plugin that will do a browser-style find-on-page search. I need to improve the search, but don't want to get into parsing the html quite yet.
At the moment my approach is to take an entire DOM element and all nested elements and simply run a regex find/replace for a given term. In the replace I will simply wrap a span around the matched term and use that span as my anchor to do highlighting, scrolling, etc. It is vital that no characters inside any html tags are matched.
This is as close as I have gotten:
(?<=^|>)([^><].*?)(?=<|$)
It does a very good job of capturing all characters that are not in an html tag, but I'm having trouble figuring out how to insert my search term.
Input: Any html element (this could be quite large, eg <body>)
Search Term: 1 or more characters
Replace Txt: <span class='highlight'>$1</span>
UPDATE
The following regex does what I want when I'm testing with http://gskinner.com/RegExr/...
Regex: (?<=^|>)(.*?)(SEARCH_STRING)(?=.*?<|$)
Replacement: $1<span class='highlight'>$2</span>
However I am having some trouble using it in my javascript. With the following code chrome is giving me the error "Invalid regular expression: /(?<=^|>)(.?)(Mary)(?=.?<|$)/: Invalid group".
var origText = $('#'+opt.targetElements).data('origText');
var regx = new RegExp("(?<=^|>)(.*?)(" + $this.val() + ")(?=.*?<|$)", 'gi');
$('#'+opt.targetElements).each(function() {
var text = origText.replace(regx, '$1<span class="' + opt.resultClass + '">$2</span>');
$(this).html(text);
});
It's breaking on the group (?<=^|>) - is this something clumsy or a difference in the Regex engines?
UPDATE
The reason this regex is breaking on that group is because Javascript does not support regex lookbehinds. For reference & possible solutions: http://blog.stevenlevithan.com/archives/mimic-lookbehind-javascript.
Just use jQuerys built-in text() method. It will return all the characters in a selected DOM element.
For the DOM approach (docs for the Node interface): Run over all child nodes of an element. If the child is an element node, run recursively. If it's a text node, search in the text (node.data) and if you want to highlight/change something, shorten the text of the node until the found position, and insert a highligth-span with the matched text and another text node for the rest of the text.
Example code (adjusted, origin is here):
(function iterate_node(node) {
if (node.nodeType === 3) { // Node.TEXT_NODE
var text = node.data,
pos = text.search(/any regular expression/g), //indexOf also applicable
length = 5; // or whatever you found
if (pos > -1) {
node.data = text.substr(0, pos); // split into a part before...
var rest = document.createTextNode(text.substr(pos+length)); // a part after
var highlight = document.createElement("span"); // and a part between
highlight.className = "highlight";
highlight.appendChild(document.createTextNode(text.substr(pos, length)));
node.parentNode.insertBefore(rest, node.nextSibling); // insert after
node.parentNode.insertBefore(highlight, node.nextSibling);
iterate_node(rest); // maybe there are more matches
}
} else if (node.nodeType === 1) { // Node.ELEMENT_NODE
for (var i = 0; i < node.childNodes.length; i++) {
iterate_node(node.childNodes[i]); // run recursive on DOM
}
}
})(content); // any dom node
There's also highlight.js, which might be exactly what you want.

JSFL: convert text from a textfield to a HTML-format string

I've got a deceptively simple question: how can I get the text from a text field AND include the formatting? Going through the usual docs I found out it is possible to get the text only. It is also possible to get the text formatting, but this only works if the entire text field uses only one kind of formatting. I need the precise formatting so that I convert it to a string with html-tags.
Personally I need this so I can pass it to a custom-made text field component that uses HTML for formatting. But it could also be used to simply export the contents of any text field to any other format. This could be of interest to others out there, too.
Looking for a solution elsewhere I found this:
http://labs.thesedays.com/blog/2010/03/18/jsfl-rich-text/
Which seems to do the reverse of what I need, convert HTML to Flash Text. My own attempts to reverse this have not been successful thus far. Maybe someone else sees an easy way to reverse this that I’m missing? There might also be other solutions. One might be to get the EXACT data of the text field, which should include formatting tags of some kind(XML, when looking into the contents of the stored FLA file). Then remove/convert those tags. But I have no idea how to do this, if at all possible. Another option is to cycle through every character using start- and endIndex, and storing each formatting kind in an array. Then I could apply the formatting to each character. But this will result in excess tags. Especially for hyperlinks! So can anybody help me with this?
A bit late to the party but the following function takes a JSFL static text element as input and returns a HTML string (using the Flash-friendly <font> tag) based on the styles found it its TextRuns array. It's doing a bit of basic regex to clear up some tags and double spaces etc. and convert /r and /n to <br/> tags. It's probably not perfect but hopefully you can see what's going on easy enough to change or fix it.
function tfToHTML(p_tf)
{
var textRuns = p_tf.textRuns;
var html = "";
for ( var i=0; i<textRuns.length; i++ )
{
var textRun = textRuns[i];
var chars = textRun.characters;
chars = chars.replace(/\n/g,"<br/>");
chars = chars.replace(/\r/g,"<br/>");
chars = chars.replace(/ /g," ");
chars = chars.replace(/. <br\/>/g,".<br/>");
var attrs = textRun.textAttrs;
var font = attrs.face;
var size = attrs.size;
var bold = attrs.bold;
var italic = attrs.italic;
var colour = attrs.fillColor;
if ( bold )
{
chars = "<b>"+chars+"</b>";
}
if ( italic )
{
chars = "<i>"+chars+"</i>";
}
chars = "<font size=\""+size+"\" face=\""+font+"\" color=\""+colour+"\">"+chars+"</font>";
html += chars;
}
return html;
}