How to avoid html blocks with regex - html

I have to find all the strings surrounded by "[" and "]" using regex, but avoiding the ones inside the <table></table> block, for example:
<html>
<body>
<p><table>
<tbody>
<tr>
<td style="border-style: solid; border-width:1px;">
<span style="font-family: Courier;">[data1]</span>
</td>
<td style="border-style: solid; border-width:1px;">
<span style="font-family: Courier;">[data10]</span>
</td>
</tr>
</tbody>
</table>
</p>
<p>[data3] [data4] [data5]</p>
</body>
</html>
in this case only [data3], [data4] and [data5] should be found.
So far I have this:
#"(((?<!<span>)(\[[a-zA-Z_0-9]+)](?!<\/span>))|((?<!<span>)(\[[a-zA-Z_0-9]+)])|((\[[a-zA-Z_0-9]+)](?!<\/span>)))(?!.*\1)"
That finds all the [] blocks that are not surrounded by tags and I tried adding a negative lookahead and lookbehind of but it doesn't work, it stills gets the ones inside the table block.
Hope you guys can help me with this.

Below regex will return your all [data] which enclose in <p> </p> tag.
/<p.*?>\[(.*?)\]<*.p>/g
so above regex will return this <p>[data3] [data4] [data5]</p> from your above HTML code.
When you get that string from above regex then use below regex to get only all [data] string.
/\[(.*?)\]/g
so above regex will return " [data3][data4][data5] " from above string.

Related

HTML editor removes FreeMarker tags inside <table> tag

I use FreeMarker to build various templates from email to invoice templates. Issue that i am seeking right now is related to FreeMarker code being extracted outside table tag since HTML doesn't allow other characters inside it beside tbody, thead, tr.
Would be glad if anyone has an idea how to bypass this.
Example:
<table>
<tbody>
[#assign eventDetails = []]
[#if items?? && items?has_content]
[#list items as item]
<tr>
<td style="padding: 5px;vertical-align: top;border-bottom: 1px solid #eee; text-align: center;">
${item.name}
</td>
</tr>
[/#list]
[/#if]
</tbody>
</table>
become like this after being applied to editor using element.innerHTML:
[#assign eventDetails = []]
[#if items?? && items?has_content]
[#list items as item]
[/#list]
[/#if]
<table>
<tbody>
<tr>
<td style="padding: 5px;vertical-align: top;border-bottom: 1px solid #eee; text-align: center;">
${item.name}
</td>
</tr>
</tbody>
</table>
It of course only depends on the HTML editor you are using, but you should try if same happens if, either:
You are using the <#...> syntax instead. That looks like some unknown tag for the editor, not CDATA, so maybe it doesn't react the same way.
Replace <tbody>...</tbody> with [#html.tbody]...[/#html.tbody]. Then the editor might won't be confident enough to remove stuff. Or, the same with [#html.table], etc. Quite awkward, but might be better than what you have now. (For that to actually work when running the template, you will have to define the tbody macro in the html namespace of course. It's not built in.)
Example of the last:
<#ftl output_format='HTML'>
<#macro table attrs...><#elementWithNested 'table' attrs><#nested></#></#macro>
<#macro tbody attrs...><#elementWithNested 'tbody' attrs><#nested></#></#macro>
<#macro elementWithNested elementName attrs>
<${elementName}<#if attrs?size != 0><#list attrs as k, v> ${k}="${v}"</#list></#if>>
<#nested>
</${elementName}>
</#macro>
If above template is #import-ed as html, then <#html.table foo="bar">...</#html.table> etc. will work.

Generate HTML file with both quote ' and " from R

I need to create an HTML file from R software. The problem is that javascript implies simple quote and styles double quote in the string generated.
cat() function returns a quite good text removing backslashs in front of ". But I did not found how to print it like this in an html file using write.table(Text, "index.html", sep="\t")
Thanks in advance for any help.
NB : I removed a "<" character in front of /script in order to be able to post it =)
For exemple :
Text=paste0('<html>
<script type="text/javascript">',
"function lang1(event) {
var iframe = document.getElementById('id1');
var target = event.target || event.srcElement;
iframe.src = event.target.innerHTML + '.html';
}
/script>",
'<body style="overflow:hidden; margin:0">
<div id="main">
<div id="content">
<table style="border: 0; height:100%;width:100%;">
<tr style="height:5%;">
<td colspan="2" style="text-align:center;">
<h2>',paste0("some text"),'</h2>
</td>
</tr>
<tr>
<td style="width: 10%;font-size:14px;">
<ul onclick="lang1(event);">',
paste('<li>',c("link1","link2"),'</li>',collapse=""),
'</ul>
</td>
<td style="width: 90%;">
<iframe id="id1" width="99%" height="99%"></iframe>
</td>
</tr>
</table>
</div>
</div>
</body>
</html>')
Text=gsub("\n","",Text)
I'm not sure I fully understand the question but whenever I generate html files from text in R, I use the \ character to escape quote marks that are needed in the JavaScript.
Then I open a file connection and use the writeLines function to correctly write my text to the file
Text<-"<!doctype html>
<html lang=\"en\">
<head>
<meta charset=\"utf-8\">
<style>
body {
font-size : 16px;
font-family: \"Helvetica Neue\",Helvetica,Arial,sans-serif;
}
</style>
</head>
</html>
"
fileConn<-file("mywebpage.html")
writeLines(Text, fileConn)
close(fileConn)
Maybe that will help you.

Find specific element position in XPath after checking a condition

I have the following html I am working with: (a chunk of it here)
<table class="detailTable">
<tbody>
<tr>
<td class="detailTitle" align="top">
<h3>Credit Limit:</h3>
<h3>Current Balance:</h3>
<h3>Pending Balance:</h3>
<h3>Available Credit:</h3>
</td>
<td align="top">
<p>$677.77</p>
<p>$7.77</p>
<p>$7.77</p>
<p>$677.77</p>
</td>
<td class="detailTitle">
<h3>Last Statement Date:</h3>
<h4>Payment Address</h4>
</td>
<td>
<p> 05/19/2015 </p>
<p class="attribution">
</td>
</tr>
</tbody>
</table>
I need to first check if "Statement Date" exists, and then find its position. Then get it's value which is in a corresponding <p> tag. I need to do this using XPath. Any suggestions?
So far I tried using //table[#class='detailTable'][1]//td[2]//p[position(td[contains(.,'Statement Date')])] but it doesn't work.
This is one possible way : (formatted for readability)
//table[#class='detailTable']
//tr
/td[*[contains(.,'Statement Date')]]
/following-sibling::td[1]
/*[position()
=
count(
parent::td
/preceding-sibling::td[1]
/*[contains(.,'Statement Date')]/preceding-sibling::*
)+1
]
explanation :
..../td[*[contains(.,'Statement Date')]] : From the beginning up to this part, the XPath will find td element where, at least, one of its children contains text "Statement Date"
/following-sibling::td[1] : from previously matched td, navigate to the nearest following sibling td ...
/*[position() = count(parent::td/preceding-sibling::td[1]/*[contains(.,'Statement Date')]/preceding-sibling::*)+1] : ...and return child element at position equals to position of element that contains text "Statement Date" in the previous td. Notice that we use count(preceding-sibling::*)+1 to get position index of the element containing text "Statement Date" here.
You can do it this way:
//table[#class='detailTable'][1]//td[#class="detailTitle" and contains(./h3, 'Statement Date')]/following-sibling::td[1]/p[1]/text()
This will find the <td> that contains the Statement Date heading, and get the <td> immediately after it. Then it gets the text content of the first p in that <td>.

JSTL: I want to generate white spaces

I have a jsp that uses JSTL to manage the value of the beans and I need to align two elements without touching the rest of the page (that was not written by me)
I have two rows like this:
AAAAAAAAAAAAAAAA element one
element two
IF "AAAAAAAAAAAAAAAA" (a variable) has a value I want to align "element two" with "element one" (that has AAA before, so is more at right than the second one)
So I want to obtain this (only if AAAA.. has a value):
AAAAAAAAAAAAAAAA element one
element two
the part is:
<c:out value='${variableA}'/> <c:out value='${elementOne}'/>
<c:if test="${variableA != ''}">
--I need to insert spaces here--
</c:if>
<c:out value='${elementTwo}'/>
I tried inserting many &nbsp, but they get inserted even when the "if" is false. I tried with div and p but without success and I tried valorizing a JSTL variable with many "&nbsp" but obviously they get trimmed when converted into html.
Can someone help me putting some spaces over there? :(
Thank you very much :)
use ${" "} to have real space, not
You need to change the <c:if> empty condition like below
<c:out value='${variableA}'/> <c:out value='${elementOne}'/> <br>
<c:if test="${not empty variableA}">
</c:if>
<c:out value='${elementTwo}'/>
To keep alignment better to use table tag instead of adding spaces
<table>
<tr>
<td> <c:out value='${variableA}'/> </td>
<td> <c:out value='${elementOne}'/> </td>
</tr>
<tr>
<td> <c:out value='${variableB}'/> </td>
<td> <c:out value='${elementTwo}'/> </td>
</tr>
</table>
<style>
.left-div{
float: left;
width: 200px;
display: block;
}
.main-div{
width: 400px;
}
</style>
<div class="main-div">
<div class="left-div">AAAAAAAAAAA</div><div class="left-div">EL 1</div>
<div class="left-div"> </div><div class="left-div">EL 2</div>
</div>

How to get line from table with Jsoup

I have table without any class or id (there are more tables on the page) with this structure:
<table cellpadding="2" cellspacing="2" width="100%">
...
<tr>
<td class="cell_c">...</td>
<td class="cell_c">...</td>
<td class="cell_c">...</td>
<td class="cell">SOME_ID</td>
<td class="cell_c">...</td>
</tr>
...
</table>
I want to get only one row, which contains <td class="cell">SOME_ID</td> and SOME_ID is an argument.
UPD.
Currently i am doing iy in this way:
doc = Jsoup.connect("http://www.bank.gov.ua/control/uk/curmetal/detail/currency?period=daily").get();
Elements rows = doc.select("table tr");
Pattern p = Pattern.compile("^.*(USD|EUR|RUB).*$");
for (Element trow : rows) {
Matcher m = p.matcher(trow.text());
if(m.find()){
System.out.println(m.group());
}
}
But why i need Jsoup if most of work is done by regexp ? To download HTML ?
If you have a generic HTML structure that always is the same, and you want a specific element which has no unique ID or identifier attribute that you can use, you can use the css selector syntax in Jsoup to specify where in the DOM-tree the element you are after is located.
Consider this HTML source:
<html>
<head></head>
<body>
<table cellpadding="2" cellspacing="2" width="100%">
<tbody>
<tr>
<td class="cell">I don't want this one...</td>
<td class="cell">Neither do I want this one...</td>
<td class="cell">Still not the right one..</td>
<td class="cell">BINGO!</td>
<td class="cell">Nothing further...</td>
</tr> ...
</tbody>
</table>
</body>
</html>
We want to select and parse the text from the fourth <td> element.
We specify that we want to select the <td> element that has the index 3 in the DOM-tree, by using td:eq(3). In the same way, we can select all <td> elements before index 3 by using td:lt(3). As you've probably figured out, this is equal and less than.
Without using first() you will get an Elements object, but we only want the first one so we specify that. We could use get(0) instead too.
So, the following code
Element e = doc.select("td:eq(3)").first();
System.out.println("Did I find it? " + e.text());
will output
Did I find it? BINGO!
Some good reading in the Jsoup cookbook!