Powershell modifying HTML from ConvertTo-HTML - html

I have a script that generates an array of objects that I want to email out in HTML format. That part works fine. I am trying to modify the HTML string to make certain rows a different font color.
Part of the html string looks like this (2 rows only):
<tr>
<td>ABL - Branch5206 Daily OD Report</td>
<td>'\\CTB052\Shared_Files\FIS-BIC Reporting\Report Output Files\ABL\Operations\Daily\ABL - Branch5206 Daily OD Report.pdf'</td>
<td>13124</td>
<td>4/23/2013 8:05:34 AM</td>
<td>29134</td>
<td>0</td>
<td>Delivered</td>
</tr>
<tr>
<td>ABL - Branch5206 Daily OD Report</td>
<td>'\\CTB052\Shared_Files\FIS-BIC Reporting\Report Output Files\ABL\Operations\Daily\ABL - Branch5206 Daily OD Report.xls'</td>
<td>15716</td>
<td>4/23/2013 8:05:34 AM</td>
<td>29134</td>
<td>0</td>
<td>Delivered</td>
</tr>
I tried regex to add a font color to the beginning and end of the rows where the row ends with "Delivered":
$email = [regex]::Replace($email, "<tr><td>(.*?)Delivered</td></tr>", '<tr><font color = green><td>$1Delivered</td></font></tr>')
This didn't work (I am not sure if you can set font color for a whole row like that).
Any ideas on how to do this easily/efficiently? I have to do it on several different statuses (like Delivered)

Disclaimer: HTML cannot be parsed by regular expression parser. A regular expression will NOT provide a general solution to this problem. If your HTML structure is well known and you don't have any other <tr></tr> elements, though, the following might work. On that note, though, is there some reason you can't modify the HTML generation to do this then instead of waiting until the HTML is already generated?
Try this command:
PS > $email = $email -replace '(?s)<tr>(.*?)<td>Delivered</td>(.*?)</tr>','<tr style="color: #FF0000">$1<td>Delivered</td>$2</tr>'
The first string is the pattern. The (?s) tells the parser to allow . to accept newlines; this is called "single line" mode. Then it grabs a <tr> element that contains the string <td>Delivered</td>. The two capture groups grab everything else in the <tr> element around the <td>Delivered</td> string. Take note of the question marks following the *s. * by itself is greedy and matches as much text as possible; *? matches as little text as possible. If we just used * here, it would treat your entire string as one match and only replace the first <tr>.
The second string is the replacement. It plops the <tr> element and its contents back in place with an added style attribute, and all without back ref.
One other minor note is the quoting. I tend toward single quotes anyway, but in this case, you're likely to have double quotes in the replacement string. So single quotes are probably the way to go.
As for how you could do this for different statuses, regular expressions really aren't designed for conditional content like that; it's like trying to use a screwdriver as a drill. You can hard code several replaces or loop over status/color pairs and build your pattern and replace strings from them. A full blown HTML parser would be more efficient if you can find one for .NET; you might try to get away with an XML parser if you can guarantee it's valid XML. Or, going back to my question at the beginning, you could modify the HTML generation. If your e-mails are few in number, though, this may not be a bottleneck worth addressing. Development time spent is also costly. See if it's fast enough and try a different route if not.
Credit where it's due: I took the HTML style attribute from #FrankieTheKneeMan.

Related

Web scraping without id VBA

I'm trying to scrape a web , some elements were easy to get . But I have a problem with those who have no id like this .
<TABLE class=DisplayMain1 cellSpacing=1 cellPadding=0><TBODY>
<TR class=TitleLabelBig1>
<TD class=Title1 colSpan=100><SPAN style="FONT-FAMILY: arial narrow; FONT-WEIGHT: normal">Tool & </SPAN><BR>PE311934-1-1 </TD></TR></TBODY></TABLE>
i want this ---►PE311934-1-1
i Try with "document.getElementsByClassName" but the vba gave me a error :/..
some tip?
Use Regular Expressions and the XMLHttpRequest object in VBA
I made a AddIn some time ago that does just that:
http://www.analystcave.com/excel-tools/excel-scrape-html-add/
If you just want the source code then here (GetElementByRegex function):
http://www.analystcave.com/excel-scrape-html-element-id/
Now the actual regex will be quite simple:
</SPAN><BR>(.*?)</TD></TR></TBODY></TABLE>
If it captures too much items simply expand the regex.
You don't specify the error and there is not enough HTML to know how many elements there are on the page.
You may have forgotten to use an index with document.getElementsByClassName("Title1"), as it returns a collection
For example, the first item would be: document.getElementsByClassName("Title1")(0)
In the same way, you could use a CSS querySelector such as .Title1
Which says the same thing i.e. select the elements with ClassName "Title1".
For the first instance simply use:
document.querySelector(".Title1")
For a nodeList of all matching
document.querySelectorAll(".Title1")
and then iterate over its length.
You would access the .innerText property of the element, generally, to retrieve the required string.
For the snippet shown, assuming the item is the first .Title1 on the page the CSS selector retrieves the following from your HTML
The resultant string can then be processed for what you want. This method, and regex, are fragile at best considering how easily an updated source page can break these methods.
In your above example, you can use the class name, .Title1, and then use Replace() to remove the Tool & .

regex to get tags previous and next

I have some html in a single string that may or may not have newlines. It could look something like this:
<table><tr><th>blah1</th></tr><tr><input class="inputClass"><span>open<pfelclose/>pfelsingle'pfeldouble"pfel</span></input></tr></table>
formatted nicely:
<table>
<tr>
<th>blah1</th>
</tr>
<tr>
<input class="inputClass">
<span>open<pfelclose/>pfelsingle'pfeldouble"pfel</span>
</input>
</tr>
</table>
I'd like to search this string for
(open<pfel|close/>pfel|single'pfel|double"pfel)
but also get two open tags before and two close tags after. So I'd like to get something like:
<input class="inputClass"><span>open<pfelclose/>pfelsingle'pfeldouble"pfel</span></input>
I cannot assume that input or span will be there, nor can I assume that there are necessarily two tags before or two tags after.
My attempt seems to always pull the entire start of the string:
.*[<]{0,2}?(open<pfel|close/>pfel|single'pfel|double"pfel)[/>]{0,2}?
The trouble with your situation is that you want to find matching tags (the open and close tags before and after the text you're searching for). Regex cannot do this. It isn't capable of parsing a nested structure like HTML. Regex parses regular languages, and HTML isn't one. Advanced Regex engines can sometimes be coerced into doing almost what you're trying to do here, but it's usually more trouble than it's worth.
Your solution in the comments is probably the correct one. Find what you're looking for with the regex, and then use an HTML parser to get what you need.

Get last </td></tr> with regular expression?

I need to get all tags between last </td> and the closing </tr> in each row. The regular expression I use <\/TD\s*>(.*?)<\/TR\s*> retrieve all from first </TD> till last </TR> - marked with bold on sample below.
<TABLE>
<TR><TD>TD11**</TD><TD>TD12</TD><TD>TD13</TD><SPAN><FONT>test1</FONT></SPAN></TR>**
<TR><TD>TD21**</TD><TD>TD22</TD><TD>TD23</TD><SPAN><FONT>test2</FONT></SPAN></TR>**
</TABLE>
But a what I really need is
<TABLE>
<TR><TD>TD11</TD><TD>TD12</TD><TD>TD13**</TD><SPAN><FONT>test1</FONT></SPAN></TR>**
<TR><TD>TD21</TD><TD>TD22</TD><TD>TD23**</TD><SPAN><FONT>test2</FONT></SPAN></TR>**
</TABLE>
Its not recommended to use regular expressions to parse HTML, html is non regular and there for notoriously unreliable when trying to use regular expressions.
Heres a good blog post explaining the logic and offering alternatives:
http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html
</TD>((?:(?!</T[DR]>).)*)</TR>
The regex starts to match at the first </TD>, but fails as soon as it reaches the second </TD> because of the (?!</T[DR]>)., which matches any character that's not the first character of a </TD> or </TR> tag. That's optional because of the enclosing (?:...)*, so it tries to match the next part of the regex, which is </TR>. That fails too, so the match attempt is abandoned.
It tries again starting at the second </TD> and fails again. Finally, it starts matching at the third </TD> and successfully matches from there to the first </TR>.
You may want to specify "single-line" or "dot-matches-all" mode, in case there are newlines that didn't show in your example. You didn't specify a regex flavor, so I can't say exactly how to do that.

Regex: Extracting readable (non-code) text and URLs from HTML documents

I am creating an application that will take a URL as input, retrieve the page's html content off the web and extract everything that isn't contained in a tag. In other words, the textual content of the page, as seen by the visitor to that page. That includes 'masking' out everything encapsuled in <script></script>, <style></style> and <!-- -->, since these portions contain text that is not enveloped within a tag (but is best left alone).
I have constructed this regex:
(?:<(?P<tag>script|style)[\s\S]*?</(?P=tag)>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>)
It correctly selects all the content that i want to ignore, and only leaves the page's text contents. However, that means that what I want to extract won't show up in the match collection (I am using VB.Net in Visual Studio 2010).
Is there a way to "invert" the matching of a whole document like this, so that I'd get matches on all the text strings that are left out by the matching in the above regex?
So far, what I did was to add another alternative at the end, that selects "any sequence that doesn't contain < or >", which then means the leftover text. I named that last bit in a capture group, and when I iterate over the matches, I check for the presence of text in the "text" group. This works, but I was wondering if it was possible to do it all through regex and just end up with matches on the plain text.
This is supposed to work generically, without knowing any specific tags in the html. It's supposed to extract all text. Additionally, I need to preserve the original html so the page retains all its links and scripts - i only need to be able to extract the text so that I can perform searches and replacements within it, without fear of "renaming" any tags, attributes or script variables etc (so I can't just do a "replace with nothing" on all the matches I get, because even though I am then left with what I need, it's a hassle to reinsert that back into the correct places of the fully functional document).
I want to know if this is at all possible using regex (and I know about HTML Agility Pack and XPath, but don't feel like).
Any suggestions?
Update:
Here is the (regex-based) solution I ended up with: http://www.martinwardener.com/regex/, implemented in a demo web application that will show both the active regex strings along with a test engine which lets you run the parsing on any online html page, giving you parse times and extracted results (for link, url and text portions individually - as well as views where all the regex matches are highlighted in place in the complete HTML document).
what I did was to add another alternative at the end, that selects "any sequence that doesn't contain < or >", which then means the leftover text. I named that last bit in a capture group, and when I iterate over the matches, I check for the presence of text in the "text" group.
That's what one would normally do. Or even simpler, replace every match of the markup pattern with and empty string and what you've got left is the stuff you're looking for.
It kind of works, but there seems to be a string here and there that gets picked up that shouldn't be.
Well yeah, that's because your expression—and regex in general—is inadequate to parse even valid HTML, let alone the horrors that are out there on the real web. First tip to look at, if you really want to chase this futile approach: attribute values (as well as text content in general) may contain an unescaped > character.
I would like to once again suggest the benefits of HTML Agility Pack.
ETA: since you seem to want it, here's some examples of markup that looks like it'll trip up your expression.
<a href=link></a> - unquoted
<a href= link></a> - unquoted, space at front matched but then required at back
- very common URL char missing in group
- more URL chars missing in group
<a href=lïnk></a> - IRI
<a href
="link"> - newline (or tab)
<div style="background-image: url(link);"> - unquoted
<div style="background-image: url( 'link' );"> - spaced
<div style="background-image: url('link');"> - html escape
<div style="background-image: ur\l('link');"> - css escape
<div style="background-image: url('link\')link');"> - css escape
<div style="background-image: url(\
'link')"> - CSS folding
<div style="background-image: url
('link')"> - newline (or tab)
and that's just completely valid markup that won't match the right link, not any of the possible invalid markup, markup that shouldn't but does match a link, or any of the many problems with your other technique of splitting markup from text. This is the tip of the iceberg.
Regex is not reliable for retrieving textual contents of HTML documents. Regex cannot handle nested tags. Supposing a document doesn't contain any nested tag, regex still requires every tags are properly closed.
If you are using PHP, for simplicity, I strongly recommend you to use DOM (Document Object Model) to parse/extract HTML documents. DOM library usually exists in every programming language.
If you're looking to extract parts of a string not matched by a regex, you could simply replace the parts that are matched with an empty string for the same effect.
Note that the only reason this might work is because the tags you're interested in removing, <script> and <style> tags, cannot be nested.
However, it's not uncommon for one <script> tag to contain code to programmatically append another <script> tag, in which case your regex will fail. It will also fail in the case where any tag isn't properly closed.
You cannot parse HTML with regular expressions.
Parsing HTML with regular expressions leads to sadness.
I know you're just doing it for fun, but there are so many packages out there than actually do the parsing the right way, AND do it reliably, AND have been tested.
Don't go reinventing the wheel, and doing it a way that is all but guaranteed to frustrate you down the road.
OK, so here's how I'm doing it:
Using my original regex (with the added search pattern for the plain text, which happens to be any text that's left over after the tag searches are done):
(?:(?:<(?P<tag>script|style)[\s\S]*?</(?P=tag)>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>))|(?P<text>[^<>]*)
Then in VB.Net:
Dim regexText As New Regex("(?:(?:<(?<tag>script|style)[\s\S]*?</\k<tag>>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>))|(?<text>[^<>]*)", RegexOptions.IgnoreCase)
Dim source As String = File.ReadAllText("html.txt")
Dim evaluator As New MatchEvaluator(AddressOf MatchEvalFunction)
Dim newHtml As String = regexText.Replace(source, evaluator)
The actual replacing of text happens here:
Private Function MatchEvalFunction(ByVal match As Match) As String
Dim plainText As String = match.Groups("text").Value
If plainText IsNot Nothing AndAlso plainText <> "" Then
MatchEvalFunction = match.Value.Replace(plainText, plainText.Replace("Original word", "Replacement word"))
Else
MatchEvalFunction = match.Value
End If
End Function
Voila. newHtml now contains an exact copy of the original, except every occurrence of "Original word" in the page (as it's presented in a browser) is switched with "Replacement word", and all html and script code is preserved untouched. Of course, one could / would put in a more elaborate replacement routine, but this shows the basic principle. This is 12 lines of code, including function declaration and loading of html code etc. I'd be very interested in seeing a parallel solution, done in DOM etc for comparison (yes, I know this approach can be thrown off balance by certain occurrences of some nested tags quirks - in SCRIPT rewriting - but the damage from that will still be very limited, if any (see some of the comments above), and in general this will do the job pretty darn well).
For Your Information,
Instead of Regex, With JQuery , Its possible to extract text alone from a html markup. For that you can use the following pattern.
$("<div/>").html("#elementId").text()
You can refer this JSFIDDLE

PHP Regex: Get info between groups of HTML tags?

I have been programming a word-unscrambler. I need to parse the information between a group of tags and another, and put all matches into an array. The beginning tag is:
<tr> <td></td><td><li>
and the ending tag is:
</li></td> </tr>
I know some regular expressions, but I am unfamiliar with PHP.
<tr> <td><\/td><td><li>(.*)<\/li><\/td> <\/tr>
Test is here: http://rubular.com/regexes/13241
PHP can cope with both PERL and POSIX regexes, so you'll need to use the function for the flavour that you know. I believe that the code posted by #Gazler should work in either case, though it will need to be surrounded by /s, and quoted:
$reg = '/<tr> <td><\/td><td><li>(.*)<\/li><\/td> <\/tr>/';
There's a PHP PERL regex based test site here: http://j-r.camenisch.net/regex/