I know that I shouldn't be using REGEX for parsing HTML, and I promise to check out the HTML agility pack, but in the meantime, could some expert tell me if there's a pattern that will match this entire block?
<tr bgcolor="#f4f4ff"><td align="center"><font size="2">42</font></td>
<td align="center"><font size="2">35</font></td>
<td><font size="2"><b>Bears</b></font></td>
<td><font size="2">BV</font></td>
<td align="right"><font size="2"><b>$33,845</b></font></td>
<td align="right"><font size="2"><font color="#ff0000">-60.1%</font></font></td>
<td align="right"><font size="2">75</font></td>
<td align="right"><font size="2"><font color="#ff0000">-35</font></font></td>
<td align="right"><font size="2">$451</font></td>
<td align="right"><font size="2">$17,492,470</font></td>
<td align="right"><font size="2">-</font></td>
<td align="center"><font size="2">8</font></td>
</tr>
I'm using VBA, and regexoptions don't seem to be available. I've fiddled endlessly, and "should works" like
<tr[.\n]+tr>
<tr[.\s]+tr>
<tr[.\x0C\x\0A]+tr>
don't. I can match everything up to the first line break, then I hit a brick wall. Is there any workaround if the singleline option isn't available? Maybe using the VBA REPLACE function to change all vbcrlf instances to some other character before I try to match? And can somebody point me to an example of how much easier this would be with the HTML agility pack?
Okay, so you've heard the warnings against using regex to parse html?
As far as I know, VBAScript doesn't have the DOTALL mode, so to match across lines we have to make a fake DOTALL dot with something like this: (?:.|[\r\n]) or [\s\S]
This regex will match your block:
<tr[\s\S]*?</tr>
But if there are nested rows, you will be sorry you used regex. :)
Sample code
Dim myRegExp, myMatches, matched_block
Set myRegExp = New RegExp
myRegExp.Pattern = "<tr[\s\S]*?</tr>"
Set myMatches = myRegExp.Execute(SubjectString)
If myMatches.Count >= 1 Then
matched_block = myMatches(0).Value
Else
matched_block = ""
End If
Related
I am by no means an expert and only got BBEdit for a one time project recently.
I am working on a HTML file that has lots of entries I would like to remove. The code I wish to remove are all the tables that have the string NOT CONVERTED inside them without removing all other tables that have pretty much the same table pattern but different text or strings inside the table.
<table border=0 width="100%">
<tr>
<td class="out" valign=top nowrap width=5%>30.12.2004
22:34:03 <font color=black><b>></b>TOM </font>
</td>
<td class="out"
align=left>{{{NOT CONVERTED}}}</td>
</tr>
</table
<table border=0 width="100%"><tr><td class="timedel" valign=top nowrap
width=5%>30.12.2004 22:36:37 <font
color=black><b><</b>Benjamin </font></td><td class="incom" align=left>random string</td></tr></table>
<table border=0 width="100%"><tr><td class="incom" valign=top nowrap width=5%>30.12.2004
22:36:47 <font color=black><b><</b>Benjamin </font></td><td
class="incom" align=left>{{{NOT CONVERTED}}}</td></tr></table>
<table border=0 width="100%"><tr><td class="timedel" valign=top nowrap
width=5%>30.12.2004 22:36:47 <font
color=black><b><</b>Benjamin </font></td><td class="incom" align=left>random chat text</td></tr></table>
<table border=0 width="100%"><tr><td class="incom" valign=top nowrap width=5%>30.12.2004
22:36:50 <font color=black><b><</b>Benjamin </font></td><td
class="incom" align=left>{{{NOT CONVERTED}}}</td></tr></table>
I have 3000 of those tables in my html file and I wish to find and remove those tables.
The DATE, NAME and " >" are variables that differ in each table, the rest always has the same pattern.
How can I use the grep feature in this instance to identify this pattern and have it removed.
If you simply want to remove the entire table (from the opening <table> to the ending </table>), if the string "{{{NOT CONVERTED}}}" appears in the middle of it, then this pattern will match the entire table:
(?s)<table.+?>.+?{{{NOT CONVERTED}}}.+?</table>\n
(The (?s) at the beginning allows . to match across line breaks.)
Use "Replace All", replacing with nothing, to delete all of the eligible tables. Undo is your friend if it doesn't do what you need.
Thanks Siegel again for your help - I played around with your code and this did the trick:
<table.+?>{{{NOT CONVERTED}}}.+?</table>
This successfully identified the tables that had the string not converted in them.
Thanks again !
With this HTML
<tr>
<td class="listOddRow">Chequing </td>
<td class="listOddRow">00-227-03</td>
<td class="listOddRow" nowrap="">0275-1</td>
<td class="listOddRow" align="right" nowrap="">$ 28.08</td>
</tr>
Anyone knows why this works
//td[contains(text(),"00-227-03")]/parent::tr//a
but not this? I want to remove the dashes from the text() before calling contains()
//td[contains(replace(text(), "-", ""),"0022703")]/parent::tr//a
At least in xpath 1.0 there is no function replace, but there is translate working in the same way - it replaces in the 1st string chars presented in the 2nd one by corresponding chars in the 3rd. So, you can use the Xpath
//td[contains(translate(text(), "-", ""),"0022703")]/parent::tr//a
I have a bit of a dilemma. I need to parse a chunk of HTML through JSoup, that chunk is later passed on to another class that handles the jsoup elements. Unfortunately when I pass a chunk into Jsoup that represents a part of a table, for some odd reason jsoup just throws out all of the html and delivers me nothing, but the text. Here is an example:
<tr>
<td>Declared</td>
<td>Other Supported Languages</td>
<td>/ATP_ETK_89078_1006/atp_etk_89078_1006_p4/nonshared/E-trak_API_Build/obfuscated/vna.dll</td>
<td align="right">1519616</td>
<td align="right"></td>
<td align="right"></td>
<td>COM DEV</td>
<td>Unspecified</td>
<td>License for COM DEV</td>
<td>Component (Dynamic Library)</td>
<td>100%</td>
<td style="text-align: center;"></td>
<td></td>
<td></td>
<td valign="top"></td>
</tr>
<tr>
<td>Declared</td>
<td>Other Supported Languages</td>
<td>/ATP_ETK_89078_1006/atp_etk_89078_1006_p4/nonshared/E-trak_API_Build/obfuscated/vna.dll</td>
<td align="right">1519616</td>
<td align="right"></td>
<td align="right"></td>
<td>COM DEV</td>
<td>Unspecified</td>
<td>License for COM DEV</td>
<td>Component (Dynamic Library)</td>
<td>100%</td>
<td style="text-align: center;"></td>
<td></td>
<td></td>
<td valign="top"></td>
</tr>
This is the fragment and as you can see it just represents two rows from a table.
However the Jsoup Doc produces the following:
<html>
<head></head>
<body>
Declared Other Supported Languages /ATP_ETK_89078_1006/atp_etk_89078_1006_p4/nonshared/E-trak_API_Build/obfuscated/vna.dll 1519616 COM DEV Unspecified License for COM DEV Component (Dynamic Library) 100%
Declared Other Supported Languages /ATP_ETK_89078_1006/atp_etk_89078_1006_p4/nonshared/E-trak_API_Build/obfuscated/vna.dll 1519616 COM DEV Unspecified License for COM DEV Component (Dynamic Library) 100%
</body>
</html>
Now if the original headers of the table were there including the table open/close headers it seems to work, but that defeats the entire purpose of this fragment parsing as the HTML docs can get quite huge.
ANY HELP would be greatly appreciated.
Tested with JSoup 1.7.1 --> same problem.
I guess, the only way is to wrap your fragment into a table-tag.
String html = ... // your html
Document doc = Jsoup.parse(html);
// doesn't work as you said
String html = ... // your html
Document doc = Jsoup.parse("<table>" + html + "</table>");
// works
Don't know how you use Jsoup in your case, but maybe you can do somethink like this:
public String doSomethingWithFragment(String html)
{
Document doc = Jsoup.parse("<table>" + html + "</table>");
Elements fragment = doc.select("tbody > *");
// Do something with 'fragment' here ...
}
In this example fragment contains exactly the HTML as you posted above and you can do further things with it.
I know its a realy strange workaround - adding things and remove it in the next step. But however ... it works (i hope) :-) ...
We're creating html signatures for all the users within our domain, based on a simple html template.
...
<tr>
<td colspan="3" style="font-style:normal; font-size:12px;"><%Tel%></td>
</tr>
<tr>
<td colspan="3" style="font-style:normal; font-size:12px;"><%Mobile%></td>
</tr>
<tr>
<td colspan="3" style="font-style:normal; font-size:12px;"><%Fax%></td>
</tr>
...
The placeholders are replaced with the actual numbers for a user.
The following lines are a part of the generated signature, with telephone, mobile and fax numbers. If a user has no mobile number, the second tr-td is empty:
...
<tr>
<td colspan="3" style="font-style:normal; font-size:12px;">T +123 456 789</td>
</tr>
<tr>
<td colspan="3" style="font-style:normal; font-size:12px;"></td>
</tr>
<tr>
<td colspan="3" style="font-style:normal; font-size:12px;">F +123 456 789</td>
</tr>
...
When leaving a line empty ( like in the second line ) the html renders just fine in modern browsers, making sure the Tel and the Fax line are close together.
However, once I add this template to Outlook 2003, Outlook adds an extra 'nbsp;' to the html, between the empty td-tags. This results in an full empty line being shown between the tel and fax number.
Obviously, the user is annoyed with this extra line and cannot be bothered to remove the extra line manually each time. The signatures are read-only, so changing it in the settings is not an option.
Any ideas on why this happens, and how to fix this?
Edit: Apologies, Outlook version actually is 2003, not 2010.
Not sure if this will work but it's worth a shot. Have you tried just closing the tag like so:
<td colspan="3" style="font-style:normal; font-size:12px;"/>
Ok. This is a big one. I'll try and explain my issue but let me know if you need more info.
I would like to alter the HTML that SharePoint 2010 generates.
I'm going to use the HTML Agility Pack] which will take a string of HTML among other objects and alter the source.
There are 2 ways of altering the full source in SP.
Using Control Adapters, or Extending Controls I can access the Page, MasterPage or even ContentPlaceHolder render methods, grab the HTML, Alter it and then write it.
Use a HTTP Module with a filter and alter the output stream
Unfortunately there are issues with both of these methods.
Number 2, the filter works good but you have to disable the Output Cache.
I can't do this. The sites I brand have massive amounts of traffic.
So filters are moot until the SharePoint 2010 team fixes / presents us with a workaround. I read somewhere in my travels they are aware and will be doing something about it.
Number 1 works great. I simply use the following and I can alter the HTML of the page but there is one big problem.
HtmlDocument hd = new HtmlDocument(); //Agility HTML Object.
StringBuilder sb = new StringBuilder();
StringWriter sw = new StringWriter(sb);
HtmlTextWriter htw = new HtmlTextWriter(sw);
//Render Object into HtmlTextWriter
base.Render(htw);
//String to hold the HTML from the StringBuilder of the HtmlTextWriter
string html = sb.ToString();
//Mess with the String here using the Agility HTML Pack
//Write the HTML to the writer
writer.Write(html);
//Done!
This code works great, but SharePoint 2010 is appending data to the writer / or altering controls after the render override for the Page / MasterPage.
When i debug and walk-through and look at the html string it look like the following.
"<html .... ....</html>"
This is important to note because there is nothing before the first HTML tag. The Opening "<" is in position 0.
But when the HTML ends up in the browser I see the following.
DOMAIN\user<script type="text/javascript">
//<![CDATA[
var _spUserId=1;
//]]>
</script>
<html...
....</html>
Also the user name is missing from the welcome area in the top right of the ribbon because it's now putting it up before the opening tag.
If I don't assign the writer object anything in the render override the page in the browser just shows everything before the HTML tag. The funny thing is it shouldn't show anything!
If I visit other pages. ANY page that has a list view. All of the items are rendered before tag. It's important to note that it's just the Values, not the html. The HTML for the list items exists where it should down in the list view.
For Example. (Page is "/_layouts/viewlsts.aspx" )
Without the render override.
<html...
...
//The Table where the data should be.
<tr>
<td class="ms-gb" colspan="5" style="white-space:nowrap;">
<h3 class="ms-standardheader">
Picture Libraries
</h3>
</td>
</tr>
<tr><td class="ms-vb2 ms-viewlsts-noitems" colspan="6">
There are no picture libraries. To create one, click <b>Create</b> above.
</td></tr>
<tr>
<td class="ms-gb" colspan="5" style="white-space:nowrap;">
<h3 class="ms-standardheader">
Lists
</h3>
</td>
</tr>
<tr><td class="ms-vb2 ms-viewlsts-noitems" colspan="6">
There are no lists. To create one, click <b>Create</b> above.
</td></tr>
<tr>
<td class="ms-gb" colspan="5" style="white-space:nowrap;">
<h3 class="ms-standardheader">
Discussion Boards
</h3>
</td>
</tr>
<tr class="ms-alternatingstrong">
<td class="ms-vb-icon">
<a id="viewlistDiscussionBoard" href="/Lists/Discussion%20Board/AllItems.aspx" >
<img border="0" alt="Discussion Board" src="/_layouts/images/itdisc.png" width="16" height="16" /></a>
</td>
<td class="ms-vb2" >
<a id="viewlistDiscussionBoard" href="/Lists/Discussion%20Board/AllItems.aspx">Discussion Board</a>
</td>
<td class="ms-vb2" width="40%" >
</td>
<td class="ms-vb2" width="3%" align="right">
1
</td>
<td class="ms-vb2" width="25%" >
<nobr>
3 days ago
</nobr>
</td>
</tr>
...</html>
With the render override.
DOMAIN\user<script type="text/javascript">
//<![CDATA[
var _spUserId=1;
//]]>
</script>Document Libraries"viewlistDocumentLibrary""/AnalyticsReports/Forms/AllItems.aspx""Customized Reports""/_layouts/images/itdl.png""viewlistDocumentLibrary""/AnalyticsReports/Forms/AllItems.aspx"Customized ReportsThis Document library has the templates to create Web Analytics custom reports for this site collection04 days ago"viewlistDocumentLibrary""/Style%20Library/Forms/AllItems.aspx""Style Library""/_layouts/images/itdl.png""viewlistDocumentLibrary""/Style%20Library/Forms/AllItems.aspx"Style LibraryUse the style library to store style sheets, such as CSS or XSL files. The style sheets in this gallery can be used by this site or any of its subsites.04 days agoPicture LibrariesThere are no picture libraries. To create one, click <b>Create</b> above.ListsThere are no lists. To create one, click <b>Create</b> above.Discussion Boards"viewlistDiscussionBoard""/Lists/Discussion%20Board/AllItems.aspx""Discussion Board""/_layouts/images/itdisc.png""viewlistDiscussionBoard""/Lists/Discussion%20Board/AllItems.aspx"Discussion Board13 days agoSurveysThere are no surveys. To create one, click <b>Create</b> above.Blog0
<html...
//The Table where the data should be.
...
<tr>
<td class="ms-gb" colspan="5" style="white-space:nowrap;">
<h3 class="ms-standardheader">
</h3>
</td>
</tr>
<tr><td class="ms-vb2 ms-viewlsts-noitems" colspan="6">
</td></tr>
<tr>
<td class="ms-gb" colspan="5" style="white-space:nowrap;">
<h3 class="ms-standardheader">
</h3>
</td>
</tr>
<tr class="ms-alternatingstrong">
<td class="ms-vb-icon">
<a id= href= >
<img border="0" alt= src= width="16" height="16" /></a>
</td>
<td class="ms-vb2" >
<a id= href=></a>
</td>
<td class="ms-vb2" width="40%" >
</td>
<td class="ms-vb2" width="3%" align="right">
</td>
<td class="ms-vb2" width="25%" >
<nobr>
</nobr>
</td>
</tr>
... </html>
My guess is it has something to do with resource files..
Whatever the case it's obvious that there is some sort of manipulation of the objects after the Page or MasterPage render methods. I just can't find it.
My HTTP Module filter stream has the HTML in the proper place..
So somewhere between the Page render and sending the HTML to the browser there is something going on.
Here are a couple of other people reporting the same issue without any useful responses.
Link 1
Link 2
I would appreciate any insight on this! THANKS!
The problem is caused by something in the page interfering with the outputcache during render, and more specifically the behaviour of the PostCacheSubstitutionTextHelper class. Probably your call to Render above triggers this.
The welcome.asxc control behaves approximately like this (pseudo sequence diagram):
Page Render > Welcome.ascx Render > PersonalActions Render > PostCacheSubstitutionText.Render > PostCacheSubstitutionTextHelper.RenderAndRegisterSubstitutionCallbackHandler
(new instance of)PostCacheSubstitutionText.Render(called in delegate and now writes to HtmlTextWriter) > PostCacheSubstitutionTextHelper. RenderAndRegisterSubstitutionCallbackHandler HttpContext.Response.WriteSubstitution(stuffFromNewInstanceOfPostCacheSubstitutionText)
So enough with the sharepoint freakshow and over to the workaround:-)
Try adding this to your welcome.ascx in the hive:
// in bottom of directives in /CONTROLTEMPLATES/Welcome.ascx
<%# OutputCache Duration="1" VaryByParam="none" %>
Notice that the Duration is set to 1 since 0 is not allowed for user controls. this leaves a theoretical possiblity for failure, but in our scenario it worked.
I ran into this same problem, I was able to workaround the issue by removing this line from the masterpage.
<wssuc:Welcome id="IdWelcome" runat="server">
</wssuc:Welcome>