I am by no means an expert and only got BBEdit for a one time project recently.
I am working on a HTML file that has lots of entries I would like to remove. The code I wish to remove are all the tables that have the string NOT CONVERTED inside them without removing all other tables that have pretty much the same table pattern but different text or strings inside the table.
<table border=0 width="100%">
<tr>
<td class="out" valign=top nowrap width=5%>30.12.2004
22:34:03 <font color=black><b>></b>TOM </font>
</td>
<td class="out"
align=left>{{{NOT CONVERTED}}}</td>
</tr>
</table
<table border=0 width="100%"><tr><td class="timedel" valign=top nowrap
width=5%>30.12.2004 22:36:37 <font
color=black><b><</b>Benjamin </font></td><td class="incom" align=left>random string</td></tr></table>
<table border=0 width="100%"><tr><td class="incom" valign=top nowrap width=5%>30.12.2004
22:36:47 <font color=black><b><</b>Benjamin </font></td><td
class="incom" align=left>{{{NOT CONVERTED}}}</td></tr></table>
<table border=0 width="100%"><tr><td class="timedel" valign=top nowrap
width=5%>30.12.2004 22:36:47 <font
color=black><b><</b>Benjamin </font></td><td class="incom" align=left>random chat text</td></tr></table>
<table border=0 width="100%"><tr><td class="incom" valign=top nowrap width=5%>30.12.2004
22:36:50 <font color=black><b><</b>Benjamin </font></td><td
class="incom" align=left>{{{NOT CONVERTED}}}</td></tr></table>
I have 3000 of those tables in my html file and I wish to find and remove those tables.
The DATE, NAME and " >" are variables that differ in each table, the rest always has the same pattern.
How can I use the grep feature in this instance to identify this pattern and have it removed.
If you simply want to remove the entire table (from the opening <table> to the ending </table>), if the string "{{{NOT CONVERTED}}}" appears in the middle of it, then this pattern will match the entire table:
(?s)<table.+?>.+?{{{NOT CONVERTED}}}.+?</table>\n
(The (?s) at the beginning allows . to match across line breaks.)
Use "Replace All", replacing with nothing, to delete all of the eligible tables. Undo is your friend if it doesn't do what you need.
Thanks Siegel again for your help - I played around with your code and this did the trick:
<table.+?>{{{NOT CONVERTED}}}.+?</table>
This successfully identified the tables that had the string not converted in them.
Thanks again !
Related
Hello I want match a table tag, followed by any characters but not another table, followed by an element with id ContentPlaceHolder1 and finally followed by the /table closed tag.
I write this reg exp:
~\<table[^>]*>.*?ContentPlaceHolder1.+?<\/table>~is
In my text editor (Emeditor) work fine, in PHP script this match the first table tag of page and al the followed code.
Can anyone tell me what's wrong?
Tks a lot
I am just assuming what you wish to achieve, and as Matt has commented on your question, a code snippet with an explanation of what exactly you are trying to achieve would help us help you.
So, in that context, I will try to guess the issue:
I'm guessing that your code has an element with id ContentPlaceHolder1 near the end and maybe nowhere else. What is leading me to assume that is that you are stating:
in PHP script this match the first table tag of page and al the followed code.
and also
want match a table tag, followed by any characters but not another table
Though this is not the case. In fact your regex is doing the following:
Match the first <table> tag with any attributes there might be inside it ([^>]*)
Match any character as few times as possible (.*?)
Match ContentPlaceHolder1
Match at least one character to any, but as few as possible to make a match (.+?)
Match a closing <\/table> tag
What I tend to believe you are misinterpreting is step #2. What this step is trying to achieve, is not to ignore leading <table> tags, but instead ignore multiple occurrences of the keyword ContentPlaceHolder1.
Consider the following example (please ignore that the html is broken, it's just an example):
<table border="3" cellpadding="10" cellspacing="10">
<td>
<table border="3" cellpadding="3" cellspacing="3">
<td>2nd table</td>
<some_element id="ContentPlaceHolder1"></some_element>
</table>
<td>2nd table</td>
<tr>
<td>2nd table</td>
<td>2nd table</td>
</tr>
</table>
<some_element id="ContentPlaceHolder1"></some_element>
</td>
<td> the cell next to this one has a smaller table inside of it, a table inside a table.</td>
</table>
Here, .*? is not instructing the regex engine to avoid matching a second <table> tag, what is instructing instead is to match the first occurence of the keyword ContentPlaceHolder1 instead of greedily matching the last one.
What you are trying to achieve, can be achieved using Negative Lookahead. What this implies, is that it instructs the regex engine, to look further away and assure that it doesn't match the first subset, if the second one exists. You can see this in practice in this demo, where I'm using negative lookahead to instruct the regex engine to only match a <table> tag if it is not followed by another <table> tag (<table[^>]*>(?!.*<table[^>]*>).
Please review my answer, and if it does solve your issue, please add more information and a sample of your code so that we can provide further assistance.
Regards
tks for the answere
This is an ipotetical page code:
<table border="3" cellpadding="10" cellspacing="10">
<tr>
<td>aaaaa</td>
</tr>
</table>
<table border="3" cellpadding="10" cellspacing="10">
<tr>
<td>aaaaa</td>
</tr>
</table>
<table border="3" cellpadding="10" cellspacing="10">
<tr>
<td>
<!-- from here -->
<table border="3" cellpadding="3" cellspacing="3">
<tr>
<td>aaaa</td>
<td><a id="ContentPlaceHolder1">link</a></td>
</tr>
</table>
<!-- to here -->
</td>
</tr>
</table>
<table border="3" cellpadding="10" cellspacing="10">
<tr>
<td>aaaaa</td>
</tr>
</table>
<table border="3" cellpadding="10" cellspacing="10">
<tr>
<td>aaaaa</td>
</tr>
</table>
I want to match from the first comment to the second.
In other word I want to match the complete table that contains elemnt with ContentPlaceHolder1 id.
My regexp in PHP match the first page table tag.
Tks a lot
i am trying to parse some text from bibliographic database which contains not standard tables. specifications of articles may or may not exist, bu if exist they have same tags for their specifications. For example; all articles have title but only some of them have keywords section. but when they have that section it shown with standard tags like that:
<tr>
<td align="right" valign="top" nowrap="nowrap">Database Name: </td>
<td>Social Science Database</td>
</tr>
<tr>
<td align="right" valign="top" nowrap="nowrap">Journal: </td>
<td>Social Science and Education, 2011,8(4):29-42</td>
</tr>
<tr>
<td align="right" valign="top" nowrap="nowrap">Author: </td>
<td>James H.; Chaomei C.</td>
<td align="right" valign="top" nowrap="nowrap">Type: </td>
<td>Journal</td>
</tr>
<tr>
<td align="right" valign="top" nowrap="nowrap">Article Type: </td>
<td>Research Article</td>
</tr>
<tr>
<td align="right" valign="top" nowrap="nowrap">Retrieve Type: </td>
<td>Bibliographic</td>
</tr>
<tr><td align="right" valign="top" nowrap="nowrap">Language: </td>
<td>En</td>
</tr>
<tr>
<td align="right" valign="top" nowrap="nowrap">Abstract Language: </td>
<td>En</td>
</tr>
Here is my question. I am trying to parse text with Knime using Xpath but i couldn't achieve anything i want. I want to find <tr>'s that contains specific text and take second <td>'s of that section. For example:
for "Database Name:" Xpath must get "Social Science Database".
I tried this code:
.//dns:tr//text()[contains(., 'Database Name:')]
But result contains just first , i need second one.I tried to that code, but it brings nothing.
.//dns:tr//text()[contains(., 'Database Name:')]/dns:td[*]
You can try this:
.//dns:tr//text()[contains(., 'Database Name:')]/../../dns:td[2]
.. takes you to the parent. You need to traverse 2 levels up and get the 2nd td.
I know that I shouldn't be using REGEX for parsing HTML, and I promise to check out the HTML agility pack, but in the meantime, could some expert tell me if there's a pattern that will match this entire block?
<tr bgcolor="#f4f4ff"><td align="center"><font size="2">42</font></td>
<td align="center"><font size="2">35</font></td>
<td><font size="2"><b>Bears</b></font></td>
<td><font size="2">BV</font></td>
<td align="right"><font size="2"><b>$33,845</b></font></td>
<td align="right"><font size="2"><font color="#ff0000">-60.1%</font></font></td>
<td align="right"><font size="2">75</font></td>
<td align="right"><font size="2"><font color="#ff0000">-35</font></font></td>
<td align="right"><font size="2">$451</font></td>
<td align="right"><font size="2">$17,492,470</font></td>
<td align="right"><font size="2">-</font></td>
<td align="center"><font size="2">8</font></td>
</tr>
I'm using VBA, and regexoptions don't seem to be available. I've fiddled endlessly, and "should works" like
<tr[.\n]+tr>
<tr[.\s]+tr>
<tr[.\x0C\x\0A]+tr>
don't. I can match everything up to the first line break, then I hit a brick wall. Is there any workaround if the singleline option isn't available? Maybe using the VBA REPLACE function to change all vbcrlf instances to some other character before I try to match? And can somebody point me to an example of how much easier this would be with the HTML agility pack?
Okay, so you've heard the warnings against using regex to parse html?
As far as I know, VBAScript doesn't have the DOTALL mode, so to match across lines we have to make a fake DOTALL dot with something like this: (?:.|[\r\n]) or [\s\S]
This regex will match your block:
<tr[\s\S]*?</tr>
But if there are nested rows, you will be sorry you used regex. :)
Sample code
Dim myRegExp, myMatches, matched_block
Set myRegExp = New RegExp
myRegExp.Pattern = "<tr[\s\S]*?</tr>"
Set myMatches = myRegExp.Execute(SubjectString)
If myMatches.Count >= 1 Then
matched_block = myMatches(0).Value
Else
matched_block = ""
End If
Here is the code for the table:
<table align="center" width="303" height="740" border="1" cellpadding="10">
<tr>
<th width="130" height="41" scope="col">URL1 - Normal</th>
<th width="121" scope="col">URL2 - Hover</th>
</tr>
<tr>
<td height="94"><img src="http://i1018.photobucket.com/albums/af309/5416339/ad-green.png"/></td>
<td><img src="http://i1018.photobucket.com/albums/af309/5416339/ad-green-h.png" alt=""/></td>
</tr>
<tr>
<td height="124"><img src="http://i1018.photobucket.com/albums/af309/5416339/ad-blue.png" alt=""/></td>
<td><img src="http://i1018.photobucket.com/albums/af309/5416339/ad-blue-h.png" alt=""/></td>
</tr>
<tr>
<td height="147"><img src="http://i1018.photobucket.com/albums/af309/5416339/ad-grey-h.png" alt=""/></td>
<td><img src="http://i1018.photobucket.com/albums/af309/5416339/ad-grey.png" alt=""/></td>
</tr>
<tr>
<td height="137"><img src="http://i1018.photobucket.com/albums/af309/5416339/ad-pink.png" alt=""/></td>
<td><img src="http://i1018.photobucket.com/albums/af309/5416339/ad-pink-h.png" alt=""/></td>
</tr>
<tr>
<td height="132"><img src="http://i1018.photobucket.com/albums/af309/5416339/ad-red.png" alt=""/></td>
<td><img src="http://i1018.photobucket.com/albums/af309/5416339/ad-red-h.png" alt=""/></td>
</tr>
<tr>
<td height="132"><img src="http://i1018.photobucket.com/albums/af309/5416339/ad-black.png" alt=""/></td>
<td><img src="http://i1018.photobucket.com/albums/af309/5416339/ad-black-h.png" alt=""/></td>
</tr>
</table>
When I insert the table, it leaves a gap in-between the table and the text. If I remove the table, then everything is fine. What's going wrong here?
Blogspot inserts line breaks for you... and they push the table down. (I haven't found a workaround yet.)
If you view the source, you can see them:
<table align="center" width="303" height="740" border="1" cellpadding="10"><br />
<tr><br />
<th width="130" height="41" scope="col">URL1 - Normal</th><br />
<th width="121" scope="col">URL2 - Hover</th><br />
</tr><br />
<tr><br />
<td height="94"><img src="http://i1018.photobucket.com/albums/af309/5416339/ad-green.png"/></td><br />
...
Because the BRs are invalid when directly inside a TABLE, TR, or after a TH or TD, the browser pushes those elements out of and above the table when rendering the DOM.
If you take a look at the source of the page, you'll notice a TON of <br/> tags interspersed with your table (but not contained in cell elements). They are rendered above the table.
It looks like your HTML is being parsed by something, and your line-breaks are being replaced with BR tags.
Quick solution: remove all linebreaks and just have the table code on one line :)
It has nothing to do with the table. It's the fact that there are 31 <br> (line break) tags before the table (which are what are creating the huge gap.
It sounds like BlogSpot (or whatever blog service you are using) is adding extra <br> tags based on how you're formatting the rest of your content. Edit the source of the page if possible and manually remove them...otherwise it becomes a support issue with whatever blog platform you're on.
This has nothing to do with anything in your table markup. Viewing the HTML source of that page shows about 30 <br> tags ahead of the table. They are obviously responsible for the extra space.
Why you get 30 <br> tags when inserting a table must have something to do with how blogspot.com is formatting your content. Your best bet is to try editing the HTML by hand to remove the <br> tags. If you can't do that, or if the <br> tags don't show up when editing the HTML, it's a question for customer service at Blogspot.
I am trying to make an HTML table like this:
Name Price Original Value
RED ALL 50 10
A 980 100
B 75 45
YELLOW ALL 500 100
A 550 150
B 80 40
I came up with this but its wrong and looks ugly :( http://jsbin.com/ayixi
Your example updated and working.
I don't know what you were doing in the example because you're missing data etc... The simplest thing to do is just show you how to go about it. Only one of your columns needs a colspan, and only one of your rows needs rowspans to span the rows... (the name column and the color grouping for the rows)
<style>
th {
text-align:left;
}
.endofrow td {
padding-bottom:1em;
}
</style>
<table width="50%" border=1>
<tr><th>Name<th colspan=2>Price<th>Original Value</tr>
<tr><td rowspan=3 valign=top>Red<td>ALL<td>50<td>10</tr>
<tr><td>A<td>980<td>100</tr>
<tr class="endofrow"><td>B<td>80<td>50</tr>
<tr><td rowspan=3 valign=top>Yellow<td>ALL<td>500<td>100</tr>
<tr><td>A<td>980<td>100</tr>
<tr class="endofrow"><td>B<td>80<td>50</tr>
</table>
(note, I've left out the closing tags as they will be filled in and it's easier to read the tables without them)
If you want a space between rows, don't use a <br> or a <br />, they are both a poor solution to the problem. You want to add a class to the final row of that group and put some padding in there. That's the most semantically correct thing to do, and you avoid mixing in line breaks where they don't belong.
You need to look at the colspan and rowspan values. For example in your Table there is the following entry:
<td CLASS="trheadermain" colspan=2 rowspan=3 align="center" height=17>
<B>NAME</B></td>
The rowspan=3 is making the NAMElabel take up too much space
There are some <br> elements that where they don't belong:
</tr>
<br><br><br>
<tr height=20 bgColor=>
You may want to modernize your HTML: use <br /> in place of <br>, <strong> in place of <b>, colspan="2" in place of colspan=2, etc.
The rowspans on the Name, Price and Original Value cells are breaking your layout. It should work alright without these.
<td CLASS="trheadermain" colspan=2 rowspan=3 align="center" height=17 ><B>NAME</B></td>
<td rowspan=2 CLASS="trheadermain" ><B>Price</B></td>
<td rowspan=2 CLASS="trheadermain" ><B>Original Value</B></td>
->
<td CLASS="trheadermain" colspan=2 align="center" height=17 ><B>NAME</B></td>
<td CLASS="trheadermain" ><B>Price</B></td>
<td CLASS="trheadermain" ><B>Original Value</B></td>