Multiple Condition in grep regex - html

I want to use grep to search a file for all lines containing <div, except that I do not want lines between <!-- and -->.
I have this regex to find the lines containing <div: ^.*\<div.*$
I have this regex to exclude the lines between <!-- and -->: ^((?!\<!--.*?--\>).)*$ — but it doesn't work. It only matches when the comment is in a single line. How can I fix that?
How can I combine these in one grep line? Or do I have to type two greps?

grep does not support multiline searches like your search for <!-- ... -->. This can be worked around by using various helper commands, but in your case it's not worth it. It's better to just use a more powerful language, such as sed or AWK or Perl:
perl -ne '$on = 1 if m/<!--/; $on = "" if m/-->/; print if !$on and m/<div/' FILE
Edited to add: If you also want to discount instances of <!-- ... <div ... --> on a single line, you can write:
perl -ne ' my $line = $_;
if ($in_comment && s/.*?-->//) {
$in_comment = "";
}
while (!$in_comment && s/<!--.*?(-->)?/) {
$in_comment = 1 if $1;
}
print $line if !$in_comment && m/<div/
' FILE

Related

Remove double quotes if delimiter value is not present in data

An input file is given, each line of which contains quotes for each column and carriage return/ new line character.
If the line contains new lines it has be appended with in the same
line which is inside the quotes i.e for example line 1
Removing of double quotes for each column if the delimiter(,) is
not present.
Removing of Carriage Return characters i.e(^M)
To exemplify, given the following input file
"name","address","age"^M
"ram","abcd,^M
def","10"^M
"abhi","xyz","25"^M
"ad","ram,John","35"^M
I would like to obtain the following output by means of a sed/perl/awk script/oneliner.
name,address,age
ram,"abcd,def",10
abhi,xyz,25
ad,"ram,John",35
Solutions which i have tired it so far
For appending with previous line
sed '/^[^"]*"[^"]*$/{N;s/\n//}' sample.txt
for replacing control-m characters
perl -pne 's/\\r//g' sample.txt
But i didn't achieve final output what i required below
Use a library to parse CSV files. Apart from always wanting to use a library for that here you also have very specific reasons, with embedded newlines and delimiters.
In Perl a good library is Text::CSV (which wraps Text::CSV_XS if installed). A basic example
use warnings;
use strict;
use feature 'say';
use Text::CSV;
my $file = shift or die "Usage: $0 file.csv\n";
my $csv = Text::CSV->new({ binary => 1, auto_diag => 1 });
open my $fh, '<', $file or die "Can't open $file: $!";
while (my $row = $csv->getline($fh)) {
s/\n+//g for #$row;
$csv->say(\*STDOUT, $row);
}
Comments
The binary option in the constructor is what handles newlines embedded in data
Once a line is read into the array reference $row I remove newlines in each field with a simplistic regex. By all means please improve this as/if needed
The pruning of $row works as follows. In a foreach loop each element is really aliased by the loop variable, so if that gets changed the array changes. I used default where elements are aliased by $_, which the regex changes so $row changes.
I like this compact shortcut because it has such a distinct look that I can tell from across the room that an array is being changed in place; so I consider it a sort-of-an-idiom. But if it is in fact confusing please by all means write out a full and proper loop
The processed output is printed to STDOUT. Or, open an output file and pass that filehandle to say (or to print in older module versions) so the output goes directly to that file
The above prints, for the sample input provided in the question
name,address,age
ram,"abcd,def",10
abhi,xyz,25
ad,"ram,John",35
This might work for you (GNU sed):
sed ':a;/[^"]$/{N;s/\n//;ba};s/"\([^",]*\)"/\1/g' file
The solution is in two parts:
Join broken lines to make whole ones.
Remove double quotes surrounding fields that do not contain commas.
If the current line does not end with double quotes, append the next line, remove the newline and repeat. Otherwise: remove double quotes surrounding fields that do not contain double quotes or commas.
N.B. Supposes that fields do not contain quoted double quotes. If that is the case, the condition for the first step would need to be amended and double quotes within fields would need to catered for.
FPAT is the way to go using gnu awk, it handles comma separated files.
remove ^m
clean lines
remove qutes
.
dos2unix sample.txt
awk '{printf "%s"(/,$/?"":"\n"),$0}' sample.txt > tmp && mv tmp sample.txt
"name","address","age"
"ram","abcd,def","10"
"abhi","xyz","25"
"ad","ram,John","35"
awk -v FPAT="([^,]+)|(\"[^\"]+\")" -v OFS=, '{for (i=1;i<=NF;i++) if($i!~",") $i=substr($i,2,length($i)-2)}1' sample.txt
name,address,age
ram,"abcd,def",10
abhi,xyz,25
ad,"ram,John",35
All in one go:
dos2unix sample.txt && awk '{printf "%s"(/,$/?"":"\n"),$0}' sample.txt | awk -v FPAT="([^,]+)|(\"[^\"]+\")" -v OFS=, '{for (i=1;i<=NF;i++) if($i!~",") $i=substr($i,2,length($i)-2)}1'
Normally you set Filed Separator FS or F to tell how filed are separated. FPAT="([^,]+)|(\"[^\"]+\")" FPAT tells how the filed looks like using a regex. This regex is complicated and often used with CSV.
(i=1;i<=NF;i++) loop through on by one field on the line.
if($i!~",") if it does not contain comma, then
$i=substr($i,2,length($i)-2) remove first and last character, the "
If a field for some reason do not contain ", this is more robust:
awk -v FPAT="([^,]+)|(\"[^\"]+\")" -v OFS=, '{for (i=1;i<=NF;i++) if($i!~",") {n=split($i,a,"\"");$i=(n>1?a[2]:$i)}}1' file
It will not do any thing to a field not contains double quote.
With perl, please try the following:
perl -e '
while (<>) {
s/\r$//; # remove trailing CR code
$str .= $_;
}
while ($str =~ /("(("")|[^"])*"\n?)|((^|(?<=,))[^,]*((?=,)|\n))/g) {
$_ = $&;
if (/,/) { # the element contains ","
s/\n//g; # then remove newline(s) if any
} else { # otherwise remove surrounding double quotes
s/^"//s; s/"$//s;
}
push(#ary, $_);
if (/\n$/) { # newline terminates the element
print join(",", #ary);
#ary = ();
}
}' sample.txt
Output:
name,address,age
ram,"abcd,def",10
abhi,xyz,25
ad,"ram,John",35

Use regular expression to extract img tag from HTML in Perl

I need to extract captcha from url and recognised it with Tesseract.
My code is:
#!/usr/bin/perl -X
###
$user = 'user'; #Enter your username here
$pass = 'pass'; #Enter your password here
###
#Server settings
$home = "http://perltest.adavice.com";
$url = "$home/c/test.cgi?u=$user&p=$pass";
###Add code here!
#Grab img from HTML code
#if ($html =~ /<img. *?src. *?>/)
#{
# $img1 = $1;
#}
#else
#{
# $img1 = "";
#}
$img2 = grep(/<img. *src=.*>/,$html);
if ($html =~ /\img[^>]* src=\"([^\"]*)\"[^>]*/)
{
my $takeImg = $1;
my #dirs = split('/', $takeImg);
my $img = $dirs[2];
}
else
{
print "Image not found\n";
}
###
die "<img> not found\n" if (!$img);
#Download image to server (save as: ocr_me.img)
print "GET '$img' > ocr_me.img\n";
system "GET '$img' > ocr_me.img";
###Add code here!
#Run OCR (using shell command tesseract) on img and save text as ocr_result.txt
system("tesseract ocr_me.img ocr_result");
print "GET '$txt' > ocr_result.txt\n";
system "GET '$txt' > ocr_result.txt";
###
die "ocr_result.txt not found\n" if (!-e "ocr_result.txt");
# check OCR results:
$txt = 'cat ocr_result.txt';
$txt =~ s/[^A-Za-z0-9\-_\.]+//sg;
$img =~ s/^.*\///;
print `echo -n "file=$img&text=$txt" | POST "$url"`;
As you see I`m trying extract img src tag. This solution did not work for me ($img1) use shell command tesseract in perl script to print a text output. Also I used adopted version of that solution($img2) How can I extract URL and link text from HTML in Perl?.
If you need HTMLcode from that page, here is:
<html>
<head>
<title>Perl test</title>
</head>
<body style="font: 18px Arial;">
<nobr>somenumbersimg src="/JJ822RCXHFC23OXONNHR.png"
somenumbers<img src="/captcha/1533030599.png"/>
somenumbersimg src="/JJ822RCXHFC23OXONNHR.png" </nobr><br/><br/><form method="post" action="?u=user&p=pass">User: <input name="u"/><br/>PW: <input name="p"/><br/><input type="hidden" name="file" value="1533030599.png"/>Text: <input name="text"></br><input type="submit"></form><br/>
</body>
</html>
I got error that image not found. My problem is wrong regular expression, as I think.I can not install any modules such as HTTP::Parser or similar
Aside from the fact that using regular expressions on HTML isn't very reliable, your regular expression in the following code isn't going to work because it's missing a capture group, so $1 won't be assigned a value.
if ($html =~ /<img. *?src. *?>/)
{
$img = $1;
}
If you want to extract parts of text using a regular expression you need to put that part inside brackets. Like for example:
$example = "hello world";
$example =~ /(hello) world/;
this will set $1 to "hello".
The regular expression itself doesn't make that much sense - where you have ". *?", that'll match any character followed by 0 or more spaces. Is that a typo for ".*?" which would match any number of characters but isn't greedy like ".*", so will stop when it finds a match for the next part of the regex.
This regular expression is possibly closer to what you're looking for. It'll match the first img tag that has a src attribute that starts with "/captcha/" and store the image URL in $1
$html =~ m%<img[^>]*src="(/captcha/[^"]*)"%s;
To break it down how it works. The "m%....%" is just a different way of saying "/.../" that allows you to put slashes in the regex without needing to escape them. "[^>]*" will match zero or more of any character except ">" - so it won't match the end of the tag. And "(/captcha/[^"]*)" is using a capture group to grab anything inside the double quotes that will be the URL. It's also using the "/s" modifier on the end which will treat $html as if it is just one long line of text and ignoring any \n in it which probably isn't needed, but on the off chance the img tag is split over multiple lines it'll still work.

Remove HTML comments from Markdown file

This would come in handy when converting from Markdown to HTML, for example, if one needs to prevent comments from appearing in the final HTML source.
Example input my.md:
# Contract Cancellation
Dear Contractor X, due to delays in our imports, we would like to ...
<!--
... due to a general shortage in the Y market
TODO make sure to verify this before we include it here
-->
best,
me <!-- ... or should i be more formal here? -->
Example output my-filtered.md:
# Contract Cancellation
Dear Contractor X, due to delays in our imports, we would like to ...
best,
me
On Linux, I would do something like this:
cat my.md | remove_html_comments > my-filtered.md
I am also able to write an AWK script that handles some common cases,
but as I understood, neither AWK nor any other of the common tools for simple text manipulation (like sed) are really up to this job. One would need to use an HTML parser.
How to write a proper remove_html_comments script, and with what tools?
I see from your comment that you mostly use Pandoc.
Pandoc version 2.0, released October 29, 2017, adds a new option --strip-comments. The related issue provides some context to this change.
Upgrading to the latest version and adding --strip-comments to your command should remove HTML comments as part of the conversion process.
It might be a bit counter-intuitive, bud i would use a HTML parser.
Example with Python and BeautifulSoup:
import sys
from bs4 import BeautifulSoup, Comment
md_input = sys.stdin.read()
soup = BeautifulSoup(md_input, "html5lib")
for element in soup(text=lambda text: isinstance(text, Comment)):
element.extract()
# bs4 wraps the text in <html><head></head><body>…</body></html>,
# so we need to extract it:
output = "".join(map(str, soup.find("body").contents))
print(output)
Output:
$ cat my.md | python md.py
# Contract Cancellation
Dear Contractor X, due to delays in our imports, we would like to ...
best,
me
It shouldn't break any other HTML you might have in your .md files (it might change the code formatting a bit, but not it's meaning):
Of course test it thouroughly if you decide to use it.
Edit – Try it out online here: https://repl.it/NQgG (input is read from input.md, not stdin)
This awk should work
$ awk -v FS="" '{ for(i=1; i<=NF; i++){if($i$(i+1)$(i+2)$(i+3)=="<!--"){i+=4; p=1} else if(!p && $i!="-->"){printf $i} else if($i$(i+1)$(i+2)=="-->") {i+=3; p=0;} } printf RS}' file
Dear Contractor X, due to delays in our imports, we would like to ...
best,
me
For better readability and explanation :
awk -v FS="" # Set null as field separator so that each character is treated as a field and it will prevent the formatting as well
'{
for(i=1; i<=NF; i++) # Iterate through each character
{
if($i$(i+1)$(i+2)$(i+3)=="<!--") # If combination of 4 chars makes a comment start tag
{ # then raise flag p and increment i by 4
i+=4; p=1
}
else if(!p && $i!="-->") # if p==0 then print the character
printf $i
else if($i$(i+1)$(i+2)=="-->") # if combination of 3 fields forms comment close tag
{ # then reset flag and increment i by 3
i+=3; p=0;
}
}
printf RS
}' file
If you open it with vim you could do:
:%s/<!--\_.\{-}-->//g
With _. you allow the regular expression to match all characters even the new line character, the {-} is for making it lazy, otherwise you will lose all content from the first to the last comment.
I have tried to use the same expression on sed but it wont work.
my AWK solution, probably more easily to understand then the one of #batMan, at least for high-level devs. the functionality should be about the same.
file remove_html_comments:
#!/usr/bin/awk -f
# Removes common, simple cases of HTML comments.
#
# Example:
# > cat my.html | remove_html_comments > my-filtered.html
#
# This is usefull for example to pre-parse Markdown before generating
# an HTML or PDF document, to make sure the commented out content
# does not end up in the final document, # not even as a comment
# in source code.
#
# Example:
# > cat my.markdown | remove_html_comments | pandoc -o my-filtered.html
#
# Source: hoijui
# License: CC0 1.0 - https://creativecommons.org/publicdomain/zero/1.0/
BEGIN {
com_lvl = 0;
}
/<!--/ {
if (com_lvl == 0) {
line = $0
sub(/<!--.*/, "", line)
printf line
}
com_lvl = com_lvl + 1
}
com_lvl == 0
/-->/ {
if (com_lvl == 1) {
line = $0
sub(/.*-->/, "", line)
print line
}
com_lvl = com_lvl - 1;
}

getting specific filename from bash

So I have a perl module that uses a bash command to obtain the file(s) with certain "table" names. In my specific case, it is looking for tables with the name "event", but I need this to work with all names too.
Currently, I have the following code in my perl script to obtain MYI files with the name table, and I am receiving not only event_* but also event_extra_data_* as well. For my example, I only need the 2nd table that exists in my database for event_. As my test info, I have, currently,
event_1459161160_0
event_1459182760_0
event_extra_data_1459182745_0
event_extra_data_1459182760_0
which are partitioned tables from tables "event" and event_extra_data which is the value that the $table variable sees below.
Anyways, my question is, how do i limit this to only receiving event_1459182760_0.MYI and not event_extra_data_1459182760_0.MYI which it is currently getting?
elsif ($sql =~ /\{LAST\}/i )
{
$cmd = 'ls -1 /var/lib/mysql/sfsnort/'.$table.'_*MYI | grep -v template | tail -n1 | cut -d"/" -f6 | cut -d"." -f1';
$value = `$cmd`;
print "Search Value: $value\n";
if ($value eq "")
{
$sql = ""; # same as with FIRST
}
else
{
$sql =~ s/\{LAST\}/$value/g;
}
}
Don't parse ls - there's no point, and it's prone to causing problems.
I would point out this - the glob function within perl allows you to do to a limited number of "regex-like" patterns. (Note - they aren't regex, so don't get them mixed up).
foreach my $filename ( glob "event_[0-9]*" ) {
#do something with $filename
}
If you're just after the last - when sorted numerically:
my ( $last ) = reverse sort glob "event_[0-9]*";
Given you have a single path, then you should be able to:
my ( $last ) = reverse sort glob "/var/lib/mysql/sfsnort/event_[0-9]*.MYI";
Note - that this works, assuming you're working with time() numeric values - it's doing an alphanumeric sort (and on directory names too).
If that isn't a valid assumption, you'll need a custom sort - which is quite easy, you can feed sort a subroutine to sort by.
Either:
sort { my ($a1) = $a =~ /(\d+)/; my ($b1) = $b =~ /(\d+)/; $b1 <=> $a1 }
To extract the first 'string of digits' from the path. (note - also includes directories).
Or use the -M file test:
sort { -M $a <=> -M $b }
Which will read modification time from the file (technically -M is age in days).
You can remove the reverse if you custom sort, just by swapping $a and $b.
Though I think this would be better done all in perl, to answer your specific question about how to get event_* but not event_extra*, you could of course add that to your grep to filter out, or you could use a different glob, like $table_[0-9]* if there's always an _ then a digit after the table name.
In perl you could do it something like the following though:
opendir( DIR, '/var/lib/mysql/sfsnort/' );
my #files = sort grep { /$table_\d/ } readdir( DIR );
closedir( DIR );
$files[$#files] =~ /(^[^.]+)/;
my $value = $1;

Editing multiple HTML files using SED (or something similar)

I have about 1000 HTML files to edit which represent footnotes in a large technical document. I have been asked to go through the HTML files one by one and manually edit the HTML, to get it all on the straight and narrow.
I know that this could probably be done in a matter of seconds with SED as the changes to each file are similar. The body text in each file can be different but I want to change the tags to match the following:
<body>
<p class="Notes">See <i>R v Swain</i> (1992) 8 CRNZ 657 (HC).</p>
</body>
The text may change, for example, it could say 'See R v Pinky and the Brain (1992) or something like that but basically the body text should be that.
Currently, however, the body text may be:
<body>
<p class="Notes"><span class="FootnoteReference"><span lang="EN-GB" xml:lang="EN-GB"><span><span
class="FootnoteReference"><span lang="EN-GB" xml:lang="EN-GB" style="font-size: 10.0pt;">See <i>R v Pinky and the Brain</i> (1992) </span></span></span></span></span></p>
</body>
or even:
<body>
<p class="FootnoteText"><span class="FootnoteReference"><span lang="EN-US"
xml:lang="EN-US" style="font-size: 10.0pt;"><span><![endif]></span></span></span>See <i>R v Pinky and the Brain</i> (1992)</p>
</body>
Can anybody suggest a SED expression or something similar that would solve this?
Like this?:
perl -pe 's/Swain/Pinky and the Brain/g;' -i lots.html of.html files.html
The breakdown:
-e = "Use code on the command line"
-p = "Execute the code on every line of every file, and print out the line, including what changed"
-i = "Actually replace the files with the new content"
If you swap out -i with -i.old then lots.html.old and of.html.old (etc) will contain the files before the changes, in case you need to go back.
This will replace just Swain with Pinky and the Brain in all the files. Further changes would require more runs of the command. Or:
s/Swain/Pinky/g; s/Twain/Brain/g;
To swap Swain with Pinky and Twain with Brain everywhere.
Update:
If you can be sure about the incoming formatting of the data, then something like this may suffice:
# cat ff.html
<body>
<p class="Notes"><span class="FootnoteReference"><span lang="EN-GB" xml:lang="EN-GB"><span><span
class="FootnoteReference"><span lang="EN-GB" xml:lang="EN-GB" style="font-size: 10.0pt;">See <i>R v Twain</i> (1992) </span></span></span></span></span></p>
<p class="Notes"><span class="FootnoteReference"><span lang="EN-GB" xml:lang="EN-GB"><span><span
class="FootnoteReference"><span lang="EN-GB" xml:lang="EN-GB" style="font-size: 10.0pt;">See <i>R v Swain</i> (1992) </span></span></span></span></span></p>
</body>
# perl -pe 'BEGIN{undef $/;} s/<[pP][ >].*?See <i>(.*?)<\/i>(.*?)<.*?\/[pP]>/<p class="Notes">See <i>$1<\/i>$2<\/p>/gsm;' ff.html
<body>
<p class="Notes">See <i>R v Twain</i> (1992) </p>
<p class="Notes">See <i>R v Swain</i> (1992) </p>
</body>
Explanations:
BEGIN{undef $/;} = treat the whole document as one string, or else html that has newlines in it won't get handled properly
<[pP[ >] = the beginning of a p-tag (case insensitive)
.*? = lots of stuff, non-greedy-matched i.e. http://en.wikipedia.org/wiki/Regular_expression#Lazy_quantification
See <i> = literally look for that string - very important, since that seems to be the only common denominator
(.*?) = put more stuff into a parentheses group (to be used later)
<\/i> = the end i-tag
(.*?) = put more stuff into a parentheses group (to be used later)
<.*?\/[pP] = the end p-tag and other possible tags mashed up before it (like all your spans)
and replace it with the string you want, where $1 and $2 are what got snagged in the parentheses before, i.e. the two (.*?) 's
g = global search - so possibly more than one per line
s = treat everything like one line (which it is now due to the BEGIN at the top)
First convert your HTML files to proper XHTML using http://tidy.sourceforge.net and then use xmlstarlet to do the necessary XHTML processing.
Note: Get the current version of xmlstarlet for in-place XML file editing.
Here's a simple, yet complete mini-example:
curl -s http://checkip.dyndns.org > dyndns.html
tidy -wrap 0 -numeric -asxml -utf8 2>/dev/null < dyndns.html > dyndns.xml
# test: print body text to stdout (dyndns.xml)
xml sel -T \
-N XMLNS="http://www.w3.org/1999/xhtml" \
-t -m "//XMLNS:body" -v '.' -n \
dyndns.xml
# edit body text in-place (dyndns.xml)
xml ed -L \
-N XMLNS="http://www.w3.org/1999/xhtml" \
-u "//XMLNS:body" -v '<p NEW BODY TEXT </p>' \
dyndns.xml
# create new HTML file (by overwriting the original one!)
xml unesc < dyndns.xml > dyndns.html
To consolidate the span tags you may use tidy (version released on 25 March 2009) as well!
# get current tidy version: http://tidy.cvs.sourceforge.net/viewvc/tidy/tidy/
# see also: http://tidy.sourceforge.net/docs/quickref.html#merge-spans
tidy -q -c --merge-spans yes file.html
You will have to check your input files to verify some assumptions can be made. Based on your two examples, I have made the following assumptions. You will need to check them and take some sample input files to verify you have found all assumptions.
The file consists of a single footnote contained in a single <body></body> pair. The body tags are always present and well formed.
The footnote is buried somewhere inside a <p></p> pair and one or many <span></span> tags. <!...> tags can be discarded.
The following Perl script works for both examples you have supplied (on Linux with Perl 5.10.0). Before using it, make sure you have a backup of your original html files. By default, it will only print the result on stdout without changing any file.
#!/usr/bin/perl
$overwrite = 0;
# get rid of default line separator to facilitate slurping in a $scalar var
$/ = '';
foreach $filename (#ARGV)
{
# slurp entire file in $text variable
open FH, "<$filename";
$full_text = <FH>;
close FH;
if ($overwrite)
{
! -f "$filename.bak" && rename $filename, "$filename.bak";
}
# match everything that is found before the body tag, everything
# between and including the body tags, and what follows
# s modifier causes full_text to be considered a single long string
# instead of individual lines
($before_body, $body, $after_body) = ($full_text =~ m!(.*)<body>(.*)</body>(.*)!s);
#print $before_body, $body, $after_body;
# Discard unwanted tags from the body
$body =~ s%<span.*?>%%sg;
$body =~ s%</span.*?>%%sg;
$body =~ s%<p.*?>%%sg;
$body =~ s%</p.*?>%%sg;
$body =~ s%<!.*?>%%sg;
# Remaining leading and trailing whitespace likely to be newlines: remove
$body =~ s%^\s*%%sg;
$body =~ s%\s*$%%sg;
if ($overwrite)
{
open FH, ">$filename";
print FH $before_body, "<body>\n<p class=\"Notes\">$body</p>\n</body>", $after_body;
close FH;
}
else
{
print $before_body, "<body>\n<p class=\"Notes\">$body</p>\n</body>", $after_body;
}
}
To use it:
./script.pl file1.html
./script.pl file1.html file2.html
./script.pl *.html
Tweak it and when you're happy set $overwrite=1. The script creates a .bak only if it does not already exist.
If you have 1 entry per file, no rigid structure in these files, and possibly multiple lines, I would go for a php or perl script to process them file by file, while emitting suitable warnings when the patterns don't match.
use
php -f thescript.php
to execute thescript.php, which contains
<?php
$path = "datapath/";
$dir = opendir($path);
while ( ( $fn = readdir($dir) ) !== false )
{
if ( preg_match("/html$/",$fn) ) process($path.$fn);
}
function process($file)
{
$in = file_get_contents($file);
$in2 = str_replace("\n"," ",strip_tags($in,"<i>"));
if ( preg_match("#^(.*)<i>(.*)</i>(.*)$#i",$in2,$match) )
{
list($dummy,$p0,$p1,$p2) = $match;
$out = "<body>$p0<i>$p1</i>$p2</body>";
file_put_contents($file.".out",$out);
} else {
print "Problem with $file? (stripped down to: $in2)\n";
file_put_contents($file.".problematic",$in);
}
}
?>
you could tweak this to your needs until the number of misses is low enough to do the last few by hand. You probably need to add some $p0 = trim($p0); etc to sanitize everything.