Remove HTML comments from Markdown file - html

This would come in handy when converting from Markdown to HTML, for example, if one needs to prevent comments from appearing in the final HTML source.
Example input my.md:
# Contract Cancellation
Dear Contractor X, due to delays in our imports, we would like to ...
<!--
... due to a general shortage in the Y market
TODO make sure to verify this before we include it here
-->
best,
me <!-- ... or should i be more formal here? -->
Example output my-filtered.md:
# Contract Cancellation
Dear Contractor X, due to delays in our imports, we would like to ...
best,
me
On Linux, I would do something like this:
cat my.md | remove_html_comments > my-filtered.md
I am also able to write an AWK script that handles some common cases,
but as I understood, neither AWK nor any other of the common tools for simple text manipulation (like sed) are really up to this job. One would need to use an HTML parser.
How to write a proper remove_html_comments script, and with what tools?

I see from your comment that you mostly use Pandoc.
Pandoc version 2.0, released October 29, 2017, adds a new option --strip-comments. The related issue provides some context to this change.
Upgrading to the latest version and adding --strip-comments to your command should remove HTML comments as part of the conversion process.

It might be a bit counter-intuitive, bud i would use a HTML parser.
Example with Python and BeautifulSoup:
import sys
from bs4 import BeautifulSoup, Comment
md_input = sys.stdin.read()
soup = BeautifulSoup(md_input, "html5lib")
for element in soup(text=lambda text: isinstance(text, Comment)):
element.extract()
# bs4 wraps the text in <html><head></head><body>…</body></html>,
# so we need to extract it:
output = "".join(map(str, soup.find("body").contents))
print(output)
Output:
$ cat my.md | python md.py
# Contract Cancellation
Dear Contractor X, due to delays in our imports, we would like to ...
best,
me
It shouldn't break any other HTML you might have in your .md files (it might change the code formatting a bit, but not it's meaning):
Of course test it thouroughly if you decide to use it.
Edit – Try it out online here: https://repl.it/NQgG (input is read from input.md, not stdin)

This awk should work
$ awk -v FS="" '{ for(i=1; i<=NF; i++){if($i$(i+1)$(i+2)$(i+3)=="<!--"){i+=4; p=1} else if(!p && $i!="-->"){printf $i} else if($i$(i+1)$(i+2)=="-->") {i+=3; p=0;} } printf RS}' file
Dear Contractor X, due to delays in our imports, we would like to ...
best,
me
For better readability and explanation :
awk -v FS="" # Set null as field separator so that each character is treated as a field and it will prevent the formatting as well
'{
for(i=1; i<=NF; i++) # Iterate through each character
{
if($i$(i+1)$(i+2)$(i+3)=="<!--") # If combination of 4 chars makes a comment start tag
{ # then raise flag p and increment i by 4
i+=4; p=1
}
else if(!p && $i!="-->") # if p==0 then print the character
printf $i
else if($i$(i+1)$(i+2)=="-->") # if combination of 3 fields forms comment close tag
{ # then reset flag and increment i by 3
i+=3; p=0;
}
}
printf RS
}' file

If you open it with vim you could do:
:%s/<!--\_.\{-}-->//g
With _. you allow the regular expression to match all characters even the new line character, the {-} is for making it lazy, otherwise you will lose all content from the first to the last comment.
I have tried to use the same expression on sed but it wont work.

my AWK solution, probably more easily to understand then the one of #batMan, at least for high-level devs. the functionality should be about the same.
file remove_html_comments:
#!/usr/bin/awk -f
# Removes common, simple cases of HTML comments.
#
# Example:
# > cat my.html | remove_html_comments > my-filtered.html
#
# This is usefull for example to pre-parse Markdown before generating
# an HTML or PDF document, to make sure the commented out content
# does not end up in the final document, # not even as a comment
# in source code.
#
# Example:
# > cat my.markdown | remove_html_comments | pandoc -o my-filtered.html
#
# Source: hoijui
# License: CC0 1.0 - https://creativecommons.org/publicdomain/zero/1.0/
BEGIN {
com_lvl = 0;
}
/<!--/ {
if (com_lvl == 0) {
line = $0
sub(/<!--.*/, "", line)
printf line
}
com_lvl = com_lvl + 1
}
com_lvl == 0
/-->/ {
if (com_lvl == 1) {
line = $0
sub(/.*-->/, "", line)
print line
}
com_lvl = com_lvl - 1;
}

Related

transform multiline text into csv with awk sed and grep

I run a shell command that returns a list of repeated values like this (note the indentation):
Name: vm346
cpu 1 (12%) 6150m (76%)
memory 1130Mi (7%) 1130Mi (7%)
Name: vm847
cpu 6 (75%) 30150m (376%)
memory 12980Mi (87%) 12980Mi (87%)
Name: vm848
cpu 3500m (43%) 17150m (214%)
memory 6216Mi (41%) 6216Mi (41%)
I am trying to transform that data like this (in csv):
vm346,1,(12%),6150m,(76%),1130Mi,(7%),1130Mi,(7%)
vm847,6,(75%),30150m,(376%),12980Mi,(87%),12980Mi,(87%)
vm848,3500m,(43%),17150m,(214%),6216Mi,(41%),6216Mi,(41%)
The problem is that any given dataset like the one above is always on more than one line.
when I pipe that into it awk it drives me mad because even if I use:
BEGIN{ FS="\n" }
to try and stitch the data together in one line, it doesn't work. No matter what I do, awk keeps the name value as a separated line above everything else.
I am sorry I haven't much code to share but I have been spinning my wheels with this for a few hours now and I am running out of ideas...
I can solve this in Perl:
perl -ane 'print join ",", #F[1 .. $#F]; print $F[0] eq "memory" ? "\n" : ","'
It should be easy to translate it to awk if you need it.
How does it work?
-a splits each line on whitespace into the #F array
-n reads the input line by line and runs the code specified after -e for each line
We print all the elements but the first one separated by commas (see join)
We then look at the first column, if it's memory, we are at the last line of the block, so we print a newline, otherwise we print a comma
With AWK, one option is to set RS to "Name: ", and ignore the first record with NR > 1, e.g.
awk -v RS="Name: " 'BEGIN{OFS=","} NR > 1 {print $1, $3, $4, $5, $6, $8, $9, $10, $11}' file
#> vm346,1,(12%),6150m,(76%),1130Mi,(7%),1130Mi,(7%)
#> vm847,6,(75%),30150m,(376%),12980Mi,(87%),12980Mi,(87%)
#> vm848,3500m,(43%),17150m,(214%),6216Mi,(41%),6216Mi,(41%)
awk '{$1=""}1' | paste -sd' \n' - | awk '{$1=$1}1' OFS=,
Get rid of the first column. Join every three rows. Same idea with sed:
sed 's/^ *[^ ]* *//' | paste -sd' \n' - | sed 's/ */,/g'
Something else:
awk '
$1=="Name:" {
sep=ors
ors=ORS
} {
for (i=2;i<=NF;++i) {
printf "%s%s",sep,$i
sep=OFS
}
} END {printf "%s",ors}'
Or if you want to print an ORS based on the first field being "memory" (note that this program may end without printing a terminating ORS):
awk '{for (i=2;i<=NF;++i) printf "%s%s",$i,(i==NF && $1=="memory" ? ORS : OFS)}'
something else else:
awk -v OFS=, '
index($0,$1)==1 {
OFS=ors
ors=ORS
} {
$1=""
printf "%s",$0
OFS=ofs
} END {printf "%s",ors} BEGIN {ofs=OFS}'
This might work for you (GNU sed):
sed -nE '/^ +\S+ +/{s///;H;$!d};x;/./s/\s+/,/gp;x;s/^\S+ +//;h' file
In overview the sed program processes indented lines, already gathered lines (except in the case that the current line is the first line of the file) and non-indented lines.
Turn off implicit printing and enable extended regexp's. (-nE).
If the current line is indented, remove the indent, the first field and any following spaces, append the result to the hold space and if it is not the last line, delete it.
Otherwise, check the hold space for gathered lines and if found, replace one or more whitespaces by commas and print the result. Then prep the current line by removing the first field and any following spaces and replace the hold space with the result.
The solution seems logically back-to-front, but programming in this style avoids having to check for end-of-file multiple times and invoking labels and gotos.
N.B. This solution will work for any number of indented lines.
Here is a ruby to do that:
ruby -e '
s=$<.read
s.scan(/^([^ \t]+:)([\s\S]+?)(?=^\1|\z)/m). # parse blocks
map(&:last). # get data part
# parse and join the data fields:
map{|block| block.split(/\n[ \t]+[^ \t]+[ \t]+/)}.
map{|lines| lines.map(&:strip).join(" ").split().join(",")}.
each{|l| puts "#{l}"}
' file
vm346,1,(12%),6150m,(76%),1130Mi,(7%),1130Mi,(7%)
vm847,6,(75%),30150m,(376%),12980Mi,(87%),12980Mi,(87%)
vm848,3500m,(43%),17150m,(214%),6216Mi,(41%),6216Mi,(41%)
The advantage is that this is not dependent on the number of lines or the number of fields. It is parsing data that is in blocks of the form:
START: ([ \t]+[data_with_no_space])*\n
l1 ([ \t]+[data_with_no_space])*\n
...
START:
...
Works this way:
Parse the blocks with THIS REGEX;
Save an array of the data elements;
Join the sub arrays and then split into data fields;
Join(',') to make a csv.

Parsing JSON from shell script using JSON.sh

I'm working on parsing JSON data using JSON.sh. And I wanted to read data from json file (test.json) whose content will be something like,
{
"/home/ukrishnan/projects/test.yml": {
"LOG_DRIVER": "syslog",
"IMAGE": "mysql:5.6"
},
"/home/ukrishnan/projects/mysql/app.xml": {
"ENV_ACCOUNT_BRIDGE_ENDPOINT": "/u01/src/test/sample.txt"
}
}
And I try to parse this JSON using JSON.sh by using,
test_parser=`sh ./lib/JSON.sh < test/test.json`
echo $test_parser
It prints,
["/home/ukrishnan/projects/test.yml","LOG_DRIVER"] "syslog" ["/home/ukrishnan/projects/test.yml","IMAGE"] "mysql:5.6" ["/home/ukrishnan/projects/test.yml"] {"LOG_DRIVER":"syslog","IMAGE":"mysql:5.6"} ["/home/ukrishnan/projects/mysql/app.xml","ENV_ACCOUNT_BRIDGE_ENDPOINT"] "/u01/src/test/sample.txt" ["/home/ukrishnan/projects/mysql/app.xml"] {"ENV_ACCOUNT_BRIDGE_ENDPOINT":"/u01/src/test/sample.txt"} [] {"/home/ukrishnan/projects/test.yml":{"LOG_DRIVER":"syslog","IMAGE":"mysql:5.6"},"/home/ukrishnan/projects/mysql/app.xml":{"ENV_ACCOUNT_BRIDGE_ENDPOINT":"/u01/src/test/sample.txt"}}
Whereas, the same command (sh ./lib/JSON.sh < test/test.json), if I run through terminal, it is printing with line breaks,
["/home/ukrishnan/projects/test.yml","LOG_DRIVER"] "syslog"
["/home/ukrishnan/projects/test.yml","IMAGE"] "mysql:5.6"
["/home/ukrishnan/projects/test.yml"] {"LOG_DRIVER":"syslog","IMAGE":"mysql:5.6"}
["/home/ukrishnan/projects/mysql/app.xml","ENV_ACCOUNT_BRIDGE_ENDPOINT"] "/u01/src/test/sample.txt"
["/home/ukrishnan/projects/mysql/app.xml"] {"ENV_ACCOUNT_BRIDGE_ENDPOINT":"/u01/src/test/sample.txt"}
[] {"/home/ukrishnan/projects/test.yml":{"LOG_DRIVER":"syslog","IMAGE":"mysql:5.6"},"/home/ukrishnan/projects/mysql/app.xml":{"ENV_ACCOUNT_BRIDGE_ENDPOINT":"/u01/src/test/sample.txt"}}
I wanted to read this and assign to bash variables like,
file_name='/home/ukrishnan/projects/test.yml'
key='LOG_DRIVER'
value='syslog'
As I'm almost completely new to shell script and grep or awk, I don't have much idea of how to achieve this. Any help on this would be greatly appreciated.
I wrote a JSON serializer / deserializer for gawk, if you're interested. Save that script and modify it, replacing everything above # === FUNCTIONS === with the following:
#!/usr/bin/gawk -f
# capture JSON string from beginning to end into a scalar variable
{ json = json ORS $0 }
END {
# objectify JSON string to the multilevel array "obj"
deserialize(json, obj)
for (filename in obj) {
print "file_name=" quote(filename)
for (key in obj[filename]) {
# print key="value"
print key "=" quote(obj[filename][key])
}
}
}
Do chmod 755 json.awk and execute it. Output will resemble this:
$ ./json.awk test5.json
file_name="/home/ukrishnan/projects/mysql/app.xml"
ENV_ACCOUNT_BRIDGE_ENDPOINT="/u01/src/test/sample.txt"
file_name="/home/ukrishnan/projects/test.yml"
LOG_DRIVER="syslog"
IMAGE="mysql:5.6"
Hopefully the logic is reasonably easy to follow. If you prefer to output filename=, key=, and value= on every loop iteration, modify the nested for loops accordingly:
for (filename in obj) {
for (key in obj[filename]) {
print "file_name=" quote(filename)
print "key=" quote(key)
print "value=" quote(obj[filename][key])
}
}
That change will result in the following output:
$ ./json.awk test5.json
file_name="/home/ukrishnan/projects/mysql/app.xml"
key="ENV_ACCOUNT_BRIDGE_ENDPOINT"
value="/u01/src/test/sample.txt"
file_name="/home/ukrishnan/projects/test.yml"
key="LOG_DRIVER"
value="syslog"
file_name="/home/ukrishnan/projects/test.yml"
key="IMAGE"
value="mysql:5.6"
Anyway, with that output, you can do something silly in BASH like this to populate and act upon the variables:
#!/bin/bash
./test.awk test5.json | while read -r line; do {
eval $line
[ "${line/=*/}" = "value" ] && {
echo "bash: file_name=$file_name"
echo "bash: key=$key"
echo "bash: value=$value"
echo "------"
}
}; done
It'd probably be more graceful just to do all processing within gawk from start to finish and not mess with the polyglot handoff, though.
Getting back to json.awk, if you prefer to keep json.awk modular for easy reuse in future projects, you could remove everything above # === FUNCTIONS ===, create a separate main.awk containing the code block at the top of this answer, and #include "json.awk" as a helper library pretty much anywhere outside of END {...} (just below the shbang, for example).
JSON.sh (from http://json.org) offers a nice bash friendly means of flattening out a JSON file. Which you've already provided how it looks in your question. So, the flatten form is the format:
[node] tab value
You have to think in UNIX script in extracting the information you want, you'll note the lines you're interested in actually follow this pattern:
["filename","key"] tab ["value"]
In regex notation, we replace:
filename with (.*)
key with (.*)
tab with \t
value with (.*)
We can retrieve the first, second and third matching groups with \1, \2, \3 respectively.
When used in sed we also note that these symbols []() need to be escaped with a backslash \, resulting in the following script:
./lib/JSON.sh < test/test.json | sed 's/\["\(.*\)","\(.*\)\"]\t"\(.*\)"/\1,\2,\3/;t;d'
/home/ukrishnan/projects/test.yml,LOG_DRIVER,syslog
/home/ukrishnan/projects/test.yml,IMAGE,mysql:5.6
/home/ukrishnan/projects/mysql/app.xml,ENV_ACCOUNT_BRIDGE_ENDPOINT,/u01/src/test/sample.txt
Now we put the lines in a loop and for each line, we can extract out filename,key,value:
for line in $(./lib/JSON.sh < test/test.json | sed 's/\["\(.*\)","\(.*\)\"]\t"\(.*\)"/\1,\2,\3/;t;d')
do
IFS="," read -ra arr <<< $line
filename=${arr[0]}
key=${arr[1]}
value=${arr[2]}
cat <<EOF
filename : $filename
key : $key
value : $value
EOF
done
Which outputs:
filename : /home/ukrishnan/projects/test.yml
key : LOG_DRIVER
value : syslog
filename : /home/ukrishnan/projects/test.yml
key : IMAGE
value : mysql:5.6
filename : /home/ukrishnan/projects/mysql/app.xml
key : ENV_ACCOUNT_BRIDGE_ENDPOINT
value : /u01/src/test/sample.txt

Removing specific html tag

I would like to remove a specific set of html tags, here is what I have tried
$str_rep="<table></td></tr></table></td></tr></table></td></tr>";
local $^I = ""; # Enable in-place editing.
push(#files,"$report_file");
local #ARGV = #files; # Set files to operate on.
while (<>) {
s/(.*)$str_rep(.*)$/$1$2/g;
print;
}
Html file has got only two lines - one is the page header and the 2nd line has got the full content including a couple of tables. Now I am trying to remove some unwanted table closing tabs which help me to merge tables together. Unfortunately it is removing everything after the replacement string. Where am I going wrong ?
You should escape slashes /, and simply replace the matched string by an empty string :
$str_rep="<table><\/td><\/tr><\/table><\/td><\/tr><\/table><\/td><\/tr>";
local $^I = ""; # Enable in-place editing.
push(#files,"$report_file");
local #ARGV = #files; # Set files to operate on.
while (<>) {
s/$str_rep//g;
print;
}
Here you are:
my $report_file = 'input.html';
# see at this v - you forget about one \/ near table :)
my $str_rep="<\/table><\/td><\/tr><\/table><\/td><\/tr><\/table><\/td><\/tr>";
local $^I = ""; # Enable in-place editing.
push(#files,"$report_file");
local #ARGV = #files; # Set files to operate on.
while (<>) {
s/$str_rep//g;
print;
}
I use diff for input.html and target.html
Everything works fine!

JSON to fixed width file

I have to extract data from JSON file depending on a specific key. The data then has to be filtered (based on the key value) and separated into different fixed width flat files. I have to develop a solution using shell scripting.
Since the data is just key:value pair I can extract them by processing each line in the JSON file, checking the type and writing the values to the corresponding fixed-width file.
My problem is that the input JSON file is approximately 5GB in size. My method is very basic and would like to know if there is a better way to achieve this using shell scripting ?
Sample JSON file would look like as below:
{"Type":"Mail","id":"101","Subject":"How are you ?","Attachment":"true"}
{"Type":"Chat","id":"12ABD","Mode:Online"}
The above is a sample of the kind of data I need to process.
Give this a try:
#!/usr/bin/awk
{
line = ""
gsub("[{}\x22]", "", $0)
f=split($0, a, "[:,]")
for (i=1;i<=f;i++)
if (a[i] == "Type")
file = a[++i]
else
line = line sprintf("%-15s",a[i])
print line > file ".fixed.out"
}
I made assumptions based on the sample data provided. There is a lot based on those assumptions that may need to be changed if the data varies much from what you've shown. In particular, this script will not work properly if the data values or field names contain colons, commas, quotes or braces. If this is a problem, it's one of the primary reasons that a proper JSON parser should be used. If it were my assignment, I'd push back hard on this point to get permission to use the proper tools.
This outputs lines that have type "Mail" to a file named "Mail.fixed.out" and type "Chat" to "Chat.fixed.out", etc.
The "Type" field name and field value ("Mail", etc.) are not output as part of the contents. This can be changed.
Otherwise, both the field names and values are output. This can be changed.
The field widths are all fixed at 15 characters, padded with spaces, with no delimiters. The field width can be changed, etc.
Let me know how close this comes to what you're looking for and I can make some adjustments.
perl script
#!/usr/bin/perl -w
use strict;
use warnings;
no strict 'refs'; # for FileCache
use FileCache; # avoid exceeding system's maximum number of file descriptors
use JSON;
my $type;
my $json = JSON->new->utf8(1); #NOTE: expect utf-8 strings
while(my $line = <>) { # for each input line
# extract type
eval { $type = $json->decode($line)->{Type} };
$type = 'json_decode_error' if $#;
$type ||= 'missing_type';
# print to the appropriate file
my $fh = cacheout '>>', "$type.out";
print $fh $line; #NOTE: use cache if there are too many hdd seeks
}
corresponding shell script
#!/bin/bash
#NOTE: bash is used to create non-ascii filenames correctly
__extract_type()
{
perl -MJSON -e 'print from_json(shift)->{Type}' "$1"
}
__process_input()
{
local IFS=$'\n'
while read line; do # for each input line
# extract type
local type="$(__extract_type "$line" 2>/dev/null ||
echo json_decode_error)"
[ -z "$type" ] && local type=missing_type
# print to the appropriate file
echo "$line" >> "$type.out"
done
}
__process_input
Example:
$ ./script-name < input_file
$ ls -1 *.out
json_decode_error.out
Mail.out

Editing multiple HTML files using SED (or something similar)

I have about 1000 HTML files to edit which represent footnotes in a large technical document. I have been asked to go through the HTML files one by one and manually edit the HTML, to get it all on the straight and narrow.
I know that this could probably be done in a matter of seconds with SED as the changes to each file are similar. The body text in each file can be different but I want to change the tags to match the following:
<body>
<p class="Notes">See <i>R v Swain</i> (1992) 8 CRNZ 657 (HC).</p>
</body>
The text may change, for example, it could say 'See R v Pinky and the Brain (1992) or something like that but basically the body text should be that.
Currently, however, the body text may be:
<body>
<p class="Notes"><span class="FootnoteReference"><span lang="EN-GB" xml:lang="EN-GB"><span><span
class="FootnoteReference"><span lang="EN-GB" xml:lang="EN-GB" style="font-size: 10.0pt;">See <i>R v Pinky and the Brain</i> (1992) </span></span></span></span></span></p>
</body>
or even:
<body>
<p class="FootnoteText"><span class="FootnoteReference"><span lang="EN-US"
xml:lang="EN-US" style="font-size: 10.0pt;"><span><![endif]></span></span></span>See <i>R v Pinky and the Brain</i> (1992)</p>
</body>
Can anybody suggest a SED expression or something similar that would solve this?
Like this?:
perl -pe 's/Swain/Pinky and the Brain/g;' -i lots.html of.html files.html
The breakdown:
-e = "Use code on the command line"
-p = "Execute the code on every line of every file, and print out the line, including what changed"
-i = "Actually replace the files with the new content"
If you swap out -i with -i.old then lots.html.old and of.html.old (etc) will contain the files before the changes, in case you need to go back.
This will replace just Swain with Pinky and the Brain in all the files. Further changes would require more runs of the command. Or:
s/Swain/Pinky/g; s/Twain/Brain/g;
To swap Swain with Pinky and Twain with Brain everywhere.
Update:
If you can be sure about the incoming formatting of the data, then something like this may suffice:
# cat ff.html
<body>
<p class="Notes"><span class="FootnoteReference"><span lang="EN-GB" xml:lang="EN-GB"><span><span
class="FootnoteReference"><span lang="EN-GB" xml:lang="EN-GB" style="font-size: 10.0pt;">See <i>R v Twain</i> (1992) </span></span></span></span></span></p>
<p class="Notes"><span class="FootnoteReference"><span lang="EN-GB" xml:lang="EN-GB"><span><span
class="FootnoteReference"><span lang="EN-GB" xml:lang="EN-GB" style="font-size: 10.0pt;">See <i>R v Swain</i> (1992) </span></span></span></span></span></p>
</body>
# perl -pe 'BEGIN{undef $/;} s/<[pP][ >].*?See <i>(.*?)<\/i>(.*?)<.*?\/[pP]>/<p class="Notes">See <i>$1<\/i>$2<\/p>/gsm;' ff.html
<body>
<p class="Notes">See <i>R v Twain</i> (1992) </p>
<p class="Notes">See <i>R v Swain</i> (1992) </p>
</body>
Explanations:
BEGIN{undef $/;} = treat the whole document as one string, or else html that has newlines in it won't get handled properly
<[pP[ >] = the beginning of a p-tag (case insensitive)
.*? = lots of stuff, non-greedy-matched i.e. http://en.wikipedia.org/wiki/Regular_expression#Lazy_quantification
See <i> = literally look for that string - very important, since that seems to be the only common denominator
(.*?) = put more stuff into a parentheses group (to be used later)
<\/i> = the end i-tag
(.*?) = put more stuff into a parentheses group (to be used later)
<.*?\/[pP] = the end p-tag and other possible tags mashed up before it (like all your spans)
and replace it with the string you want, where $1 and $2 are what got snagged in the parentheses before, i.e. the two (.*?) 's
g = global search - so possibly more than one per line
s = treat everything like one line (which it is now due to the BEGIN at the top)
First convert your HTML files to proper XHTML using http://tidy.sourceforge.net and then use xmlstarlet to do the necessary XHTML processing.
Note: Get the current version of xmlstarlet for in-place XML file editing.
Here's a simple, yet complete mini-example:
curl -s http://checkip.dyndns.org > dyndns.html
tidy -wrap 0 -numeric -asxml -utf8 2>/dev/null < dyndns.html > dyndns.xml
# test: print body text to stdout (dyndns.xml)
xml sel -T \
-N XMLNS="http://www.w3.org/1999/xhtml" \
-t -m "//XMLNS:body" -v '.' -n \
dyndns.xml
# edit body text in-place (dyndns.xml)
xml ed -L \
-N XMLNS="http://www.w3.org/1999/xhtml" \
-u "//XMLNS:body" -v '<p NEW BODY TEXT </p>' \
dyndns.xml
# create new HTML file (by overwriting the original one!)
xml unesc < dyndns.xml > dyndns.html
To consolidate the span tags you may use tidy (version released on 25 March 2009) as well!
# get current tidy version: http://tidy.cvs.sourceforge.net/viewvc/tidy/tidy/
# see also: http://tidy.sourceforge.net/docs/quickref.html#merge-spans
tidy -q -c --merge-spans yes file.html
You will have to check your input files to verify some assumptions can be made. Based on your two examples, I have made the following assumptions. You will need to check them and take some sample input files to verify you have found all assumptions.
The file consists of a single footnote contained in a single <body></body> pair. The body tags are always present and well formed.
The footnote is buried somewhere inside a <p></p> pair and one or many <span></span> tags. <!...> tags can be discarded.
The following Perl script works for both examples you have supplied (on Linux with Perl 5.10.0). Before using it, make sure you have a backup of your original html files. By default, it will only print the result on stdout without changing any file.
#!/usr/bin/perl
$overwrite = 0;
# get rid of default line separator to facilitate slurping in a $scalar var
$/ = '';
foreach $filename (#ARGV)
{
# slurp entire file in $text variable
open FH, "<$filename";
$full_text = <FH>;
close FH;
if ($overwrite)
{
! -f "$filename.bak" && rename $filename, "$filename.bak";
}
# match everything that is found before the body tag, everything
# between and including the body tags, and what follows
# s modifier causes full_text to be considered a single long string
# instead of individual lines
($before_body, $body, $after_body) = ($full_text =~ m!(.*)<body>(.*)</body>(.*)!s);
#print $before_body, $body, $after_body;
# Discard unwanted tags from the body
$body =~ s%<span.*?>%%sg;
$body =~ s%</span.*?>%%sg;
$body =~ s%<p.*?>%%sg;
$body =~ s%</p.*?>%%sg;
$body =~ s%<!.*?>%%sg;
# Remaining leading and trailing whitespace likely to be newlines: remove
$body =~ s%^\s*%%sg;
$body =~ s%\s*$%%sg;
if ($overwrite)
{
open FH, ">$filename";
print FH $before_body, "<body>\n<p class=\"Notes\">$body</p>\n</body>", $after_body;
close FH;
}
else
{
print $before_body, "<body>\n<p class=\"Notes\">$body</p>\n</body>", $after_body;
}
}
To use it:
./script.pl file1.html
./script.pl file1.html file2.html
./script.pl *.html
Tweak it and when you're happy set $overwrite=1. The script creates a .bak only if it does not already exist.
If you have 1 entry per file, no rigid structure in these files, and possibly multiple lines, I would go for a php or perl script to process them file by file, while emitting suitable warnings when the patterns don't match.
use
php -f thescript.php
to execute thescript.php, which contains
<?php
$path = "datapath/";
$dir = opendir($path);
while ( ( $fn = readdir($dir) ) !== false )
{
if ( preg_match("/html$/",$fn) ) process($path.$fn);
}
function process($file)
{
$in = file_get_contents($file);
$in2 = str_replace("\n"," ",strip_tags($in,"<i>"));
if ( preg_match("#^(.*)<i>(.*)</i>(.*)$#i",$in2,$match) )
{
list($dummy,$p0,$p1,$p2) = $match;
$out = "<body>$p0<i>$p1</i>$p2</body>";
file_put_contents($file.".out",$out);
} else {
print "Problem with $file? (stripped down to: $in2)\n";
file_put_contents($file.".problematic",$in);
}
}
?>
you could tweak this to your needs until the number of misses is low enough to do the last few by hand. You probably need to add some $p0 = trim($p0); etc to sanitize everything.