Match any character (including newlines) in sed - html

I have a sed command that I want to run on a huge, terrible, ugly HTML file that was created from a Microsoft Word document. All it should do is remove any instance of the string
style='text-align:center; color:blue;
exampleStyle:exampleValue'
The sed command that I am trying to modify is
sed "s/ style='[^']*'//" fileA > fileB
It works great, except that whenever there is a new line inside of the matching text, it doesn't match. Is there a modifier for sed, or something I can do to force matching of any character, including newlines?
I understand that regexps are terrible at XML and HTML, blah blah blah, but in this case, the string patterns are well-formed in that the style attributes always start with a single quote and end with a single quote. So if I could just solve the newline problem, I could cut down the size of the HTML by over 50% with just that one command.
In the end, it turned out that Sinan Ünür's perl script worked best. It was almost instantaneous, and it reduced the file size from 2.3 MB to 850k. Good ol' Perl...

sed goes over the input file line by line which means, as I understand, what you want is not possible in sed.
You could use the following Perl script (untested), though:
#!/usr/bin/perl
use strict;
use warnings;
{
local $/; # slurp mode
my $html = <>;
$html =~ s/ style='[^']*'//g;
print $html;
}
__END__
A one liner would be:
$ perl -e 'local $/; $_ = <>; s/ style=\047[^\047]*\047//g; print' fileA > fileB

Sed reads the input line by line, so it is not simple to do processing over one line... but it is not impossible either, you need to make use of sed branching. The following will work, I have commented it to explain what is going on (not the most readable syntax!):
sed "# if the line matches 'style='', then branch to label,
# otherwise process next line
/style='/b style
b
# the line contains 'style', try to do a replace
: style
s/ style='[^']*'//
# if the replace worked, then process next line
t
# otherwise append the next line to the pattern space and try again.
N
b style
" fileA > fileB

You could remove all CR/LF using tr, run sed, and then import into an editor that auto-formats.

Another way is like:
$ cat toreplace.txt
I want to make \
this into one line
I also want to \
merge this line
$ sed -e 'N;N;s/\\\n//g;P;D;' toreplace.txt
Output:
I want to make this into one line
I also want to merge this line
The N loads another line, P prints the pattern space up to the first newline, and D deletes the pattern space up to the first newline.

You can try this:
awk '/style/&&/exampleValue/{
gsub(/style.*exampleValue\047/,"")
}
/style/&&!/exampleValue/{
gsub(/style.* /,"")
f=1
}
f &&/exampleValue/{
gsub(/.*exampleValue\047 /,"")
f=0
}
1
' file
Output:
# more file
this is a line
style='text-align:center; color:blue; exampleStyle:exampleValue'
this is a line
blah
blah
style='text-align:center; color:blue;
exampleStyle:exampleValue' blah blah....
# ./test.sh
this is a line
this is a line
blah
blah
blah blah....

Remove XML elements across several lines
My use case was pretty much the same, but I needed to match opening and closing tags from XML elements and remove them completely --including whatever was inside.
<xmlTag whatever="parameter that holds in the tag header">
<whatever_is_inside/>
<InWhicheverFormat>
<AcrossSeveralLines/>
</InWhicheverFormat>
</xmlTag>
Still, sed works on one single line. What we do here is tricking it to append subsequent lines to the current one so we can edit all lines we like, then rewrite the output (\n is a legal char you can output with sed to divide lines again).
Inspired by the answer from #beano, and another answer in Unix stackExchange, I built up my working sed "program":
sed -s --in-place=.back -e '/\(^[ ]*\)<xmlTag/{ # whenever you encounter the xmlTag
$! { # do
:begin # label to return to
N; # append next line
s/\(^[ ]*\)<\(xmlTag\)[^·]\+<\/\2>//; # Attempt substitution (elimination) of pattern
t end # if substitution succeeds, jump to :end
b begin # unconditional jump to :begin to append yet another line
:end # label to mark the end
}
}' myxmlfile.xml
Some explanations:
I match <xmlTag without closing the > because my XML element contains parameters.
What precedes <xmlTag is a very helpful piece of RegExp to match any existing indentation: \(^[ ]*\) so you can later output it with just \1 (even if it was not needed this time).
The addition of ; in several places is so that sed will understand that the command (N, s or whichever) ends there and following character(s) are another command.
most of my trouble was trying to find a RegExp that would match "anything in between". I finally settled by anything but · (i.e. [^·]\+), counting on not having that char in any of the data files. I needed to scape + because is special for GNU sed.
my original files remain as .back, just in case something goes wrong --tests still do fail after modification-- and are flagged easily by version control for removal in bulk.
I use this kind of sed-automation to evolve .XML files that we use with serialized data to run our unit and Integration tests. Whenever our classes change (loose or gain fields), the data have to be updated. I do that with a single ´find´ that executes a sed-automation in the files that contain the modified class. We hold hundreds of xml data files.

Related

exclude words those may or may not end with slash

I am trying to exclude certain words from dictionary file.
# cat en.txt
test
testing
access/p
batch
batch/n
batches
cross
# cat exclude.txt
test
batch
# grep -vf exclude.txt en.txt
access/p
cross
The words like "testing" and "batches" should be included in the results.
expected result:
testing
access/p
batches
cross
Because the word "batch" may or may not be followed by a slash "/". There can be one or more tags after slash (n in this case). But the word "batches" is a different word and should not match with "batch".
I would harness GNU AWK for this task following way, let en.txt content be
test
testing
access/p
batch
batch/n
batches
cross
and exclude.txt content be
test
batch
then
awk 'BEGIN{FS="/"}FNR==NR{arr[$1];next}!($1 in arr)' exclude.txt en.txt
gives output
testing
access/p
batches
cross
Explanation: I inform GNU AWK that / is field separator (FS), then when processing first file (where number of row globally is equal to number of row inside file, that is FNR==NR) I simply use 1st column value as key in array arr and then go to next line, so nothing other happens, for 2nd (and following files if present) I select lines whose 1st column is not (!) one of keys of array arr.
(tested in GNU Awk 5.0.1)
Using grep matching whole words:
grep -wvf exclude.txt en.txt
Explanation (from man grep)
-w --word-regexp Select only those lines containing matches that form whole words.
-v --invert-match Invert the sense of matching, to select non-matching lines.
-f -f FILE Obtain patterns from FILE, one per line.
Output
testing
access/p
batches
cross
Since there are many words in a dictionary that may have a root in one of those to exclude we cannot conveniently† use a look-up hash (built of the exclude list), but have to check all of them. One way to do that more efficiently is to use an alternation pattern built from the exclude list
use warnings;
use strict;
use feature 'say';
use Path::Tiny; # to read ("slurp") a file conveniently
my $excl_file = 'exclude.txt';
my $re_excl = join '|', split /\n/, path($excl_file)->slurp;
$re_excl = qr($re_excl);
while (<>) {
if ( m{^ $re_excl (?:/.)? $}x ) {
# say "Skip printing (so filter out): $_";
next;
}
say;
}
This is used as program.pl dictionary-filename and it prints the filtered list.
Here I've assumed that what may follow the root-word to exclude is / followed by one character, (?:/.)?, since examples in the question use that and there is no precise statement on it. The pattern also assumes no spaces around the word.
Please adjust as/if needed for what may actually follow /. For example, it'd be (?:/.+)? for at least one character, (?:/[np])? for any character from a specific list (n or p), (?:[^xy]+)? for any characters not in the given list, etc.
The qr operator forms a proper regex pattern.
† Can still first strip non-word endings, then use a look-up, then put back those endings
use Path::Tiny; # to read a file conveniently
my %lu = map { $_ => 1 } path($excl_file)->lines({ chomp => 1 });
while (<>) {
chomp;
# [^\w-] protects hyphenated words (or just use \W)
# Or: s{(/.+$}{}g; if "/" is the only possibility
s/([^\w-].+)$//g;
next if exists $lu{$_};
$_ .= $1 if $1;
say;
}
This will be far more efficient, on large dictionaries and long lists of exclude words.
However, it is far more complex and probably fails some (unstated) requirements

Find and replace text from a large txt file on ubuntu

I want to replace "}{" by "},{" to make a large txt file into valid json. Need help !!!
This can be achieved with sed:
sed -i 's/}{/},{/g' filename
sed is the command, -i implies that the changes have to made to a file, which name you're giving at the end (and you should change filename, of course).
The substitution part starts with the s, between the first //, you set what you want to replace, between the last //, you set what you want instead. The g at the end makes sure that this search/replace is not only executed once, but as long as sed finds matches.
If you have any newlines present after the }, you can simply remove them all, you'll still get a valid JSON afterwards:
cat filename | tr -d '\n' | sed 's/}{/},{/g' >newfilename
This would simply delete all newlines (\n) and pass it to the command. It will create a new file, though.

Fart.exe replace two carriage returns in a row

I have csv file where I'm trying to replace two carriage returns in a row with a single carriage return using Fart.exe. First off, is this possible? If so, the text within the CSV is laid out like the below where "CRLF" is an actual carriage return.
,CRLF
CRLF
But I want it to be just this without the extra carriage return on the second line:
,CRLF
I thought I could just do the below but it won't work:
CALL "C:\tmp\fart.exe" -C "C:\tmp\myfile.csv" ,\r\n\r\n ,\r\n
I need to know what to change ,\r\n\r\n to in order to make this work. Any ideas how I could make this happen? Thanks!
As Squashman has suggested, you are simply trying to remove empty lines.
There is no need for a 3rd party tool to do this. You can simply use FINDSTR to discard empty lines:
findstr /v "^$" myFile.txt >myFile.txt.new
move /y myFile.txt.new *. >nul
However, this will only work if all the lines end with CRLF. If you have a unix formatted file that ends each line with LF, then it will not work.
A more robust option would be to use JREPL.BAT - a regular expression command line text processing utility.
jrepl "^$" "" /r 0 /f myFile.txt /o -
Be sure to use CALL JREPL if you put the command within a batch script.
FART processes one line at a time, and the CRLF is not considered to be part of the line. So you can't use a normal FART command to remove CRLF. If you really want to use FART, then you will need to use the -B binary mode. You also need to use -C to get support for the escape sequences.
I've never used FART, so I can't be sure - but I believe the following would work
call fart -B -C myFile.txt "\r\n\r\n" "\r\n"
If you have many consecutive empty lines, then you will need to run the FART command repeatedly until there are no more changes.

Find a string between 2 other strings in document

I have found a ton of solutions do do what I want with only one exception.
I need to search a .html document and pull a string.
The line containing the string will look like this (1 line, no newlines)
<script type="text/javascript">g_initHeader(0);LiveSearch.attach(ge('oh2345v5ks'));var _ = g_items;_[60]={icon:'INV_Chest_Leather_09',name_enus:'Layered Tunic'};_[6076]={icon:'INV_Pants_11',name_enus:'Tapered Pants'};_[3070]={icon:'INV_Misc_Cape_01',name_enus:'Ensign Cloak'};</script>
The text I need to get is
INV_CHEST_LEATHER_09
When I use awk, grep, and sed, I extract the data between icon:' and ',name_
The problem is, all three of these scripts scan the entire line and use the last occurring ',name_ thus I end up with
INV_Chest_Leather_09',name_enus:'Layered
Tunic'};_[6076]={icon:'INV_Pants_11',name_enus:'Tapered
Pants'};_[3070]={icon:'INV_Misc_Cape_01
Here's the last one I tried
grep -Po -m 1 "(?<=]={icon:').*(?=',name_)"
I've tried awk and sed too, and I don't really have a preference of which one to use.
So basically, I need to search the entire html file, find the first occurrence of icon:', extract the text right after it until the first occurrence after icon:' of ',name_.
With GNU awk for the 3rd arg to match():
$ awk 'match($0,/icon:\047([^\047]+)/,a){print a[1]}' file
INV_Chest_Leather_09
Simple perl approach:
perl -ne 'print "$1\n" if /\bicon:\047([^\047]+)/' file
The output:
INV_Chest_Leather_09
The .* in your regular expression is a greedy matcher, so the pattern will match till the end of the string and then backtrack to match the ,name_ portion. You could try replacing the .* with something like [^,]* (i.e. match anything except comma):
grep -Po -m 1 "(?<=]={icon:')[^,]*(?=',name_)"

how do remove carriage returns in a txt file

I recently received some data items 99 pipe delimited txt files, however in some of them and ill use dataaddress.txt as an example, where there is a return in the address eg
14 MakeUp Road
Hull
HU99 9HU
It goming out on 3 rows rather than one, bear in made there is data before and after this address separated by pipes. It just seems to be this addresss issue which is causing me issues in oading the txt file correcting using SSIS.
Rather than go back to source I wondered if there was a way we can manipulate the txt file to remove these carriage returns while not affected the row end returns if that makes sense.
I would use sed or awk. I will show you how to do this with awk, because it more platform independent. If you do not have awk, you can download a mawk binary from http://invisible-island.net/mawk/mawk.html.
The idea is as follows - tell awk that your line separator is something different, not carriage return or line feed. I will use comma.
Than use a regular expression to replace the string that you do not like.
Here is a test file I created. Save it as test.txt:
1,Line before ...
2,Broken line ... 14 MakeUp Road
Hull
HU99 9HU
3,Line after
And call awk as follows:
awk 'BEGIN { RS = ","; ORS=""; s=""; } $0 != "" { gsub(/MakeUp Road[\n\r]+Hull[\n\r]+HU99 9HU/, "MakeUp Road Hull HU99 9HU"); print s $0; s="," }' test.txt
I suggest that you save the awk code into a file named cleanup.awk. Here is the better formatted code with explanations.
BEGIN {
# This block is executed at the beginning of the file
RS = ","; # Tell awk our records are separated by comma
ORS=""; # Tell awk not to use record separator in the output
s=""; # We will print this as record separator in the output
}
{
# This block is executed for each line.
# Remember, our "lines" are separated by commas.
# For each line, use a regular expression to replace the bad text.
gsub(/MakeUp Road[\n\r]+Hull[\n\r]+HU99 9HU/, "MakeUp Road Hull HU99 9HU");
# Print the replaced text - $0 variable represents the line text.
print s $0; s=","
}
Using the awk file, you can execute the replacement as follows:
awk -f cleanup.awk test.txt
To process multiple files, you can create a bash script:
for f in `ls *.txt`; do
# Execute the cleanup.awk program for each file.
# Save the cleaned output to a file in a directory ../clean
awk -f cleanup.awk $f > ../clean/$f
done
You can use sed to remove the line feed and carriage return characters:
sed ':a;N;$!ba;s/MakeUp Road[\n\r]\+/MakeUp Road /g' test.txt | sed ':a;N;$!ba;s/Hull[\n\r]\+/Hull /g'
Explanation:
:a create a label 'a'
N append the next line to the pattern space
$! if not the last line, ba branch (go to) label 'a'
s substitute command, \n represents new line, \r represents carriage return, [\n\r]+ - match new line or carriage return in a sequence as many times as they occur (at least one), /g global match (as many times as it can)
sed will loop through step 1 to 3 until it reach the last line, getting all lines fit in the pattern space where sed will substitute all \n characters