Find/replace space with %20
I must replace all spaces in *.html files which are inside href="something something .pdf" with %20.
I found a regular expression for that task:
find : href\s*=\s*['"][^'" ]*\K\h|(?!^)\G[^'" ]*\K\h
replace : %20
That regular expression works in text editors like Notepad++ or Geany.
I want use that regular expression from the Linux command line with sed or perl.
Solution (1):
cat test002.html | perl -ne 's/href\s*=\s*['\''"][^'\''" ]*\K\h|(?!^)\G[^'\''" ]*\K\h/%20/g; print;' > Work_OK01.html
Solution (2):
cat test002.html | perl -ne 's/href\s*=\s*[\x27"][^\x27" ]*\K\h|(?!^)\G[^\x27" ]*\K\h/%20/g; print;' > Work_OK02.html
The problem is that you don't properly escape the single quotes in your program.
If your program is
...[^'"]...
The shell literal might be
'...[^'\''"]...'
'...[^'"'"'"]...'
'...[^\x27"]...' # Avoids using a single quote to avoid escaping it.
So, you were going for
perl -ne 's/href\s*=\s*['\''"][^'\''" ]*\K\h|(?!^)\G[^'\''" ]*\K\h/%20/g; print;'
Don't try do everything at once. Here are some far cleaner (i.e. far more readable) solutions:
perl -pe's{href\s*=\s*['\''"]\K([^'\''"]*)}{ $1 =~ s/ /%20/rg }eg' # 5.14+
perl -pe's{href\s*=\s*['\''"]\K([^'\''"]*)}{ (my $s = $1) =~ s/ /%20/g; $s }eg'
Note that -p is the same as -n, except that it cause a print to be performed for each line.
The above solutions make a large number of assumptions about the files that might be encountered[1]. All of these assumptions would go away if you use a proper parser.
If you have HTML files:
perl -MXML::LibXML -e'
my $doc = XML::LibXML->new->parse_file($ARGV[0]);
$_->setValue( $_->getValue() =~ s/ /%20/gr )
for $doc->findnodes(q{//#href});
binmode(STDOUT);
print($doc->toStringHTML());
' in_file.html >out_file.html
If you have XML (incl XHTML) files:
perl -MXML::LibXML -e'
my $doc = XML::LibXML->new->parse_file($ARGV[0]);
$_->setValue( $_->getValue() =~ s/ /%20/gr )
for $doc->findnodes(q{//#href});
binmode(STDOUT);
$doc->toFH(\*STDOUT);
' in_file.html >out_file.html
Assumptions made by the substitution-based solutions:
File uses an ASCII-based encoding (e.g. UTF-8, iso-latin-1, but not UTF-16le).
No newline between href and =.
No newline between = and the value.
No newline in the value of href attributes.
Nothing matching /href\s*=/ in text (incl CDATA sections).
Nothing matching /href\s*=/ in comments.
No other attributes have a name ending in href.
No single quote (') in href="...".
No double quote (") in href='...'.
No href= with an unquoted value.
Space in href attributes aren't encoded using a character entity (e.g ).
Maybe more?
(SLePort makes similar assumptions even though they didn't document them. They also assume href attributes don't contain >.)
An xml parser would be more suited for that job(eg. XMLStarlet, xmllint,...), but if you don't have newlines in your a tags, the below sed should work.
Using the t command and backreferences, it loops over and replace all spaces up to last " inside the a tags:
$ sed ':a;s/\(<a [^>]*href=[^"]*"[^ ]*\) \([^"]*">\)/\1%20\2/;ta' <<< '<a href="http://url with spaces">'
<a href="http://url%20with%20spaces">
You seem to have neglected to escape the quotes inside the string you pass to Perl. So Bash sees you giving perl the following arguments:
s/href\s*=\s*[][^', which results from the concatenation of the single-quoted string 's/href\s*=\s*[' and the double-quoted string "][^'"
]*Kh, unquoted, because \K and \h are not special characters in the shell so it just treats them as K and h respectively
Then Bash sees a pipe character |, followed by a subshell invocation (?!^), in which !^ gets substituted with the first argument of the last command invoked. (See "History Expansion > Word Designators" in the Bash man page.) For example, if your last command was echo myface then (?!^) would look for the command named ?myface and runs it in a subshell.
And finally, Bash gets to the sequence \G[^'" ]*\K\h/%20/g; print;', which is interpreted as the concatenation of G (from \G), [^, and the single-quoted string " ]*\K\h/%20/g; print;. Bash has no idea what to do with G[^" ]*\K\h/%20/g; print;, since it just finished parsing a subshell invocation and expects to see a semicolon, line break, or logical operator (or so on) before getting another arbitrary string.
Solution: properly quote the expression you give to perl. You'll need to use a combination of single and double quotes to pull it off, e.g.
perl -ne 's/href\s*=\s*['"'\"][^'\" ]*"'\K\h|(?!^)\G[^'"'\" ]*"'\K\h/%20/g; print;'
Related
Should or should I not wrap quotes around variables in a shell script?
For example, is the following correct:
xdg-open $URL
[ $? -eq 2 ]
or
xdg-open "$URL"
[ "$?" -eq "2" ]
And if so, why?
General rule: quote it if it can either be empty or contain spaces (or any whitespace really) or special characters (wildcards). Not quoting strings with spaces often leads to the shell breaking apart a single argument into many.
$? doesn't need quotes since it's a numeric value. Whether $URL needs it depends on what you allow in there and whether you still want an argument if it's empty.
I tend to always quote strings just out of habit since it's safer that way.
In short, quote everything where you do not require the shell to perform word splitting and wildcard expansion.
Single quotes protect the text between them verbatim. It is the proper tool when you need to ensure that the shell does not touch the string at all. Typically, it is the quoting mechanism of choice when you do not require variable interpolation.
$ echo 'Nothing \t in here $will change'
Nothing \t in here $will change
$ grep -F '#&$*!!' file /dev/null
file:I can't get this #&$*!! quoting right.
Double quotes are suitable when variable interpolation is required. With suitable adaptations, it is also a good workaround when you need single quotes in the string. (There is no straightforward way to escape a single quote between single quotes, because there is no escape mechanism inside single quotes -- if there was, they would not quote completely verbatim.)
$ echo "There is no place like '$HOME'"
There is no place like '/home/me'
No quotes are suitable when you specifically require the shell to perform word splitting and/or wildcard expansion.
Word splitting (aka token splitting);
$ words="foo bar baz"
$ for word in $words; do
> echo "$word"
> done
foo
bar
baz
By contrast:
$ for word in "$words"; do echo "$word"; done
foo bar baz
(The loop only runs once, over the single, quoted string.)
$ for word in '$words'; do echo "$word"; done
$words
(The loop only runs once, over the literal single-quoted string.)
Wildcard expansion:
$ pattern='file*.txt'
$ ls $pattern
file1.txt file_other.txt
By contrast:
$ ls "$pattern"
ls: cannot access file*.txt: No such file or directory
(There is no file named literally file*.txt.)
$ ls '$pattern'
ls: cannot access $pattern: No such file or directory
(There is no file named $pattern, either!)
In more concrete terms, anything containing a filename should usually be quoted (because filenames can contain whitespace and other shell metacharacters). Anything containing a URL should usually be quoted (because many URLs contain shell metacharacters like ? and &). Anything containing a regex should usually be quoted (ditto ditto). Anything containing significant whitespace other than single spaces between non-whitespace characters needs to be quoted (because otherwise, the shell will munge the whitespace into, effectively, single spaces, and trim any leading or trailing whitespace).
When you know that a variable can only contain a value which contains no shell metacharacters, quoting is optional. Thus, an unquoted $? is basically fine, because this variable can only ever contain a single number. However, "$?" is also correct, and recommended for general consistency and correctness (though this is my personal recommendation, not a widely recognized policy).
Values which are not variables basically follow the same rules, though you could then also escape any metacharacters instead of quoting them. For a common example, a URL with a & in it will be parsed by the shell as a background command unless the metacharacter is escaped or quoted:
$ wget http://example.com/q&uack
[1] wget http://example.com/q
-bash: uack: command not found
(Of course, this also happens if the URL is in an unquoted variable.) For a static string, single quotes make the most sense, although any form of quoting or escaping works here.
wget 'http://example.com/q&uack' # Single quotes preferred for a static string
wget "http://example.com/q&uack" # Double quotes work here, too (no $ or ` in the value)
wget http://example.com/q\&uack # Backslash escape
wget http://example.com/q'&'uack # Only the metacharacter really needs quoting
The last example also suggests another useful concept, which I like to call "seesaw quoting". If you need to mix single and double quotes, you can use them adjacent to each other. For example, the following quoted strings
'$HOME '
"isn't"
' where `<3'
"' is."
can be pasted together back to back, forming a single long string after tokenization and quote removal.
$ echo '$HOME '"isn't"' where `<3'"' is."
$HOME isn't where `<3' is.
This isn't awfully legible, but it's a common technique and thus good to know.
As an aside, scripts should usually not use ls for anything. To expand a wildcard, just ... use it.
$ printf '%s\n' $pattern # not ``ls -1 $pattern''
file1.txt
file_other.txt
$ for file in $pattern; do # definitely, definitely not ``for file in $(ls $pattern)''
> printf 'Found file: %s\n' "$file"
> done
Found file: file1.txt
Found file: file_other.txt
(The loop is completely superfluous in the latter example; printf specifically works fine with multiple arguments. stat too. But looping over a wildcard match is a common problem, and frequently done incorrectly.)
A variable containing a list of tokens to loop over or a wildcard to expand is less frequently seen, so we sometimes abbreviate to "quote everything unless you know precisely what you are doing".
Here is a three-point formula for quotes in general:
Double quotes
In contexts where we want to suppress word splitting and globbing. Also in contexts where we want the literal to be treated as a string, not a regex.
Single quotes
In string literals where we want to suppress interpolation and special treatment of backslashes. In other words, situations where using double quotes would be inappropriate.
No quotes
In contexts where we are absolutely sure that there are no word splitting or globbing issues or we do want word splitting and globbing.
Examples
Double quotes
literal strings with whitespace ("StackOverflow rocks!", "Steve's Apple")
variable expansions ("$var", "${arr[#]}")
command substitutions ("$(ls)", "`ls`")
globs where directory path or file name part includes spaces ("/my dir/"*)
to protect single quotes ("single'quote'delimited'string")
Bash parameter expansion ("${filename##*/}")
Single quotes
command names and arguments that have whitespace in them
literal strings that need interpolation to be suppressed ( 'Really costs $$!', 'just a backslash followed by a t: \t')
to protect double quotes ('The "crux"')
regex literals that need interpolation to be suppressed
use shell quoting for literals involving special characters ($'\n\t')
use shell quoting where we need to protect several single and double quotes ($'{"table": "users", "where": "first_name"=\'Steve\'}')
No quotes
around standard numeric variables ($$, $?, $# etc.)
in arithmetic contexts like ((count++)), "${arr[idx]}", "${string:start:length}"
inside [[ ]] expression which is free from word splitting and globbing issues (this is a matter of style and opinions can vary widely)
where we want word splitting (for word in $words)
where we want globbing (for txtfile in *.txt; do ...)
where we want ~ to be interpreted as $HOME (~/"some dir" but not "~/some dir")
See also:
Difference between single and double quotes in Bash
What are the special dollar sign shell variables?
Quotes and escaping - Bash Hackers' Wiki
When is double quoting necessary?
I generally use quoted like "$var" for safe, unless I am sure that $var does not contain space.
I do use $var as a simple way to join lines:
lines="`cat multi-lines-text-file.txt`"
echo "$lines" ## multiple lines
echo $lines ## all spaces (including newlines) are zapped
Whenever the https://www.shellcheck.net/ plugin for your editor tells you to.
Hi im try to add a definded text area %-74s using sed and printf in a tcl script i have but im not sure how to add the printf info to the line of code i have
puts $f "sed -i "s/XXXTLEXXX/\$1/\" /$file";
any help would be greatly appreciated
ive tried a few combinations but all error
Your problem is that you have a need to peint a string with limited substitutions in it, yet that string contains $, " and \ characters in it. Those special characters mean that using a normal double-quoted word in Tcl is very awkward; you could use lots of backslashes to quote the TCL metacharacters, but that's horrible when most of the string is in another language (shell/sed in your case). Here is a better option with string map and a brace-quoted word (which is free of substitutions):
set str {sed -i "s/XXXTLEXXX/$1/" /%FILE%}
puts $f [string map [list "%FILE%" $file] $str]
Note that you can do multiple substitutions in one string map, and that it does each substitution wherever it can. You can use a multi-line literal too. (%FILE% was chosen to be a literal that didn't otherwise occur in the string. Pick your own as you need them, but putting the name in helps with readability.)
I have a binary file containing some file paths. If the path starts with a certain string, the rest of the file path [\x20-\x7f]+ should be masked, leaving the general structure and size of the file intact!
So with a list of paths to search for is this:
/usr/local/bin/
/home/joe/
Then an occurrence like this in the binary data:
^#^#^#^#/home/joe/documents/hello.docx^#^#^#^#
Should be changed to this:
^#^#^#^#/home/joe/********************^#^#^#^#
What is the best way to do this? Do sed, perl or awk have a way? Or do I have to write a C or PHP program where I find the string and write strlen() number of mask characters in its place?
perl is a good choice for working on binary data. For sed and awk, only the GNU implementations can generally cope with binary data, the other ones would choke on the NUL byte or on long sequences between two newline characters, or on non-terminated lines.
perl -pi.back -e 's{(/usr/local/bin|/home/joe)/\K[\x20-\x7f]+}{
$& =~ s/./*/rg}ge' binary-file
You'd need not too old a version of perl for the /r flag (returns the result of the substitution instead of applying it on the variable) and \K (reset the start of the matched string).
By default, perl -p works on one line at a time, since the newline character is not part of [\x20-\x7f], that's fine.
Here is some perl code that works, though I'm sure it can be optimised. It is a filter, so it reads all of stdin into $data, then for each string in the array #dirs it does a substitute for the pattern. The replacement however is not a fixed string but a function call replace($dir,$1) which is evaluated because of the e modifier to the substitute command.
#!/usr/bin/perl
use strict;
sub replace{
my ($dir,$rest) = #_;
$rest =~ s/./*/g;
return $dir.$rest;
}
my #dirs = ('/usr/local/bin/','/home/joe/');
my $data = join("",<STDIN>);
foreach my $dir (#dirs){
$data =~ s|$dir([\x20-\x7f]+)|replace($dir,$1)|ge;
}
print $data;
The function is given 2 arguments, the directory and the captured part of the pattern. It returns these concatenated after replacing each character in the captured string.
I have a large csv file with about 100 double quoted "text fields" per line. Many of the lines have a \r\n embedded in a double quoted field. The \r\n pair is also used for line termination.
How can the \r\n pairs be removed from the double quoted fields and not impact the \r\n line terminations.
I have tried creating individual sed scripts to identify the particular embedded sequences. That sort of worked, but the number of scripts became unmanageable. I have also tried using a 'tr -d '\r' command, that did not work.
For this purpose you can use perl.
perl -0pe 's/"(.*?)([[:space:]]+)(.*?)"/"\1 \3"/g' input.csv > output.csv
perl -pe 's/../../g' file is analogous to sed -e 's/../../g' file but using Perl compatible regular expressions (PCRE). See this or this for more details.
With the option zero perl -0 we are considering the null character as input record separator (see xargs -0 ... or find . -print0 ...).
All versions of sed supports Basic Regular Expressions (BRE) and some versions, as GNU sed (using sed -E), supports Extended Regular Expressions (ERE), a superset of BRE. Both in BRE and ERE the quantifiers * and + are always greedy, they match as many characters as possible. In some contexts, however, we need to match as few characters as possible. With PCRE we can do the quantifiers * and + lazy or minimal or reluctant using *? and +? respectively.
In order to remove \r\n inside double quotes preserving any other line break, we use lazy matching: "(.*?)([[:space:]]+)(.*?)". We also use backreferences "\1 \3" omitting the second matched parentheses which contains any sequence of white spaces (characters \t,\r, \n, \v or \f), in particular sequences of \r\n.
This approach works in CSV files in which "lines have a \r\n embedded in a double quoted field" as is your case. If input.csv contains cells with three or more paragraphs separated by \r\n, we must modify the regular expression.
I've been racking my brain with this for the past half an hour and everything I've tried so far has failed miserably!
Within an html file, there is a field within tags, but the field itself is not separated with a space from the > sign so it's hard to read with awk. I would basically like to add a single space after the opening tag, but gsub and awk are refusing to cooperate.
I've tried
awk 'gsub("class\\\'\\\'>","class\\\'\\\'>")' filename
since one backslash is needed to escape the single quote, the second to escape the backslash itself, and the third to escape the sequence \' but Terminal (I'm working on a Mac) refuses to execute, and instead goes in the next line awaiting some other input from me.
Please help :(
In Bash, single quotes accept absolutely no kind of escape. Suppose e.g. I write this command:
$ echo '\''
>
Bash will consider the string opened by ' closed at the second ', generating a string containing only \. The next ', then, is considered the opening of a new string, so bash expects for more input in the next line (signalled by the >).
If you are not aware of this fact, you may think that the string after the echo command below will be open yet, but it is closed:
$ echo 'will this string contain a single quote like \'
will this string contain a single quote like \
So, when you write
'gsub("class\\\'\\\'>","class\\\'\\\'> ")'
you are writing the string gsub("class\\\ concatenated with a backslash and a quote (\'); then a greater than signal. After this, the "," is interpreted as a string containing a comma, because the single quote of the beginning of the expression was closed before. For now, the result is:
gsub("class\\\\'>,
After the comma, you have the string class, followed by a backslash and a quote, followed by another backslash and another quote, and finally by a greater than symbol and a space. This is the current string:
gsub("class\\\\'>,class\'\'>
This is no valid awk expression! Anyway, it gets worse: the double quote " will start a string, which will contain a closing parenthesis and a single quote, but this string is never closed!
Summing up, your problem is that, if you opened a string with ' in Bash, it will be forcedly close at the next ', no matter how many backslashes you put before it.
Solution: you can make some tricks opening and closing strings with ' and " but it will become cumbersome quickly. My suggested solution is to put your awk expression in a file. Then, use the -f flag from awk - this flag will make awk to execute the following file:
$ cat filename # The file to be changed
class''>
class>
class''>
$ cat mycode.awk # The awk script
gsub("class''>", "class''>[PSEUDOSPACE]")
$ awk -f mycode.awk filename # THE RESULT!
class''>[PSEUDOSPACE]
class''>[PSEUDOSPACE]
If you do not want to write a file, use the so called here documents:
$ awk -f- filename <<EOF
gsub("class''>", "class''>[PSEUDOSPACE]")
EOF
class''>[PSEUDOSPACE]
class''>[PSEUDOSPACE]
The problem is that you are escaping the ', so you are not finishing the command. For example:
echo \' > foo
echoes a single quote into the file named foo, and
echo \\\' > foo
writes a single backslash followed by a single quote.
In particular, you cannot escape a single quote inside a string, so
'foo\'bar'
is the string foo\ followed by the string bar followed by an unmatched open quote. It is exactly the same as writing "foo\\"bar'