I have a binary file containing some file paths. If the path starts with a certain string, the rest of the file path [\x20-\x7f]+ should be masked, leaving the general structure and size of the file intact!
So with a list of paths to search for is this:
/usr/local/bin/
/home/joe/
Then an occurrence like this in the binary data:
^#^#^#^#/home/joe/documents/hello.docx^#^#^#^#
Should be changed to this:
^#^#^#^#/home/joe/********************^#^#^#^#
What is the best way to do this? Do sed, perl or awk have a way? Or do I have to write a C or PHP program where I find the string and write strlen() number of mask characters in its place?
perl is a good choice for working on binary data. For sed and awk, only the GNU implementations can generally cope with binary data, the other ones would choke on the NUL byte or on long sequences between two newline characters, or on non-terminated lines.
perl -pi.back -e 's{(/usr/local/bin|/home/joe)/\K[\x20-\x7f]+}{
$& =~ s/./*/rg}ge' binary-file
You'd need not too old a version of perl for the /r flag (returns the result of the substitution instead of applying it on the variable) and \K (reset the start of the matched string).
By default, perl -p works on one line at a time, since the newline character is not part of [\x20-\x7f], that's fine.
Here is some perl code that works, though I'm sure it can be optimised. It is a filter, so it reads all of stdin into $data, then for each string in the array #dirs it does a substitute for the pattern. The replacement however is not a fixed string but a function call replace($dir,$1) which is evaluated because of the e modifier to the substitute command.
#!/usr/bin/perl
use strict;
sub replace{
my ($dir,$rest) = #_;
$rest =~ s/./*/g;
return $dir.$rest;
}
my #dirs = ('/usr/local/bin/','/home/joe/');
my $data = join("",<STDIN>);
foreach my $dir (#dirs){
$data =~ s|$dir([\x20-\x7f]+)|replace($dir,$1)|ge;
}
print $data;
The function is given 2 arguments, the directory and the captured part of the pattern. It returns these concatenated after replacing each character in the captured string.
Related
Should or should I not wrap quotes around variables in a shell script?
For example, is the following correct:
xdg-open $URL
[ $? -eq 2 ]
or
xdg-open "$URL"
[ "$?" -eq "2" ]
And if so, why?
General rule: quote it if it can either be empty or contain spaces (or any whitespace really) or special characters (wildcards). Not quoting strings with spaces often leads to the shell breaking apart a single argument into many.
$? doesn't need quotes since it's a numeric value. Whether $URL needs it depends on what you allow in there and whether you still want an argument if it's empty.
I tend to always quote strings just out of habit since it's safer that way.
In short, quote everything where you do not require the shell to perform word splitting and wildcard expansion.
Single quotes protect the text between them verbatim. It is the proper tool when you need to ensure that the shell does not touch the string at all. Typically, it is the quoting mechanism of choice when you do not require variable interpolation.
$ echo 'Nothing \t in here $will change'
Nothing \t in here $will change
$ grep -F '#&$*!!' file /dev/null
file:I can't get this #&$*!! quoting right.
Double quotes are suitable when variable interpolation is required. With suitable adaptations, it is also a good workaround when you need single quotes in the string. (There is no straightforward way to escape a single quote between single quotes, because there is no escape mechanism inside single quotes -- if there was, they would not quote completely verbatim.)
$ echo "There is no place like '$HOME'"
There is no place like '/home/me'
No quotes are suitable when you specifically require the shell to perform word splitting and/or wildcard expansion.
Word splitting (aka token splitting);
$ words="foo bar baz"
$ for word in $words; do
> echo "$word"
> done
foo
bar
baz
By contrast:
$ for word in "$words"; do echo "$word"; done
foo bar baz
(The loop only runs once, over the single, quoted string.)
$ for word in '$words'; do echo "$word"; done
$words
(The loop only runs once, over the literal single-quoted string.)
Wildcard expansion:
$ pattern='file*.txt'
$ ls $pattern
file1.txt file_other.txt
By contrast:
$ ls "$pattern"
ls: cannot access file*.txt: No such file or directory
(There is no file named literally file*.txt.)
$ ls '$pattern'
ls: cannot access $pattern: No such file or directory
(There is no file named $pattern, either!)
In more concrete terms, anything containing a filename should usually be quoted (because filenames can contain whitespace and other shell metacharacters). Anything containing a URL should usually be quoted (because many URLs contain shell metacharacters like ? and &). Anything containing a regex should usually be quoted (ditto ditto). Anything containing significant whitespace other than single spaces between non-whitespace characters needs to be quoted (because otherwise, the shell will munge the whitespace into, effectively, single spaces, and trim any leading or trailing whitespace).
When you know that a variable can only contain a value which contains no shell metacharacters, quoting is optional. Thus, an unquoted $? is basically fine, because this variable can only ever contain a single number. However, "$?" is also correct, and recommended for general consistency and correctness (though this is my personal recommendation, not a widely recognized policy).
Values which are not variables basically follow the same rules, though you could then also escape any metacharacters instead of quoting them. For a common example, a URL with a & in it will be parsed by the shell as a background command unless the metacharacter is escaped or quoted:
$ wget http://example.com/q&uack
[1] wget http://example.com/q
-bash: uack: command not found
(Of course, this also happens if the URL is in an unquoted variable.) For a static string, single quotes make the most sense, although any form of quoting or escaping works here.
wget 'http://example.com/q&uack' # Single quotes preferred for a static string
wget "http://example.com/q&uack" # Double quotes work here, too (no $ or ` in the value)
wget http://example.com/q\&uack # Backslash escape
wget http://example.com/q'&'uack # Only the metacharacter really needs quoting
The last example also suggests another useful concept, which I like to call "seesaw quoting". If you need to mix single and double quotes, you can use them adjacent to each other. For example, the following quoted strings
'$HOME '
"isn't"
' where `<3'
"' is."
can be pasted together back to back, forming a single long string after tokenization and quote removal.
$ echo '$HOME '"isn't"' where `<3'"' is."
$HOME isn't where `<3' is.
This isn't awfully legible, but it's a common technique and thus good to know.
As an aside, scripts should usually not use ls for anything. To expand a wildcard, just ... use it.
$ printf '%s\n' $pattern # not ``ls -1 $pattern''
file1.txt
file_other.txt
$ for file in $pattern; do # definitely, definitely not ``for file in $(ls $pattern)''
> printf 'Found file: %s\n' "$file"
> done
Found file: file1.txt
Found file: file_other.txt
(The loop is completely superfluous in the latter example; printf specifically works fine with multiple arguments. stat too. But looping over a wildcard match is a common problem, and frequently done incorrectly.)
A variable containing a list of tokens to loop over or a wildcard to expand is less frequently seen, so we sometimes abbreviate to "quote everything unless you know precisely what you are doing".
Here is a three-point formula for quotes in general:
Double quotes
In contexts where we want to suppress word splitting and globbing. Also in contexts where we want the literal to be treated as a string, not a regex.
Single quotes
In string literals where we want to suppress interpolation and special treatment of backslashes. In other words, situations where using double quotes would be inappropriate.
No quotes
In contexts where we are absolutely sure that there are no word splitting or globbing issues or we do want word splitting and globbing.
Examples
Double quotes
literal strings with whitespace ("StackOverflow rocks!", "Steve's Apple")
variable expansions ("$var", "${arr[#]}")
command substitutions ("$(ls)", "`ls`")
globs where directory path or file name part includes spaces ("/my dir/"*)
to protect single quotes ("single'quote'delimited'string")
Bash parameter expansion ("${filename##*/}")
Single quotes
command names and arguments that have whitespace in them
literal strings that need interpolation to be suppressed ( 'Really costs $$!', 'just a backslash followed by a t: \t')
to protect double quotes ('The "crux"')
regex literals that need interpolation to be suppressed
use shell quoting for literals involving special characters ($'\n\t')
use shell quoting where we need to protect several single and double quotes ($'{"table": "users", "where": "first_name"=\'Steve\'}')
No quotes
around standard numeric variables ($$, $?, $# etc.)
in arithmetic contexts like ((count++)), "${arr[idx]}", "${string:start:length}"
inside [[ ]] expression which is free from word splitting and globbing issues (this is a matter of style and opinions can vary widely)
where we want word splitting (for word in $words)
where we want globbing (for txtfile in *.txt; do ...)
where we want ~ to be interpreted as $HOME (~/"some dir" but not "~/some dir")
See also:
Difference between single and double quotes in Bash
What are the special dollar sign shell variables?
Quotes and escaping - Bash Hackers' Wiki
When is double quoting necessary?
I generally use quoted like "$var" for safe, unless I am sure that $var does not contain space.
I do use $var as a simple way to join lines:
lines="`cat multi-lines-text-file.txt`"
echo "$lines" ## multiple lines
echo $lines ## all spaces (including newlines) are zapped
Whenever the https://www.shellcheck.net/ plugin for your editor tells you to.
I am trying to exclude certain words from dictionary file.
# cat en.txt
test
testing
access/p
batch
batch/n
batches
cross
# cat exclude.txt
test
batch
# grep -vf exclude.txt en.txt
access/p
cross
The words like "testing" and "batches" should be included in the results.
expected result:
testing
access/p
batches
cross
Because the word "batch" may or may not be followed by a slash "/". There can be one or more tags after slash (n in this case). But the word "batches" is a different word and should not match with "batch".
I would harness GNU AWK for this task following way, let en.txt content be
test
testing
access/p
batch
batch/n
batches
cross
and exclude.txt content be
test
batch
then
awk 'BEGIN{FS="/"}FNR==NR{arr[$1];next}!($1 in arr)' exclude.txt en.txt
gives output
testing
access/p
batches
cross
Explanation: I inform GNU AWK that / is field separator (FS), then when processing first file (where number of row globally is equal to number of row inside file, that is FNR==NR) I simply use 1st column value as key in array arr and then go to next line, so nothing other happens, for 2nd (and following files if present) I select lines whose 1st column is not (!) one of keys of array arr.
(tested in GNU Awk 5.0.1)
Using grep matching whole words:
grep -wvf exclude.txt en.txt
Explanation (from man grep)
-w --word-regexp Select only those lines containing matches that form whole words.
-v --invert-match Invert the sense of matching, to select non-matching lines.
-f -f FILE Obtain patterns from FILE, one per line.
Output
testing
access/p
batches
cross
Since there are many words in a dictionary that may have a root in one of those to exclude we cannot conveniently† use a look-up hash (built of the exclude list), but have to check all of them. One way to do that more efficiently is to use an alternation pattern built from the exclude list
use warnings;
use strict;
use feature 'say';
use Path::Tiny; # to read ("slurp") a file conveniently
my $excl_file = 'exclude.txt';
my $re_excl = join '|', split /\n/, path($excl_file)->slurp;
$re_excl = qr($re_excl);
while (<>) {
if ( m{^ $re_excl (?:/.)? $}x ) {
# say "Skip printing (so filter out): $_";
next;
}
say;
}
This is used as program.pl dictionary-filename and it prints the filtered list.
Here I've assumed that what may follow the root-word to exclude is / followed by one character, (?:/.)?, since examples in the question use that and there is no precise statement on it. The pattern also assumes no spaces around the word.
Please adjust as/if needed for what may actually follow /. For example, it'd be (?:/.+)? for at least one character, (?:/[np])? for any character from a specific list (n or p), (?:[^xy]+)? for any characters not in the given list, etc.
The qr operator forms a proper regex pattern.
† Can still first strip non-word endings, then use a look-up, then put back those endings
use Path::Tiny; # to read a file conveniently
my %lu = map { $_ => 1 } path($excl_file)->lines({ chomp => 1 });
while (<>) {
chomp;
# [^\w-] protects hyphenated words (or just use \W)
# Or: s{(/.+$}{}g; if "/" is the only possibility
s/([^\w-].+)$//g;
next if exists $lu{$_};
$_ .= $1 if $1;
say;
}
This will be far more efficient, on large dictionaries and long lists of exclude words.
However, it is far more complex and probably fails some (unstated) requirements
Hi im try to add a definded text area %-74s using sed and printf in a tcl script i have but im not sure how to add the printf info to the line of code i have
puts $f "sed -i "s/XXXTLEXXX/\$1/\" /$file";
any help would be greatly appreciated
ive tried a few combinations but all error
Your problem is that you have a need to peint a string with limited substitutions in it, yet that string contains $, " and \ characters in it. Those special characters mean that using a normal double-quoted word in Tcl is very awkward; you could use lots of backslashes to quote the TCL metacharacters, but that's horrible when most of the string is in another language (shell/sed in your case). Here is a better option with string map and a brace-quoted word (which is free of substitutions):
set str {sed -i "s/XXXTLEXXX/$1/" /%FILE%}
puts $f [string map [list "%FILE%" $file] $str]
Note that you can do multiple substitutions in one string map, and that it does each substitution wherever it can. You can use a multi-line literal too. (%FILE% was chosen to be a literal that didn't otherwise occur in the string. Pick your own as you need them, but putting the name in helps with readability.)
Find/replace space with %20
I must replace all spaces in *.html files which are inside href="something something .pdf" with %20.
I found a regular expression for that task:
find : href\s*=\s*['"][^'" ]*\K\h|(?!^)\G[^'" ]*\K\h
replace : %20
That regular expression works in text editors like Notepad++ or Geany.
I want use that regular expression from the Linux command line with sed or perl.
Solution (1):
cat test002.html | perl -ne 's/href\s*=\s*['\''"][^'\''" ]*\K\h|(?!^)\G[^'\''" ]*\K\h/%20/g; print;' > Work_OK01.html
Solution (2):
cat test002.html | perl -ne 's/href\s*=\s*[\x27"][^\x27" ]*\K\h|(?!^)\G[^\x27" ]*\K\h/%20/g; print;' > Work_OK02.html
The problem is that you don't properly escape the single quotes in your program.
If your program is
...[^'"]...
The shell literal might be
'...[^'\''"]...'
'...[^'"'"'"]...'
'...[^\x27"]...' # Avoids using a single quote to avoid escaping it.
So, you were going for
perl -ne 's/href\s*=\s*['\''"][^'\''" ]*\K\h|(?!^)\G[^'\''" ]*\K\h/%20/g; print;'
Don't try do everything at once. Here are some far cleaner (i.e. far more readable) solutions:
perl -pe's{href\s*=\s*['\''"]\K([^'\''"]*)}{ $1 =~ s/ /%20/rg }eg' # 5.14+
perl -pe's{href\s*=\s*['\''"]\K([^'\''"]*)}{ (my $s = $1) =~ s/ /%20/g; $s }eg'
Note that -p is the same as -n, except that it cause a print to be performed for each line.
The above solutions make a large number of assumptions about the files that might be encountered[1]. All of these assumptions would go away if you use a proper parser.
If you have HTML files:
perl -MXML::LibXML -e'
my $doc = XML::LibXML->new->parse_file($ARGV[0]);
$_->setValue( $_->getValue() =~ s/ /%20/gr )
for $doc->findnodes(q{//#href});
binmode(STDOUT);
print($doc->toStringHTML());
' in_file.html >out_file.html
If you have XML (incl XHTML) files:
perl -MXML::LibXML -e'
my $doc = XML::LibXML->new->parse_file($ARGV[0]);
$_->setValue( $_->getValue() =~ s/ /%20/gr )
for $doc->findnodes(q{//#href});
binmode(STDOUT);
$doc->toFH(\*STDOUT);
' in_file.html >out_file.html
Assumptions made by the substitution-based solutions:
File uses an ASCII-based encoding (e.g. UTF-8, iso-latin-1, but not UTF-16le).
No newline between href and =.
No newline between = and the value.
No newline in the value of href attributes.
Nothing matching /href\s*=/ in text (incl CDATA sections).
Nothing matching /href\s*=/ in comments.
No other attributes have a name ending in href.
No single quote (') in href="...".
No double quote (") in href='...'.
No href= with an unquoted value.
Space in href attributes aren't encoded using a character entity (e.g ).
Maybe more?
(SLePort makes similar assumptions even though they didn't document them. They also assume href attributes don't contain >.)
An xml parser would be more suited for that job(eg. XMLStarlet, xmllint,...), but if you don't have newlines in your a tags, the below sed should work.
Using the t command and backreferences, it loops over and replace all spaces up to last " inside the a tags:
$ sed ':a;s/\(<a [^>]*href=[^"]*"[^ ]*\) \([^"]*">\)/\1%20\2/;ta' <<< '<a href="http://url with spaces">'
<a href="http://url%20with%20spaces">
You seem to have neglected to escape the quotes inside the string you pass to Perl. So Bash sees you giving perl the following arguments:
s/href\s*=\s*[][^', which results from the concatenation of the single-quoted string 's/href\s*=\s*[' and the double-quoted string "][^'"
]*Kh, unquoted, because \K and \h are not special characters in the shell so it just treats them as K and h respectively
Then Bash sees a pipe character |, followed by a subshell invocation (?!^), in which !^ gets substituted with the first argument of the last command invoked. (See "History Expansion > Word Designators" in the Bash man page.) For example, if your last command was echo myface then (?!^) would look for the command named ?myface and runs it in a subshell.
And finally, Bash gets to the sequence \G[^'" ]*\K\h/%20/g; print;', which is interpreted as the concatenation of G (from \G), [^, and the single-quoted string " ]*\K\h/%20/g; print;. Bash has no idea what to do with G[^" ]*\K\h/%20/g; print;, since it just finished parsing a subshell invocation and expects to see a semicolon, line break, or logical operator (or so on) before getting another arbitrary string.
Solution: properly quote the expression you give to perl. You'll need to use a combination of single and double quotes to pull it off, e.g.
perl -ne 's/href\s*=\s*['"'\"][^'\" ]*"'\K\h|(?!^)\G[^'"'\" ]*"'\K\h/%20/g; print;'
I'm trying to pull variables from an API in json format and then put them back together with one variable changed and fire them back as a put.
Only issue is that every value has quote marks in it and must go back to the API separated by commas only.
example of what it should see with redacted information, variables inside the **'s:
curl -skv -u redacted:redacted -H Content-Type: application/json -X PUT -d'{properties:{basic:{request_rules:[**"/(req) testrule","/test-body","/(req) test - Admin","test-Caching"**]}}}' https://x.x.x.x:9070/api/tm/1.0/config/active/vservers/xxx-xx
Obviously if I fire them as a plain array I get spaces instead of commas. However I tried outputting it as a plain string
longstr=$(echo ${valuez[#]})
output=$(echo $longstr |sed -e 's/" /",/g')
And due to the way bash is interpreted it seems to either interpret the quotes wrong or something else. I guess it might well be the single ticks encapsulating after the PUT -d as well but I'm not sure how I can throw a variable into something that has single ticks.
If I put the raw data in manually it works so it's either the way the variable is being sent or the single ticks. I don't get an error and when I echo the line out it looks perfect.
Any ideas?
valuez=( "/(req) testrule" "/test-body" "/(req) test - Admin" "test-Caching" )
# Temporarily set IFS to some character which is known not to appear in the array.
oifs=$IFS
IFS=$'\014'
# Flatten the array with the * expansion giving a string containing the array's elements separated by the first character of $IFS.
d_arg="${valuez[*]}"
IFS=$oifs
# If necessary, quote or escape embedded quotation marks. (Implementation-specific, using doubled double quotes as an example.)
d_arg="${d_arg//\"/\"\"}"
# Substitute the known-to-be-absent character for the desired quote+separator+quote.
d_arg="${d_arg//$'\014'/\",\"}"
# Prepend and append quotes.
d_arg="\"$d_arg\""
# insert the prepared arg into the final string.
d_arg="{properties:{basic:{request_rules:[${d_arg}]}}}"
curl ... -d"$d_arg" ...
if you have gnu awk with version 4 and above, which support FPAT
output=$(echo $longstr |awk '$1=$1' FPAT="(\"[^\"]+\")" OFS=",")
Explanation
FPAT #
This is a regular expression (as a string) that tells gawk to create the fields based on text that matches the regular expression. Assigning a value to FPAT overrides the use of FS and FIELDWIDTHS for field splitting. See Splitting By Content, for more information.
If gawk is in compatibility mode (see Options), then FPAT has no special meaning, and field-splitting operations occur based exclusively on the value of FS.
valuez=( "/(req) testrule" "/test-body" "/(req) test - Admin" "test-Caching" )
csv="" sep=""
for v in "${valuez[#]}"; do csv+="$sep\"$v\""; sep=,; done
echo "$csv"
"/(req) testrule","/test-body","/(req) test - Admin","test-Caching"
If it's something you need to do repeatedly, but it into a function:
toCSV () {
local csv sep val
for val; do
csv+="$sep\"$val\""
sep=,
done
echo "$csv"
}
csv=$(toCSV "${valuez[#]}")