Why is my perl replace not working? - html

I have the following perl search and replace :
With Escapes :
perl -pi -w -e 's/ onclick=\"img=document.getElementById(\'img_1\')\; img.style.display = (img.style.display == \'none\' ? \'block\' : \'none\');return false"//' test.html
Without Escapes :
perl -pi -w -e 's/ onclick="img=document.getElementById('img_1'); img.style.display = (img.style.display == 'none' ? 'block' : 'none');return false"//' test.html
My objective is to replace onclick="img=document.getElementById('img_1'); img.style.display = (img.style.display == 'none' ? 'block' : 'none');return false" with nothing in the file test.html. What am I messing up?
I keep on getting the error : -sh: syntax error near unexpected token)'` which I cannot help but feel is because of some stupid escaping on my part. Please help me out.

[ You didn't specify for which shell you are building the command. I'm going to assume sh or bash. ]
Problem #1: Converting text into a regex pattern
Many characters have a special meaning in regex patterns. Among those you used, (, ), . and ? are special. These need to be escaped.
If you want to match
onclick="img=document.getElementById('img_1'); img.style.display = (img.style.display == 'none' ? 'block' : 'none');return false"
You need to use the pattern
onclick="img=document\.getElementById\('img_1'\); img\.style\.display = \(img\.style\.display == 'none' \? 'block' : 'none'\);return false"
So your Perl code is
s/onclick="img=document\.getElementById\('img_1'\); img\.style\.display = \(img\.style\.display == 'none' \? 'block' : 'none'\);return false"//
Problem #2: Converting text (a Perl program) into a shell literal
This is the problem you asked about.
Sections quoted by single quotes ("'") end at the next single quote (unconditionally), so the following is wrong:
echo 'foo\'bar' # Third "'" isn't matched.
If you wanted to output "foo'bar", you could use
echo 'foo'\''bar' # Concatenation of "foo", "'" and "bar".
foo and bar are quoted by single quotes, and the single quote is escaped using \. \ works here because it's outside of single quotes.
So, the lesson is basically use '\'' instead of ' inside single quotes.
you want to pass the following program to Perl:
s/onclick="img=document\.getElementById\('img_1'\); img\.style\.display = \(img\.style\.display == 'none' \? 'block' : 'none'\);return false"//
To do so, you'll need to create a shell literal that produces the correct argument, meaning we want to surround the whole with single quotes and escape the existing single quotes. We get the following:
's/onclick="img=document\.getElementById\('\''img_1'\''\); img\.style\.display = \(img\.style\.display == '\''none'\'' \? '\''block'\'' : '\''none'\''\);return false"//'
As such, the entire command would be
perl -i -wpe's/onclick="img=document\.getElementById\('\''img_1'\''\); img\.style\.display = \(img\.style\.display == '\''none'\'' \? '\''block'\'' : '\''none'\''\);return false"//' test.html

Using the next:
perl -lnE 'say quotemeta $_'
and feed it with your plain input:
onclick="img=document.getElementById('img_1'); img.style.display = (img.style.display == 'none' ? 'block' : 'none');return false"
you will get:
onclick\=\"img\=document\.getElementById\(\'img_1\'\)\;\ img\.style\.display\ \=\ \(img\.style\.display\ \=\=\ \'none\'\ \?\ \'block\'\ \:\ \'none\'\)\;return\ false\"
So using it:
perl -i -pe "s/onclick\=\"img\=document\.getElementById\(\'img_1\'\)\;\ img\.style\.display\ \=\ \(img\.style\.display\ \=\=\ \'none\'\ \?\ \'block\'\ \:\ \'none\'\)\;return\ false\"//"
should work.

Some things are easier when you don't use a one-liner. Less things to escape, and not as fragile.
use strict;
use warnings;
my $literal_string = q{ onclick="img=document.getElementById('img_1'); img.style.display = (img.style.display == 'none' ? 'block' : 'none');return false"};
while (<>) {
s/\Q$literal_string//;
print;
}
Then execute the script:
perl -i thescript.pl test.html

Related

Strange interaction between print and the ternary conditional operator

Ran into a strange interaction between print and the ternary conditional operator that I don't understand. If we do...:
print 'foo, ' . (1 ? 'yes' : 'no') . ' bar';
...then we get the output...:
foo, yes bar
...as we would expect. However, if we do...:
print (1 ? 'yes' : 'no') . ' bar';
...then we just get the output...:
yes
Why isn't " bar" getting appended to the output in the second case?
Let's do it, but for real -- that is, with warnings on
perl -we'print (1 ? "yes" : "no") . " bar"'
It prints
print (...) interpreted as function at -e line 1.
Useless use of concatenation (.) or string in void context at -e line 1.
yes
(but no newline at the end)
So since (1 ? "yes" : "no") is taken as the argument list for the print function then the ternary is evaluated to yes and that is the argument for print and so that, alone, is printed. As this is a known "gotcha," which can easily be done in error, we are kindly given a warning for it.
Then the string " bar" is concatenated (to the return value of print which is 1), what is meaningless in void context, and for what we also get a warning.
One workaround is to prepend a +, forcing the interpretation of () as an expression
perl -we'print +(1 ? "yes" : "no") . " bar", "\n"'
Or, call the print as function properly, with full parenthesis
perl -we'print( (1 ? "yes" : "no") . " bar", "\n" )'
where I've added the newline in both cases.
See this post for a detailed discussion of a related example and precise documentation links.
If the first non-whitespace character after a function name is an opening parenthesis, then Perl will interpret that as the start of the function's parameter list and the matching closing parenthesis will be used as the end of the parameter list. This is one of the things that use warnings will tell you about.
The usual fix is to insert a + before the opening parenthesis.
$ perl -e "print (1 ? 'yes' : 'no') . ' bar'"
yes
$ perl -e "print +(1 ? 'yes' : 'no') . ' bar'"
yes bar

Bash regex match on JSON structure

I have a JSON structure like this which I've read into a bash variable as a string:
{
"elem1": "val1",
"THEELEM": "THEVAL",
"elem3": "val3"
}
I want to use regex to match on "THEELEM": "THEVAL". It works if I try individual words, where the JSON is stored in result as a string:
[[ $result =~ THEVAL ]] && echo "yes"
But I want to match on the key-pair like this:
[[ $result =~ "THEELEM": "THEVAL" ]] && echo "yes"
That gives me syntax issues. I've tried escaping, single-quotes and triple quotes to no avail. Any help appreciated.
Quoting works for me.
[[ $result =~ '"THEELEM": "THEVAL"' ]] && echo "yes"
Note that quoting the pattern disables recognition of special regular expression characters, and just searches for the literal substring. This is not a problem here, since you don't have any wildcards or other non-literal pattern characters. But if you did, you'd have to put the pattern in a variable, as in #noah's answer.
You can create a variable $expr to hold the string you want to match to and then use that for the regex.
expr='"THEELEM": "THEVAL"'
[[ $result =~ $expr ]] && echo "yes"
Inspired by this stack overflow post

Use regular expression to extract img tag from HTML in Perl

I need to extract captcha from url and recognised it with Tesseract.
My code is:
#!/usr/bin/perl -X
###
$user = 'user'; #Enter your username here
$pass = 'pass'; #Enter your password here
###
#Server settings
$home = "http://perltest.adavice.com";
$url = "$home/c/test.cgi?u=$user&p=$pass";
###Add code here!
#Grab img from HTML code
#if ($html =~ /<img. *?src. *?>/)
#{
# $img1 = $1;
#}
#else
#{
# $img1 = "";
#}
$img2 = grep(/<img. *src=.*>/,$html);
if ($html =~ /\img[^>]* src=\"([^\"]*)\"[^>]*/)
{
my $takeImg = $1;
my #dirs = split('/', $takeImg);
my $img = $dirs[2];
}
else
{
print "Image not found\n";
}
###
die "<img> not found\n" if (!$img);
#Download image to server (save as: ocr_me.img)
print "GET '$img' > ocr_me.img\n";
system "GET '$img' > ocr_me.img";
###Add code here!
#Run OCR (using shell command tesseract) on img and save text as ocr_result.txt
system("tesseract ocr_me.img ocr_result");
print "GET '$txt' > ocr_result.txt\n";
system "GET '$txt' > ocr_result.txt";
###
die "ocr_result.txt not found\n" if (!-e "ocr_result.txt");
# check OCR results:
$txt = 'cat ocr_result.txt';
$txt =~ s/[^A-Za-z0-9\-_\.]+//sg;
$img =~ s/^.*\///;
print `echo -n "file=$img&text=$txt" | POST "$url"`;
As you see I`m trying extract img src tag. This solution did not work for me ($img1) use shell command tesseract in perl script to print a text output. Also I used adopted version of that solution($img2) How can I extract URL and link text from HTML in Perl?.
If you need HTMLcode from that page, here is:
<html>
<head>
<title>Perl test</title>
</head>
<body style="font: 18px Arial;">
<nobr>somenumbersimg src="/JJ822RCXHFC23OXONNHR.png"
somenumbers<img src="/captcha/1533030599.png"/>
somenumbersimg src="/JJ822RCXHFC23OXONNHR.png" </nobr><br/><br/><form method="post" action="?u=user&p=pass">User: <input name="u"/><br/>PW: <input name="p"/><br/><input type="hidden" name="file" value="1533030599.png"/>Text: <input name="text"></br><input type="submit"></form><br/>
</body>
</html>
I got error that image not found. My problem is wrong regular expression, as I think.I can not install any modules such as HTTP::Parser or similar
Aside from the fact that using regular expressions on HTML isn't very reliable, your regular expression in the following code isn't going to work because it's missing a capture group, so $1 won't be assigned a value.
if ($html =~ /<img. *?src. *?>/)
{
$img = $1;
}
If you want to extract parts of text using a regular expression you need to put that part inside brackets. Like for example:
$example = "hello world";
$example =~ /(hello) world/;
this will set $1 to "hello".
The regular expression itself doesn't make that much sense - where you have ". *?", that'll match any character followed by 0 or more spaces. Is that a typo for ".*?" which would match any number of characters but isn't greedy like ".*", so will stop when it finds a match for the next part of the regex.
This regular expression is possibly closer to what you're looking for. It'll match the first img tag that has a src attribute that starts with "/captcha/" and store the image URL in $1
$html =~ m%<img[^>]*src="(/captcha/[^"]*)"%s;
To break it down how it works. The "m%....%" is just a different way of saying "/.../" that allows you to put slashes in the regex without needing to escape them. "[^>]*" will match zero or more of any character except ">" - so it won't match the end of the tag. And "(/captcha/[^"]*)" is using a capture group to grab anything inside the double quotes that will be the URL. It's also using the "/s" modifier on the end which will treat $html as if it is just one long line of text and ignoring any \n in it which probably isn't needed, but on the off chance the img tag is split over multiple lines it'll still work.

Function eats my space characters

I wrote a function
function! BrickWrap(data, icon_align, icon_left, icon_right)
if empty(a:data)
return ''
endif
let l:brick = (a:icon_align ==# 'left') ? a:icon_left . a:data :
\ (a:icon_align ==# 'right') ? a:data . a:icon_right :
\ (a:icon_align ==# 'surround') ? a:icon_left . a:data . a:icon_right :
\ (a:icon_align ==# 'none') ? a:data :
\ ''
if empty(l:brick)
echom 'BrickWrap(): Argument "icon_align" must be "left", "right", "surround" or "none".'
return ''
endif
return l:brick
endfunction
in order to format some data which gets displayed inside my statusline, e.g.:
set statusline+=%#User2#%{BrickWrap('some_data','surround','\ ','\ ')}
The example above should wrap the data with a space character on each side. But what happens actually is that it only appends a space character to the right but not to the left. In order get a space character displayed to the left I have to pass two escaped space characters ('\ \ '). I have to mention that it only happens in the statusline. If I'd :call the function it works as expected.
Why does this happen?
To use backslashes and whitespace with :set, you need additional escaping, see :help option-backslash. So, your backslash in front of the space is already taken by the :set command. (You can check this via :set stl?)
If coming up with the correct escaping is too hard, an alternative is to use the :let command instead:
:let &statusline = '...'
However, then you must only use double quotes in your statusline value, or deal with the quote-within-quote escaping then.

How to check for a value before '.' in BASH

I'm outputting some values to JSON format, and it appears if a value starts with a '.' it isn't valid JSON (The API doesn't seem to like these int's inside " "). What would be the best way to check if my value has anything in front of a '.', and if not, put a 0 there?
i.e
value = .53
newvalue = 0.53
I'm not very good at doing anything more than simple functions in BASH at the moment, still trying to learn awk/sed and other useful things such as cut.
There might be a number of possible solutions given the nature of the input. However, given those unknowns an easy workaround would be to say:
[[ $value == \.* ]] && newvalue=0${value}
Example:
$ value=.42
$ [[ $value == \.* ]] && newvalue=0${value}
$ echo $newvalue
0.42