Grep ignore special characters before applying regular expression - json

General
I am trying to recursively search through hundreds of JSON files under a specific directory for lines that match a specific regular expression.
grep -rh works great for searching recursively for specific lines. I am having a problem applying a regular expression with the search because all the lines in the JSON files begin with a " and end in either ", or ".
Example: If I want to apply a regular expression to get all the lines that begin with zxc I will not be able to do it because the lines actually begin with "zxc
Code
The following command would work if the lines had no " at the beginning.
/bin/grep -rh -E "^(zxc)" "/etc/json_dir/"
The following command works, but I do not want grep to get hundreds of thousands of lines from all the JSON files and then apply a regular expression afterwards.
/bin/grep -rh -E ".*" "/etc/json_dir/" | /bin/sed -e 's/^"//g' -e 's/,$//g' -e 's/"$//g' | /bin/grep -E "^(zxc)"
Question
Is there a way for grep to ignore the " character at the beginning and " and ", characters at the end of the lines before it applies a regular expression ?
If there's no way, is there a way to do it with some other bash command, perl, python or some other language.

You can go with awk if I understand Your question properly:
awk '{gsub(/^"|"$/,"") } # this part removes all the "s from the start and end of line
/^WHAT/ { print } # or any other processing
' **/*.json
Note: the **/* requires the globestar recursive globbing option in (modern) bash.
See it in action at Ideone.
You can shorten it somewhat to:
awk '/^"?WHAT/' **/* # this executes the default printing action
But awk|sed|grep might not be the right tool to search JSON.

Related

Calling Imagemagick from awk?

I have a CSV of image details I want to loop over in a bash script. awk seems like an obvious choice to loop over the data.
For each row, I want to take the values, and use them to do Imagemagick stuff. The following isn't working (obviously):
awk -F, '{ magick "source.png" "$1.jpg" }' images.csv
GNU AWK excels at processing structured text data, although it can be used to summon commands using system function it is less handy for that than some other language, e.g. python has module of standard library called subprocess which is more feature-rich.
If you wish to use awk for this task anyway, then I suggest preparing output to be feed into bash command, say you have file.txt with following content
file1.jpg,file1.bmp
file2.png,file2.bmp
file3.webp,file3.bmp
and you have files listed in 1st column in current working directory and wish to convert them to files shown in 2nd column and access to convert command, then you might do
awk 'BEGIN{FS=","}{print "convert \"" $1 "\" \"" $2 "\""}' file.txt | bash
which is equvialent to starting bash and doing
convert "file1.jpg" "file1.bmp"
convert "file2.png" "file2.bmp"
convert "file3.webp" "file3.bmp"
Observe that I have used literal " to enclose filenames, so it should work with names containing spaces. Disclaimer: it might fail if name containing special character, e.g. ".

Getting JSON value from JSON String using Shell Script

I have this JSON String:
{"name":"http://someUrl/ws/someId","id":"someId"}
I just want to get value for "id" key and store it in some variable. I succesfully tried using jq. But due to some constraints, I need to achieve this just by using grep and string matching.
I tried this so far: grep -Po '"id":.*?[^\\]"'; But that is giving "id":"ws-4c906698-03a2-49c3-8b3e-dea829c7fdbe" as output. I just need the id value. Please help
With a PCRE regex, you may use lookarounds. Thus, you need to put "id":" into the positive lookbehind construct, and then match 1 or more chars other than ":
grep -Po '(?<="id":")[^"]+'
where
(?<="id":") - requires a "id":" to appear immediately to the left of the current position (but the matched text is not added to the match value) and
[^"]+ - matches and adds to the match 1 or more chars other than ".
To get the values with escaped quotes:
grep -Po '(?<="id":")[^"\\]*(?:\\.[^"\\]*)*'
Here, (?<="id":") will still match the position right after "id":" and then the following will get matched:
[^"\\]* - zero or more chars other than " and \
(?:\\.[^"\\]*)* - zero or more consequent sequences of:
\\. - a \ and any char (any escape sequence)
[^"\\]* - zero or more chars other than " and \
See Jshon, it is a command line Json parser for shell script usage.
echo '{"name":"http://someUrl/ws/someId","id":"someId"}' | jshon -e id
"someId"
Just noticed I read past the section stating you needed to use standard tools available, if your admin doesn't allow Jshon it is very likely that the system will have Python available which you could use.
echo '{"name":"http://someUrl/ws/someId","id":"someId"}' | python -c 'import sys, json; print json.load(sys.stdin)["id"]'
someId
Using grep for this is just asking for trouble, I would avoid it and opt for a proper Json parser as above.

Similar strings, different results

I'm creating a Bash script to parse the air pollution levels from the webpage:
http://aqicn.org/city/beijing/m/
There is a lot of stuff in the file, but this is the relevant bit:
"iaqi":[{"p":"pm25","v":[59,21,112],"i":"Beijing pm25 (fine
particulate matter) measured by U.S Embassy Beijing Air Quality
Monitor
(\u7f8e\u56fd\u9a7b\u5317\u4eac\u5927\u4f7f\u9986\u7a7a\u6c14\u8d28\u91cf\u76d1\u6d4b).
Values are converted from \u00b5g/m3 to AQI levels using the EPA
standard."},{"p":"pm10","v":[15,5,69],"i":"Beijing pm10
(respirable particulate matter) measured by Beijing Environmental
Protection Monitoring Center
I want the script to parse and display 2 numbers: current PM2.5 and PM10 levels (the numbers in bold in the above paragraph).
CITY="beijing"
AQIDATA=$(wget -q 0 http://aqicn.org/city/$CITY/m/ -O -)
PM25=$(awk -v FS="(\"p\":\"pm25\",\"v\":\\\[|,[0-9]+)" '{print $2}' <<< $AQIDATA)
PM100=$(awk -v FS="(\"p\":\"pm10\",\"v\":\\\[|,[0-9]+)" '{print $2}' <<< $AQIDATA)
echo $PM25 $PM100
Even though I can get PM2.5 levels to display correctly, I cannot get PM10 levels to display. I cannot understand why, because the strings are similar.
Anyone here able to explain?
The following approach is based on two steps:
(1) Extracting the relevant JSON;
(2) Extracting the relevant information from the JSON using a JSON-aware tool -- here jq.
(1) Ideally, the web service would provide a JSON API that would allow one to obtain the JSON directly, but as the URL you have is intended for viewing with a browser, some form of screen-scraping is needed. There is a certain amount of brittleness to such an approach, so here I'll just provide something that currently works:
wget -O - http://aqicn.org/city/beijing/m |
gawk 'BEGIN{RS="function"}
$1 ~/getAqiModel/ {
sub(/.*var model=/,"");
sub(/;return model;}/,"");
print}'
(gawk or an awk that supports multi-character RS can be used; if you have another awk, then first split on "function", using e.g.:
sed $'s/function/\\\n/g' # three backslashes )
The output of the above can be piped to the following jq command, which performs the filtering envisioned in (2) above.
(2)
jq -c '.iaqi | .[]
| select(.p? =="pm25" or .p? =="pm10") | [.p, .v[0]]'
The result:
["pm25",59]
["pm10",15]
I think your problem is that you have a single line HTML file that contains a script that contains a variable that contains the data you are looking for.
Your field delimiters are either "p":"pm100", "v":[ or a comma and some digits.
For pm25 this works, because it is the first, and there are no occurrences of ,21 or something similar before it.
However, for pm10, there are some that are associated with pm25 ahead of it. So the second field contains the empty string between ,21 and ,112
#karakfa has a hack that seems to work -- but he doesn't explain very well why it works.
What he does is use awk's record separator (which is usually a newline) and sets it to either of :, ,, or [. So in your case, one of the records would be "pm25", because it is preceded by a colon, which is a separator, and succeeded by a comma, also a separator.
Once it hits the matching content ("pm25") it sets a counter to 4. Then, for this and the next records, it counts this counter down. "pm25" itself, "v", the empty string between : and [, and finally reaches one when hitting the record with the number you want to output: 4 && ! 3 is false, 3 && ! 2 is false, 2 && ! 1 is false, but 1 && ! 0 is true. Since there is no execution block, awk simply prints this record, which is the value you want.
A more robust work would probably be using xpath to find the script, then use some json parser or similar to get the value.
chw21's helpful answer explains why your approach didn't work.
peak's helpful answer is the most robust, because it employs proper JSON parsing.
If you don't want to or can't use third-party utility jq for JSON parsing, I suggest using sed rather than awk, because awk is not a good fit for field-based parsing of this data.
$ sed -E 's/^.*"pm25"[^[]+\[([0-9]+).+"pm10"[^[]+\[([0-9]+).*$/\1 \2/' <<< "$AQIDATA"
59 15
The above should work with both GNU and BSD/OSX sed.
To read the result into variables:
read pm25 pm10 < \
<(sed -E 's/^.*"pm25"[^[]+\[([0-9]+).+"pm10"[^[]+\[([0-9]+).*$/\1 \2/' <<< "$AQIDATA")
Note how I've chosen lowercase variable names, because it's best to avoid all upper-case variables in shell programming, so as to avoid conflicts with special shell and environment variables.
If you can't rely on the order of the values in the source string, use two separate sed commands:
pm25=$(sed -E 's/^.*"pm25"[^[]+\[([0-9]+).*$/\1/' <<< "$AQIDATA")
pm10=$(sed -E 's/^.*"pm10"[^[]+\[([0-9]+).*$/\1/' <<< "$AQIDATA")
awk to the rescue!
If you have to, you can use this hacky way using smart counters with hand-crafted delimiters. Setting RS instead of FS transfers looping through fields to awk itself. Multi-char RS is not available for all awks (gawk supports it).
$ awk -v RS='[:,[]' '$0=="\"pm25\""{c=4} c&&!--c' file
59
$ awk -v RS='[:,[]' '$0=="\"pm10\""{c=4} c&&!--c' file
15

Expect: extract specific string from output

I am navigating a Java-based CLI menu on a remote machine with expect inside a bash script and I am trying to extract something from the output without leaving the expect session.
Expect command in my script is:
expect -c "
spawn ssh user#host
expect \"#\"
send \"java cli menu command here\r\"
expect \"java cli prompt\"
send \"java menu command\"
"
###I want to extract a specific string from the above output###
Expect output is:
Id Name
-------------------
abcd 12 John Smith
I want to extract abcd 12 from the above output into another expect variable for further use within the expect script. So that's the 3rd line, first field by using a double-space delimiter. The awk equivalent would be: awk -F ' ' 'NR==3 {$1}'
The big issue is that the environment through which I am navigating with Expect is, as I stated above, a Java CLI based menu so I can't just use awk or anything else that would be available from a bash shell.
Getting out from the Java menu, processing the output and then getting in again is not an option as the login process lasts for 15 seconds so I need to remain inside and extract what I need from the output using expect internal commands only.
You can use regexp in expect itself directly with the use of -re flag. Thanks to Donal on pointing out the single quote and double quote issues. I have given solution using both ways.
I have created a file with the content as follows,
Id Name
-------------------
abcd 12 John Smith
This is nothing but your java program's console output. I have tested this in my system with this. i.e. I just simulated your program's output with cat. You just replace the cat code with your program commands. Simple. :)
Double Quotes :
#!/bin/bash
expect -c "
spawn ssh user#domain
expect \"password\"
send \"mypassword\r\"
expect {\\\$} { puts matched_literal_dollar_sign}
send \"cat input_file\r\"; # Replace this code with your java program commands
expect -re {-\r\n(.*?)\s\s}
set output \$expect_out(1,string)
#puts \$expect_out(1,string)
puts \"Result : \$output\"
"
Single Quotes :
#!/bin/bash
expect -c '
spawn ssh user#domain
expect "password"
send "mypasswordhere\r"
expect "\\\$" { puts matched_literal_dollar_sign}
send "cat input_file\r"; # Replace this code with your java program commands
expect -re {-\r\n(.*?)\s\s}
set output $expect_out(1,string)
#puts $expect_out(1,string)
puts "Result : $output"
'
As you can see, I have used {-\r\n(.*?)\s\s}. Here the braces prevent any variable substitutions. In your output, we have a 2nd line with full of hyphens. Then a newline. Then your 3rd line content. Let's decode the regex used.
-\r\n is to match one literal hyphen and a new line together. This will match the last hyphen in the 2nd line and the newline which in turn make it to 3rd line now. So, .*? will match the required output (i.e. abcd 12) till it encounters double space which is matched by \s\s.
You might be wondering why I need parenthesis which is used to get the sub-match patterns.
In general, expect will save the expect's whole match string in expect_out(0,string) and buffer all the matched/unmatched input to expect_out(buffer). Each sub match will be saved in subsequent numbering of string such as expect_out(1,string), expect_out(2,string) and so on.
As Donal pointed out, it is better to use single quote's approach since it looks less messy. :)
It is not required to escape the \r with the backslash in case of double quotes.
Update :
I have changed the regexp from -\r\n(\w+\s+\w+)\s\s to -\r\n(.*?)\s\s.
With this way - your requirement - such as match any number of letters and single spaces until you encounter first occurrence of double spaces in the output
Now, let's come to your question. You have mentioned that you have tried -\r\n(\w+)\s\s. But, there is a problem here with \w+. Remember \w+ will not match space character. Your output has some spaces in it till double spaces.
The use of regexp will matter based on your requirements on the input string which is going to get matched. You can customize the regular expressions based on your needs.
Update version 2 :
What is the significance of .*?. If you ask separately, I am going to repeat what you commented. In regular expressions, * is a greedy operator and ? is our life saver. Let us consider the string as
Stackoverflow is already overflowing with number of users.
Now, see the effect of the regular expression .*flow as below.
* matches any number of characters. More precisely, it matches the longest string possible while still allowing the pattern itself to match. So, due to this, .* in the pattern matched the characters Stackoverflow is already over and flow in pattern matched the text flow in the string.
Now, in order to prevent the .* to match only up to the first occurrence of the string flow, we are adding the ? to it. It will help the pattern to behave as non-greedy manner.
Now, again coming back to your question. If we have used .*\s\s, then it will match the whole line since it is trying to match as much as possible. This is common behavior of regular expressions.
Update version 3:
Have your code in the following way.
x=$(expect -c "
spawn ssh user#host
expect \"password\"
send \"password\r\"
expect {\\\$} { puts matched_literal_dollar_sign}
send \"cat input\r\"
expect -re {-\r\n(.*?)\s\s}
if {![info exists expect_out(1,string)]} {
puts \"Match did not happen :(\"
exit 1
}
set output \$expect_out(1,string)
#puts \$expect_out(1,string)
puts \"Result : \$output\"
")
y=$?
# $x now contains the output from the 'expect' command, and $y contains the
# exit status
echo $x
echo $y;
If the flow happened properly, then exit code will have value as 0. Else, it will have 1. With this way, you can check the return value in bash script.
Have a look at here to know about the info exists command.

Shell: Replacing each New Line "\n" character with "\\n"

I'm inserting a git diff of changed files into a JSON object to send using a curl request.
The problem is it doesn't like the new-line characters being inserted into the JSON but I'm not sure how to get around that. Translate tool didn't work, this perl solution I'm using is close but just replaces with spaces:
changedfiles=$(git diff --name-only $3..$4 | perl -p -e 's/\n/ /')
and changing it to this didn't help:
changedfiles=$(git diff --name-only $3..$4 | perl -p -e 's/\n/\\n/')
Can anyone point me in the right direction? It doesn't need to use perl, it just needs to work
(...being simple would be nice too)
Instead of trying to do ad-hoc escaping for characters that your immediate testing finds problematic, how about using an actual JSON library that handles all of them in a solid way?
Here's an example in bash using inlined python:
python -c '
import json
import sys
print(json.dumps({"data": sys.argv[1]}))
' "$(git diff --name-only $3..$4)"
It prints the json object { "data": "your command output here" } with standards compliant escaping.
This is what I think you want to do to get a quoted list of files separated by commas (i.e. for inserting into a JSON string):
git diff --name-only $3..$4 | perl -p -e 's/(.*)/"$1",/;s/\n//;s/""/","/'
This works if your files don't contain double quotes or special characters that need to be JSON escaped.
First, we put the files in quotes followed by a comma, then remove newlines, then change the "" between files to ",". Although, this is kind of a hack. Somewhat better might be:
git diff --name-only $3..$4 | perl -p -e '$/="";s/(.*)\n/"$1",/g;s/,$//'
Here we read in the whole input, newlines and all, do our substitution and remove the final comma.