Similar strings, different results - json

I'm creating a Bash script to parse the air pollution levels from the webpage:
http://aqicn.org/city/beijing/m/
There is a lot of stuff in the file, but this is the relevant bit:
"iaqi":[{"p":"pm25","v":[59,21,112],"i":"Beijing pm25 (fine
particulate matter) measured by U.S Embassy Beijing Air Quality
Monitor
(\u7f8e\u56fd\u9a7b\u5317\u4eac\u5927\u4f7f\u9986\u7a7a\u6c14\u8d28\u91cf\u76d1\u6d4b).
Values are converted from \u00b5g/m3 to AQI levels using the EPA
standard."},{"p":"pm10","v":[15,5,69],"i":"Beijing pm10
(respirable particulate matter) measured by Beijing Environmental
Protection Monitoring Center
I want the script to parse and display 2 numbers: current PM2.5 and PM10 levels (the numbers in bold in the above paragraph).
CITY="beijing"
AQIDATA=$(wget -q 0 http://aqicn.org/city/$CITY/m/ -O -)
PM25=$(awk -v FS="(\"p\":\"pm25\",\"v\":\\\[|,[0-9]+)" '{print $2}' <<< $AQIDATA)
PM100=$(awk -v FS="(\"p\":\"pm10\",\"v\":\\\[|,[0-9]+)" '{print $2}' <<< $AQIDATA)
echo $PM25 $PM100
Even though I can get PM2.5 levels to display correctly, I cannot get PM10 levels to display. I cannot understand why, because the strings are similar.
Anyone here able to explain?

The following approach is based on two steps:
(1) Extracting the relevant JSON;
(2) Extracting the relevant information from the JSON using a JSON-aware tool -- here jq.
(1) Ideally, the web service would provide a JSON API that would allow one to obtain the JSON directly, but as the URL you have is intended for viewing with a browser, some form of screen-scraping is needed. There is a certain amount of brittleness to such an approach, so here I'll just provide something that currently works:
wget -O - http://aqicn.org/city/beijing/m |
gawk 'BEGIN{RS="function"}
$1 ~/getAqiModel/ {
sub(/.*var model=/,"");
sub(/;return model;}/,"");
print}'
(gawk or an awk that supports multi-character RS can be used; if you have another awk, then first split on "function", using e.g.:
sed $'s/function/\\\n/g' # three backslashes )
The output of the above can be piped to the following jq command, which performs the filtering envisioned in (2) above.
(2)
jq -c '.iaqi | .[]
| select(.p? =="pm25" or .p? =="pm10") | [.p, .v[0]]'
The result:
["pm25",59]
["pm10",15]

I think your problem is that you have a single line HTML file that contains a script that contains a variable that contains the data you are looking for.
Your field delimiters are either "p":"pm100", "v":[ or a comma and some digits.
For pm25 this works, because it is the first, and there are no occurrences of ,21 or something similar before it.
However, for pm10, there are some that are associated with pm25 ahead of it. So the second field contains the empty string between ,21 and ,112
#karakfa has a hack that seems to work -- but he doesn't explain very well why it works.
What he does is use awk's record separator (which is usually a newline) and sets it to either of :, ,, or [. So in your case, one of the records would be "pm25", because it is preceded by a colon, which is a separator, and succeeded by a comma, also a separator.
Once it hits the matching content ("pm25") it sets a counter to 4. Then, for this and the next records, it counts this counter down. "pm25" itself, "v", the empty string between : and [, and finally reaches one when hitting the record with the number you want to output: 4 && ! 3 is false, 3 && ! 2 is false, 2 && ! 1 is false, but 1 && ! 0 is true. Since there is no execution block, awk simply prints this record, which is the value you want.
A more robust work would probably be using xpath to find the script, then use some json parser or similar to get the value.

chw21's helpful answer explains why your approach didn't work.
peak's helpful answer is the most robust, because it employs proper JSON parsing.
If you don't want to or can't use third-party utility jq for JSON parsing, I suggest using sed rather than awk, because awk is not a good fit for field-based parsing of this data.
$ sed -E 's/^.*"pm25"[^[]+\[([0-9]+).+"pm10"[^[]+\[([0-9]+).*$/\1 \2/' <<< "$AQIDATA"
59 15
The above should work with both GNU and BSD/OSX sed.
To read the result into variables:
read pm25 pm10 < \
<(sed -E 's/^.*"pm25"[^[]+\[([0-9]+).+"pm10"[^[]+\[([0-9]+).*$/\1 \2/' <<< "$AQIDATA")
Note how I've chosen lowercase variable names, because it's best to avoid all upper-case variables in shell programming, so as to avoid conflicts with special shell and environment variables.
If you can't rely on the order of the values in the source string, use two separate sed commands:
pm25=$(sed -E 's/^.*"pm25"[^[]+\[([0-9]+).*$/\1/' <<< "$AQIDATA")
pm10=$(sed -E 's/^.*"pm10"[^[]+\[([0-9]+).*$/\1/' <<< "$AQIDATA")

awk to the rescue!
If you have to, you can use this hacky way using smart counters with hand-crafted delimiters. Setting RS instead of FS transfers looping through fields to awk itself. Multi-char RS is not available for all awks (gawk supports it).
$ awk -v RS='[:,[]' '$0=="\"pm25\""{c=4} c&&!--c' file
59
$ awk -v RS='[:,[]' '$0=="\"pm10\""{c=4} c&&!--c' file
15

Related

Calling Imagemagick from awk?

I have a CSV of image details I want to loop over in a bash script. awk seems like an obvious choice to loop over the data.
For each row, I want to take the values, and use them to do Imagemagick stuff. The following isn't working (obviously):
awk -F, '{ magick "source.png" "$1.jpg" }' images.csv
GNU AWK excels at processing structured text data, although it can be used to summon commands using system function it is less handy for that than some other language, e.g. python has module of standard library called subprocess which is more feature-rich.
If you wish to use awk for this task anyway, then I suggest preparing output to be feed into bash command, say you have file.txt with following content
file1.jpg,file1.bmp
file2.png,file2.bmp
file3.webp,file3.bmp
and you have files listed in 1st column in current working directory and wish to convert them to files shown in 2nd column and access to convert command, then you might do
awk 'BEGIN{FS=","}{print "convert \"" $1 "\" \"" $2 "\""}' file.txt | bash
which is equvialent to starting bash and doing
convert "file1.jpg" "file1.bmp"
convert "file2.png" "file2.bmp"
convert "file3.webp" "file3.bmp"
Observe that I have used literal " to enclose filenames, so it should work with names containing spaces. Disclaimer: it might fail if name containing special character, e.g. ".

Extract numeric values from awk return

For supervision system, I need to return 2 values about latency to my supervisor server thru nrpe.
Here the values that I'm working on (I put this in a file : test.txt) :
{"status":"success","data":{"resultType":"vector","result":[{"metric":{"project_site":"AUB"},"value":[1575277537.052,"0.3889104875437488"]},{"metric":{"project_site":"VDR"},"value":[1575277537.052,"0.2267407994117705"]}]}}
I need to extract 0.3889104875437488 and 0.2267407994117705
I'm using this :
for i in $(""cat test.txt | awk -F ',' '{print $5 $NF}' | grep -o '[0.0001-9999.9]\+'""); do echo $i; done
I'm not sure that's the best method, especially since I have to add this : "AUB" for row 1 and "VDR" for row 2 before each line. Like :
AUB : 0.3889104875437488 seconds
VDR : 0.2267407994117705 seconds
Use jq for parsing JSON, for example:
$ jq -r '.data.result[] | "\(.metric.project_site) : \(.value[1]) seconds"' file
AUB : 0.3889104875437488 seconds
VDR : 0.2267407994117705 seconds
I have upvoted the answer by #oguzismail, and will repeat their suggestion to use jq instead if at all feasible.
If your input is not valid JSON, there are several things wrong with your approach, several of them related more to efficiency and common practice than outright erroneous.
Your regex is wrong. See below.
Avoid the useless cat.
If you are using Awk anyway, you don't need grep. See useless use of grep.
Quote your variable.
Only in this case, you want to remove the useless echo entirely. Capturing standard output so that you can echo it to standard output is simply a waste of processes (unless you specifically wanted to break the quoting, as a special case of the previous item; but that is not the case here).
It is unclear what you hope for the empty string "" to accomplish. After the shell is done with quote removal, ""cat is simply cat.
In some more detail, [0.0001-9999.9] matches a single character which is 0 or . or 0 (oh we mentioned that already, didn't we?) or 0 (ditto) or between 1 and 9 or 9 (etc etc). In short, grep is not at all the right tool for searching for number ranges; fortunately, Awk can do that easily too.
Here, then, is an attempt to refactor to remove these problems.
awk -F ',' '{ split("5:" NF, a, ":"); split("AUB:VDR", l, ":")
for (i=1; i<=2; i++) {
n = $a[i]; gsub(/[]}"]+/, "", n);
if (n >= 0.0001 && n <= 9999.9)
print l[i] ": " n " seconds"} }' test.txt
This is extremely brittle because it hard-codes the locations of the strings within the surface structure of the (not?) JSON data, which could change without warning.
The split is a hack to get the numbers 5 and NF into an array a. We create a second array with the same length for the corresponding labels. We then loop over the first array and use the numbers as indices into the current record's fields. We trim off any quoting and brackets, and then perform the numeric comparison on the thus extracted field. At the end, we add the corresponding label from the other array in front of the printed text.

Take only first json object from stream with jq, do not touch rest

This thread Split multiple input JSONs with jq helped me to solve one problem. But not other.
mkfifo xxs
exec 3<>xxs ## keep open file descriptor
echo '{"a":0,"b":{"c":"C"}}{"x":33}{"asd":889}' >&3
jq -nc input <&3 ## prints 1st object '{"a":0,"b":{"c":"C"}}' and reads out the rest
cat <&3 ## prints nothing
My problem is to make jq stop reading after first object is read, and do not touch other data in stream (fifo). So cat should show the rest of data: '{"x":33}{"asd":889}'.
How to achive that with jq?
jq doesn't have to read to the whole input to get the first value. This can be verified by feeding an infinite sequence of values to jq which takes the first value and exit:
yes '{}' | jq -n input
Though, the question assumes a bit more. Namely that jq can read a single JSON value from a named pipe and stop reading "right at that point" so the rest can be then read by cat.
mkfifo xxs
exec 3<>xxs ## keep open file descriptor
echo '1 2 3' >&3
jq -nc input <&3 >first ## Get first value
cat <&3 >rest ## Nothing to show jq read all data
This gets more complicated as we don't know where that first value ends and most Unix programs (jq included) read input in larger chunks to limit the number of read syscalls.
jq would need an option to read its input one byte at a time. And, while this could be implemented, it may be of limited utility.
The closest thing I can think of is to output the first value to
stderr and the rest to stdout.
jq -n 'input | stderr | inputs' <&3 2>first 1>rest
Input is processed in a streaming fashion (one input value at a time) and you can pipe stdout and/or stderr to something else. Though the whole input has to be valid JSON and it will be prettified while passing through jq (unlike with cat above).
If reading from a named pipe is not a requirement and you can afford to read the input from a file. Then, you can access the first value and the rest in two separate invocations.
echo '1 2 3' > in
jq -n 'input' in >first
jq -n 'input | inputs' in >rest
If stream processing is the goal, it may also be possible to do everything in a single jq script that processes its input incrementally.
This all assumes top-level values. Though, jq can also process nested structures incrementally using the --stream option.
If you want to partially read a stream you will probably need to do it yourself.
You could write a trivial C program to do this.
I doubt there are any off-the-shelf parsers you can find to specify stopping the read of a stream after n objects.
As mentioned before, most stream readers will use stdio and read all they can into a buffer.

Shell - read html into variable and filter sequence

I need to read a webpage with tables into a variable and filter the number of one cell out.
the html is like:
<tr><th>Totals:</th><td> 99999.9</td>
I need to get that 99999.9 number.
I tried:
value=$(curl -s -m 10 http://$host | egrep -o "Totals:</th><td> [0-9]\{5\}" | cut -d'> ' -f 2)
an other valid option is to check if the page is generated at least. I mean reading the html into an value and check if the value is full of html (maybe length).
any glue what is wrong about the curl command combined with the cut command?
thank you?
You should use a proper html parser for that. If you really want to do it with bash (which is error prone and can cause you lot of headache if the html getting more complex) you can do that in the following way:
# html="$(curl -s -m 10 http://$host)"
html="<tr><th>Totals:</th><td> 99999.9</td>"
# remove all whitespaces
# it is not guaranteed that your cell value will be on the same line with Totals:
html_cl="$(echo $html | tr -d ' \t\n\r\f')"
# strip .*Totals:</th><td> before the desired cell value
# strip </td>.* after the value
value="${html_cl##*Totals:</th><td>}"
value="${value%%</td>*}"
echo $value
Gives you the result:
99999.9
NOTE: If you have multiple Totals with the same tags then it will extract only the last one from your string.

How can I extract td from html in bash?

I am querying London postcode data from geonames:
http://www.geonames.org/postalcode-search.html?q=london&country=GB
I want to turn the output into a list of just the postcode identifiers (Bethnal Green, Islington, etc.). What is the best way to extract just the names in bash?
I'm not sure if you mean this \n delimited list (or one in brackets and comma delimited)
html='http://www.geonames.org/postalcode-search.html?q=london&country=GB'
wget -q "$html" -O - |
w3m -dump -T 'text/html'|
sed -nr 's/^ +[0-9]+ +(.*) +[A-Z]+[0-9]+ +United Kingdom.*/\1/p'
w3m is a: "WWW browsable pager with excellent tables/frames support"
output (first 10 lines)
London Bridge
Kilburn
Ealing
Wandsworth
Pimlico
Kensington
Leyton
Leytonstone
Plaistow
Poplar
I see the site offers (but not for free) web services with XML or JSON data... It would be the best way, since the HTML page is not meant to be parsed (easily).
Anyway, nothing is impossible, nonetheless using strictly only bash commands would be a lot hard, if not impossible; often several other common tools are piped in order to achieve the result. But then, sometimes it turns to be more conveniente to stick to a single tool like e.g. Perl, instead of combining cat, grep, awk, sed and whatever else.
Something like
sed -e 's/>/>\n/g' region.html |
egrep -i "^\s*[A-Z]+[0-9]+</td>" |
sed -e 's|</td>||g'
worked extracting 200 lines, assuming a specific format for the code.
ADD
If there's no limit to the software you can use to parse the data, then you could use a line like
wget -q "http://www.geonames.org/postalcode-search.html?q=london&country=GB" -O - |
sgrep '"<table class=\"restable\"" .. "</table>"' |
sed -e 's|/tr>|/tr>\n|g; s|</td>\s*<td[^>]*>|;|g; s|</th>\s*<th[^>]*>|;|g; s|<[^>]\+>||g; s|;; .*$| |g' |
grep -v "^\s*$" |
tail -n+2 | cut -d";" -f2,3
which extracts places and postal codes seperated by a ; like in a CSV, as well as awk:
wget -q "$html" -O - |
w3m -dump -T 'text/html' |
awk '/\s*[0-9]+ / { print substr($0, 11, 16); }'
which is based on the answer by Peter.O and extracts the same data... and so on. But in these cases, since you are not limited to the minimal tools found on most Unix or GNU systems, I would stick to one single widespread tool, e.g. perl.
If you have access to the mojo tool from the Mojolicious project this all becomes quite a lot easier:
mojo get 'http://www.geonames.org/postalcode-search.html?q=london&country=GB' '.restable > tr > td:nth-child(2)' text | grep ^'[a-zA-Z]'
The grep at the end is just to filter out some junk results; almost (but not quite) every other line is bad, because the page structure is slightly inconsistent. Otherwise you could say tr:nth-child(even) and get nice results.