Filter results using bash - json

To be more clear, look at the below text file.
https://brianbrandt.dk/web/var/www/public_html/.htpasswd
https://brianbrandt.dk/web/var/www/public_html/wp-config.php
https://briannajackson1.wordpress.org/high-entropy-misc.txt
https://briannajackson1.wordpress.org/Homestead.yaml
https://brickellmiami.centric.hyatt.com/dev
https://brickellmiami.centric.hyatt.com/django.log
https://brickellmiami.centric.hyatt.com/.dockercfg
https://brickellmiami.centric.hyatt.com/docker-compose.yml
https://brickellmiami.centric.hyatt.com/.docker/config.json
https://brickellmiami.centric.hyatt.com/Dockerfile
https://brideonashoestring.wordpress.org/web/var/www/public_html/config.php
https://brideonashoestring.wordpress.org/web/var/www/public_html/wp-config.php
https://brideonashoestring.wordpress.org/wp-config.php
https://brideonashoestring.wordpress.org/.wp-config.php.swp
https://brideonashoestring.wordpress.org/_wpeprivate/config.json
https://brideonashoestring.wordpress.org/yarn-debug.log
https://brideonashoestring.wordpress.org/yarn-error.log
https://brideonashoestring.wordpress.org/yarn.lock
https://brideonashoestring.wordpress.org/.yarnrc
https://bridgehome.adobe.com/etc/shadow
https://bridgehome.adobe.com/phpinfo.php
https://bridgetonema.wordpress.org/manifest.json
https://bridgetonema.wordpress.org/manifest.yml
https://bridge.twilio.com/.wp-config.php.swp
https://bridge.twilio.com/wp-content/themes/.git/config
https://bridge.twilio.com/_wpeprivate/config.json
https://bridge.twilio.com/yarn-debug.log
https://bridge.twilio.com/yarn-error.log
https://bridge.twilio.com/yarn.lock
https://bridge.twilio.com/.yarnrc
https://brightside.mtn.co.za/config.lua
https://brightside.mtn.co.za/config.php
https://brightside.mtn.co.za/config.php.txt
https://brightside.mtn.co.za/config.rb
https://brightside.mtn.co.za/config.ru
https://brightside.mtn.co.za/_config.yml
https://brightside.mtn.co.za/console
https://brightside.mtn.co.za/.credentials
https://brightside.mtn.co.za/CVS/Entries
https://brightside.mtn.co.za/CVS/Root
https://brightside.mtn.co.za/dasbhoard/
https://brightside.mtn.co.za/data
https://brightside.mtn.co.za/data.txt
https://brightside.mtn.co.za/db/dbeaver-data-sources.xml
https://brightside.mtn.co.za/db/dump.sql
https://brightside.mtn.co.za/db/.pgpass
https://brightside.mtn.co.za/db/robomongo.json
https://brightside.mtn.co.za/README.txt
https://brightside.mtn.co.za/RELEASE_NOTES.txt
https://brightside.mtn.co.za/.remote-sync.json
https://brightside.mtn.co.za/Resources.zip.manifest
https://brightside.mtn.co.za/.rspec
https://br.infinite.sx/db/dump.sql
https://br.infinite.sx/graphiql
The domain name brightside.mtn.co.za and other domains repeated more than 10 times now i want to drop brightside.mtn.co.za and other domains that are repeated more than 10 times and then the output the results the output should look like.
https://br.infinite.sx/db/dump.sql
https://br.infinite.sx/graphiql
https://bridgetonema.wordpress.org/manifest.json
https://bridgetonema.wordpress.org/manifest.yml

[The following is a response to the original question, which was premised on JSON input.]
Since you need to count the items in a group, it would appear that you will find group_by( sub("/[^/]*$";"") ) useful.
For example, if you wanted to omit large groups entirely, as one interpretation of the stated requirements would seem to imply, you could use the following filter:
[.results[] | select(.status==301) | .url]
| group_by( sub("/[^/]*$";"") )
| map(select(length < 10) )
| .[][]

If the text input is in input.txt, then one solution using jq at the bash command line would be:
< input.txt jq -Rr '[inputs]
| group_by( sub("/[^/]*$";"") )
| map(select(length < 10) )
| .[][]'
(If you want the output as JSON strings, omit the -r option.)
A more efficient solution
The above solution uses the built-in filter group_by/1 and is thus somewhat inefficient. For a very large number of input lines, a more efficient solution would be:
< input.txt jq -Rr '
def GROUPS_BY(stream; f):
reduce stream as $x ({}; .[$x|f] += [$x] ) | .[] ;
GROUPS_BY(inputs; sub("/[^/]*$";""))
| select(length < 10)
| .[]'

Related

How to make request in jq

I'm trying to make request in jq:
cat testfile.txt | jq 'fromjson | select(.kubernetes.pod.memory.usage.bytes != null) .kubernetes.pod.memory.usage.bytes, ."#timestamp"'
My output is:
"2019-03-15T00:24:21.733Z"
"2019-03-15T00:25:10.169Z"
"2019-03-15T00:24:47.908Z"
105889792
"2019-03-15T00:25:04.446Z"
34557952
"2019-03-15T00:25:04.787Z"
How to delete excess dates?
For example output only:
105889792
"2019-03-15T00:25:04.446Z"
34557952
"2019-03-15T00:25:04.787Z"
You just need to add a pipe after select :
cat testfile.txt | jq 'fromjson | select(.kubernetes.pod.memory.usage.bytes != null) | .kubernetes.pod.memory.usage.bytes, ."#timestamp"'
Here's a DRYer (as in dry) solution:
.["#timestamp"] as $ts | .kubernetes.pod.memory.usage.bytes // empty | ., $ts
Note that this particular use of // assumes that you wish to treat null, false, and a missing key in the same way. If not, you can still use the same idea to stay DRY.

jq fromdate "does not match format "%Y-%m-%dT%H:%M:%SZ"

I have an example json output below, the LastAccessedDate value is as AWS outputs it when running CLI command and thus I have no control over the format of the date.
{
"MyList": [
{
"Name": "MyName1",
"LastAccessedDate": "2021-06-29T02:00:00+02:00"
}
]
}
When trying to run a jq command to select using fromdate like this:
cat output.json | jq '.[] | .[] | select ( .LastAccessedDate | fromdate > "2021-01-01T02:00:00+02:00")'
then I get the error message:
jq: error (at <stdin>:8): date "2021-06-29T02:00:00+02:00" does not match format "%Y-%m-%dT%H:%M:%SZ"
Is there anything I could do to enable me to use jq to filter the output?
Would even appreciate an explanation on what may be wrong so that I can understand for future use cases.
Here is a jq function for converting certain ISO 8601 timestamps with timezone offsets to seconds since the beginning of the Epoch, thus facilitating comparisons of timestamps with different offsets.
# Convert a timestamp with a possibly empty timezone offset to seconds since the Epoch.
# Input should be a string of the form
# yyyy-mm-ddThh:mm:ss or yyyy-mm-ddThh:mm:ss<OFFSET>
# where <OFFSET> is Z, or has the form `+hh:mm` or `-hh:mm`.
# If no timezone offset is explicitly given, it is taken to be Z.
def datetime_to_seconds:
if test("[-+]")
then capture("(?<datetime>^.*T[0-9:]+)(?<s>[-+])(?<hh>[0-9]+):?(?<mm>[0-9]*)")
| (.datetime +"Z" | fromdateiso8601) as $seconds
| (if .s == "+" then -1 else 1 end) as $plusminus
| ([.hh,.mm] | map(tonumber) |.[0] *= 60 | add * 60 * $plusminus) as $offset
| ($seconds + $offset)
else . + (if test("Z") then "" else "Z" end) | fromdateiso8601
end;
Note that the interpretation of + or - in the offset conforms with the principle encapsulated in the example:
The following times all refer to the same moment: "18:30Z", "22:30+04", "1130−0700", and "15:00−03:30".
One way would be to use a custom function to strip off the timezone part and format the date string to a format compatible that jq can parse.
Note that this only works when your reference and source strings are in the same timezone offset.
jq --arg ref "2021-01-01T02:00:00+02:00" '
def c(str): str | (split("+")[0] + "Z") | fromdate ;
.MyList | map(select( c(.LastAccessedDate) > c($ref) ))' json
From jq Manual
The fromdate builtin parses datetime strings. Currently fromdate only supports ISO 8601 datetime strings, but in the future it will attempt to parse datetime strings in more formats.

jq create output in many separate files

given the following json:
[
{"_id":{"$oid":"6d2"},"jlo":"ΕΙ AJSB","dd":"d5f"},
{"_id":{"$oid":"c6d3"},"jlo":"ΕΙ ALKSB","dd":"5d9"},
{"_id":{"$oid":"b0cc6d4"},"jlo":"ΕΙ AGHTSB","dd":"1b1"},
{"_id":{"$oid":"6d2"},"jlo":"ΕPOWΙ AJSB","dd":"d5f"},
{"_id":{"$oid":"c6d3"},"jlo":"ΕGTΙ ALKSB","dd":"5d9"},
{"_id":{"$oid":"b0cc6d4"},"jlo":"ΕLKΙ AGHTSB","dd":"1b1"}
]
what i need to do is have as output for each discrete value of the ll element, the unique values of ta, in a separate file, named after a one to one representation where each dd code is substituted with a human readable representation:
d5f:departmentone
5d9:departmentalt
1b1:departshort
Desired output, in a per row basis, each unique value of jlo with the count of times it was found in each dd element so we get in the end something like this:
first file named departmentone.txt:
ΕΙ AJSB 1
ΕPOWΙ AJSB 1
second file named departmentalt.txt
ΕΙ ALKSB 1
ΕGTΙ ALKSB 1
third file named departshort.txt
ΕΙ AGHTSB 2
i have tried with map and reduce, group_by, sort_by, with really poor results
Only one invocation of jq is necessary. To allocate the output to the separate files, you can combine this one invocation with a single invocation to awk, or you could use a shell loop as illustrated below.
First, here's an illustration of how the shell pipeline would look:
jq -r --rawfile dd2name dd2name.tsv -f group.jq input.json |
while IFS=$'\t' read -r f v ; do echo "$v" >> "$f" ; done
This assumes that the mapping to filenames is in a TSV file named dd2name.tsv, and that the following jq program is in group.jq:
def dict:
split("\n") | map(select(length>0) | split("\t"))
| INDEX(.[0]) | map_values(.[1]);
($dd2name | dict) as $dict
| ($dict | keys_unsorted[]) as $dd
| map(select(.dd == $dd))
| group_by(.jlo)
| map("\($dict[$dd])\t\(.[0].jlo) \(length)")[]
As the name suggests, the dict function creates a dictionary giving the mapping of .dd values to the filenames. It assumes the availability of INDEX. If your jq does not have INDEX, then now would be an excellent time to upgrade your jq; otherwise, its def can easily be copied from builtin.jq (google: builtin.jq "def INDEX"), or you could replace the last line by: | reduce .[] as $p ({}; .[$p[0]] = $p[1]);
awk-based solution
The following invocation of awk can be used instead of the while ... done command above:
awk -F\\t 'fn && (fn!=$1) {close(fn)}; {fn=$1; print $2 >> fn}'
Season to taste
If the dd2name.tsv mapping file does not contain the ".txt" suffix, it can easily be added in any of a variety of ways, according to taste.
Note also that the proposed solutions above make some assumptions, notably that the .jlo values do not contain tabs, newlines, or NULs. If any of those assumptions is violated, then some tweaking will be required.
I'd do it in three passes, filtering the array with the desired dd and grouping by jlo, then extracting the jlo of the first (guaranteed) item of the array and its length :
map(select(.dd == "d5f")) | group_by(.jlo) | map("\(.[0].jlo) \(length)") | .[]
You can try it here.
Full bash run :
jq --arg dd d5f --raw-output 'map(select(.dd == $dd)) | group_by(.jlo) | map("\(.[0].jlo) \(length)") | .[]' yourJsonFile > departmentone.txt
jq --arg dd 5d9 --raw-output 'map(select(.dd == $dd)) | group_by(.jlo) | map("\(.[0].jlo) \(length)") | .[]' yourJsonFile > departmentalt.txt
jq --arg dd 1b1 --raw-output 'map(select(.dd == $dd)) | group_by(.jlo) | map("\(.[0].jlo) \(length)") | .[]' yourJsonFile > departmentshort.txt
Supposing you have a file named "mapping.txt" with the following content :
d5f:departmentone
5d9:departmentalt
1b1:departshort
You could extract those codes and labels to generate the files :
while IFS=: read -r code label; do
jq --arg dd $code --raw-output 'map(select(.dd == $dd)) | group_by(.jlo) | map("\(.[0].jlo) \(length)") | .[]' yourJsonFile > "$label".txt
done < mapping.txt

Check for duplicate values for a specific JSON key

I have the following JSON records stored in a container
{"memberId":"123","city":"New York"}
{"memberId":"234","city":"Chicago"}
{"memberId":"345","city":"San Francisco"}
{"memberId":"123","city":"New York"}
{"memberId":"345","city":"San Francisco"}
I am looking to check if there is any duplication of the memberId - ideally return a true/false and then also return the duplicated values.
Desired Output:
true
123
345
Here's an efficient approach using inputs. It requires invoking jq with the -n command-line option. The idea is to create a dictionary that keeps count of each memberId string value.
The dictionary can be created as follows:
reduce (inputs|.memberId|tostring) as $id ({}; .[$id] += 1)
Thus, to produce a true/false indicator, followed by the duplicates if any, you could write:
reduce (inputs|.memberId|tostring) as $id ({}; .[$id] += 1)
| to_entries
| map(select(.value > 1))
| (length > 0), .[].key
(If all the .memberId values are known to be strings, then of course the call to tostring can be dropped. Conversely, if .memberId is both string and integer-valued, then the above program won't differentiate between occurrences of 1 and "1", for example.)
bow
The aforementioned dictionary is sometimes called a "bag of words" (https://en.wikipedia.org/wiki/Bag-of-words_model). This leads to the generic function:
def bow(stream):
reduce stream as $word ({}; .[($word|tostring)] += 1);
The solution can now be written more concisely:
bow(inputs.memberId)
| to_entries
| map(select(.value > 1))
| (length > 0), .[].key
For just the values which have duplicates, one could write the more efficient query:
bow(inputs.memberId)
| keys_unsorted[] as $k
| select(.[$k] > 1)
| $k

Get count based on value bash

I have data in this format in a file:
{"field1":249449,"field2":116895,"field3":1,"field4":"apple","field5":42,"field6":"2019-07-01T00:00:10","metadata":"","frontend":""}
{"field1":249448,"field2":116895,"field3":1,"field4":"apple","field5":42,"field6":"2019-07-01T00:00:10","metadata":"","frontend":""}
{"field1":249447,"field2":116895,"field3":1,"field4":"apple","field5":42,"field6":"2019-07-01T00:00:10","metadata":"","frontend":""}
{"field1":249443,"field2":116895,"field3":1,"field4":"apple","field5":42,"field6":"2019-07-01T00:00:10","metadata":"","frontend":""}
{"field1":249449,"field2":116895,"field3":1,"field4":"apple","field5":42,"field6":"2019-07-01T00:00:10","metadata":"","frontend":""}
Here, each entry represents a row. I want to have a count of the rows with respect to the value in field one, like:
249449 : 2
249448 : 1
249447 : 1
249443 : 1
How can I get that?
with awk
$ awk -F'[,:]' -v OFS=' : ' '{a[$2]++} END{for(k in a) print k, a[k]}' file
You can use the jq command line tool to interpret JSON data. uniq -c counts the number of occurences.
% jq .field1 < $INPUTFILE | sort | uniq -c
1 249443
1 249447
1 249448
2 249449
(tested with jq 1.5-1-a5b5cbe on linux xubuntu 18.04 with zsh)
Here's an efficient jq-only solution:
reduce inputs.field1 as $x ({}; .[$x|tostring] += 1)
| to_entries[]
| "\(.key) : \(.value)"
Invocation: jq -nrf program.jq input.json
(Note in particular the -n option.)
Of course if an object-representation of the counts is satisfactory, then
one could simply write:
jq -n 'reduce inputs.field1 as $x ({}; .[$x|tostring] += 1)' input.json
Using datamash and some shell utils, change the non-data delimiters to squeezed tabs, count field 3, (it'd be field 2, but there's a leading tab), reverse, then pretty print as per OP spec:
tr -s '{":,}' '\t' < file | datamash -sg 3 count 3 | tac | xargs printf '%s : %s\n'
Output:
249449 : 2
249448 : 1
249447 : 1
249443 : 1