regex for json file with multiple piping

regex for json file with multiple piping - json

I have the following command to grab a json in unix:
wget -q -O- https://www.reddit.com/r/NetflixBestOf/.json
Which gives me the following output format (with different results each time obviously):
{
"kind": "...",
"data": {
"modhash": "",
"whitelist_status": "...",
"children": [
e1,
e2,
e3,
...
],
"after": "...",
"before": "..."
}
}
where each element of the array children is an object structured as follows:
{
"kind": "...",
"data": {
...
}
}
Here is an example of a complete .json get (body is too long to post directly:
https://pastebin.com/20p4kk3u
I need to print the complete data object as present inside each element of the array children. I know I need pipe atleast twice, to initially get children [...], then data {...} from there on, and this is what I have so far:
wget -q -O- https://www.reddit.com/r/NetflixBestOf/.json | tr -d '\r\n' | grep -oP '"children"\s*:\s*\[\s*\K({.+?})(?=\s*\])' | grep -oP '"data"\s*:\s*\K({.+?})(?=\s*},)'
I'm new to regular expressions, so I'm not sure how to handle having brackets or curly braces within elements of what I'm grepping. The line above prints nothing to the shell and I'm not sure why. Any help is appreciated.

Code
wget -q -O- https://www.reddit.com/r/NetflixBestOf/.json | tr -d '\r\n' | grep -oP '"children"\s*:\s*\[\s*\K({.+?})(?=\s*\])' | grep -oP '"data"\s*:\s*\K({.+?})(?=\s*},)'
Something about regex
* == zero or more time
+ == one or more time
? == zero or one time
\s == a space character or a tab character or a carriage return character or a new line character or a vertical tab character or a form feed character
\w == is a word character and can to be from A to Z (upper or lower), from 0 to 9, included also underscore (_)
\d == all numbers from 0 to 9
\r == carriage return
\n == new line character (line feed)
\ == escape special characters so they can to be read as normal characters
[...] == search for character class. Example: [abc] search for a or b or c
(?=) == is a positive lookahead, a type of zero-width assertion. What it's saying is that the captured match must be followed by whatever is within the parentheses but that part isn't captured.
\K == match start at this position.
Anyway you can read more about regex from here: Regex Tutorial
Now i can try to explain the code
wget download the source.
tr remove all line feed e carriage return, so we have all the output in one line and can to be handle from grep.
grep -o option is used for only matching.
grep -P option is for perl regexp.
So here
grep -oP '"children"\s*:\s*\[\s*\K({.+?})(?=\s*\])'
we have sayed:
match all the line from "children"
zero or more spaces
:
zero or more spaces
\[ escaped so it's a simple character and not a special
zero or more spaces
\K force submatch to start from here
( submatch
{.+?} all, in braces (the braces are included because after start submatch sign. See greedy, not greedy in the regex tutorial for understand how work .+?)
) close submatch
(?=\s*\]) stop submatch when zero or more space founded and simple ] is founded but not include it in the submatch.

If you want to get the children array try this but i'm not sure it's what you look for.
wget -O - https://www.reddit.com/r/NetflixBestOf/.json | sed -n '/children/,/],/p'

Related

Bulk update values in json files (writing files)

I have a set of JSON files in a local folder. What I want to do is change a particular string value in it, permanently. That means, deleting or modifying the old entry, writing a new one, and saving it.
Below is the format of the file:
{
"name": "ABC #1",
"description": "This is the description",
"image": "ipfs://NewUriToReplace/1.png",
"dna": "a56c520f57ba2a861de8c78099b4691f9dad6e87",
"edition": 1,
"date": 1641634646966,
"creator": "Team Dreamlabs",
"attributes": [
{
I want to change ABA #1 to ABC #9501 in this file, ABC #2 to ABC #9502 in the text file, and so on. How do I do that on MAC in one go?

As I understand from the example, you are adding a value of 9500 to your integers after the symbol #.
Because this kind of a replacement is a kind of string operation, a cycle with command sed might be used:
for f in *.json; do sed -i.bak 's/\("name": "ABC #\)\([0-9]\)",/\1950\2",/' $f; done
it just replaces a single digit to the new composition... Despite it responses to the example, obviously, it would not work for more than number #9.
Then we need to use a bash function:
function add_number() { old_number=$(cat $1 | sed -n 's/[ ]*"name": "ABC #\([0-9]*\)",/\1/p'); new_number=$(($old_number+9500)); sed -i.bak "s/\(\"name\": \"ABC #\)\([0-9]*\)\",/\1${new_number}\",/" $1; }; for f in *.json; do add_number $f ; done
The function add_number extracts the integer value, then adds a desired number to it and then replaces content of the file.
For both extraction and replacing the sed is used again.
At extraction flag -n allows to limit the amount of lines at sed output and mode p prints the result of replacement. Also, we do not want spaces symbols to pass into this assignment.
At replacement double quotes used in order to enable the bash to use the variable value inside of sed. Also, the real quotes are masked.
Regarding addition from the comment below, in order to make replacement in another line with tag edition (and using the same number), just a new replacement sed operation should be added with amended regular expression to fit this line.
Finally, the overall code in a better look:
function add_number() {
old_number=$(cat $1 | sed -n 's/[ ]*"name": "ABC #\([0-9]*\)",/\1/p')
new_number=$(($old_number+9500))
sed -i.bak "s/\(\"name\": \"ABC #\)[0-9]*\",/\1${new_number}\",/" $1
sed -i.bak "s/\(\"edition\": \)[0-9]*,/\1${new_number},/" $1
}
for f in *.json
do add_number $f
done
Those previous answers helped me to write this code:
using variables inside of sed
assigning the variable

If you are going to manipulate your JSON files on more than just this one occasion, then you might want to consider using tools that are designed to accomplish such tasks with ease.
One popular choice could be jq which is a "lightweight and flexible command-line JSON processor" that "has zero runtime dependencies" and is also available for OS X. By using jq within your shell, the following would be one way to accomplish what you have asked for.
Adding the numeric value 9500 to the number sitting in the field called edition:
jq '.edition += 9500' file.json
Interpreting a part of a string as number, adding again 9500 to it, and recomposing the string:
jq '.name |= ((./"#" | .[1] |= "\(tonumber + 9500)") | join("#"))' file.json
On the whole, iterating over your files, making both changes at once, writing to a temporary file and replacing the original on success, while having the value to be added as external variable:
v=9500
for f in *.json; do jq --argjson v $v '
.edition += $v | .name |= ((./"#" | .[1] |= "\(tonumber + $v)") | join("#"))
' "$f" > "$f.new" && mv "$f.new" "$f"
done
Here is an online "playground for jq", set up to simulate the application of my code from above to three imaginary files of yours. Feel free to edit the jq filter and/or the input JSON in order to see what could be possible using jq.

Use variables in JQ queries

I want to use the value of a variable USER_PROXY in the JQ query statement.
export USER_PROXY= "proxy.zyz.com:122"
BY refering the SO answer no:1 from HERE , and also the LINK, I made the following shell script.
jq -r --arg UPROXY ${USER_PROXY} '.proxies = {
"default": {
"httpProxy": "http://$UPROXY\",
"httpsProxy": "http://$UPROXY\",
"noProxy": "127.0.0.1,localhost"
}
}' ~/.docker/config.json > tmp && mv tmp ~/.docker/config.json
However, I see I get the bash error as below. What is it that is missing here. Why is JQ variable UPROXY not getting the value from USER_PROXY bash variable.

export USER_PROXY= "proxy.zyz.com:122"
You can't have a space here. This sets USER_PROXY to an empty string and tries to export a non-existant variable 'proxy.zyz.com:122'. You probably want
export USER_PROXY="proxy.zyz.com:122"
jq -r --arg UPROXY ${USER_PROXY} '.proxies = {
"default": {
"httpProxy": "http://$UPROXY\",
"httpsProxy": "http://$UPROXY\",
"noProxy": "127.0.0.1,localhost"
}
}' ~/.docker/config.json > tmp && mv tmp ~/.docker/config.json
You need quotes around ${USER_PROXY} otherwise any whitespace in it will break it. Instead use --arg UPROXY "${USER_PROXY}".
This isn't the syntax for using variables inside a string in jq. Instead of "...$UPROXY..." you need "...\($UPROXY)..."
You are escaping the " at the end of the string by putting a \ before it. I am not sure what you mean here. I think you perhaps meant to use a forward slash instead?
This last issue is the immediate cause of the error message you're saying. It says "syntax error, unexpected IDENT, expecting '}' at line 4" and then shows you what it found on line 4: "httpsProxy": .... It parsed the string from line 3, which looks like: "http://$UPROXY\"\n " because the escaped double quote doesn't end the string. After finding the end of the string on line 4, jq expects to find a } to close the object, or a , for the next key-value-pair, but it finds httpsProxy, which looks like an identifier. So that's what the error message is saying. It found an IDENTifier when it was expecting a } (or a , but it doesn't mention that).

Grep from txt file (JSON format)

I have a txt in a JSON format:
{
"items": [ {
"downloadUrl" : "some url",
"path": "yxxsf",
"id" : "abc",
"repository" : "example",
"format" : "zip",
"checksum" : {
"sha1" : "kdhjfksjdfasdfa",
"md5" : "skjfhkjshdfkjshfkjsdhf"
}
}],
"continuationToken" : null
}
I want to extract download url context (in this example i want "some url") using grep and store it in another txt file. TBH i have never used grep

Using grep
grep -oP 'downloadUrl"\s:\s"(.*)",' myfile > urlFile.txt
See this Regex in action: https://regex101.com/r/DvnXCO/1
A better way to do this is to use jq
Download jq for Windows: https://stedolan.github.io/jq/download/
jq ".items[0].downloadUrl" myfile > urlFile.txt

Although json string may contain double-quote character escaped by
a backslash, both a double quote and a backslash in URLs should be
percent-encoded according to RFC 3986. Then you can extract the URL with:
tr "[:space:]" " " < file.json | grep -Po '"downloadUrl"\s*:\s*\K"[^"]+"'
Let me use tr to pre-process the json file to convert all blank
characters to whitespaces. Then the following grep will work
if the name and the value pair are in the separate (but consecutive) lines.
The \K operator in the regex is a variable-length look-behind without
including the preceding pattern in the matched result.
Note that the command above works with the provided example but may not
be robust enough for arbitrary inputs. I'd still recommend to use jq
for the strict purpose.

If you want to use only grep:
grep downloadURL myfile > new_file.txt
If you prefer a cleaner option, adding cut command:
grep downloadURL myfile | cut -d\" -f4 > new_file.txt
BTW the image of the json file shows that you are using notepad (windows?)

Remove double quotes if delimiter value is not present in data

An input file is given, each line of which contains quotes for each column and carriage return/ new line character.
If the line contains new lines it has be appended with in the same
line which is inside the quotes i.e for example line 1
Removing of double quotes for each column if the delimiter(,) is
not present.
Removing of Carriage Return characters i.e(^M)
To exemplify, given the following input file
"name","address","age"^M
"ram","abcd,^M
def","10"^M
"abhi","xyz","25"^M
"ad","ram,John","35"^M
I would like to obtain the following output by means of a sed/perl/awk script/oneliner.
name,address,age
ram,"abcd,def",10
abhi,xyz,25
ad,"ram,John",35
Solutions which i have tired it so far
For appending with previous line
sed '/^[^"]*"[^"]*$/{N;s/\n//}' sample.txt
for replacing control-m characters
perl -pne 's/\\r//g' sample.txt
But i didn't achieve final output what i required below

Use a library to parse CSV files. Apart from always wanting to use a library for that here you also have very specific reasons, with embedded newlines and delimiters.
In Perl a good library is Text::CSV (which wraps Text::CSV_XS if installed). A basic example
use warnings;
use strict;
use feature 'say';
use Text::CSV;
my $file = shift or die "Usage: $0 file.csv\n";
my $csv = Text::CSV->new({ binary => 1, auto_diag => 1 });
open my $fh, '<', $file or die "Can't open $file: $!";
while (my $row = $csv->getline($fh)) {
s/\n+//g for #$row;
$csv->say(\*STDOUT, $row);
}
Comments
The binary option in the constructor is what handles newlines embedded in data
Once a line is read into the array reference $row I remove newlines in each field with a simplistic regex. By all means please improve this as/if needed
The pruning of $row works as follows. In a foreach loop each element is really aliased by the loop variable, so if that gets changed the array changes. I used default where elements are aliased by $_, which the regex changes so $row changes.
I like this compact shortcut because it has such a distinct look that I can tell from across the room that an array is being changed in place; so I consider it a sort-of-an-idiom. But if it is in fact confusing please by all means write out a full and proper loop
The processed output is printed to STDOUT. Or, open an output file and pass that filehandle to say (or to print in older module versions) so the output goes directly to that file
The above prints, for the sample input provided in the question
name,address,age
ram,"abcd,def",10
abhi,xyz,25
ad,"ram,John",35

This might work for you (GNU sed):
sed ':a;/[^"]$/{N;s/\n//;ba};s/"\([^",]*\)"/\1/g' file
The solution is in two parts:
Join broken lines to make whole ones.
Remove double quotes surrounding fields that do not contain commas.
If the current line does not end with double quotes, append the next line, remove the newline and repeat. Otherwise: remove double quotes surrounding fields that do not contain double quotes or commas.
N.B. Supposes that fields do not contain quoted double quotes. If that is the case, the condition for the first step would need to be amended and double quotes within fields would need to catered for.

FPAT is the way to go using gnu awk, it handles comma separated files.
remove ^m
clean lines
remove qutes
.
dos2unix sample.txt
awk '{printf "%s"(/,$/?"":"\n"),$0}' sample.txt > tmp && mv tmp sample.txt
"name","address","age"
"ram","abcd,def","10"
"abhi","xyz","25"
"ad","ram,John","35"
awk -v FPAT="([^,]+)|(\"[^\"]+\")" -v OFS=, '{for (i=1;i<=NF;i++) if($i!~",") $i=substr($i,2,length($i)-2)}1' sample.txt
name,address,age
ram,"abcd,def",10
abhi,xyz,25
ad,"ram,John",35
All in one go:
dos2unix sample.txt && awk '{printf "%s"(/,$/?"":"\n"),$0}' sample.txt | awk -v FPAT="([^,]+)|(\"[^\"]+\")" -v OFS=, '{for (i=1;i<=NF;i++) if($i!~",") $i=substr($i,2,length($i)-2)}1'
Normally you set Filed Separator FS or F to tell how filed are separated. FPAT="([^,]+)|(\"[^\"]+\")" FPAT tells how the filed looks like using a regex. This regex is complicated and often used with CSV.
(i=1;i<=NF;i++) loop through on by one field on the line.
if($i!~",") if it does not contain comma, then
$i=substr($i,2,length($i)-2) remove first and last character, the "
If a field for some reason do not contain ", this is more robust:
awk -v FPAT="([^,]+)|(\"[^\"]+\")" -v OFS=, '{for (i=1;i<=NF;i++) if($i!~",") {n=split($i,a,"\"");$i=(n>1?a[2]:$i)}}1' file
It will not do any thing to a field not contains double quote.

With perl, please try the following:
perl -e '
while (<>) {
s/\r$//; # remove trailing CR code
$str .= $_;
}
while ($str =~ /("(("")|[^"])*"\n?)|((^|(?<=,))[^,]*((?=,)|\n))/g) {
$_ = $&;
if (/,/) { # the element contains ","
s/\n//g; # then remove newline(s) if any
} else { # otherwise remove surrounding double quotes
s/^"//s; s/"$//s;
}
push(#ary, $_);
if (/\n$/) { # newline terminates the element
print join(",", #ary);
#ary = ();
}
}' sample.txt
Output:
name,address,age
ram,"abcd,def",10
abhi,xyz,25
ad,"ram,John",35

Get the same SHA-224 sum in Factor as coreutils sha224sum

$ echo *
a b c
$ cat *
file 1
file 2
file 3
$ factor -e=" \
> USING: globs io sequences sorting io.files io.encodings.utf8 ; \
> \"*\" glob natural-sort [ utf8 file-lines ] map concat [ print ] each "
file 1
file 2
file 3
The outputs are the same using Factor's glob and the shell's glob. A diff on the outputs shows they match exactly.
$ factor -e=" \
> USING: math.parser checksums checksums.sha globs io sequences sorting io.files io.encodings.utf8 ; \
> \"*\" glob natural-sort [ utf8 file-lines ] map concat sha-224 checksum-lines bytes>hex-string print "
0feaf7d5c46b802404760778091ed1312ba82d4206b9f93c35570a1a
$ cat * | sha224sum
d1240479399e5a37f8e62e2935a7ac4b9352e41d6274067b27a36101
But the checksums don't match, nor will md5 checksums. Why is this? How do I get the same checksum in Factor as in coreutils sha224sum?
Changing the encoding to ascii doesn't change the output, nor does "\n" join sha-224 checksum-bytes instead of checksum-lines.

This odd behaviour is due to a bug in checksum-lines. factor/factor#1708
Thanks to jonenst for finding the problem, and calsioro for this code on the Factor mailing list:
This code:
[
{ "a" "b" "c" } 3 [1,b]
[ number>string "file " prepend [ write ] curry
ascii swap with-file-writer ] 2each
"*" glob natural-sort [ utf8 file-lines ] map concat
[ "\n" append ] map "" join ! Add newlines between and at the end
sha-224 checksum-bytes bytes>hex-string print
] with-test-directory
gives the same hash:
d1240479399e5a37f8e62e2935a7ac4b9352e41d6274067b27a36101

jonenst also pointed out that:
Also, regarding the three different lengths you get, "exercism/self-update/self-update.factor" is missing a '\n' char at the end of the last line. That's why you get surprising results.
If you're trying to checksum files, make sure they all end with a trailing newline.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

regex for json file with multiple piping - json

If you want to get the children array try this but i'm not sure it's what you look for. wget -O - https://www.reddit.com/r/NetflixBestOf/.json | sed -n '/children/,/],/p'

Related

Bulk update values in json files (writing files)

Use variables in JQ queries

Grep from txt file (JSON format)

Remove double quotes if delimiter value is not present in data

Get the same SHA-224 sum in Factor as coreutils sha224sum

Categories

Resources