This is an output example of git's log in JSON format.
The issue is that, from time to time, the body key has got break lines in it, which makes the parsing of this JSON file impossible, unless it gets corrected.
# start of cross-section
[{
"commit-hash": "11d07df4ce627d98bd30eb1e37c27ac9515c75ff",
"abbreviated-commit-hash": "11d07df",
"author-name": "Robert Lucian CHIRIAC",
"author-email": "robert.lucian.chiriac#gmail.com",
"author-date": "Sat, 27 Jan 2018 22:33:37 +0200",
"subject": "#fix(automation): patch versions aren't released",
"sanitized-subject-line": "fix-automation-patch-versions-aren-t-released",
"body": "Nothing else to add.
Fixes #24.",
"commit-notes": ""
},
# end of cross-section
I've been going through sed's manual page and the explanation is quite hard to be digested. Does anyone have some suggestions on how I can put the value of body into one line and hence get rid of all those break lines? The idea is to make the file valid in order to be able to parse it.
At the end, it should look like this:
...
"body": "Nothing else to add. Fixes #24."
...
This, using GNU awk for multi-char RS and patsplit(), will work whether there's escaped quotes in the input or not:
$ cat tst.awk
BEGIN { RS="^$"; ORS="" }
{
gsub(/#/,"#A")
gsub(/\\"/,"#B")
nf = patsplit($0,flds,/"[^"]*"/,seps)
$0 = ""
for (i=0; i<=nf; i++) {
$0 = $0 gensub(/\s*\n\s*/," ","g",flds[i]) seps[i]
}
gsub(/#B/,"\\\"")
gsub(/#A/,"#")
print
}
$ awk -f tst.awk file
# start of cross-section
[{
"commit-hash": "11d07df4ce627d98bd30eb1e37c27ac9515c75ff",
"abbreviated-commit-hash": "11d07df",
"author-name": "Robert Lucian CHIRIAC",
"author-email": "robert.lucian.chiriac#gmail.com",
"author-date": "Sat, 27 Jan 2018 22:33:37 +0200",
"subject": "#fix(automation): patch versions aren't released",
"sanitized-subject-line": "fix-automation-patch-versions-aren-t-released",
"body": "Nothing else to add. Fixes #24.",
"commit-notes": ""
},
# end of cross-section
It replaces every escaped quote with a string that cannot exist in the input (which the first gsub() ensures) then operates on the "..." strings then puts the escaped quotes back.
You could try this but escaped double quotes in the string values will probably break it:
Using double quote as the field separator, we count how many fields are in each line.
We expect there to be 5 fields.
If there are 4, then we have an "open" string.
If we're in an open string, when we see 2 fields, that line contains the closing double quote
awk -F'"' '
NF == 4 {in_string = 1}
in_string && NF == 2 {in_string = 0}
{printf "%s%s", $0, in_string ? " " : ORS}
' file.json
To handle the inner quotes problem, let's try replacing all escaped quotes with other text, handle the newlines, then restore the escaped quotes:
awk -F'"' -v escaped_quote_marker='!#_Q_#!' '
{gsub(/\\\"/, escaped_quote_marker)}
NF == 4 {in_string = 1}
in_string && NF == 2 {in_string = 0}
{
gsub(escaped_quote_marker, "\\\"")
printf "%s%s", $0, in_string ? " " : ORS
}
' <<END
[{
"foo":"bar",
"baz":"a string with \"escaped
quotes\" and \"newlines\"
."
}]
END
[{
"foo":"bar",
"baz":"a string with \"escaped quotes\" and \"newlines\" ."
}]
I assume git log is at least kind enough to escape quotes for you.
sed doesn't handle multi-line input easily. You may use perl in slurp mode:
perl -0777 -pe 's~("body":\h*"|\G(?<!^))([^\n"]*)\n+~$1$2 ~' file
# start of cross-section
[{
"commit-hash": "11d07df4ce627d98bd30eb1e37c27ac9515c75ff",
"abbreviated-commit-hash": "11d07df",
"author-name": "Robert Lucian CHIRIAC",
"author-email": "robert.lucian.chiriac#gmail.com",
"author-date": "Sat, 27 Jan 2018 22:33:37 +0200",
"subject": "#fix(automation): patch versions aren't released",
"sanitized-subject-line": "fix-automation-patch-versions-aren-t-released",
"body": "Nothing else to add. Fixes #24.",
"commit-notes": ""
},
# end of cross-section
\G asserts position at the end of the previous match or the start of the string for the first match.
(?<!^) is a negative lookahead to ensure we don't match start position.
("body":\h*"|\G(?<!^)) expression matches "body": or end of previous match
RegEx Demo
Related
I have file containing Multiple JSON Objects and need to covert them to JSON. I have bash and Excel installed, but cannot install any other tool.
{"name": "a","age":"17"}
{"name":"b","age":"18"}
To:
[{"name": "a","age":"17"},
{"name":"b","age":"18"}]
Assuming one object per line as shown by OP question
echo -n "["; while read line; do echo "${line},"; done < <(cat test.txt) | sed -re '$ s/(.*),/\1]/'
Result:
[{"name": "a","age":"17"},
{"name":"b","age":"18"}]
Inspired by https://askubuntu.com/a/475804
awk '(NR==FNR){count++} (NR!=FNR){ if (FNR==1) printf("["); printf("%s", $0); print (FNR==count)?"]":"," } END {if (count==0) print "[]"}' file file
A less compact but more readable version:
awk '
(NR==FNR) {
count++;
}
(NR!=FNR) {
if (FNR==1)
printf("[");
printf("%s", $0);
if (FNR==count)
print "]";
else
print ",";
}
END {
if (count==0) print "[]";
}' file file
The trick is to give the same file twice to awk. Because NR==FNR is always true for the first file, there is a first parse dedicated to counting the number of lines into variable count.
The second parse with NR!=FNR will apply the following algorithm for each line:
Write [ for the first line only
Then write the record, using printf instead of print in order to avoid the newline ending
Then write either ] or , depending on whether we are on the last line or not, using print in order to end with a newline
The END command is just a failsafe to output an empty array in case the file is empty.
Assumptions:
no requirement to (re)format the input data
Sample input:
$ cat raw.dat
{"name": "a","age":"17"}
{"name":"b","age":"18"}
{"name":"C","age":"23"}
One awk idea:
awk 'BEGIN {pfx="["} {printf "%s%s",pfx,$0; pfx=",\n"} END {printf "]\n"}' raw.dat
Where:
for each input line we printf the line without a terminating linefeed
for the first line we use a prefix (pfx) of [
for subsequent lines the prefix (pfx) is set to ,\n (ie, terminate the previous line with ,\n)
once the file has been processed we terminate the last input line with a printf "]\n"
requires a single pass through the input file
This generates:
[{"name": "a","age":"17"},
{"name":"b","age":"18"},
{"name":"C","age":"23"}]
Making sure #chepner's comment (re: a sed solution) isn't lost in the mix:
sed '1s/^/[/;2,$s/^/,/;$s/$/]/' raw.dat
This generates:
[{"name": "a","age":"17"}
,{"name":"b","age":"18"}
,{"name":"C","age":"23"}]
NOTE: I can remove this if #chepner wants to post this as an answer.
I have the following command to grab a json in unix:
wget -q -O- https://www.reddit.com/r/NetflixBestOf/.json
Which gives me the following output format (with different results each time obviously):
{
"kind": "...",
"data": {
"modhash": "",
"whitelist_status": "...",
"children": [
e1,
e2,
e3,
...
],
"after": "...",
"before": "..."
}
}
where each element of the array children is an object structured as follows:
{
"kind": "...",
"data": {
...
}
}
Here is an example of a complete .json get (body is too long to post directly:
https://pastebin.com/20p4kk3u
I need to print the complete data object as present inside each element of the array children. I know I need pipe atleast twice, to initially get children [...], then data {...} from there on, and this is what I have so far:
wget -q -O- https://www.reddit.com/r/NetflixBestOf/.json | tr -d '\r\n' | grep -oP '"children"\s*:\s*\[\s*\K({.+?})(?=\s*\])' | grep -oP '"data"\s*:\s*\K({.+?})(?=\s*},)'
I'm new to regular expressions, so I'm not sure how to handle having brackets or curly braces within elements of what I'm grepping. The line above prints nothing to the shell and I'm not sure why. Any help is appreciated.
Code
wget -q -O- https://www.reddit.com/r/NetflixBestOf/.json | tr -d '\r\n' | grep -oP '"children"\s*:\s*\[\s*\K({.+?})(?=\s*\])' | grep -oP '"data"\s*:\s*\K({.+?})(?=\s*},)'
Something about regex
* == zero or more time
+ == one or more time
? == zero or one time
\s == a space character or a tab character or a carriage return character or a new line character or a vertical tab character or a form feed character
\w == is a word character and can to be from A to Z (upper or lower), from 0 to 9, included also underscore (_)
\d == all numbers from 0 to 9
\r == carriage return
\n == new line character (line feed)
\ == escape special characters so they can to be read as normal characters
[...] == search for character class. Example: [abc] search for a or b or c
(?=) == is a positive lookahead, a type of zero-width assertion. What it's saying is that the captured match must be followed by whatever is within the parentheses but that part isn't captured.
\K == match start at this position.
Anyway you can read more about regex from here: Regex Tutorial
Now i can try to explain the code
wget download the source.
tr remove all line feed e carriage return, so we have all the output in one line and can to be handle from grep.
grep -o option is used for only matching.
grep -P option is for perl regexp.
So here
grep -oP '"children"\s*:\s*\[\s*\K({.+?})(?=\s*\])'
we have sayed:
match all the line from "children"
zero or more spaces
:
zero or more spaces
\[ escaped so it's a simple character and not a special
zero or more spaces
\K force submatch to start from here
( submatch
{.+?} all, in braces (the braces are included because after start submatch sign. See greedy, not greedy in the regex tutorial for understand how work .+?)
) close submatch
(?=\s*\]) stop submatch when zero or more space founded and simple ] is founded but not include it in the submatch.
If you want to get the children array try this but i'm not sure it's what you look for.
wget -O - https://www.reddit.com/r/NetflixBestOf/.json | sed -n '/children/,/],/p'
$ echo *
a b c
$ cat *
file 1
file 2
file 3
$ factor -e=" \
> USING: globs io sequences sorting io.files io.encodings.utf8 ; \
> \"*\" glob natural-sort [ utf8 file-lines ] map concat [ print ] each "
file 1
file 2
file 3
The outputs are the same using Factor's glob and the shell's glob. A diff on the outputs shows they match exactly.
$ factor -e=" \
> USING: math.parser checksums checksums.sha globs io sequences sorting io.files io.encodings.utf8 ; \
> \"*\" glob natural-sort [ utf8 file-lines ] map concat sha-224 checksum-lines bytes>hex-string print "
0feaf7d5c46b802404760778091ed1312ba82d4206b9f93c35570a1a
$ cat * | sha224sum
d1240479399e5a37f8e62e2935a7ac4b9352e41d6274067b27a36101
But the checksums don't match, nor will md5 checksums. Why is this? How do I get the same checksum in Factor as in coreutils sha224sum?
Changing the encoding to ascii doesn't change the output, nor does "\n" join sha-224 checksum-bytes instead of checksum-lines.
This odd behaviour is due to a bug in checksum-lines. factor/factor#1708
Thanks to jonenst for finding the problem, and calsioro for this code on the Factor mailing list:
This code:
[
{ "a" "b" "c" } 3 [1,b]
[ number>string "file " prepend [ write ] curry
ascii swap with-file-writer ] 2each
"*" glob natural-sort [ utf8 file-lines ] map concat
[ "\n" append ] map "" join ! Add newlines between and at the end
sha-224 checksum-bytes bytes>hex-string print
] with-test-directory
gives the same hash:
d1240479399e5a37f8e62e2935a7ac4b9352e41d6274067b27a36101
jonenst also pointed out that:
Also, regarding the three different lengths you get, "exercism/self-update/self-update.factor" is missing a '\n' char at the end of the last line. That's why you get surprising results.
If you're trying to checksum files, make sure they all end with a trailing newline.
Here's a sample file that I want to convert to json.
Name: Jack
Address: Fancy road and some special characters :"'$#|,
City
Country
ID: 1
The special characters are double quote, single quote, $, #, pipe. I thought I can use the record separator in awk:
awk -F ":" '{RS="\n"}{print $1}'
However, what I get is:
Name:
Address
City
Country
ID
I experimented with changing the record separator to "^[a-zA-Z0-9]" to try and catch strings that don't start with a space, but somehow this doesn't work. Another attempt is to simply parse the file line by line and format the output conditional on each line's content, but this is slow.
Ideally, I would convert the file to:
{
"Name": "Jack",
"Address": "Fancy road and some special characters :\"'$#|, City, Country",
"ID": "1"
}
idk why your question talks about non-empty lines when there are no empty lines in your example but with GNU awk for the 3rd arg to match() and gensub():
$ cat tst.awk
BEGIN { printf "{" }
match($0,/^(\S[^:]+):\s*(.*)/,a) {
prt()
key = a[1]
val = a[2]
next
}
{ val = gensub(/,\s*$/,"",1,val) gensub(/^\s*/,", ",1) }
END { prt(); print "\n}" }
function prt() {
if (key != "") {
printf "%s\n\"%s\": \"%s\"", (++c>1?",":""), key, gensub(/"/,"\\\\&","g",val)
}
}
$ awk -f tst.awk file
{
"Name": "Jack",
"Address": "Fancy road and some special characters :\"'$#|, City, Country",
"ID": "1"
}
Some extra comments on the code:
match()
The match function searches the string, string, for the longest, leftmost substring matched by the regular expression, regexp. It returns the character position, or index, of where that substring begins (1, if it starts at the beginning of string).
\S
Matches any character that is not whitespace. Think of it as shorthand for ‘[^[:space:]]’.
\s
Matches any whitespace character. Think of it as shorthand for ‘[[:space:]]’.
I want to convert my sqlite data from my database to JSON format.
I would like to use this syntax:
sqlite3 -line members.db "SELECT * FROM members LIMIT 3" > members.txt
OUTPUT:
id = 1
fname = Leif
gname = Håkansson
genderid = 1
id = 2
fname = Yvonne
gname = Bergman
genderid = 2
id = 3
fname = Roger
gname = Sjöberg
genderid = 1
How to do this with nice and structur code in a for loop?
(Only in Bash)
I have tried some awk and grep but not with a great succes yet.
Would be nice with some tips.
I want a result similar to this:
[
{
"id":1,
"fname":"Leif",
"gname":"Hakansson",
"genderid":1
},
{
"id":2,
"fname":"Yvonne",
"gname":"Bergman",
"genderid":2
},
{
"id":3,
"fname":"Roger",
"gname":"Sjberg",
"genderid":1
}
}
If your sqlite3 is compiled with the json1 extension (or if you can obtain a version of sqlite3 with the json1 extension), then you can use it to generate JSON objects (one JSON object per row). For example:
select json_object('id', id, 'fname', fname, 'gname', gname, 'genderid', genderid) ...
You can then use a tool such as jq to convert the stream of objects into an array of objects, e.g. pipe the output of the sqlite3 to jq -s ..
(A less tiresome alternative might be to use the sqlite3 function json_array(), which produces an array, which you can reassemble into an object using jq.)
If the json1 extension is unavailable, then you could use the following as a starting point:
awk 'BEGIN { print "["; }
function out() {if (n++) {print ","}; if (line) {print "{" line "}"}; line="";}
function trim(x) { sub(/^ */, "", x); sub(/ *$/, "", x); return x; }
NF==0 { out(); next};
{if (line) {line = line ", " }
i=index($0,"=");
line = line "\"" trim(substr($0,1,i-1)) ": \"" substr($0, i+2) "\""}
END {out(); print "]"} '
Alternatively, you could use the following jq script, which converts numeric strings that occur on the RHS of "=" to numbers:
def trim: sub("^ *"; "") | sub(" *$"; "");
def keyvalue: index("=") as $i
| {(.[0:$i] | trim): (.[$i+2:] | (tonumber? // .))};
[foreach (inputs, "") as $line ({object: false, seed: {} };
if ($line|trim) == "" then { object: .seed, seed : {} }
else {object: false,
seed: (.seed + ($line | keyvalue)) }
end;
.object | if . and (. != {}) then . else empty end ) ]
Just type -json argument with SQLite 3.33.0 or higher and get json output:
$ sqlite3 -json database.db "select * from TABLE_NAME"
from SQLite Release 3.33.0 note:
...
CLI enhancements:
Added four new output modes: "box", "json", "markdown", and "table".
The "column" output mode automatically expands columns to contain the longest output row and automatically turns ".header" on if it has
not been previously set.
The "quote" output mode honors ".separator"
The decimal extension and the ieee754 extension are built-in to the CLI
...
I think I would prefer to parse sqlite output with a single line per record rather than the very wordy output format you suggested with sqlite3 -line. So, I would go with this:
sqlite3 members.db "SELECT * FROM members LIMIT 3"
which gives me this to parse:
1|Leif|Hakansson|1
2|Yvonne|Bergman|2
3|Roger|Sjoberg|1
I can now parse that with awk if I set the input separator to | with
awk -F '|'
and pick up the 4 fields on each line with the following and save them in an array like this:
{ id[++i]=$1; fname[i]=$2; gname[i]=$3; genderid[i]=$4 }
Then all I need to do is print the output format you need at the end. However, you have double quotes in your output and they are a pain to quote in awk, so I temporarily use another pipe symbol (|) as a double quote and then, at the very end, I get tr to replace all the pipe symbols with double quotes - just to make the code easier on the eye. So the total solution looks like this:
sqlite3 members.db "SELECT * FROM members LIMIT 3" | awk -F'|' '
# sqlite output line - pick up fields and store in arrays
{ id[++i]=$1; fname[i]=$2; gname[i]=$3; genderid[i]=$4 }
END {
printf "[\n";
for(j=1;j<=i;j++){
printf " {\n"
printf " |id|:%d,\n",id[j]
printf " |fname|:|%s|,\n",fname[j]
printf " |gname|:|%s|,\n",gname[j]
printf " |genderid|:%d\n",genderid[j]
closing=" },\n"
if(j==i){closing=" }\n"}
printf closing;
}
printf "]\n";
}' | tr '|' '"'
Sqlite-utils does exactly what you're looking for. By default, the output will be JSON.
Better late than never to plug jo.
Save sqlite3 to a text file.
Get jo (jo's also available in distro repos)
and use this bash script.
while read line
do
id=`echo $line | cut -d"|" -f1`
fname=`echo $line | cut -d"|" -f2`
gname=`echo $line | cut -d"|" -f3`
genderid=`echo $line | cut -d"|" -f4`
jsonline=`jo id="$id" fname="$fname" gname="$gname" genderid="$genderid"`
json="$json $jsonline"
done < "$1"
jo -a $json
Please don't create (or parse) json with awk. There are dedicated tools for this. Tools like xidel.
While first and foremost a html, xml and json parser, xidel can also parse plain text.
I'd like to offer a very elegant solution using this tool (with much less code than jq).
I'll assume your 'members.txt'.
First to create a sequence of each json object to-be:
xidel -s members.txt --xquery 'tokenize($raw,"\n\n")'
Or...
xidel -s members.txt --xquery 'tokenize($raw,"\n\n") ! (position(),.)'
1
id = 1
fname = Leif
gname = Håkansson
genderid = 1
2
id = 2
fname = Yvonne
gname = Bergman
genderid = 2
3
id = 3
fname = Roger
gname = Sjöberg
genderid = 1
...to better show you the individual items in the sequence.
Now you have 3 multi-line strings. To turn each item/string into another sequence where each item is a new line:
xidel -s members.txt --xquery 'tokenize($raw,"\n\n") ! x:lines(.)'
(x:lines(.) is a shorthand for tokenize(.,'\r\n?|\n'))
Now for each line tokenize on the " = " (which creates yet another sequence) and save it to a variable. For the first line for example this sequence is ("id","1"), for the second line ("fname","Leif"), etc.:
xidel -s members.txt --xquery 'tokenize($raw,"\n\n") ! (for $x in x:lines(.) let $a:=tokenize($x," = ") return ($a[1],$a[2]))'
Finally remove leading whitespace (normalize-space()), create a json object ({| {key-value-pair} |}) and put all json objects in an array ([ ... ]):
xidel -s members.txt --xquery '[tokenize($raw,"\n\n") ! {|for $x in x:lines(.) let $a:=tokenize($x," = ") return {normalize-space($a[1]):$a[2]}|}]'
Prettified + output:
xidel -s members.txt --xquery '
[
tokenize($raw,"\n\n") ! {|
for $x in x:lines(.)
let $a:=tokenize($x," = ")
return {
normalize-space($a[1]):$a[2]
}
|}
]
'
[
{
"id": "1",
"fname": "Leif",
"gname": "Håkansson",
"genderid": "1"
},
{
"id": "2",
"fname": "Yvonne",
"gname": "Bergman",
"genderid": "2"
},
{
"id": "3",
"fname": "Roger",
"gname": "Sjöberg",
"genderid": "1"
}
]
Note: For xidel-0.9.9.7173 and newer --json-mode=deprecated is needed to create a json array with [ ]. The new (XQuery 3.1) way to create a json array is to use array{ }.