Combining two files into single file - json

I have the below script which combines many json files into single one. But the first file is printed twice in final output file even though i have removed the first file from the list.Please advice how to print first file only once.
Bash Script is pasted below.
#!/bin/bash
shopt -s nullglob
declare -a jsons
jsons=(o-*.json)
echo '[' > final.json
if [ ${#jsons[#]} -gt 0 ]; then
cat "${jsons[0]}" >> final.json
unset $jsons[0]
for f in "${jsons[#]}"; do # iterate over the rest
echo "," >>final.json
cat "$f" >>final.json
done
fi
echo ']' >>final.json

You can't use unset ${foo[0]} to remove an item from an array variable in bash.
$ foo=(a b c)
$ echo "${foo[#]}"
a b c
$ unset ${foo[0]}
$ echo "${foo[#]}"
a b c
You'll need to reset the array itself using array slicing.
$ foo=("${foo[#]:1}")
$ echo "${foo[#]}"
b c
See: How can I remove an element from an array completely?

Related

How can I write a batch file using jq to find json files with certain attribute and copy the to new location

I have 100,000's of lined json files that I need to split out based on whether or not, they contain a certain value for an attribute and then I need to convert them into valid json that can be read in by another platform.
I'm using a batch file to do this and I've managed to convert them into valid json using the following:
for /r %%f in (*.json*) do jq -s -c "." "%%f" >> "C:\Users\me\my-folder\%%~nxf.json"
I just can't figure out how to only copy the files that contain a certain value. So logic should be:
Look at all the files in the folders and sub solders
If the file contains an attribute "event" with a value of "abcd123"
then: convert the file into valid json and persist it with the same filename over to location "C:\Users\me\my-folder\"
else: ignore it
Example of files it should select:
{"name":"bob","event":"abcd123"}
and
{"name":"ann","event":"abcd123"},{"name":"bob","event":"8745LLL"}
Example of files it should NOT select:
{"name":"ann","event":"778PPP"}
and
{"name":"ann","event":"778PPP"},{"name":"bob","event":"8745LLL"}
Would love help to figure out the filtering part.
Since there are probably more file names than will fit on the command line, this response will assume a shell loop through the file names will be necessary, as the question itself envisions. Since I'm currently working with a bash shell, I'll present a bash solution, which hopefully can readily be translated to other shells.
The complication in the question is that the input file might contain one or more valid JSON values, or one or more comma-separated JSON values.
The key to a simple solution using jq is jq's -e command-line option, since this sets the return code to 0 if and only if
(a) the program ran normally; and (b) the last result was a truthy value.
For clarity, let's encapsulate the relevant selection criterion in two bash functions:
# If the input is a valid stream of JSON objects
function try {
jq -e -n 'any( inputs | objects; select( .event == "abcd123") | true)' 2> /dev/null > /dev/null
}
# If the input is a valid JSON array whose elements are to be checked
function try_array {
jq -e 'any( .[] | objects; select( .event == "abcd123") | true)' 2> /dev/null > /dev/null
}
Now a comprehensive solution can be constructed along the following lines:
find . -type f -maxdepth 1 -name '*.json' | while read -r f
do
< "$f" try
if [ $? = 0 ] ; then
echo copy $f
elif [ $? = 5 ] ; then
(echo '['; cat "$f"; echo ']') | try_array
if [ $? = 0 ] ; then
echo copy $f
fi
fi
done
Have you considered using findstr?
%SystemRoot%\System32\findstr.exe /SRM "\"event\":\"abcd123\"" "C:\Users\me\my-folder\*.json"
Please open a Command Prompt window, type findstr /?, press the ENTER key, and read its usage information. (You may want to consider the /I option too, for instance).
You could then use that within another for loop to propagate those files into a variable for your copy command.
batch-file example:
#For /F "EOL=? Delims=" %%G In (
'%SystemRoot%\System32\findstr.exe /SRM "\"event\":\"abcd123\"" "C:\Users\me\my-folder\*.json"'
) Do #Copy /Y "%%G" "S:\omewhere Else"
cmd example:
For /F "EOL=? Delims=" %G In ('%SystemRoot%\System32\findstr.exe /SRM "\"event\":\"abcd123\"" "C:\Users\me\my-folder\*.json"') Do #Copy /Y "%G" "S:\omewhere Else"

Using jq to combine json files, getting file list length too long error

Using jq to concat json files in a directory.
The directory contains a few hundred thousand files.
jq -s '.' *.json > output.json
returns an error that the file list is too long. Is there a way to write this that uses a method that will take in more files?
If jq -s . *.json > output.json produces "argument list too long"; you could fix it using zargs in zsh:
$ zargs *.json -- cat | jq -s . > output.json
That you could emulate using find as shown in #chepner's answer:
$ find -maxdepth 1 -name \*.json -exec cat {} + | jq -s . > output.json
"Data in jq is represented as streams of JSON values ... This is a cat-friendly format - you can just join two JSON streams together and get a valid JSON stream.":
$ echo '{"a":1}{"b":2}' | jq -s .
[
{
"a": 1
},
{
"b": 2
}
]
The problem is that the length of a command line is limited, and *.json produces too many argument for one command line. One workaround is to expand the pattern in a for loop, which does not have the same limits as a command line, because bash can iterate over the result internally rather than having to construct an argument list for an external command:
for f in *.json; do
cat "$f"
done | jq -s '.' > output.json
This is rather inefficient, though, since it requires running cat once for each file. A more efficient solution is to use find to call cat with as many files as possible each time.
find . -name '*.json' -exec cat '{}' + | jq -s '.' > output.json
(You may be able to simply use
find . -name '*.json' -exec jq -s '{}' + > output.json
as well; it may depend on what is in the files and how multiple calls to jq using the -s option compares to a single call.)
[EDITED to use find]
One obvious thing to consider would be to process one file at a time, and then "slurp" them:
$ while IFS= read -r f ; cat "$f" ; done <(find . -maxdepth 1 -name "*.json") | jq -s .
This however would presumably require a lot of memory. Thus the following may be closer to what you need:
#!/bin/bash
# "slurp" a bunch of files
# Requires a version of jq with 'inputs'.
echo "["
while read f
do
jq -nr 'inputs | (., ",")' $f
done < <(find . -maxdepth 1 -name "*.json") | sed '$d'
echo "]"

Adding header to all .csv files in folder and include filename

I'm a command line newbie and I'm trying to figure out how I can add a header to multiple .csv files. The new header should have the following: 'TaxID' and 'filename'
I've tried multiple commands like sed, ed, awk, echo but if it worked it only changed the first file it found (I said *.csv in my command) and I can only manage this for TaxID.
Can anyone help me to get the filename into the header as well and do this for all my csv files?
(Note, I'm using a Mac)
Thank you!
Here's one way to do it, there are certainly others:
$ for i in *.csv;do echo $i;cp "$i" "$i.bak" && { echo "TaxID,$i"; cat "$i.bak"; } >"$i";done
Here's a sample run:
$ cat file1.csv
1,2
3,4
$ cat file2.csv
a,b
c,d
$ for i in *.csv;do echo $i;cp "$i" "$i.bak" && { echo "TaxID,$i"; cat "$i.bak"; } >"$i";done
file1.csv
file2.csv
$ cat file1.csv.bak
1,2
3,4
$ cat file1.csv
TaxID,file1.csv
1,2
3,4
$ cat file2.csv.bak
a,b
c,d
$ cat file2.csv
TaxID,file2.csv
a,b
c,d
Breaking it down:
$ for i in *.csv; do
This loops over all the files ending in .csv in the current directory. Each will be put in the shell variable i in turn.
echo $i;
This just echoes the current filename so you can see the progress. This can be safely left out.
cp "$i" "$i.bak"
Copy the current file (whose name is in i) to a backup. This is both to preserve the file if something goes awry, and gives subsequent commands something to copy from.
&&
Only run the subsequent commands if the cp succeeds. If you can't make a backup, don't continue.
{
Start a group command.
echo "TaxID,$i";
Output the desired header.
cat "$i.bak";
Output the original file.
}
End the group command.
>"$i";
Redirect the output of the group command (the new header and the contents of the original file) to the original file. This completes one file.
done
Finish the loop over all the files.
For fun, here are a couple of other ways (one JRD beat me to), including one using ed!
$ for i in *.csv;do echo $i;perl -p -i.bak -e 'print "TaxID,$ARGV\n" if $. == 1' "$i";done
$ for i in *.csv;do echo $i;echo -e "1i\nTaxID,$i\n.\nw\nq\n" | ed "$i";done
Here is on way in perl that modifies the files in place by adding a header of TaxID,{filename}, ignoring adding the header if it thinks it already exists.
ls
a.csv b.csv
cat a.csv
1,a.txt
2,b.txt
cat b.csv
3,c.txt
4,d.txt
ls *.csv | xargs -I{} -n 1 \
perl -p -i -e 'print "TaxID,{}\n" if !m#^TaxID# && !$h; $h = 1;' {}
cat a.csv
TaxID,a.csv
1,a.txt
2,b.txt
cat b.csv
TaxID,b.csv
3,c.txt
4,d.txt
You may want to create some backups of your files, or run on a few sample copies before running in earnest.
Explanatory:
List all files in directory with .csv extenstion
ls *.csv
"Pipe" the output of ls command into xargs so the perl command can run for each file. -I{} allows the filename to be subsequently referenced with {}. -n tells xargs to only pass 1 file at a time to perl.
| xargs -I{} -n 1
-p print each line of the input (file)
-i modifying the file in place
-e execute the following code
perl -p -i -e
Perl will implicitly loop over each line of the file and print it (due to -p). Print the header if we have not printed the header already and the current line doesn't already look like a header.
'print "TaxID,{}\n" if !m#^TaxID# && !$h; $h = 1;'
This is replaced with the filename.
{}
All told, in this example the commands to be run would be:
perl -p -i -e 'print "TaxID,{}\n" if !m#^TaxID# && !$h; $h = 1;' a.csv
perl -p -i -e 'print "TaxID,{}\n" if !m#^TaxID# && !$h; $h = 1;' b.csv
perl -p -i -e 'print "TaxID,{}\n" if !m#^TaxID# && !$h; $h = 1;' c.csv
perl -p -i -e 'print "TaxID,{}\n" if !m#^TaxID# && !$h; $h = 1;' d.csv

Bourne shell function return variable always empty

The following Bourne shell script, given a path, is supposed to test each component of the path for existence; then set a variable comprising only those components that actually exist.
#! /bin/sh
set -x # for debugging
test_path() {
path=""
echo $1 | tr ':' '\012' | while read component
do
if [ -d "$component" ]
then
if [ -z "$path" ]
then path="$component"
else path="$path:$component"
fi
fi
done
echo "$path" # this prints nothing
}
paths=/usr/share/man:\
/usr/X11R6/man:\
/usr/local/man
MANPATH=`test_path $paths`
echo $MANPATH
When run, it always prints nothing. The trace using set -x is:
+ paths=/usr/share/man:/usr/X11R6/man:/usr/local/man
++ test_path /usr/share/man:/usr/X11R6/man:/usr/local/man
++ path=
++ echo /usr/share/man:/usr/X11R6/man:/usr/local/man
++ tr : '\012'
++ read component
++ '[' -d /usr/share/man ']'
++ '[' -z '' ']'
++ path=/usr/share/man
++ read component
++ '[' -d /usr/X11R6/man ']'
++ read component
++ '[' -d /usr/local/man ']'
++ '[' -z /usr/share/man ']'
++ path=/usr/share/man:/usr/local/man
++ read component
++ echo ''
+ MANPATH=
+ echo
Why is the final echo $path empty? The $path variable within the while loop was incrementally set for each iteration just fine.
The pipe runs all commands involved in sub-shells, including the entire while ... loop. Therefore, all changes to variables in that loop are confined to the sub-shell and invisible to the parent shell script.
One way to work around that is putting the while ... loop and the echo into a list that executes entirely in the sub-shell, so that the modified variable $path is visible to echo:
test_path()
{
echo "$1" | tr ':' '\n' | {
while read component
do
if [ -d "$component" ]
then
if [ -z "$path" ]
then
path="$component"
else
path="$path:$component"
fi
fi
done
echo "$path"
}
}
However, I suggest using something like this:
test_path()
{
echo "$1" | tr ':' '\n' |
while read dir
do
[ -d "$dir" ] && printf "%s:" "$dir"
done |
sed 's/:$/\n/'
}
... but that's a matter of taste.
Edit: As others have said, the behaviour you are observing depends on the shell. The POSIX standard describes pipelined commands as run in sub-shells, but that is not a requirement:
Additionally, each command of a multi-command pipeline is in a subshell environment; as an extension, however, any or all commands in a pipeline may be executed in the current environment.
Bash runs them in sub-shells, but some shells run the last command in the context of the main script, when only the preceding commands in the pipeline are run in sub-shells.
This should work in a Bourne shell that understands functions (and would work in Bash and other shells too):
test_path() {
echo $1 | tr ':' '\012' |
{
path=""
while read component
do
if [ -d "$component" ]
then
if [ -z "$path" ]
then path="$component"
else path="$path:$component"
fi
fi
done
echo "$path" # this prints nothing
}
}
The inner set of braces groups the commands into a unit, so path is only set in the subshell but is echoed from the same subshell.
Why is the final echo $path empty?
Until recently, Bash would give all components of a pipeline their own process, separate from the shell process in which the pipeline is run.
Separate process == separate address space, and no variable sharing.
In ksh93 and in recent Bash (may need a shopt setting), the shell will run the last component of a pipeline in the calling shell, so any variables changed inside the loop are preserved when the loop exits.
Another way to accomplish what you want is to make sure that the echo $path is in the same process as the loop, using parentheses:
#! /bin/sh
set -x # for debugging
test_path() {
path=""
echo $1 | tr ':' '\012' | ( while read component
do
[ -d "$component" ] || continue
path="${path:+$path:}$component"
done
echo "$path"
)
}
Note: I simplified the inner if. There was no else so the test can be replaced with a shortcut. Also, the two path assignments can be combined into one, using the S{var:+ ...} parameter substitution trick.
Your script works just fine with no change under Solaris 11 and probably also most commercial Unix like AIX and HP-UX because under these OSes, the underlying implementation of /bin/sh is provided by ksh. This would be also the case if /bin/sh is backed by zsh.
It doesn't work for you likely because your /bin/sh is implemented by one of bash, dash, mksh or busybox sh which all process each component of a pipeline in a subshell while ksh and zsh both keep the last element of a pipeline in the current shell, saving an unnecessary fork.
It is possible to "fix" your script for it to work when sh is provided by bash by adding this line somewhere before the pipeline:
shopt -s lastpipe
or better, if you wan't to keep portability:
command -v shopt > /dev/null && shopt -s lastpipe
This will keep the script working for ksh, and zsh but still won't work for dash, mksh or the original Bourne shell.
Note that both bash and ksh behaviors are allowed by the POSIX standard.

Read from file into variable - Bash Script take2

This is a second part of Read from file into variable - Bash Script
I have a bash script that reads strings in a file parses and assigns it to a variable. The file looks like this (file.txt):
database1 table1
database1 table4
database2
database3 table2
Using awk in the script:
s=$(awk '{$1=$1}1' OFS='.' ORS='|' file.txt)
LIST="${s%|}"
echo "$LIST"
database1.table1|database1.table4|database2|database3.table2
But I need to add some wildcards at the end of each substring. I need this result:
database1.table1.*|database1.table4.*|database2*.*|database3.table2.*
The conditions are: if we read database2 the output should be database2*.* and if we read a database and a table the output should be database1.table1.*
Use this awk with ORS='.*|':
s=$(awk '$0=="database2"{$0=$0 "*.*";print;next} {$2=$2 ".*"}1' OFS='.' ORS='|' file.txt)
LIST="${s%|}"
echo "$LIST"
database1.table1.*|database1.table4.*|database2*.*|database3.table2.*
Assuming the (slightly odd) regex is correct the following awk script works for me on your example input.
BEGIN {OFS="."; ORS="|"}
!$2 {$1=$1"*"}
{$(NF+1)="*"}
1
Set OFS and ORS.
If we do not have a second field add a * to our first field.
Add a * as a final field.
Print the line.
Run as awk -f script.awk inputfile where the above script is in the script.awk (or whatever) file.
I'd do it like this.
script.sh containing the following code:
#!/bin/bash
while IFS='' read -r line ;do
database=$(awk '{print $1}' <<< $line)
table=$(awk '{print $2}' <<< $line)
if [ "${table}" == '' ] ;then
list=(${list}\|${database}\*\.\*)
else
list=(${list}\|${database}\.${table}\.\*)
fi
done <file.txt
list=$(cut -c 2- <<< ${list})
echo ${list}
exit 0
file.txt containing the following data:
database1 table1
database1 table4
database2
database3 table2
Script output is the following:
database1.table1.*|database1.table4.*|database2*.*|database3.table2.*
Tested in BASH version:
GNU bash, version 5.0.17(1)-release (x86_64-pc-linux-gnu)