I am trying to increment a value in a csv file, provided it matches a search string. Here is the script that was utilized:
awk -i inplace -F',' '$1 == "FL" { print $1, $2+1} ' data.txt
Contents of data.txt:
NY,1
FL,5
CA,1
Current Output:
FL 6
Intended Output:
NY,1
FL,6
CA,1
Thanks.
$ awk 'BEGIN{FS=OFS=","} $1=="FL"{++$2} 1' data.txt
NY,1
FL,6
CA,1
Intended Output:
NY,1 FL,6 CA,1
I would harness GNU AWK for this task following way, let file.txt content be
NY,1
FL,5
CA,1
then
awk 'BEGIN{FS=OFS=",";ORS=" "}{print $1,$2+($1=="FL")}' file.txt
gives output
NY,1 FL,6 CA,1
Explanation: I inform GNU AWK that field separator (FS) and output field separator (OFS) is , and output row separator (ORS) is space with accordance to your requirements. Then for each line I print 1st field followed by 2nd field increased by is 1st field FL? with 1 denoting it does hold, 0 denotes it does not hold. If you want to know more about FS or OFS or ORS then read 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
(tested in gawk 4.2.1)
Use this Perl one-liner:
perl -i -F',' -lane 'if ( $F[0] eq "FL" ) { $F[1]++; } print join ",", #F;' data.txt
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array #F on whitespace or on the regex specified in -F option.
-F',' : Split into #F on comma, rather than on whitespace.
-i.bak : Edit input files in-place (overwrite the input file). Before overwriting, save a backup copy of the original file by appending to its name the extension .bak. If you want to skip writing a backup file, just use -i and skip the extension.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
I run a shell command that returns a list of repeated values like this (note the indentation):
Name: vm346
cpu 1 (12%) 6150m (76%)
memory 1130Mi (7%) 1130Mi (7%)
Name: vm847
cpu 6 (75%) 30150m (376%)
memory 12980Mi (87%) 12980Mi (87%)
Name: vm848
cpu 3500m (43%) 17150m (214%)
memory 6216Mi (41%) 6216Mi (41%)
I am trying to transform that data like this (in csv):
vm346,1,(12%),6150m,(76%),1130Mi,(7%),1130Mi,(7%)
vm847,6,(75%),30150m,(376%),12980Mi,(87%),12980Mi,(87%)
vm848,3500m,(43%),17150m,(214%),6216Mi,(41%),6216Mi,(41%)
The problem is that any given dataset like the one above is always on more than one line.
when I pipe that into it awk it drives me mad because even if I use:
BEGIN{ FS="\n" }
to try and stitch the data together in one line, it doesn't work. No matter what I do, awk keeps the name value as a separated line above everything else.
I am sorry I haven't much code to share but I have been spinning my wheels with this for a few hours now and I am running out of ideas...
I can solve this in Perl:
perl -ane 'print join ",", #F[1 .. $#F]; print $F[0] eq "memory" ? "\n" : ","'
It should be easy to translate it to awk if you need it.
How does it work?
-a splits each line on whitespace into the #F array
-n reads the input line by line and runs the code specified after -e for each line
We print all the elements but the first one separated by commas (see join)
We then look at the first column, if it's memory, we are at the last line of the block, so we print a newline, otherwise we print a comma
With AWK, one option is to set RS to "Name: ", and ignore the first record with NR > 1, e.g.
awk -v RS="Name: " 'BEGIN{OFS=","} NR > 1 {print $1, $3, $4, $5, $6, $8, $9, $10, $11}' file
#> vm346,1,(12%),6150m,(76%),1130Mi,(7%),1130Mi,(7%)
#> vm847,6,(75%),30150m,(376%),12980Mi,(87%),12980Mi,(87%)
#> vm848,3500m,(43%),17150m,(214%),6216Mi,(41%),6216Mi,(41%)
awk '{$1=""}1' | paste -sd' \n' - | awk '{$1=$1}1' OFS=,
Get rid of the first column. Join every three rows. Same idea with sed:
sed 's/^ *[^ ]* *//' | paste -sd' \n' - | sed 's/ */,/g'
Something else:
awk '
$1=="Name:" {
sep=ors
ors=ORS
} {
for (i=2;i<=NF;++i) {
printf "%s%s",sep,$i
sep=OFS
}
} END {printf "%s",ors}'
Or if you want to print an ORS based on the first field being "memory" (note that this program may end without printing a terminating ORS):
awk '{for (i=2;i<=NF;++i) printf "%s%s",$i,(i==NF && $1=="memory" ? ORS : OFS)}'
something else else:
awk -v OFS=, '
index($0,$1)==1 {
OFS=ors
ors=ORS
} {
$1=""
printf "%s",$0
OFS=ofs
} END {printf "%s",ors} BEGIN {ofs=OFS}'
This might work for you (GNU sed):
sed -nE '/^ +\S+ +/{s///;H;$!d};x;/./s/\s+/,/gp;x;s/^\S+ +//;h' file
In overview the sed program processes indented lines, already gathered lines (except in the case that the current line is the first line of the file) and non-indented lines.
Turn off implicit printing and enable extended regexp's. (-nE).
If the current line is indented, remove the indent, the first field and any following spaces, append the result to the hold space and if it is not the last line, delete it.
Otherwise, check the hold space for gathered lines and if found, replace one or more whitespaces by commas and print the result. Then prep the current line by removing the first field and any following spaces and replace the hold space with the result.
The solution seems logically back-to-front, but programming in this style avoids having to check for end-of-file multiple times and invoking labels and gotos.
N.B. This solution will work for any number of indented lines.
Here is a ruby to do that:
ruby -e '
s=$<.read
s.scan(/^([^ \t]+:)([\s\S]+?)(?=^\1|\z)/m). # parse blocks
map(&:last). # get data part
# parse and join the data fields:
map{|block| block.split(/\n[ \t]+[^ \t]+[ \t]+/)}.
map{|lines| lines.map(&:strip).join(" ").split().join(",")}.
each{|l| puts "#{l}"}
' file
vm346,1,(12%),6150m,(76%),1130Mi,(7%),1130Mi,(7%)
vm847,6,(75%),30150m,(376%),12980Mi,(87%),12980Mi,(87%)
vm848,3500m,(43%),17150m,(214%),6216Mi,(41%),6216Mi,(41%)
The advantage is that this is not dependent on the number of lines or the number of fields. It is parsing data that is in blocks of the form:
START: ([ \t]+[data_with_no_space])*\n
l1 ([ \t]+[data_with_no_space])*\n
...
START:
...
Works this way:
Parse the blocks with THIS REGEX;
Save an array of the data elements;
Join the sub arrays and then split into data fields;
Join(',') to make a csv.
I have a directory with 2.5 million small JSON files in it. It's 104gb on disk. They're multi-line files.
I would like to create a set of JSON arrays from the files so that I can import them using mongoimport in a reasonable amount of time. The files can be no bigger than 16mb, but I'd be happy even if I managed to get them in sets of ten.
So far, I can use this to do them one at a time at about 1000/minute:
for i in *.json; do mongoimport --writeConcern 0 --db mydb --collection all --quiet --file $i; done
I think I can use "jq" to do this, but I have no idea how to make the bash loop pass 10 files at a time to jq.
Note that using bash find results in an error as there are too many files.
With jq you can use --slurp to create arrays, and -c to make multiline json single line. However, I can't see how to combine the two into a single command.
Please help with both parts of the problem if possible.
Here's one approach. To illustrate, I've used awk as it can read the list of files in small batches and because it has the ability to execute jq and mongoimport. You will probably need to make some adjustments to make the whole thing more robust, to test for errors, and so on.
The idea is either to generate a script that can be reviewed and then executed, or to use awk's system() command to execute the commands directly. First, let's generate the script:
ls *.json | awk -v group=10 -v tmpfile=json.tmp '
function out() {
print "jq -s . " files " > " tmpfile;
print "mongoimport --writeConcern 0 --db mydb --collection all --quiet --file " tmpfile;
print "rm " tmpfile;
files="";
}
BEGIN {n=1; files="";
print "test -r " tmpfile " && rm " tmpfile;
}
n % group == 0 {
out();
}
{ files = files " \""$0 "\"";
n++;
}
END { if (files) {out();}}
'
Once you've verified this works, you can either execute the generated script, or change the "print ..." lines to use "system(....)"
Using jq to generate the script
Here's a jq-only approach for generating the script.
Since the number of files is very large, the following uses features that were only introduced in jq 1.5, so its memory usage is similar to the awk script above:
def read(n):
# state: [answer, hold]
foreach (inputs, null) as $i
([null, null];
if $i == null then .[0] = .[1]
elif .[1]|length == n then [.[1],[$i]]
else [null, .[1] + [$i]]
end;
.[0] | select(.) );
"test -r json.tmp && rm json.tmp",
(read($group|tonumber)
| map("\"\(.)\"")
| join(" ")
| ("jq -s . \(.) > json.tmp", mongo("json.tmp"), "rm json.tmp") )
Invocation:
ls *.json | jq -nRr --arg group 10 -f generate.jq
Here is what I came up with. It seems to work and is importing at roughly 80 a second into an external hard drive.
#!/bin/bash
files=(*.json)
for((I=0;I<${#files[*]};I+=500)); do jq -c '.' ${files[#]:I:500} | mongoimport --writeConcern 0 --numInsertionWorkers 16 --db mydb --collection all --quiet;echo $I; done
However, some are failing. I've imported 105k files but only 98547 appeared in the mongo collection. I think it's because some documents are > 16mb.
I often have to conduct very simple queries on tab separated files in bash. For example summing/counting/max/min all the values in the n-th column. I usually do this in awk via command-line, but I've grown tired of re-writing the same one line scripts over and over and I'm wondering if there is a known package or solution for this.
For example, consider the text file (test.txt):
apples joe 4
oranges bill 3
apples sally 2
I can query this as:
awk '{ val += $3 } END { print "sum: "val }' test.txt
Also, I may want a where clause:
awk '{ if ($1 == "apples") { val += $3 } END { print "sum: "val }' test.txt
Or a group by:
awk '{ val[$1] += $3 } END { for(k in val) { print k": "val[k] } }' test.txt
What I would rather do is:
query 'sum($3)' test.txt
query 'sum($3) where $1 = "apples"' test.txt
query 'sum($3) group by $1' test.txt
#Wintermute posted a link to a great tool for this in the comments below. Unfortunately it does have one drawback:
$ time gawk '{ a += $6 } END { print a }' my1GBfile.tsv
28371787287
real 0m2.276s
user 0m1.909s
sys 0m0.313s
$ time q -t 'select sum(c6) from my1GBfile.tsv'
28371787287
real 3m32.361s
user 3m27.078s
sys 0m1.983s
it also loads the entire file into memory, obviously this will be necessary in some cases, but doesn't work for me as I often work with large files.
Wintermute's answer: Tools like q that can run SQL queries directly on CSVs.
Ed Morton's answer: Refer https://stackoverflow.com/a/15765479/1745001
I wrote a quick shell script to emulate the situation of xkcd #981 (without hard links, just symlinks to parent dirs) and used a recursive function to create all the directories. Unfortunately this script does not provide the desired result, so I think my understanding of the scope of variable $count is wrong.
How can I properly make the function use recursion to create twenty levels of folders, each containing 3 folders (3^20 folders, ending in soft links back to the top)?
#!/bin/bash
echo "Generating folders:"
toplevel=$PWD
count=1
GEN_DIRS() {
for i in 1 2 3
do
dirname=$RANDOM
mkdir $dirname
cd $dirname
count=$(expr $count + 1)
if [ $count < 20 ] ; then
GEN_DIRS
else
ln -s $toplevel "./$dirname"
fi
done
}
GEN_DIRS
exit
Try this (amended version of the script) — it seems to work for me. I decline to test to 20 levels deep, though; at 8 levels deep, each of the three top-level directories occupies some 50 MB on a Mac file system.
#!/bin/bash
echo "Generating folders:"
toplevel=$PWD
GEN_DIRS()
{
cur=${1:?}
max=${2:?}
for i in 1 2 3
do
dirname=$RANDOM
if [ $cur -le $max ]
then
(
echo "Directory: $PWD/$dirname"
mkdir $dirname
cd $dirname
GEN_DIRS $((cur+1)) $max
)
else
echo "Symlink: $PWD/$dirname"
ln -s $toplevel "./$dirname"
fi
done
}
GEN_DIRS 1 ${1:-4}
Lines 6 and 7 are giving names to the positional parameters ($1 and $2) passed to the function — the ${1:?} notation simply means that if you omit to pass a parameter $1, you get an error message from the shell (or sub-shell) and it exits.
The parentheses on their own (lines 13 and 18 above) mean that the commands in between are run in a sub-shell, so changes in directory inside the sub-shell do not affect the parent shell.
The condition on line 11 now uses arithmetic (-le) instead of string < comparisons; this works better for deep nesting (because the < is a lexicographic comparison, so level 9 is not less than level 10). It also means that the [ command is OK to use instead of the [[ command (although [[ would also work, I prefer the old-fashioned notation).
I end up creating a script like this:
#!/bin/bash
echo "Generating folders:"
toplevel=$PWD
level=0
maxlevel=4
function generate_dirs {
pushd "$1" >/dev/null || return
(( ++level ))
for i in 1 2 3; do
dirname=$RANDOM
if (( level < maxlevel )); then
echo "$PWD/$dirname"
mkdir "$dirname" && generate_dirs "$dirname"
else
echo "$PWD/$dirname (link to top)"
ln -sf "$toplevel" "$dirname"
fi
done
popd >/dev/null
(( --level ))
}
generate_dirs .
exit