Interpolating Columnar Data - csv

I am trying to find a way to interpolate between two lines of data in a CSV file, likely using awk. Right now, each line represents a data point at Hour 0 and Hour 6. I am looking to fill in missing hourly data between Hour 0 and Hour 6.
Current CSV
lat,lon,fhr
33.90000,-76.50000,0
34.20000,-77.00000,6
Expected Interpolated Output
lat,lon,fhr
33.90000,-76.50000,0
33.95000,-76.58333,1
34.00000,-76.66667,2
34.05000,-76.75000,3
34.10000,-76.83333,4
34.15000,-76.91667,5
34.20000,-77.00000,6

Here is an awk file that should achieve this
# initialize lastTime, also used as a flag to show that the 1st data line has been read
BEGIN { lastTime=-100 }
# match data lines
/^[0-9]/{
if (lastTime == -100) {
# this is the first data line, print it
print;
} else {
if ($3 == lastTime+1) {
# increment of 1 hour, no need to interpolate
print;
} else {
# increment othet than 1 hour, interpolate
for (i = 1 ; i < $3 - lastTime; i = i + 1) {
print lastLat+($1-lastLat)*(i/($3 - lastTime))","lastLon+($2-lastLon)*(i/($3 - lastTime))","lastTime+i
}
print;
}
}
# save the current values for the next line
lastTime = $3;
lastLon = $2;
lastLat = $1;
}
/lat/{
# this is the header line, just print it
print;
}
Run it as
awk -F, -f test.awk test.csv
I assume your third column has integral values.

Related

Trying to read from specific fields of a CSV file

The code provided reads a CSV file and prints the count of all strings found in descending order. However, I would like to know how to specify what fields I would like to read in count...for example
./example-awk.awk 1,2 file.csv would read strings from fields 1 and 2 and print the counts
#!/bin/awk -f
BEGIN {
FIELDS = ARGV[1];
delete ARGV[1];
FS = ", *"
}
{
for(i = 1; i <= NF; i++)
if(FNR != 1)
data[++data_index] = $i
}
END {
produce_numbers(data)
PROCINFO["sorted_in"] = "#val_num_desc"
for(i in freq)
printf "%s\t%d\n", i, freq[i]
}
function produce_numbers(sortedarray)
{
n = asort(sortedarray)
for(i = 1 ; i <= n; i++)
{
freq[sortedarray[i]]++
}
return
}
This is currently the code I am working with, ARGV[1] will of course be the specified fields. I am unsure how to go about storing this value to use it.
For example ./example-awk.awk 1,2 simple.csv with simple.csv containing
A,B,C,A
B,D,C,A
C,D,A,B
D,C,A,A
Should result in
D 3
C 2
B 2
A 1
Because it only counts strings in fields 1 and 2
EDIT(as per OP's request): As per OP he/she needs to have solution using ARGV so adding solution as per that now (NOTE: cat script.awk is only written to show content of actual awk script only).
cat script.awk
BEGIN{
FS=","
OFS="\t"
for(i=1;i<(ARGC-1);i++){
arr[ARGV[i]]
delete ARGV[i]
}
}
{
for(i in arr){ value[$i]++ }
}
END{
PROCINFO["sorted_in"] = "#ind_str_desc"
for(j in value){
print j,value[j]
}
}
Now when we run it as follows:
awk -f script.awk 1 2 Input_file
D 3
C 2
B 2
A 1
My original solution: Could you please try following, written and tested with shown samples. It is a generic solution where awk program has a variable named fields where you could mention all field numbers which you want to deal with using ,(comma) separator in it.
awk -v fields="1,2" '
BEGIN{
FS=","
OFS="\t"
num=split(fields,arr,",")
for(i=1;i<=num;i++){
key[arr[i]]
}
}
{
for(i in key){
value[$i]++
}
}
END{
for(i in value){
print i,value[i]
}
}' Input_file | sort -rk1
Output will be as follows.
D 3
C 2
B 2
A 1
Don't use a shebang to invoke awk in a shell script as that robs you of the ability to use the shell and awk separately for what they both do best. Use the shebang to invoke your shell and then call awk within the script. You also don't need to use gawk-only sorting functions for this:
$ cat tst.sh
#!/usr/bin/env bash
(( $# == 2 )) || { echo "bad args: $0 $*" >&2; exit 1; }
cols=$1
shift
awk -v cols="$cols" '
BEGIN {
FS = ","
OFS = "\t"
split(cols,tmp)
for (i in tmp) {
fldNrs[tmp[i]]
}
}
{
for (fldNr in fldNrs) {
val = $fldNr
cnt[val]++
}
}
END {
for (val in cnt) {
print val, cnt[val]
}
}
' "${#:--}" |
sort -r
$ ./tst.sh 1,2 file
D 3
C 2
B 2
A 1
I decided to give it a go in the spirit of OP's attempt as kids don't learn if kids don't play (trying ARGIND manipulation (it doesn't work) and delete ARGV[] and some others that also didn't work):
$ gawk '
BEGIN {
FS=","
OFS="\t"
split(ARGV[1],t,/,/) # field list picked from ARGV
for(i in t) # from vals to index
h[t[i]]
delete ARGV[1] # ARGIND manipulation doesnt work
}
{
for(i in h) # subset of fields processes
a[$i]++ # count hits
}
END {
PROCINFO["sorted_in"]="#val_num_desc" # ordering from OPs attempt
for(i in a)
print i,a[i]
}' 1,2 file
Output
D 3
B 2
C 2
A 1
You could as well drop the ARGV[] manipulation and replace the BEGIN block with:
$ gawk -v var=1,2 '
BEGIN {
FS=","
OFS="\t"
split(var,t,/,/) # field list picked from a var
for(i in t) # from vals to index
h[t[i]]
} ...

Replacing a few sensitive characters in fields with XXX-masked fields in UNIX

I have a table which has been exported to a file in UNIX which has data in CSV format like for e.g.:
File 1:
ACCT_NUM,EXPIRY_DT,FIRST_NAME,LAST_NAME
123456,09-09-2019,Prisi,Kumar
Now I need to mask ACCT_NUM and FIRST_NAME and replace the masked values in File 1, the output should look something like this
File 2:
ACCT_NUM,EXPIRY_DT,FIRST_NAME,LAST_NAME
123XXX,09-09-2019,PRXXX,Kumar
I have separate masking functions for numerical and string fields, I need to know how to replace the masked columns in the original file.
I'm not sure what you want to do with FNR and what the point of assigning to array a should be. This is how I would do it:
$ cat x.awk
#!/bin/sh
awk -F, -vOFS=, ' # Set input and output field separators.
NR == 1 { # First record?
print # Just output.
next # Then continue with next line.
}
NR > 1 { # Second and subsequent record?
if (length($1) < 4) { # Short account number?
$1 = "XXX" # Replace the whole number.
} else {
sub(/...$/, "XXX", $1) # Change last three characters.
}
if (length($3) < 4) { # Short first name number?
$3 = "XXX" # Replace the whole name.
} else {
sub(/...$/, "XXX", $3) # Change last three characters.
}
print # Output the changed line.
}'
Showtime!
$ cat input
ACCT_NUM,EXPIRY_DT,FIRST_NAME,LAST_NAME
123456,09-09-2019,Prisi,Kumar
123,29-12-2017,Jim,Kirk
$ ./x.awk < input
ACCT_NUM,EXPIRY_DT,FIRST_NAME,LAST_NAME
123XXX,09-09-2019,PrXXX,Kumar
XXX,29-12-2017,XXX,Kirk

Separating output records in AWK without a trailing separator

I have the following records:
31 Stockholm
42 Talin
34 Helsinki
24 Moscow
15 Tokyo
And I want to convert it to JSON with AWK. Using this code:
#!/usr/bin/awk
BEGIN {
print "{";
FS=" ";
ORS=",\n";
OFS=":";
};
{
if ( !a[city]++ && NR > 1 ) {
key = $2;
value = $1;
print "\"" key "\"", value;
}
};
END {
ORS="\n";
OFS=" ";
print "\b\b}";
};
Gives me this:
{
"Stockholm":31,
"Talin":42,
"Helsinki":34,
"Moscow":24,
"Tokyo":15, <--- I don't want this comma
}
The problem is that trailing comma on the last data line. It makes the JSON output not acceptable. How can I get this output:
{
"Stockholm":31,
"Talin":42,
"Helsinki":34,
"Moscow":24,
"Tokyo":15
}
Mind some feedback on your posted script?
#!/usr/bin/awk # Just be aware that on Solaris this will be old, broken awk which you must never use
BEGIN {
print "{"; # On this and every other line, the trailing semi-colon is a pointless null-statement, remove all of these.
FS=" "; # This is setting FS to the value it already has so remove it.
ORS=",\n";
OFS=":";
};
{
if ( !a[city]++ && NR > 1 ) { # awk consists of <condition>{<action} segments so move this condition out to the condition part
# also, you never populate a variable named "city" so `!a[city]++` won't behave sensibly.
key = $2;
value = $1;
print "\"" key "\"", value;
}
};
END {
ORS="\n"; # no need to set ORS and OFS when the script will no longer use them.
OFS=" ";
print "\b\b}"; # why would you want to print a backspace???
};
so your original script should have been written as:
#!/usr/bin/awk
BEGIN {
print "{"
ORS=",\n"
OFS=":"
}
!a[city]++ && (NR > 1) {
key = $2
value = $1
print "\"" key "\"", value
}
END {
print "}"
}
Here's how I'd really write a script to convert your posted input to your posted output though:
$ cat file
31 Stockholm
42 Talin
34 Helsinki
24 Moscow
15 Tokyo
$
$ awk 'BEGIN{print "{"} {printf "%s\"%s\":%s",sep,$2,$1; sep=",\n"} END{print "\n}"}' file
{
"Stockholm":31,
"Talin":42,
"Helsinki":34,
"Moscow":24,
"Tokyo":15
}
You have a couple of choices. An easy one would be to add the comma of the previous line as you are about to write out a new line:
Set a variable first = 1 in your BEGIN.
When about to print a line, check first. If it is 1, then just set it to 0. If it is 0 print out a comma and a newline:
if (first) { first = 0; } else { print ","; }
The point of this is to avoid putting an extra comma at the start of the list.
Use printf("%s", ...) instead of print ... so that you can avoid the newline when printing a record.
Add an extra newline before the close brace, as in: print "\n}";
Also, note that if you don't care about the aesthetics, JSON doesn't really require newlines between items, etc. You could just output one big line for the whole enchilada.
You should really use a json parser but here is how with awk:
BEGIN {
print "{"
}
NR==1{
s= "\""$2"\":"$1
next
}
{
s=s",\n\""$2"\":"$1
}
END {
printf "%s\n%s",s,"}"
}
Outputs:
{
"Stockholm":31,
"Talin":42,
"Helsinki":34,
"Moscow":24,
"Tokyo":15
}
Why not use json parser? Don't force awk to do something isn't wasn't designed to do. Here is a solution using python:
import json
d = {}
with open("file") as f:
for line in f:
(val, key) = line.split()
d[key] = int(val)
print json.dumps(d,indent=0)
This outputs:
{
"Helsinki": 34,
"Moscow": 24,
"Stockholm": 31,
"Talin": 42,
"Tokyo": 15
}

Remove Rows From CSV Where A Specific Column Matches An Input File

I have a CSV that contains multiple columns and rows [File1.csv].
I have another CSV file (just one column) that lists a specific words [File2.csv].
I want to able to take remove rows within File1 if any columns match any of the words listed in File2.
I originally used this:
grep -v -F -f File2.csv File1.csv > File3.csv
This worked, to a certain extent. This issue I ran into was with columns that had more than word in it (ex. word1,word2,word3). File2 contained word2 but did not delete that row.
I tired spreading the words apart to look like this: (word1 , word2 , word3), but the original command did not work.
How can I remove a row that contains a word from File2 and may have other words in it?
One way using awk.
Content of script.awk:
BEGIN {
## Split line with a doble quote surrounded with spaces.
FS = "[ ]*\"[ ]*"
}
## File with words, save them in a hash.
FNR == NR {
words[ $2 ] = 1;
next;
}
## File with multiple columns.
FNR < NR {
## Omit line if eigth field has no interesting value or is first line of
## the file (header).
if ( $8 == "N/A" || FNR == 1 ) {
print $0
next
}
## Split interested field with commas. Traverse it searching for a
## word saved from first file. Print line only if not found.
## Change due to an error pointed out in comments.
##--> split( $8, array, /[ ]*,[ ]*/ )
##--> for ( i = 1; i <= length( array ); i++ ) {
len = split( $8, array, /[ ]*,[ ]*/ )
for ( i = 1; i <= len; i++ ) {
## END change.
if ( array[ i ] in words ) {
found = 1
break
}
}
if ( ! found ) {
print $0
}
found = 0
}
Assuming File1.csv and File2.csv have content provided in comments of Thor's answer (I suggest to add that information to the question), run the script like:
awk -f script.awk File2.csv File1.csv
With following output:
"DNSName","IP","OS","CVE","Name","Risk"
"ex.example.com","1.2.3.4","Linux","N/A","HTTP 1.1 Protocol Detected","Information"
"ex.example.com","1.2.3.4","Linux","CVE-2011-3048","LibPNG Memory Corruption Vulnerability (20120329) - RHEL5","High"
"ex.example.com","1.2.3.4","Linux","CVE-2012-2141","Net-SNMP Denial of Service (Zero-Day) - RHEL5","Medium"
"ex.example.com","1.2.3.4","Linux","N/A","Web Application index.php?s=-badrow Detected","High"
"ex.example.com","1.2.3.4","Linux","CVE-1999-0662","Apache HTTPD Server Version Out Of Date","High"
"ex.example.com","1.2.3.4","Linux","CVE-1999-0662","PHP Unsupported Version Detected","High"
"ex.example.com","1.2.3.4","Linux","N/A","HBSS Common Management Agent - UNIX/Linux","High"
You could convert split lines containing multiple patterns in File2.csv.
Below uses tr to convert lines containing word1,word2 into separate lines before using them as patterns. The <() construct temporarily acts as a file/fifo (tested in bash):
grep -v -F -f <(tr ',' '\n' < File2.csv) File1.csv > File3.csv

How to remove empty tables from a MySQL backup file

I have multiple large MySQL backup files all from different DBs and having different schemas. I want to load the backups into our EDW but I don't want to load the empty tables.
Right now I'm cutting out the empty tables using AWK on the backup files, but I'm wondering if there's a better way to do this.
If anyone is interested, this is my AWK script:
EDIT: I noticed today that this script has some problems, please beware if you want to actually try to use it. Your output may be WRONG... I will post my changes as I make them.
# File: remove_empty_tables.awk
# Copyright (c) Northwestern University, 2010
# http://edw.northwestern.edu
/^--$/ {
i = 0;
line[++i] = $0; getline
if ($0 ~ /-- Definition/) {
inserts = 0;
while ($0 !~ / ALTER TABLE .* ENABLE KEYS /) {
# If we already have an insert:
if (inserts > 0)
print
else {
# If we found an INSERT statement, the table is NOT empty:
if ($0 ~ /^INSERT /) {
++inserts
# Dump the lines before the INSERT and then the INSERT:
for (j = 1; j <= i; ++j) print line[j]
i = 0
print $0
}
# Otherwise we may yet find an insert, so save the line:
else line[++i] = $0
}
getline # go to the next line
}
line[++i] = $0; getline
line[++i] = $0; getline
if (inserts > 0) {
for (j = 1; j <= i; ++j) print line[j]
print $0
}
next
} else {
print "--"
}
}
{
print
}
I can't think of any option in mysqldump that would skip the empty tables in your backup. Maybe the -where option but not sure you can do sth generic. IMHO a post-treatment in a second script is not that bad.
Using regex and perl one liners. It works by matching the comment header + white space + start of next header. One is for ordered dumps and the next is for non-ordered dumps.
perl -0777 -pi -e 's/--\s*-- Dumping data for table \`\w+\`\s*--\s*-- ORDER BY\: [^\n]+\s+(?=--)//g' "dump.sql"
perl -0777 -pi -e 's/--\s*-- Dumping data for table \`\w+\`\s*--\n(?!--)\s*(?=--)//g' "dump.sql"