python csv print first 10 rows only - csv

I am working with a large CSV file with a lot of rows and columns. I need only the first 5 columns but only if the value for column 1 of each row is 1. (Column 1 can only have value 0 or 1).
So far I can print out the first 5 columns but can't filter to only show when column 1 is equal to 1. My .awk file looks like:
BEGIN {FS = ","}
NR!=1 {print $1", " $2", " $3", "$4", "$5}
I have tried things like $1>1 but to no luck, the output is always every row, regardless if the first column of each row is a 0 or 1.

Modifying your awk a bit:
BEGIN {FS = ","; OFS = ", "}
$1 == 1 {print $1, $2, $3, $4, $5; n++}
n == 10 {exit}

Related

AWK: statistics operations of multi-column CSV data

With the aim to perform some statistical analysis of multi-column data I am analyzing big number of CSV filles using the following bash + AWK routine:
#!/bin/bash
home="$PWD"
# folder with the outputs
rescore="${home}"/rescore
# folder with the folders to analyse
storage="${home}"/results
#cd "${home}"/results
cd ${storage}
csv_pattern='*_filt.csv'
while read -r d; do
awk -v rescore="$rescore" '
FNR==1 {
if (n)
mean[suffix] = s/n
prefix=suffix=FILENAME
sub(/_.*/, "", prefix)
sub(/\/[^\/]+$/, "", suffix)
sub(/^.*_/, "", suffix)
s=n=0
}
FNR > 1 {
s += $3
++n
}
END {
out = rescore "/" prefix ".csv"
mean[suffix] = s/n
print prefix ":", "dG(mean)" > out
for (i in mean)
printf "%s: %.2f\n", i, mean[i] >> out
close(out)
}' "${d}_"*/${csv_pattern} #> "${rescore}/"${d%%_*}".csv"
done < <(find . -maxdepth 1 -type d -name '*_*_*' | awk -F '[_/]' '!seen[$2]++ {print $2}')
Basically the script takes ensemble of CSV files belonged to the same prefix (defined as the naming pattern occured at the begining of the directory contained CSV, for example 10V1 from 10V1_cne_lig1) and calculate for it the mean value for the numbers in the third column:
# input *_filt.csv located in the folder 10V1_cne_lig1001
ID, POP, dG
1, 142, -5.6500
2, 10, -5.5000
3, 2, -4.9500
add 1 string to 10V1.csv, which is organized in 2 column format i) the name of the suffix of the folder with initial CSV; ii) the mean value calculated for all numbers in the third column (dG) of input.csv:
# this is two column format of output.csv: 10V1.csv
10V1: dG(mean)
lig1001: -5.37
in this way for 100 CSV filles such output.csv should contain 100 lines with the mean values, etc
I need to introduce a small modification to my AWK part of my routine that would add the 3rd column to the output CSV with RMSD value (as the measure of the differences between initial dG values) of the initial data (dG), which had been used to calculate the MEAN value. Using AWK syntax, with a particular MEAN value the RMS could be expressed as
mean=$(awk -F , '{sum+=$3}END{printf "%.2f", sum/NR}' $csv)
rmsd=$(awk -v mean=$mean '{++n;sum+=($NF-mean)^2} END{if(n) printf "%.2f", sqrt(sum/n)}' $csv)
Here is expected output for 5 means and 5 rmsds values calculated for 5 CSV logs (the first one is corresponded to my above example!):
10V1: dG(mean): RMSD (error)
lig1001 -5.37 0.30
lig1002 -8.53 0.34
lig1003 -6.57 0.25
lig1004 -9.53 0.00 # rmsd=0 since initial csv has only 1 line: no data variance
lig1005 -8.11 0.39
How this addition could be incorporated into my main bash-AWK code with the aim to add the third RMSD column (for each of the processed CSV, thus taking each of the calculated MEAN) to the output.csv?
You can calculate both of mean and rmsd within the awk code.
Would you please try the following awk code:
awk -v rescore="$rescore" '
FNR==1 {
if (n) { # calculate the results of previous file
m = s / n # mean
var = s2 / n - m * m # variance
if (var < 0) var = 0 # avoid an exception due to round-off error
mean[suffix] = m # store the mean in an array
rmsd[suffix] = sqrt(var)
}
prefix=suffix=FILENAME
sub(/_.*/, "", prefix)
sub(/\/[^\/]+$/, "", suffix)
sub(/^.*_/, "", suffix)
s = 0 # sum of $3
s2 = 0 # sum of $3 ** 2
n = 0 # count of samples
}
FNR > 1 {
s += $3
s2 += $3 * $3
++n
}
END {
out = rescore "/" prefix ".csv"
m = s / n
var = s2 / n - m * m
if (var < 0) var = 0
mean[suffix] = m
rmsd[suffix] = sqrt(var)
print prefix ":", "dG(mean)", "dG(rmsd)" > out
for (i in mean)
printf "%s: %.2f %.2f\n", i, mean[i], rmsd[i] >> out
close(out)
}'
Here is the version to print the lowest value of dG.
awk -v rescore="$rescore" '
FNR==1 {
if (n) { # calculate the results of previous file
m = s / n # mean
var = s2 / n - m * m # variance
if (var < 0) var = 0 # avoid an exception due to round-off error
mean[suffix] = m # store the mean in an array
rmsd[suffix] = sqrt(var)
lowest[suffix] = min
}
prefix=suffix=FILENAME
sub(/_.*/, "", prefix)
sub(/\/[^\/]+$/, "", suffix)
sub(/^.*_/, "", suffix)
s = 0 # sum of $3
s2 = 0 # sum of $3 ** 2
n = 0 # count of samples
min = 0 # lowest value of $3
}
FNR > 1 {
s += $3
s2 += $3 * $3
++n
if ($3 < min) min = $3 # update the lowest value
}
END {
if (n) { # just to avoid division by zero
m = s / n
var = s2 / n - m * m
if (var < 0) var = 0
mean[suffix] = m
rmsd[suffix] = sqrt(var)
lowest[suffix] = min
}
out = rescore "/" prefix ".csv"
print prefix ":", "dG(mean)", "dG(rmsd)", "dG(lowest)" > out
for (i in mean)
printf "%s: %.2f %.2f %.2f\n", i, mean[i], rmsd[i], lowest[i] > out
}' file_*.csv
I've assumed all dG values are negative. If there is any chance the
value is greater than zero, modify the line min = 0 which initializes
the variable to considerably big value (10,000 or whatever).
Please apply your modifications regarding the filenames, if needed.
The suggestions by Ed Morton are also included although the results will be the same.

import csv to redis with hash datatype

awk -F, 'NR > 0{print "SET", "\"calc_"NR"\"", "\""$0"\"" }' files/calc.csv | unix2dos | redis-cli --pipe
I use the above command to import a csv file into redis database with string datatype.Something like,
set cal_1 product,cost,quantity
set cal_2 t1,100,5
How do I import as hash datatype with field name as rowcount , key as column header, value as column value in awk.
HMSET calc:1 product "t1" cost 100 quantity 5
HMSET calc:2 product "t2" cost 500 quantity 4
Input file Example:
product cost quantity
t1 100 5
t2 500 4
t3 600 9
Can I get this result from awk
for each row present in csv file,
HMSET calc_'row no' 1st row 1st column value current row 1st column value 1st row 2nd column value current row 2nd column value 1st row 3rd column value urrent row 3rd column value
so for the above example,
HMSET calc_1 product t1 cost 100 quantity 5
HMSET calc_2 product t2 cost 500 quantity 4
HMSET calc_3 product t3 cost 600 quantity 9
for all the rows dynamically?
You can use the following awk command:
awk '{if(NR==1){col1=$1; col2=$2; col3=$3}else{product[NR]=$1;cost[NR]=$2;quantity[NR]=$3;tmp=NR}}END{printf "[("col1","col2","col3"),"; for(i=2; i<=tmp;i++){printf "("product[i]","cost[i]","quantity[i]")";}print "]";}' input_file.txt
on your input file:
product cost quantity
t1 100 5
t2 500 4
t3 600 9
it gives the following output:
[(product,cost,quantity),(t1,100,5)(t2,500,4)(t3,600,9)]
awk commands:
# gawk profile, created Fri Dec 29 15:12:39 2017
# Rule(s)
{
if (NR == 1) { # 1
col1 = $1
col2 = $2
col3 = $3
} else {
product[NR] = $1
cost[NR] = $2
quantity[NR] = $3
tmp = NR
}
}
# END rule(s)
END {
printf "[(" col1 "," col2 "," col3 "),"
for (i = 2; i <= tmp; i++) {
printf "(" product[i] "," cost[i] "," quantity[i] ")"
}
print "]"
}

awk delete field if other column match

i have a CSV file that looks like this:
col1,col2,col3,col4
1,2,a,01/01
2,3,b,01/02
3,4,5,c,01/03
2,5,6,c,01/03
The last 2 rows have been appended to the file but it has an extra column (third column). I want to delete the third column from the last 2 rows (i.e. where column 4 == "c" and column 5 == "01/03")
Output i want is to remove the third column from last 2 rows such that it has only 4 columns:
col1,col2,col3,col4
1,2,a,01/01
2,3,b,01/02
3,4,c,01/03
2,5,c,01/03
if can be done in vim, would be good too
Here's a slightly different approach that avoids having to type the list of columns to be included:
awk -F, 'BEGIN {OFS=FS} NF==5 {for(i=3;i<=NF;i++){$i=$(i+1)}; NF--} 1'
The solution with an explicit listing of columns can also be written more compactly as follows:
awk -F, 'BEGIN {OFS=FS} NF == 5 {print $1, $2, $4, $5; next} 1'
This should do it
awk -F, 'BEGIN {OFS=","} {if (NF == 5) {print $1, $2, $4, $5} else {print}}' filename
$ awk 'BEGIN{FS=OFS=","} {print $1,$2,$(NF-1),$NF}' file
col1,col2,col3,col4
1,2,a,01/01
2,3,b,01/02
3,4,c,01/03
2,5,c,01/03

Convert MySql column with multiple values to proper table

I have a dataset in the form of a CSV file than is sent to me on a regular basis. I want to import this data into my MySql database and turn it into a proper set of tables. The problem I am having is that one of the fields the is used to store multiple values. For example the field is storing email addresses. It may one email address or it may have two, or three, or four, etc. The field contents would look something like this. "user1#domain.com,user2#domain.com,user3#domain.com".
I need to be able to take the undetermined number of values from each field and then add them into a separate table so that they look like this.
user1#domain.com
user2#domain.com
user3#domain.com
I am not sure how I can do this. Thank you for the help.
Probably the simplest way is a brute force approach of inserting the first email, then the second, and so on:
insert into newtable(email)
select substring_index(substring_index(emails, ',', 1), ',', -1)
from emails
where (length(replace(emails, ',', ',,')) - length(emails)) >= 1;
insert into newtable(email)
select substring_index(substring_index(emails, ',', 2), ',', -1)
from emails
where (length(replace(emails, ',', ',,')) - length(emails)) >= 2;
insert into newtable(email)
select substring_index(substring_index(emails, ',', 3), ',', -1)
from emails
where (length(replace(emails, ',', ',,')) - length(emails)) >= 3;
And so on.
That is, extract the nth element from the list and insert that into the table. The where clause counts the number of commas in the list, which is a proxy for the length of the list.
You need to repeat this up to the maximum number of emails in the list.
Instead of importing the csv file directly and then trying to fix the problems in it, I found the best way to attack this was to first pass the csv to AWK.
AWK outputs three separate csv file that follow the normal forms. I then import those tables and all is well.
2 info="`ncftpget -V -c -u myuser -p mypassword ftp://fake.com/data_map.csv`"
3
4 echo "$info" | \
5 awk -F, -v OFS="," 'NR > 1 {
6 split($6, keyvalue, ";")
7 for (var in keyvalue) {
8 gsub(/.*:/, "", keyvalue[var])
9 print $1, keyvalue[var]
10 }}' > ~/sqlrw/table1.csv
11
12 echo "$info" | \
13 awk -F, -v OFS="," 'NR > 1 {
14 split($6, keyvalue, ";")
15 for (var in keyvalue) {
16 gsub(/:/, ",", keyvalue[var])
17 print keyvalue[var]
18 }}' > ~/sqlrw/table2.csv
19
20 sort -u ~/sqlrw/table2.csv -o ~/sqlrw/table2.csv
21
22 echo "$info" | \
23 awk -F, -v OFS="," 'NR > 1 {
24 print $1, $2, $3, $4, $5, $7, $8
25 }' > ~/sqlrw/table3.csv
Maybe using a simple php script would/shoud do the trick
<?php
$file = file_get_contents("my_file.csv");
$tmp = explode(";", $file); // iirc lines in csv are terminated by a ;
for ($i=0; $i<count($tmp); $i++)
{
$field = $tmp[$i];
$q = "INSERT INTO my_table (emails) VALUES (`$field`)";
// or use $i as an id if don't have an autoincrement
$q = "INSERT INTO my_table (id, emails) VALUES ($i, `$field`)";
// execute query ....
}
?>
Hope this helps even if it's not pure SQL .....

Awk: How to cut similar part of 2 fields and then get the difference of remaining part?

Let say I have 2 fields displaying epoch time in microseconds:
1318044415123456,1318044415990056
What I wanted to do is:
Cut the common part from both fields: "1318044415"
Get the difference of the remaining parts: 990056 - 123456 = 866600
Why am I doing this? Because awk uses floating point IEEE 754 but not 64 bit integers and I need to get difference of epoch time of 2 events in microseconds.
Thanks for any help!
EDIT:
Finally I found the largest number Awk could handle on Snow Leopard 10.6.8: 9007199254740992.
Try this: echo '9007199254740992' | awk -F ',' '{print $1 + 0}'
The version of Awk was 20070501 (produced by awk --version)
Here is an awk script that meets your requirements:
BEGIN {
FS = ","
}
{
s1 = $1
s2 = $2
while (length(s1) > 1 && substr(s1, 1, 1) == substr(s2, 1, 1))
{
s1 = substr(s1, 2)
s2 = substr(s2, 2)
}
n1 = s1 + 0
n2 = s2 + 0
print n2 - n1
}