I have a MySQL-Database with 7 Columns (chr, pos, num, iA, iB, iC, iD) and a file that contains 40 million lines each containing a dataset. Each line has 4 tab delimited columns, whereas the first three columns always contain data, and the fourth column can contain up to three different key=value pairs separated by a semicolon
chr pos num info
1 10203 3 iA=0.34;iB=nerv;iC=45;iD=dskf12586
1 10203 4 iA=0.44;iC=45;iD=dsf12586;iB=nerv
1 10203 5
1 10213 1 iB=nerv;iC=49;iA=0.14;iD=dskf12586
1 10213 2 iA=0.34;iB=nerv;iD=cap1486
1 10225 1 iD=dscf12586
The key=value pairs in the column info have no specific order. I'm also not sure if a key can occur twice (I hope not).
I'd like to write the data into the database. The first three columns are no problem, but extractiong the values from the info-columns puzzles me, since the key=value pairs are unordered and not every key has to be in the line.
For a similar dataset (with ordered info-Column) I used a java-Programm in connection with regular expressions, which allowed me to (1) check and (2) extract data, but now I'm stranded.
How can I resolve this task, preferably with a bash-script or directly in MySQL?
You did not mention exactly how you want to write the data. But the below example with awk shows how you can get each individual id and key in each line. instead of the printf, you can use your own logic to write data
[[bash_prompt$]]$ cat test.sh; echo "###########"; awk -f test.sh log
{
if(length($4)) {
split($4,array,";");
print "In " $1, $2, $3;
for(element in array) {
key=substr(array[element],0,index(array[element],"="));
value=substr(array[element],index(array[element],"=")+1);
printf("found %s key and %s value for %d line from %s\n",key,value,NR,array[element]);
}
}
}
###########
In 1 10203 3
found iD= key and dskf12586 value for 1 line from iD=dskf12586
found iA= key and 0.34 value for 1 line from iA=0.34
found iB= key and nerv value for 1 line from iB=nerv
found iC= key and 45 value for 1 line from iC=45
In 1 10203 4
found iB= key and nerv value for 2 line from iB=nerv
found iA= key and 0.44 value for 2 line from iA=0.44
found iC= key and 45 value for 2 line from iC=45
found iD= key and dsf12586 value for 2 line from iD=dsf12586
In 1 10213 1
found iD= key and dskf12586 value for 4 line from iD=dskf12586
found iB= key and nerv value for 4 line from iB=nerv
found iC= key and 49 value for 4 line from iC=49
found iA= key and 0.14 value for 4 line from iA=0.14
In 1 10213 2
found iA= key and 0.34 value for 5 line from iA=0.34
found iB= key and nerv value for 5 line from iB=nerv
found iD= key and cap1486 value for 5 line from iD=cap1486
In 1 10225 1
found iD= key and dscf12586 value for 6 line from iD=dscf12586
Awk solution from #abasu with inserts that also solves the unordered key-value pairs.
parse.awk :
NR>1 {
col["iA"]=col["iB"]=col["iC"]=col["iD"]="null";
if(length($4)) {
split($4,array,";");
for(element in array) {
split(array[element],keyval,"=");
col[keyval[1]] = "'" keyval[2] "'";
}
}
print "INSERT INTO tbl VALUES (" $1 "," $2 "," $3 "," col["iA"] "," col["iB"] "," col["iC"] "," col["iD"] ");";
}
Test/run :
$ awk -f parse.awk file
INSERT INTO tbl VALUES (1,10203,3,'0.34','nerv','45','dskf12586');
INSERT INTO tbl VALUES (1,10203,4,'0.44','nerv','45','dsf12586');
INSERT INTO tbl VALUES (1,10203,5,null,null,null,null);
INSERT INTO tbl VALUES (1,10213,1,'0.14','nerv','49','dskf12586');
INSERT INTO tbl VALUES (1,10213,2,'0.34','nerv',null,'cap1486');
INSERT INTO tbl VALUES (1,10225,1,null,null,null,'dscf12586');
Related
How do I set the precision of a double value in a collection in the same way as a double column when exporting to a CSV file with COPY TO?
DOUBLEPRECISION works for the column (d below), but not for the key in the map.
CREATE TABLE test.digits (
name text,
d double,
digits map<double, int>,
PRIMARY KEY ((name))
);
INSERT INTO test.digits (name, d, digits) VALUES ('Fred', 0.1234567890123456789,
{
1.1234567890123456789 : 1,
2.1234567890123456789 : 2,
3.1234567890123456789 : 3
}
);
SELECT * from test.digits;
name | d | digits
------+----------+--------------------------------------
Fred | 0.123457 | {1.12346: 1, 2.12346: 2, 3.12346: 3}
COPY test.digits (name, d, digits) TO 'digits.csv' WITH header=true AND DELIMITER='|' AND NULL='' AND DOUBLEPRECISION=15;
cat digits.csv
name|d|digits
Fred|0.1234567890123457|{1.12346: 1, 2.12346: 2, 3.12346: 3}
If it's not possible to set this precision in a collection, is this a bug or a feature?
Try adding: AND FLOATPRECISION=15
COPY my_keyspace.digits (name, d, digits) TO 'digits.csv'
WITH header=true AND DELIMITER='|' AND NULL='' AND DOUBLEPRECISION=15
AND FLOATPRECISION = 16;
➜ cat digits.csv
name|d|digits
Fred|0.1234567890123457|{1.1234567890123457: 1, 2.1234567890123457: 2, 3.1234567890123457: 3}
The cqlsh utility displays 5 digits of precision by default.
To enable more digits for the select, add or create a ~/.cassandra/cqlshrc file with:
[ui]
float_precision = 10
double_precision = 15
However, this will not resolve the COPY TO precision issue which seems like a bug in collections export as implied by #fg78nc.
awk -F, 'NR > 0{print "SET", "\"calc_"NR"\"", "\""$0"\"" }' files/calc.csv | unix2dos | redis-cli --pipe
I use the above command to import a csv file into redis database with string datatype.Something like,
set cal_1 product,cost,quantity
set cal_2 t1,100,5
How do I import as hash datatype with field name as rowcount , key as column header, value as column value in awk.
HMSET calc:1 product "t1" cost 100 quantity 5
HMSET calc:2 product "t2" cost 500 quantity 4
Input file Example:
product cost quantity
t1 100 5
t2 500 4
t3 600 9
Can I get this result from awk
for each row present in csv file,
HMSET calc_'row no' 1st row 1st column value current row 1st column value 1st row 2nd column value current row 2nd column value 1st row 3rd column value urrent row 3rd column value
so for the above example,
HMSET calc_1 product t1 cost 100 quantity 5
HMSET calc_2 product t2 cost 500 quantity 4
HMSET calc_3 product t3 cost 600 quantity 9
for all the rows dynamically?
You can use the following awk command:
awk '{if(NR==1){col1=$1; col2=$2; col3=$3}else{product[NR]=$1;cost[NR]=$2;quantity[NR]=$3;tmp=NR}}END{printf "[("col1","col2","col3"),"; for(i=2; i<=tmp;i++){printf "("product[i]","cost[i]","quantity[i]")";}print "]";}' input_file.txt
on your input file:
product cost quantity
t1 100 5
t2 500 4
t3 600 9
it gives the following output:
[(product,cost,quantity),(t1,100,5)(t2,500,4)(t3,600,9)]
awk commands:
# gawk profile, created Fri Dec 29 15:12:39 2017
# Rule(s)
{
if (NR == 1) { # 1
col1 = $1
col2 = $2
col3 = $3
} else {
product[NR] = $1
cost[NR] = $2
quantity[NR] = $3
tmp = NR
}
}
# END rule(s)
END {
printf "[(" col1 "," col2 "," col3 "),"
for (i = 2; i <= tmp; i++) {
printf "(" product[i] "," cost[i] "," quantity[i] ")"
}
print "]"
}
Instead of having one column for each group of values, I made one column named "data" and used HTML like this:
<dt>Phone:</dt><dd>0 23 16/3 82 73 42 23</dd>
<dt>Phone:</dt><dd>0 21 61/81 26 73 13 22</dd>
<dt>Fax:</dt><dd>03 27/3 87 42 37 32</dd>
<dt>Website:</dt><dd>www.example.com</dd>
Now, I recognized, that wasn't very clever and I made a column for each value. My new columns names are "phone", "phone2", "fax" and "website".
I need an SQL code for e.g. selecting all between the delimiters <dt>Phone:</dt><dd> and </dd> and the delimiters itself, insert this string in the column "phone" and delete this string in the "data" column.
But I need to select the first string <dt>Phone:</dt><dd>0 23 16/3 82 73 42 23</dd> not the second <dt>Phone:</dt><dd>0 21 61/81 26 73 13 22</dd>.
Can anybody give me a hint how to do that?
For selecting data between <dt>Phone:</dt><dd> and </dd> you can use SUBSTRING_INDEX.
Like this
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(data, '</dd>', 1), '<dt>Phone:</dt><dd>', -1) as phone,
SUBSTRING_INDEX(SUBSTRING_INDEX(data, '<dt>Phone:</dt><dd>', -1), '</dd>', 1) as phone2,
SUBSTRING_INDEX(SUBSTRING_INDEX(data, '<dt>Fax:</dt><dd>', -1), '</dd>', 1) as fax,
SUBSTRING_INDEX(SUBSTRING_INDEX(data, '<dt>Website:</dt><dd>', -1), '</dd>', 1) as website
from data_col;
Update:
<dt>Phone:</dt> is not always in top.
in case if there is no specified order of data in "data" column, try this one:
SELECT
IF (temp.f_phone > 0, SUBSTR(data, temp.f_phone + LENGTH('<dt>Phone:</dt><dd>'), f_phone_end - temp.f_phone - LENGTH('<dt>Phone:</dt><dd>')), null) as PHONE_1,
IF (temp.s_phone > 0, SUBSTR(data, temp.s_phone + LENGTH('<dt>Phone:</dt><dd>'), s_phone_end - temp.s_phone - LENGTH('<dt>Phone:</dt><dd>')), null) as PHONE_2
from data_col dc
JOIN (
SELECT id, #f_phone:= LOCATE('<dt>Phone:</dt><dd>', data) as f_phone,
LOCATE('</dd>', data, #f_phone+1) f_phone_end,
#s_phone := LOCATE('<dt>Phone:</dt><dd>', data, #f_phone+1) as s_phone,
LOCATE('</dd>', data, #s_phone+1) as s_phone_end
from data_col) temp ON temp.id = dc.id;
First find the starting position of each possible element (e.g. "phone" "phone2") and a position of closing tag <\dd>. And than use SUBSTR from starting position of element + length of delimiter, with length = end_position - start_position - delimiter_length
I am working with a large CSV file with a lot of rows and columns. I need only the first 5 columns but only if the value for column 1 of each row is 1. (Column 1 can only have value 0 or 1).
So far I can print out the first 5 columns but can't filter to only show when column 1 is equal to 1. My .awk file looks like:
BEGIN {FS = ","}
NR!=1 {print $1", " $2", " $3", "$4", "$5}
I have tried things like $1>1 but to no luck, the output is always every row, regardless if the first column of each row is a 0 or 1.
Modifying your awk a bit:
BEGIN {FS = ","; OFS = ", "}
$1 == 1 {print $1, $2, $3, $4, $5; n++}
n == 10 {exit}
How to count instances of strings in a tab separated value (tsv) file?
The tsv file has hundreds of millions of rows, each of which is of form
foobar1 1 xxx yyy
foobar1 2 xxx yyy
foobar2 2 xxx yyy
foobar2 3 xxx yyy
foobar1 3 xxx zzz
. How to count instances of each unique integer in the entire second column in the file, and ideally add the count as the fifth value in each row?
foobar1 1 xxx yyy 1
foobar1 2 xxx yyy 2
foobar2 2 xxx yyy 2
foobar2 3 xxx yyy 2
foobar1 3 xxx zzz 2
I prefer a solution using only UNIX command line stream processing programs.
I'm not entirely clear what you want to do. Do you want to add 0/1 depending on the value of the second column as the fifth column or do you want to get the distribution of the values in the second column, total for the entire file?
In the first case, use something like awk -F'\t' '{ if($2 == valueToCheck) { c = 1 } else { c = 0 }; print $0 "\t" c }' < file.
In the second case, use something like awk -F'\t' '{ h[$2] += 1 } END { for(val in h) print val ": " h[val] }' < file.
One solution using perl assuming that values of second column are sorted, I mean, when found value 2, all lines with same value will be consecutive. The script keeps lines until it finds a different value in second column, get the count, print them and frees memory, so shouldn't generate a problem regardless of how big is the input file:
Content of script.pl:
use warnings;
use strict;
my (%lines, $count);
while ( <> ) {
## Remove last '\n'.
chomp;
## Split line in spaces.
my #f = split;
## Assume as malformed line if it hasn't four fields and omit it.
next unless #f == 4;
## Save lines in a hash until found a different value in second column.
## First line is special, because hash will always be empty.
## In last line avoid reading next one, otherwise I would lose lines
## saved in the hash.
## The hash will ony have one key at same time.
if ( exists $lines{ $f[1] } or $. == 1 ) {
push #{ $lines{ $f[1] } }, $_;
++$count;
next if ! eof;
}
## At this point, the second field of the file has changed (or is last line), so
## I will print previous lines saved in the hash, remove then and begin saving
## lines with new value.
## The value of the second column will be the key of the hash, get it now.
my ($key) = keys %lines;
## Read each line of the hash and print it appending the repeated lines as
## last field.
while ( #{ $lines{ $key } } ) {
printf qq[%s\t%d\n], shift #{ $lines{ $key } }, $count;
}
## Clear hash.
%lines = ();
## Add current line to hash, initialize counter and repeat all process
## until end of file.
push #{ $lines{ $f[1] } }, $_;
$count = 1;
}
Content of infile:
foobar1 1 xxx yyy
foobar1 2 xxx yyy
foobar2 2 xxx yyy
foobar2 3 xxx yyy
foobar1 3 xxx zzz
Run it like:
perl script.pl infile
With following output:
foobar1 1 xxx yyy 1
foobar1 2 xxx yyy 2
foobar2 2 xxx yyy 2
foobar2 3 xxx yyy 2
foobar1 3 xxx zzz 2