I have multiple large MySQL backup files all from different DBs and having different schemas. I want to load the backups into our EDW but I don't want to load the empty tables.
Right now I'm cutting out the empty tables using AWK on the backup files, but I'm wondering if there's a better way to do this.
If anyone is interested, this is my AWK script:
EDIT: I noticed today that this script has some problems, please beware if you want to actually try to use it. Your output may be WRONG... I will post my changes as I make them.
# File: remove_empty_tables.awk
# Copyright (c) Northwestern University, 2010
# http://edw.northwestern.edu
/^--$/ {
i = 0;
line[++i] = $0; getline
if ($0 ~ /-- Definition/) {
inserts = 0;
while ($0 !~ / ALTER TABLE .* ENABLE KEYS /) {
# If we already have an insert:
if (inserts > 0)
print
else {
# If we found an INSERT statement, the table is NOT empty:
if ($0 ~ /^INSERT /) {
++inserts
# Dump the lines before the INSERT and then the INSERT:
for (j = 1; j <= i; ++j) print line[j]
i = 0
print $0
}
# Otherwise we may yet find an insert, so save the line:
else line[++i] = $0
}
getline # go to the next line
}
line[++i] = $0; getline
line[++i] = $0; getline
if (inserts > 0) {
for (j = 1; j <= i; ++j) print line[j]
print $0
}
next
} else {
print "--"
}
}
{
print
}
I can't think of any option in mysqldump that would skip the empty tables in your backup. Maybe the -where option but not sure you can do sth generic. IMHO a post-treatment in a second script is not that bad.
Using regex and perl one liners. It works by matching the comment header + white space + start of next header. One is for ordered dumps and the next is for non-ordered dumps.
perl -0777 -pi -e 's/--\s*-- Dumping data for table \`\w+\`\s*--\s*-- ORDER BY\: [^\n]+\s+(?=--)//g' "dump.sql"
perl -0777 -pi -e 's/--\s*-- Dumping data for table \`\w+\`\s*--\n(?!--)\s*(?=--)//g' "dump.sql"
Related
I would like to use AWK (Windows) to convert a text file with a single column to multiple columns - the count specified in the script or on the command line.
This question has been asked before but my final data file needs to have the same column count all the way.
Example of input:
L1
L2
L3
L4
L5
L6
L7
split into 3 columns and ";" as a separator
L1;L2;L3
L4;L5;L6
L7;; <<< here two empty fields are created after end of file, since I used just one on this line.
I tried to modify variants of the typical solution given: NR%4 {printf $0",";next} 1; and a counter, but could not quite get it right.
I would prefer not to count lines before, thereby running over the file multiple times.
You may use this awk solution:
awk -v n=3 '{
sub(/\r$/, "") # removes DOS line break, if present
printf "%s", $0(NR%n ? ";" : ORS)
}
END {
# now we need to add empty columns in last record
if (NR % n) {
for (i=1; i < (n - (NR % n)); ++i)
printf ";"
print ""
}
}' file
L1;L2;L3
L4;L5;L6
L7;;
With your shown samples please try following awk code. Using xargs + awk combination to achieve the outcome needed by OP.
xargs -n3 < Input_file |
awk -v OFS=";" '{if(NF==1){$0=$0";;"};if(NF==2){$0=$0";"};$1=$1} 1'
For an awk I would do:
awk -v n=3 '
{printf("%s%s", $0, (NR%n>0) ? ";" : ORS)}
END{
for(i=NR%n; i<n-1; i++) printf(";")
printf ORS
}' file
Or, an alternative awk:
awk -v n=3 -v OFS=";" '
{ row=row ? row FS $0 : $0 } # build row of n fields
!(NR%n) {$0=row; NF=n; print; row="" } # split the fields sep by OFS
END { if (NR%n) { $0=row; NF=n; print } } # same
' file
Or you can use ruby if you want more options:
ruby -le '
n=3
puts $<.read.
split($/).
each_slice(n).
map{|sl| sl.fill(sl.size...n) { "" }; sl.join(";") }.
join($\) # By using $\ and $/ with the -l the RS and ORS is set correctly for the platform
' file
Or, realize that paste is designed to do this:
paste -d';' - - - <file
(Use a - for each column desired)
Any of those prints (with n=3):
L1;L2;L3
L4;L5;L6
L7;;
(And work correctly for other values of n...)
I am trying to find a way to interpolate between two lines of data in a CSV file, likely using awk. Right now, each line represents a data point at Hour 0 and Hour 6. I am looking to fill in missing hourly data between Hour 0 and Hour 6.
Current CSV
lat,lon,fhr
33.90000,-76.50000,0
34.20000,-77.00000,6
Expected Interpolated Output
lat,lon,fhr
33.90000,-76.50000,0
33.95000,-76.58333,1
34.00000,-76.66667,2
34.05000,-76.75000,3
34.10000,-76.83333,4
34.15000,-76.91667,5
34.20000,-77.00000,6
Here is an awk file that should achieve this
# initialize lastTime, also used as a flag to show that the 1st data line has been read
BEGIN { lastTime=-100 }
# match data lines
/^[0-9]/{
if (lastTime == -100) {
# this is the first data line, print it
print;
} else {
if ($3 == lastTime+1) {
# increment of 1 hour, no need to interpolate
print;
} else {
# increment othet than 1 hour, interpolate
for (i = 1 ; i < $3 - lastTime; i = i + 1) {
print lastLat+($1-lastLat)*(i/($3 - lastTime))","lastLon+($2-lastLon)*(i/($3 - lastTime))","lastTime+i
}
print;
}
}
# save the current values for the next line
lastTime = $3;
lastLon = $2;
lastLat = $1;
}
/lat/{
# this is the header line, just print it
print;
}
Run it as
awk -F, -f test.awk test.csv
I assume your third column has integral values.
Below I have files as they should, and further down, what I made till now. I think that in my code is the source of the problem: delimiters, but I can't get it much better.
My source file is with ; as delimiter, and the files for my database have a , as separator; also, the strings are between "":
The category file should be like this:
"1","1","testcategory","testdescription"
And the manufacturers file, like this:
"24","ASUS",NULL,NULL,NULL
"23","ASROCK",NULL,NULL,NULL
"22","ARNOVA",NULL,NULL,NULL
What I have at this moment:
- category file:
1;2;Alarmen en beveiligingen;
2;2;Apparatuur en toebehoren;
3;2;AUDIO;
- manufacturers file:
315;XTREAMER;NULL;NULL;NULL
316;XTREMEMAC;NULL;NULL;NULL
317;Y-CAM;NULL;NULL;NULL
318;ZALMAN;NULL;NULL;NULL
I tried a bit around to use sed; first, on the categories file:
cut -d ";" -f1 /home/arno/pixtmp/pixtmp.csv |sort | uniq > /home/arno/pixtmp/categories_description-in.csv
sed 's/^/;2;/g' /home/arno/pixtmp/categories_description-in.csv > /home/arno/pixtmp/categories_description-in.tmp
sed -e "s/$/;/" /home/arno/pixtmp/categories_description-in.tmp > /home/arno/pixtmp/categories_description-in.tmp2
awk 'BEGIN{n=1}{printf("%s%s\n",n++,$0)}' /home/arno/pixtmp/categories_description-in.tmp2 > /home/arno/pixtmp/categories_description$
And then on the manufacturers file:
cut -d ";" -f5 /home/arno/pixtmp/pixtmp.csv |sort | uniq > /home/arno/pixtmp/manufacturers-in
sed 's/^/;/g' /home/arno/pixtmp/manufacturers-in > /home/arno/pixtmp/manufacturers-tmp
sed -e "s/$/;NULL;NULL;NULL/" /home/arno/pixtmp/manufacturers-tmp > /home/arno/pixtmp/manufacturers-tmp2
awk 'BEGIN{n=1}{printf("%s%s\n",n++,$0)}' /home/arno/pixtmp/manufacturers-tmp2 > /home/arno/pixtmp/manufacturers.ok
You were trying to solve the problem by using cut, sed, and AWK. AWK by itself is powerful enough to solve your problem.
I wrote one AWK program that can handle both of your examples. If NULL is not a special case, and the manufacturers' file is a different format, you will need to make two AWK programs but I think it should be clear how to do it.
All we do here is tell AWK that the "field separator" is the semicolon. Then AWK splits the input lines into fields for us. We loop over the fields, printing as we go.
#!/usr/bin/awk -f
BEGIN {
FS = ";"
DQUOTE = "\""
}
function add_quotes(s) {
if (s == "NULL")
return s
else
return DQUOTE s DQUOTE
}
NF > 0 {
# if input ended with a semicolon, last field will be empty
if ($NF == "")
NF -= 1 # subtract one from NF to forget the last field
if (NF > 0)
{
for (i = 1; i <= NF - 1; ++i)
printf("%s,", add_quotes($i))
printf("%s\n", add_quotes($i))
}
}
I have a CSV that contains multiple columns and rows [File1.csv].
I have another CSV file (just one column) that lists a specific words [File2.csv].
I want to able to take remove rows within File1 if any columns match any of the words listed in File2.
I originally used this:
grep -v -F -f File2.csv File1.csv > File3.csv
This worked, to a certain extent. This issue I ran into was with columns that had more than word in it (ex. word1,word2,word3). File2 contained word2 but did not delete that row.
I tired spreading the words apart to look like this: (word1 , word2 , word3), but the original command did not work.
How can I remove a row that contains a word from File2 and may have other words in it?
One way using awk.
Content of script.awk:
BEGIN {
## Split line with a doble quote surrounded with spaces.
FS = "[ ]*\"[ ]*"
}
## File with words, save them in a hash.
FNR == NR {
words[ $2 ] = 1;
next;
}
## File with multiple columns.
FNR < NR {
## Omit line if eigth field has no interesting value or is first line of
## the file (header).
if ( $8 == "N/A" || FNR == 1 ) {
print $0
next
}
## Split interested field with commas. Traverse it searching for a
## word saved from first file. Print line only if not found.
## Change due to an error pointed out in comments.
##--> split( $8, array, /[ ]*,[ ]*/ )
##--> for ( i = 1; i <= length( array ); i++ ) {
len = split( $8, array, /[ ]*,[ ]*/ )
for ( i = 1; i <= len; i++ ) {
## END change.
if ( array[ i ] in words ) {
found = 1
break
}
}
if ( ! found ) {
print $0
}
found = 0
}
Assuming File1.csv and File2.csv have content provided in comments of Thor's answer (I suggest to add that information to the question), run the script like:
awk -f script.awk File2.csv File1.csv
With following output:
"DNSName","IP","OS","CVE","Name","Risk"
"ex.example.com","1.2.3.4","Linux","N/A","HTTP 1.1 Protocol Detected","Information"
"ex.example.com","1.2.3.4","Linux","CVE-2011-3048","LibPNG Memory Corruption Vulnerability (20120329) - RHEL5","High"
"ex.example.com","1.2.3.4","Linux","CVE-2012-2141","Net-SNMP Denial of Service (Zero-Day) - RHEL5","Medium"
"ex.example.com","1.2.3.4","Linux","N/A","Web Application index.php?s=-badrow Detected","High"
"ex.example.com","1.2.3.4","Linux","CVE-1999-0662","Apache HTTPD Server Version Out Of Date","High"
"ex.example.com","1.2.3.4","Linux","CVE-1999-0662","PHP Unsupported Version Detected","High"
"ex.example.com","1.2.3.4","Linux","N/A","HBSS Common Management Agent - UNIX/Linux","High"
You could convert split lines containing multiple patterns in File2.csv.
Below uses tr to convert lines containing word1,word2 into separate lines before using them as patterns. The <() construct temporarily acts as a file/fifo (tested in bash):
grep -v -F -f <(tr ',' '\n' < File2.csv) File1.csv > File3.csv
I have a csv file with over 5k fields/columns with header names. I would like to import only some specific fields to my database.
I am using local infile for other smaller files which need to be imported
LOAD DATA
LOCAL INFILE 'C:/wamp/www/imports/new_export.csv'
INTO TABLE table1
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\r\n'
(colour,shape,size);
Assigning dummy variables for columns to skip might be cumbersome, Also I would prefer to reference using the fields headers to future proof in case the file has additional fields
I am considering using awk on the file before loading the file to the database. But the examples I have found in search don't seem to work.
Any suggestions on best approach for this would be appreciated.
This is similar to MvG's answer, but it doesn't require gawk 4 and thus uses -F as suggested in that answer. It also shows a technique for listing the desired fields and iterating over the list. This may make the code easier to maintain if there is a large list.
#!/usr/bin/awk -f
BEGIN {
col_list = "colour shape size" # continuing with as many as desired for output
num_cols = split(col_list, cols)
FS = OFS = ","
}
NR==1 {
for (i = 1; i <= NF; i++) {
p[$i] = i # remember column for name
}
# next # enable this line to suppress headers.
}
{
delim = ""
for (i = 1; i <= num_cols; i++) {
printf "%s%s", delim, $p[cols[i]]
delim = OFS
}
printf "\n"
}
Does your actual data have any commas? If not, you might be best served using cut:
cut -d, -f1,2,5,8-12
will select the named fields, splitting lines at the ,. If any of your "-enclosed text fields does contain a ,, things will break, as cut doesn't know about ".
Here is a full-featured solution which can deal with all kinds of quotes and commas in the values of the csv table, and can extract columns by name. It requires gawk and is based on the FPAT feature suggested in this answer.
BEGIN {
# Allow simple values, quoted values and even doubled quotes
FPAT="\"[^\"]*(\"\"[^\"]*)*\"|[^,]*"
}
NR==1 {
for (i = 1; i <= NF; i++) {
p[$i]=i # remember column for name
}
# next # enable this line to suppress headers.
}
{
print $p["colour"] "," $p["shape"] "," $p["size"]
}
Write this to a file, to be invoked by gawk -f file.awk.
As the column-splitting and the index-by-header features are kind of orthogonal, you could use part of the script on non-GNU awk to select by column name, not using FPAT but simple -F, instead.