Extract specific fields from text file - mysql

I have a csv file with over 5k fields/columns with header names. I would like to import only some specific fields to my database.
I am using local infile for other smaller files which need to be imported
LOAD DATA
LOCAL INFILE 'C:/wamp/www/imports/new_export.csv'
INTO TABLE table1
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\r\n'
(colour,shape,size);
Assigning dummy variables for columns to skip might be cumbersome, Also I would prefer to reference using the fields headers to future proof in case the file has additional fields
I am considering using awk on the file before loading the file to the database. But the examples I have found in search don't seem to work.
Any suggestions on best approach for this would be appreciated.

This is similar to MvG's answer, but it doesn't require gawk 4 and thus uses -F as suggested in that answer. It also shows a technique for listing the desired fields and iterating over the list. This may make the code easier to maintain if there is a large list.
#!/usr/bin/awk -f
BEGIN {
col_list = "colour shape size" # continuing with as many as desired for output
num_cols = split(col_list, cols)
FS = OFS = ","
}
NR==1 {
for (i = 1; i <= NF; i++) {
p[$i] = i # remember column for name
}
# next # enable this line to suppress headers.
}
{
delim = ""
for (i = 1; i <= num_cols; i++) {
printf "%s%s", delim, $p[cols[i]]
delim = OFS
}
printf "\n"
}

Does your actual data have any commas? If not, you might be best served using cut:
cut -d, -f1,2,5,8-12
will select the named fields, splitting lines at the ,. If any of your "-enclosed text fields does contain a ,, things will break, as cut doesn't know about ".

Here is a full-featured solution which can deal with all kinds of quotes and commas in the values of the csv table, and can extract columns by name. It requires gawk and is based on the FPAT feature suggested in this answer.
BEGIN {
# Allow simple values, quoted values and even doubled quotes
FPAT="\"[^\"]*(\"\"[^\"]*)*\"|[^,]*"
}
NR==1 {
for (i = 1; i <= NF; i++) {
p[$i]=i # remember column for name
}
# next # enable this line to suppress headers.
}
{
print $p["colour"] "," $p["shape"] "," $p["size"]
}
Write this to a file, to be invoked by gawk -f file.awk.
As the column-splitting and the index-by-header features are kind of orthogonal, you could use part of the script on non-GNU awk to select by column name, not using FPAT but simple -F, instead.

Related

Awk 4.1.4 Error when processing large file

I am using Awk 4.1.4 on Centos 7.6 (x86_64) with 250 GB RAM to transform a row-wide csv file into a column-wide csv based on the last column (Sample_Key). Here is an example small row-wide csv
Probe_Key,Ind_Beta,Sample_Key
1,0.6277,7417
2,0.9431,7417
3,0.9633,7417
4,0.8827,7417
5,0.9761,7417
6,0.1799,7417
7,0.9191,7417
8,0.8257,7417
9,0.9111,7417
1,0.6253,7387
2,0.9495,7387
3,0.5551,7387
4,0.8913,7387
5,0.6197,7387
6,0.7188,7387
7,0.8282,7387
8,0.9157,7387
9,0.9336,7387
This is what the correct output looks like for the above small csv example
7387,0.6253,0.9495,0.5551,0.8913,0.6197,0.7188,0.8282,0.9157,0.9336
7417,0.6277,0.9431,0.9633,0.8827,0.9761,0.1799,0.9191,0.8257,0.9111
Here is the awk code (based on https://unix.stackexchange.com/questions/522046/how-to-convert-a-3-column-csv-file-into-a-table-or-matrix) to achieve the row to column wide transformation
BEGIN{
printf "Probe_Key,ind_beta,Sample_Key\n";
}
NR > 1{
ks[$3 $1] = $2; # save the second column using the first and third as index
k1[$1]++; # save the first column
k2[$3]++; # save the third column
}
END {
# After processing input
for (i in k2) # loop over third column
{
printf "%s,", i ; # print it as first value in the row
for (j in k1) # loop over the first column (index)
{
if ( j < length(k1) )
{
printf "%s,",ks[i j]; #and print values ks[third_col first_col]
}
else
printf "%s",ks[i j]; #print last value
}
print ""; # newline
}
}
However, when I input a relatively large row-wide csv file (5 GB size), I get tons of values without any commas in the output and then values start to appear with commas and then values without commas. This keeps on going. Here is small excerpt from the portion without comma
0.04510.03580.81470.57690.8020.89630.90950.10880.66560.92240.05 060.78130.86910.07330.03080.0590.06440.80520.05410.91280.16010.19420.08960.0380.95010.7950.92760.9410.95710.2830.90790 .94530.69330.62260.90520.1070.95480.93220.01450.93390.92410.94810.87380.86920.9460.93480.87140.84660.33930.81880.94740 .71890.11840.05050.93760.94920.06190.89280.69670.03790.8930.84330.9330.9610.61760.04640.09120.15520.91850.76760.94840. 61340.02310.07530.93660.86150.79790.05090.95130.14380.06840.95690.04510.75220.03150.88550.82920.11520.11710.5710.94340 .50750.02590.97250.94760.91720.37340.93580.84730.81410.95510.93080.31450.06140.81670.04140.95020.73390.87250.93680.20240.05810.93660.80870.04480.8430.33120.88170.92670.92050.71290.01860.93260.02940.91820
and when I use the largest row-wide csv file (126 GB size), I get the following Error
ERROR (EXIT CODE 255) Unknow error code
How do I debug the two situations when the code works for small input size?
Instead of trying to hold all 5GB's (Or 126GB's) worth of data in memory at once and printing out everything all together at the end, here's an approach using sort and GNU datamash to group each set of values together as they come through its input:
$ datamash --header-in -t, -g3 collapse 2 < input.csv | sort -t, -k1,1n
7387,0.6253,0.9495,0.5551,0.8913,0.6197,0.7188,0.8282,0.9157,0.9336
7417,0.6277,0.9431,0.9633,0.8827,0.9761,0.1799,0.9191,0.8257,0.9111
This assumes your file is already grouped with all the identical third column values together in blocks, and the first/second columns sorted in the appropriate order already, like your sample input. If that's not the case, the slower:
$ tail -n +2 input.csv | sort -t, -k3,3n -k1,1n | datamash -t, -g3 collapse 2
7387,0.6253,0.9495,0.5551,0.8913,0.6197,0.7188,0.8282,0.9157,0.9336
7417,0.6277,0.9431,0.9633,0.8827,0.9761,0.1799,0.9191,0.8257,0.9111
If you can get rid of that header line so sort can be passed the file directly instead of in a pipe, it might be able to pick a more efficient sorting method knowing the full size in advance.
if your data is already grouped in fields 3 and sorted 1, you can just simply do
$ awk -F, 'NR==1 {next}
{if(p!=$3)
{if(p) print v; v=$3 FS $2; p=$3}
else v=v FS $2}
END{print v}' file
7417,0.6277,0.9431,0.9633,0.8827,0.9761,0.1799,0.9191,0.8257,0.9111
7387,0.6253,0.9495,0.5551,0.8913,0.6197,0.7188,0.8282,0.9157,0.9336
if not, pre-sorting is better idea instead of caching all the data in memory which will blow up for large input files.

Pasting Text Vertically Into .CSV File

I have this awk script to write a text file to a specific cell in a .cvs file, but I am trying to have the text displayed vertically, not horizontally.
`nawk -v r=2 -v c=3 '
BEGIN { FS=OFS=","
}
FNR == NR {
val = sprintf("%s%s%s", val, NR > 1 ? " " : "", $0)
next
}
FNR == r {
$c = val
}
1' file new_one.csv`
Want
the
text
like
this
Can't you do something like:
val = sprintf("%s%s%s", val, NR > 1 ? " " : "", $0) + '\n'
Documentation here
Assuming a csv input file like:
a,b,c
d,e,f
g,h,i
k,l,m
and the data you want vertically like:
This file has words horizontally
here's one way to modify your script:
NR==FNR {gsub(" ","\n"); val="\""$0"\""; next}
This is going to replace all the single spaces in $0 with newlines. Then the whole line is assigned to val, but wrapped in double quotes per wikipedia's csv page.
Running this with the data files I created from the command line (using a slightly different syntax used than you for FS/OFS):
awk -F"," -v r=2 -v c=3 'NR==FNR{gsub(" ","\n"); val="\""$0"\""; next} FNR==r {$c=val} 1' OFS="," vert data
a,b,c
d,e,"This
file
has
words
horizontally"
g,h,i
k,l,m
where vert is the name of the vertical data and data is the name of the csv data file. Notice that f at [2,3] has been replaced with the altered input from the vert file.
Be aware that the row/column indexing you've chosen only works if none of the fields in the data file have internal commas in them and that awk isn't going to be your best friend for parsing csv files in general.

Remove Rows From CSV Where A Specific Column Matches An Input File

I have a CSV that contains multiple columns and rows [File1.csv].
I have another CSV file (just one column) that lists a specific words [File2.csv].
I want to able to take remove rows within File1 if any columns match any of the words listed in File2.
I originally used this:
grep -v -F -f File2.csv File1.csv > File3.csv
This worked, to a certain extent. This issue I ran into was with columns that had more than word in it (ex. word1,word2,word3). File2 contained word2 but did not delete that row.
I tired spreading the words apart to look like this: (word1 , word2 , word3), but the original command did not work.
How can I remove a row that contains a word from File2 and may have other words in it?
One way using awk.
Content of script.awk:
BEGIN {
## Split line with a doble quote surrounded with spaces.
FS = "[ ]*\"[ ]*"
}
## File with words, save them in a hash.
FNR == NR {
words[ $2 ] = 1;
next;
}
## File with multiple columns.
FNR < NR {
## Omit line if eigth field has no interesting value or is first line of
## the file (header).
if ( $8 == "N/A" || FNR == 1 ) {
print $0
next
}
## Split interested field with commas. Traverse it searching for a
## word saved from first file. Print line only if not found.
## Change due to an error pointed out in comments.
##--> split( $8, array, /[ ]*,[ ]*/ )
##--> for ( i = 1; i <= length( array ); i++ ) {
len = split( $8, array, /[ ]*,[ ]*/ )
for ( i = 1; i <= len; i++ ) {
## END change.
if ( array[ i ] in words ) {
found = 1
break
}
}
if ( ! found ) {
print $0
}
found = 0
}
Assuming File1.csv and File2.csv have content provided in comments of Thor's answer (I suggest to add that information to the question), run the script like:
awk -f script.awk File2.csv File1.csv
With following output:
"DNSName","IP","OS","CVE","Name","Risk"
"ex.example.com","1.2.3.4","Linux","N/A","HTTP 1.1 Protocol Detected","Information"
"ex.example.com","1.2.3.4","Linux","CVE-2011-3048","LibPNG Memory Corruption Vulnerability (20120329) - RHEL5","High"
"ex.example.com","1.2.3.4","Linux","CVE-2012-2141","Net-SNMP Denial of Service (Zero-Day) - RHEL5","Medium"
"ex.example.com","1.2.3.4","Linux","N/A","Web Application index.php?s=-badrow Detected","High"
"ex.example.com","1.2.3.4","Linux","CVE-1999-0662","Apache HTTPD Server Version Out Of Date","High"
"ex.example.com","1.2.3.4","Linux","CVE-1999-0662","PHP Unsupported Version Detected","High"
"ex.example.com","1.2.3.4","Linux","N/A","HBSS Common Management Agent - UNIX/Linux","High"
You could convert split lines containing multiple patterns in File2.csv.
Below uses tr to convert lines containing word1,word2 into separate lines before using them as patterns. The <() construct temporarily acts as a file/fifo (tested in bash):
grep -v -F -f <(tr ',' '\n' < File2.csv) File1.csv > File3.csv

JSON to fixed width file

I have to extract data from JSON file depending on a specific key. The data then has to be filtered (based on the key value) and separated into different fixed width flat files. I have to develop a solution using shell scripting.
Since the data is just key:value pair I can extract them by processing each line in the JSON file, checking the type and writing the values to the corresponding fixed-width file.
My problem is that the input JSON file is approximately 5GB in size. My method is very basic and would like to know if there is a better way to achieve this using shell scripting ?
Sample JSON file would look like as below:
{"Type":"Mail","id":"101","Subject":"How are you ?","Attachment":"true"}
{"Type":"Chat","id":"12ABD","Mode:Online"}
The above is a sample of the kind of data I need to process.
Give this a try:
#!/usr/bin/awk
{
line = ""
gsub("[{}\x22]", "", $0)
f=split($0, a, "[:,]")
for (i=1;i<=f;i++)
if (a[i] == "Type")
file = a[++i]
else
line = line sprintf("%-15s",a[i])
print line > file ".fixed.out"
}
I made assumptions based on the sample data provided. There is a lot based on those assumptions that may need to be changed if the data varies much from what you've shown. In particular, this script will not work properly if the data values or field names contain colons, commas, quotes or braces. If this is a problem, it's one of the primary reasons that a proper JSON parser should be used. If it were my assignment, I'd push back hard on this point to get permission to use the proper tools.
This outputs lines that have type "Mail" to a file named "Mail.fixed.out" and type "Chat" to "Chat.fixed.out", etc.
The "Type" field name and field value ("Mail", etc.) are not output as part of the contents. This can be changed.
Otherwise, both the field names and values are output. This can be changed.
The field widths are all fixed at 15 characters, padded with spaces, with no delimiters. The field width can be changed, etc.
Let me know how close this comes to what you're looking for and I can make some adjustments.
perl script
#!/usr/bin/perl -w
use strict;
use warnings;
no strict 'refs'; # for FileCache
use FileCache; # avoid exceeding system's maximum number of file descriptors
use JSON;
my $type;
my $json = JSON->new->utf8(1); #NOTE: expect utf-8 strings
while(my $line = <>) { # for each input line
# extract type
eval { $type = $json->decode($line)->{Type} };
$type = 'json_decode_error' if $#;
$type ||= 'missing_type';
# print to the appropriate file
my $fh = cacheout '>>', "$type.out";
print $fh $line; #NOTE: use cache if there are too many hdd seeks
}
corresponding shell script
#!/bin/bash
#NOTE: bash is used to create non-ascii filenames correctly
__extract_type()
{
perl -MJSON -e 'print from_json(shift)->{Type}' "$1"
}
__process_input()
{
local IFS=$'\n'
while read line; do # for each input line
# extract type
local type="$(__extract_type "$line" 2>/dev/null ||
echo json_decode_error)"
[ -z "$type" ] && local type=missing_type
# print to the appropriate file
echo "$line" >> "$type.out"
done
}
__process_input
Example:
$ ./script-name < input_file
$ ls -1 *.out
json_decode_error.out
Mail.out

How to remove empty tables from a MySQL backup file

I have multiple large MySQL backup files all from different DBs and having different schemas. I want to load the backups into our EDW but I don't want to load the empty tables.
Right now I'm cutting out the empty tables using AWK on the backup files, but I'm wondering if there's a better way to do this.
If anyone is interested, this is my AWK script:
EDIT: I noticed today that this script has some problems, please beware if you want to actually try to use it. Your output may be WRONG... I will post my changes as I make them.
# File: remove_empty_tables.awk
# Copyright (c) Northwestern University, 2010
# http://edw.northwestern.edu
/^--$/ {
i = 0;
line[++i] = $0; getline
if ($0 ~ /-- Definition/) {
inserts = 0;
while ($0 !~ / ALTER TABLE .* ENABLE KEYS /) {
# If we already have an insert:
if (inserts > 0)
print
else {
# If we found an INSERT statement, the table is NOT empty:
if ($0 ~ /^INSERT /) {
++inserts
# Dump the lines before the INSERT and then the INSERT:
for (j = 1; j <= i; ++j) print line[j]
i = 0
print $0
}
# Otherwise we may yet find an insert, so save the line:
else line[++i] = $0
}
getline # go to the next line
}
line[++i] = $0; getline
line[++i] = $0; getline
if (inserts > 0) {
for (j = 1; j <= i; ++j) print line[j]
print $0
}
next
} else {
print "--"
}
}
{
print
}
I can't think of any option in mysqldump that would skip the empty tables in your backup. Maybe the -where option but not sure you can do sth generic. IMHO a post-treatment in a second script is not that bad.
Using regex and perl one liners. It works by matching the comment header + white space + start of next header. One is for ordered dumps and the next is for non-ordered dumps.
perl -0777 -pi -e 's/--\s*-- Dumping data for table \`\w+\`\s*--\s*-- ORDER BY\: [^\n]+\s+(?=--)//g' "dump.sql"
perl -0777 -pi -e 's/--\s*-- Dumping data for table \`\w+\`\s*--\n(?!--)\s*(?=--)//g' "dump.sql"