I have this file which reads
001,Time-in,2017-06-25 08:04:42,08:00:00,
001,Time-out,2017-06-25 17:04:42,17:00:00,
001,Time-in,2017-06-25 18:04:42,18:00:00,
001,Time-out,2017-06-25 22:04:42,22:00:00,
...
where field 1 is the ID number; 2 is the action performed; 3 is the exact timestamp; and 4 is the rounded off time.
I would like to calculate the total hours per ID based on field 4. I know I can use the formula
((Out2+Out1)-(In2+In1))
or
((Out1-In1)+(Out2-In2))
to get the total hours, but I'm quite stuck as to how I should begin.
I would like to get this output:
001,13
002,12
..,..
..,..
Where field 1 is the ID and 2 will be the total hours computed.
Also, please note that the real file would be jumbled and not sorted like the example above. If any of the required entries are missing, i.e one time-out missing etc., it should just print that it skipped that particular ID.
Any thoughts regarding this would be extremely helpful.
Thanks.
$ cat tst.awk
BEGIN { FS="[-, :]" }
$3=="in" { tin[$1] += $10 }
$3=="out" { tout[$1] += $10 }
END {
for (key in tin) {
print key, tout[key] - tin[key]
}
}
$ awk -f tst.awk file
001 13
(No error handling or error recovery below.)
I'd probably write a function to return epoch time, given an ISO timestamp. Epoch time makes the arithmetic easy. But it uses the full timestamp, not your rounded values.
function epoch_time(ts) {
gsub("-", " ", ts)
gsub(":", " ", ts)
return mktime(ts)
}
Assuming we can rely on the format on the input file--a BIG assumption--you can use pretty simple code to select and process each line of the input file.
$2 == "Time-in" {
timein = epoch_time($3)
}
$2 == "Time-out" {
timeout = epoch_time($3)
# Add the result to any existing value for this id number.
# Express in hours.
output[$1] += (((timeout - timein) / 60) / 60)
}
END {
for (key in output) print key, output[key]
}
So the full code would look like this,
# timestamp.awk
#
$2 == "Time-in" {
timein = epoch_time($3)
}
$2 == "Time-out" {
timeout = epoch_time($3)
# Add the result to any existing value for this id number.
# Express in hours.
output[$1] += (((timeout - timein) / 60) / 60)
}
END {
for (key in output) print key, output[key]
}
function epoch_time(ts) {
gsub("-", " ", ts)
gsub(":", " ", ts)
return mktime(ts)
}
. . . and I'd call it like this.
$ awk -F, -f timestamp.awk datafilename
For this data, I get the output farther below.
001,Time-in,2017-06-25 08:04:42,08:00:00,
001,Time-out,2017-06-25 17:04:42,17:00:00,
001,Time-in,2017-06-25 18:04:42,18:00:00,
001,Time-out,2017-06-25 22:04:42,22:00:00,
002,Time-in,2017-06-25 09:04:42,08:00:00,
002,Time-out,2017-06-25 17:04:42,17:00:00,
002,Time-in,2017-06-25 19:04:42,18:00:00,
002,Time-out,2017-06-25 22:04:42,22:00:00,
$ awk -F, -f timestamp.awk datafilename
002 11
001 13
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have a large csv file with thousands of columns (50,000 rows and 25,000 columns).
I want to;
Obtain the list of all columns with same values
Remove all column obtained in step 1
Sample input
F1,F2,F3,F4
0,2,1,4
0,1,1,3
0,3,1,3
0,2,1,3
Sample output
Columns with same values: F1 F2
F2,F4
2,4
1,3
3,3
2,3
I have implemented python based solutions which works fine for small files but are too slow for large files (more than 8 GB).
A solution in any programming language (but robust and fast) will be appreciated.
another awk, using subset of #Luuk's sample input
$ awk -F, 'NR==FNR && NR>1 {if(NR>2) for(i=1;i<=NF;i++) if($i!=p[i]) c[i];
split($0,p); next}
FNR==1 {n=asorti(c,m); max=m[n]}
{for(i in c) printf "%s",$i (i==max?RS:FS)}' file{,}
F2,F4
1,3
2,2
0,1
2,1
2,0
1,2
1,3
The trick is it's enough to check if any of the consecutive values are different.
some extra work to find the max included column index (required for clean printing).
If you have thousands of columns you can speed up the filtering by changing i<=NF to i<=NF && !(i in c)
Perhaps this alternative will be the fastest, which quickly removes the columns if there are different values from the search set.
$ awk -F, 'NR==FNR && NR==1 {for(i=1;i<=NF;i++) a[i]; next}
NR==FNR {if(NR>2) for(i in a) if($i!=p[i]) {delete a[i]; c[i]};
split($0,p); next}
FNR==1 {n=asorti(c,m); max=m[n]}
{for(i in c) printf "%s",$i (i==max?RS:FS)}' file{,}
All solutions double scans the file so it can't be very fast but I expect it to complete in couple minutes. Please post the timings if you can test.
$ cat tst.awk
BEGIN { FS=OFS="," }
NR==FNR {
if ( NR > 1 ) {
for (inFldNr=1; inFldNr<=NF; inFldNr++) {
vals[inFldNr][$inFldNr]
}
}
next
}
FNR==1 {
for (inFldNr=1; inFldNr<=NF; inFldNr++) {
if ( length(vals[inFldNr]) > 1 ) {
out2in[++numOutFlds] = inFldNr
}
}
}
{
for (outFldNr=1; outFldNr<=numOutFlds; outFldNr++) {
printf "%s%s", $(out2in[outFldNr]), (outFldNr<numOutFlds ? OFS : ORS)
}
}
$ awk -f tst.awk file file
F2,F4
2,4
1,3
3,3
2,3
EDIT: tested against a file that's 50,000 lines of 25,000 fields:
$ time awk -f tst.awk file.csv file.csv > out.csv
real 17m42.591s
user 17m29.421s
sys 0m3.858s
The above input file was created by running this script:
$ awk 'BEGIN{OFS=","; x=50000; y=25000; for (i=1;i<=x;i++) for (j=1;j<=y;j++) printf "%s%s", (i>1?substr(rand(),3,1):"F"j), (j<y?OFS:ORS)}' > file.csv
$ wc file.csv
50000 50000 2500113283 file.csv
$ ls -lh file.csv
-rw-r--r-- 1 MyLogin None 2.4G Nov 22 12:45 file.csv
I created a piece of code in .NET 5.0, which has following output:
(EDIT: oops, i noticed i have compiled it for .NET 3.1 🙄)
WHICH COLUMNS HAVE UNIQUE VALUES
0: True
1: False
2: True
3: False
TIME AFTER CHECKING UNIQUE COLUMNS: 72,9047581 secs
TIME AFTER WRITING NON UNIQUE COLUMNS TO NEW FILE: 221,751107 secs
This is under 4 minutes. My PC (running Windows) might be quicker then yours, but the difference, compared to 1 hour, is to large... 😉
And i created an 'input file with 4 columns as your sample, like this:
D:\Temp>type random.txt
0,1,1,3
0,2,1,2
0,0,1,1
0,2,1,1
0,2,1,0
0,1,1,2
0,1,1,3
0,0,1,3
0,2,1,4
0,3,1,1
....
My example file has 2756616192 bytes, or 344577024 lines.
The output looks like this:
D:\Temp>type randomNEW.txt
1,3
2,2
0,1
2,1
2,0
1,2
1,3
0,3
2,4
3,1
this file has 1722885120 bytes, and has also 344577024 lines.
The source looks like this (note: I am not very experienced in C#, just learning, so there might be things that can be improved!):
using System;
using System.IO;
namespace ConsoleApp59
{
class Program
{
static void Main(string[] args)
{
DateTime start = DateTime.Now;
bool[] a = new bool[4];
int[] first = new int[4];
int t = 0;
for(int i=0; i<4; i++ ) { a[i] = true; }
foreach (string line in File.ReadLines(#"d:\temp\random.txt"))
{
string[] s = line.Split(",");
int[] b = new int[4]{ Convert.ToInt32(s[0]), Convert.ToInt32(s[1]), Convert.ToInt32(s[2]), Convert.ToInt32(s[3]) };
if (t==0)
{
first = b;
}
else
{
for (int i = 0; i < 4; i++)
{
if (a[i]) a[i] = b[i] == first[i];
}
}
t++;
}
Console.WriteLine("WHICH COLUMNS HAVE UNIQUE VALUES");
for(int i=0; i<4; i++)
{
Console.WriteLine($"{i}: {a[i]}");
}
Console.WriteLine($"TIME AFTER CHECKING UNIQUE COLUMNS: { ((DateTime.Now-start).TotalSeconds)} secs");
StreamWriter n = new StreamWriter(#"D:\temp\randomNEW.txt");
foreach (string line in File.ReadLines(#"d:\temp\random.txt"))
{
string[] s = line.Split(",");
string output = "";
for (int i=0; i<4; i++ )
{
if (!a[i]) output += String.Format($"{s[i]},");
}
if (output.EndsWith(",")) output = output.Substring(0, output.Length - 1);
output += "\r\n";
n.Write(output.ToCharArray());
}
n.Close();
Console.WriteLine($"TIME AFTER WRITING NON UNIQUE COLUMNS TO NEW FILE: { ((DateTime.Now - start).TotalSeconds)} secs");
//Console.ReadLine();
}
}
}
EDIT: I just did a test on a Mac-MINI (Late 2012; macOS Catalina 10.15.7; 2.3 Ghz; 16Gb RAM)
TIME AFTER CHECKING UNIQUE COLUMNS: 104.473532 secs
TIME AFTER WRITING NON UNIQUE COLUMNS TO NEW FILE: 300.272936 secs
I'm trying to develop a simple calculator that bends the rules of math. I want it to ignore the usual math rules and perform from right to left. The user inputs a whole string as a math problem.
For example:
input: 123 - 10 + 4 * 10
Should be solved like this:
123 - 10 + 4 * 10 = 123 - ( 10 + ( 4 * 10 ) ) = 73.
Here is what i currently have:
use strict;
use warnings;
use feature 'say';
while (<>) { # while we get input
my ($main, #ops) = reverse /[\d+\-*\/]+/g; # extract the ops
while (#ops) { # while the list is not empty
$main = calc($main, splice #ops, 0, 2); # take 2 items off the list and process
}
say $main; # print result
}
sub calc {
my %proc = (
"+" => sub { $_[0] + $_[1] },
"-" => sub { $_[0] - $_[1] },
"/" => sub { $_[0] / $_[1] },
"*" => sub { $_[0] * $_[1] }
);
return $proc{$_[1]}($_[0], $_[2]);
}
Here is what output i get:
123 - 10 + 4 * 10 = ((123 - 10) + 4) * 10 = 1170
As you can see - it solves the problem from left to right. My question is - how can i reverse this? I want it to get solved from right to left. Any help will be appreciated, thanks.
This seems to do what you want.
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
my #tokens = #ARGV ? #ARGV : split /\s+/, '123 - 10 + 4 * 10';
if (grep { !/^[-+*i%]$/ and /\D/ } #tokens) {
die "Invalid input: #tokens\n";
}
while (#tokens >= 3) {
my $expr = join ' ', #tokens[-3 .. -1];
splice #tokens, -3, 3, eval $expr;
}
say "#tokens";
I split the input into tokens and then process the array of tokens three elements at a time, working from the end. Each time, I replace the final three tokens with the result of evaluating the expression.
I've used eval here instead of the dispatch table calculator that you've used. You always want to be pretty sure of your input when using eval, so I've included a quick and dirty validation step as well.
Taking a page from postfix/RPN evaluation strategies, and keeping two stacks, one for operations, one for numbers, makes for a simple implementation:
#!/usr/bin/env perl
use warnings;
use strict;
use feature 'say';
{
my %proc = (
"+" => sub { $_[0] + $_[1] },
"-" => sub { $_[0] - $_[1] },
"/" => sub { $_[0] / $_[1] },
"*" => sub { $_[0] * $_[1] }
);
sub calc {
my #tokens = reverse split ' ', $_[0];
my #opstack = grep { m!^[+-/*]$! } #tokens;
my #numstack = grep { $_ !~ m!^[+-/*]$! } #tokens;
for my $op (#opstack) {
splice #numstack, 0, 2, $proc{$op}->(#numstack[1,0]);
}
return $numstack[0];
}
}
say calc("123 - 10 + 4 * 10");
A more robust version would enforce an operator between every pair of numbers and have other error/sanity checking, of course.
Why not using the most amusing parts of Perl ?
This works and will return 73 if you enter the given test-case:
#!/usr/bin/env perl
use warnings;
use strict;
use feature 'say';
while (<>) { # while we get input
chomp;s/ //g;
1 while s/\d+[+\-*\/]\d+$/$&/ee;
say; # print result
}
If you want to understand how it works, just replace the no-op "1" to some STDERR output :
while (<>) {
chomp;s/ //g;
print STDERR "eval'd ($&) -> $_" while s/\d+[+\-*\/]\d+$/$&/ee;
say;
}
> ./test.pl
123 - 10 + 4 * 10
eval'd (4*10) -> 123-10+40
eval'd (10+40) -> 123-50
eval'd (123-50) -> 73
73
In a comment to my answer to your previous question, I said you could reverse the calculation by using reverse, and I see you have implemented that code.
As you have noticed, I assume, this is not true, because it would also invert the operations. I.e. 123 - 50 would become 50 - 123. I was a little careless in that comment. You can however achieve the same effect if you just restore the order of the operands in the calc() call with another use of reverse.
$main = calc(reverse($main, splice #ops, 0, 2)); # take 2 items off the list and process
That would mean that your string 123 - 10 + 4 * 10 would first become a list
10 * 4 + 10 - 123
And then it would be called
calc(4, '*', 10) # 40
calc(10, '+', 40) # 50
calc(123, '-', 50) # 73
I have a large database that contains data about embedded devices in the field.
I've built a MySQL query that outputs data in this format, called "/tmp/data.csv":
Device_Serial_Number, Device_Location_1, Device_Location_2, Date_1, Date_2
"3782D822", "Springfield, MA", "123 Maple Street", "2016-05-02 13:43:00", "2016-05-05 03:22:44"
. . .
The output is thousands of lines long. Note that an individual Device_Serial_Number value can appear multiple times, each with a unique set of "Date_1", "Date_2" values.
What I need to do is create separate .csv files for each value of "Device_Location_1". On that report, each unique value of "Device_Serial_Number" has only 1 row, but all values of "Date_1" and "Date_2" associated with that "Device_Serial_Number" on the entire spreadsheet will appear on that same row.
Example:
Device_Serial_Number, Device_Location_1, Device_Location_2, Date_1, Date_2, Date_1, Date_2, Date_1, Date_2
"3782D822", "Springfield, MA", "123 Maple Street", "2016-05-02 13:43:00", "2016-05-05 03:22:44", "2016-05-06 12:45:23", "2016-05-06 14:23:11", "2016-05-17 15:46:21", "2016-05-18 08:09:13"
Do do this I'm trying to use AWK within a Bash script. I have used a second MySQL query to get a list of unique device serial numbers, and have saved the results as "/tmp/devList.csv". I am attempting to read through each line of "devList.csv", and append a String variable with date strings that match that device list as found in "data.csv", then assign the Device_Serial_Number as the index on an associative array and the String of dates as the value.
Obviously this isn't working. I feel like this solution is way too complicated. Any help finding a working solution would be greatly appreciated.
awk -F, -v deviceList='/tmp/devList.csv' 'BEGIN { OFS=","; while (getline < deviceList) { device[$0]= "" } }
{
dates = $4 "," $5 ","
holder = devices[$1]
newValue = holder dates
device[$1] = newValue
}
END {
for (i in device)
if (device[$i] != "")
print > "/tmp/test_output.csv"
}' '/tmp/data.csv'
I have file that the content of file is as bellow, I have only output two records here but there is around 1000 record in single file:
Record type : GR
address : 62.5.196
ID : 1926089329
time : Sun Aug 10 09:53:47 2014
Time zone : + 16200 seconds
address [1] : 61.5.196
PN ID : 412 1
---------- Container #1 (start) -------
inID : 101
---------- Container #1 (end) -------
timerecorded: Sun Aug 10 09:51:47 2014
Uplink data volume : 502838
Downlink data volume : 3133869
Change condition : Record closed
--------------------------------------------------------------------
Record type : GR
address : 61.5.196
ID : 1926089327
time : Sun Aug 10 09:53:47 2014
Time zone : + 16200 seconds
address [1] : 61.5.196
PN ID : 412 1
---------- Container #1 (start) -------
intID : 100
---------- Container #1 (end) -------
timerecorded: Sun Aug 10 09:55:47 2014
Uplink data volume : 502838
Downlink data volume : 3133869
Change condition : Record closed
--------------------------------------------------------------------
Record type : GR
address : 63.5.196
ID : 1926089328
time : Sun Aug 10 09:53:47 2014
Time zone : + 16200 seconds
address [1] : 61.5.196
PN ID : 412 1
---------- Container #1 (start) -------
intID : 100
---------- Container #1 (end) -------
timerecorded: Sun Aug 10 09:55:47 2014
Uplink data volume : 502838
Downlink data volume : 3133869
Change condition : Record closed
my Goal is to convert this to CSV or txt file like bellow
Record type| address |ID | time | Time zone| address [1] | PN ID
GR |61.5.196 |1926089329 |Sun Aug 10 09:53:47 2014 |+ 16200 seconds |61.5.196 |412 1
any guide would be great on how you think would be best way to start this, the sample that I provided I think will give the clear idea but in words I would want to read the header of each record once and put their data under the out put header.
thanks for your time and help or suggestion
What you're doing is creating an Extract/Transform script (the ET part of an ETL). I don't know which language you're intending to use, but essentially any language can be used. Personally, unless this is a massive file, I'd recommend Python as it's easy to grok and easy to write with the included csv module.
First, you need to understand the format thoroughly.
How are records separated?
How are fields separated?
Are there any fields that are optional?
If so, are the optional fields important, or do they need to be discarded?
Unfortunately, this is all headwork: there's no magical code solution to make this easier. Then, once you have figured out the format, you'll want to start writing code. This is essentially a series of data transformations:
Read the file.
Split it into records.
For each record, transform the fields into an appropriate data structure.
Serialize the data structure into the CSV.
If your file is larger than memory, this can become more complicated; instead of reading and then splitting, for example, you may want to read the file sequentially and create a Record object each time the record delimiter is detected. If your file is even larger, you might want to use a language with better multithreading capabilities to handle the transformation in parallel; but those are more advanced than it sounds like you need to go at the moment.
This is a simple PHP script that will read a text file containing your data and write a csv file with the results. If you are on a system which has command line PHP installed, just save it to a file in some directory, copy your data file next to it renaming it to "your_data_file.txt" and call "php whatever_you_named_the_script.php" on the command line from that directory.
<?php
$text = file_get_contents("your_data_file.txt");
$matches;
preg_match_all("/Record type[\s\v]*:[\s\v]*(.+?)address[\s\v]*:[\s\v]*(.+?)ID[\s\v]*:[\s\v]*(.+?)time[\s\v]*:[\s\v]*(.+?)Time zone[\s\v]*:[\s\v]*(.+?)address \[1\][\s\v]*:[\s\v]*(.+?)PN ID[\s\v]*:[\s\v]*(.+?)/su", $text, $matches, PREG_SET_ORDER);
$csv_file = fopen("your_csv_file.csv", "w");
if($csv_file) {
if(fputcsv($csv_file, array("Record type","address","ID","time","Time zone","address [1]","PN ID"), "|") === FALSE) {
echo "could not write headers to csv file\n";
}
foreach($matches as $match) {
$clean_values = array();
for($i=1;$i<8;$i++) {
$clean_values[] = trim($match[$i]);
}
if(fputcsv($csv_file, $clean_values, "|") === FALSE) {
echo "could not write data to csv file\n";
}
}
fclose($csv_file);
} else {
die("could not open csv file\n");
}
This script assumes that your data records are always formatted similar to the examples you have posted and that all values are always present. If the data file may have exceptions to those rules, the script probably has to be adapted accordingly. But it should give you an idea of how this can be done.
Update
Adapted the script to deal with the full format provided in the updated question. The regular expression now matches single data lines (extracting their values) as well as the record separator made up of dashes. The loop has changed a bit and does now fill up a buffer array field by field until a record separator is encountered.
<?php
$text = file_get_contents("your_data_file.txt");
// this will match whole lines
// only if they either start with an alpha-num character
// or are completely made of dashes (record separator)
// it also extracts the values of data lines one by one
$regExp = '/(^\s*[a-zA-Z0-9][^:]*:(.*)$|^-+$)/m';
$matches;
preg_match_all($regExp, $text, $matches, PREG_SET_ORDER);
$csv_file = fopen("your_csv_file.csv", "w");
if($csv_file) {
// in case the number or order of fields changes, adapt this array as well
$column_headers = array(
"Record type",
"address",
"ID",
"time",
"Time zone",
"address [1]",
"PN ID",
"inID",
"timerecorded",
"Uplink data volume",
"Downlink data volume",
"Change condition"
);
if(fputcsv($csv_file, $column_headers, "|") === FALSE) {
echo "could not write headers to csv file\n";
}
$clean_values = array();
foreach($matches as $match) {
// first entry will contain the whole line
// remove surrounding whitespace
$whole_line = trim($match[0]);
if(strpos($whole_line, '-') !== 0) {
// this match starts with something else than -
// so it must be a data field, store the extracted value
$clean_values[] = trim($match[2]);
} else {
// this match is a record separator, write csv line and reset buffer
if(fputcsv($csv_file, $clean_values, "|") === FALSE) {
echo "could not write data to csv file\n";
}
$clean_values = array();
}
}
if(!empty($clean_values)) {
// there was no record separator at the end of the file
// write the last entry that is still in the buffer
if(fputcsv($csv_file, $clean_values, "|") === FALSE) {
echo "could not write data to csv file\n";
}
}
fclose($csv_file);
} else {
die("could not open csv file\n");
}
Doing the data extraction using regular expressions is one possible method mostly useful for simple data formats with a clear structure and no surprises. As syrion pointed out in his answer, things can get much more complicated. In that case you might need to write a more sophisticated script than this one.
I have a large .csv file to to process and my elements are arranged randomly like this:
xxxxxx,xx,MLOCAL,MREMOTE,33222,56,22/10/2012,18/10/2012
xxxxxx,xx,MREMOTE,MLOCAL,33222,56,22/10/2012,18/10/2012
xxxxxx,xx,MLOCAL,341993,22/10/2012
xxxxxx,xx,MREMOTE,9356828,08/10/2012
xxxxxx,xx,LOCAL,REMOTE,19316,15253,22/10/2012,22/10/2012
xxxxxx,xx,REMOTE,LOCAL,1865871,383666,22/10/2012,22/10/2012
xxxxxx,xx,REMOTE,1180306134,19/10/2012
where fields LOCAL, REMOTE, MLOCAL or MREMOTE are displayed like:
when they are displayed as pairs (LOCAL/REMOTE) if 3rd field is MLOCAL, and 4th field is MREMOTE, then 5th and 7th field represent the value and date of MLOCAL, and 6th and 8th represent the value and date of MREMOTE
when they are displayed as single (only LOCAL or only REMOTE) then the 4th and 5th fields represent the value and date of field 3.
Now, I have split these rows using:
nawk 'BEGIN{
while (getline < "'"$filedata"'")
split($0,ft,",");
name=ft[1];
ID=ft[2]
?=ft[3]
?=ft[4]
....................
but because I can't find a pattern for the 3rd and 4th field I'm pretty stuck to continue to assign var names for each of the array elements in order to use them for further processing.
Now, I tried to use "case" statement but isn't working for awk or nawk (only in gawk is working as expected). I also tried this:
if ( ft[3] == "MLOCAL" && ft[4]!= "MREMOTE" )
{
MLOCAL=ft[3];
MLOCAL_qty=ft[4];
MLOCAL_TIMESTAMP=ft[5];
}
else if ( ft[3] == MLOCAL && ft[4] == MREMOTE )
{
MLOCAL=ft[3];
MREMOTE=ft[4];
MOCAL_qty=ft[5];
MREMOTE_qty=ft[6];
MOCAL_TIMESTAMP=ft[7];
MREMOTE_TIMESTAMP=ft[8];
}
else if ( ft[3] == MREMOTE && ft[4] != MOCAL )
{
MREMOTE=ft[3];
MREMOTE_qty=ft[4];
MREMOTE_TIMESTAMP=ft[5];
..........................................
but it's not working as well.
So, if you have any idea how to handle this, I would be grateful to give me a hint in order to be able to find a pattern in order to cover all the possible situations from above.
EDIT
I don't know how to thank you for all this help. Now, what I have to do is more complex than I wrote above, I'll try to describe as simple as I can otherwise I'll make you guys pretty confused.
My output should be like following:
NAME,UNIQUE_ID,VOLUME_ALOCATED,MLOCAL_VALUE,MLOCAL_TIMESTMP,MLOCAL_limit,LOCAL_VALUE,LOCAL_TIMESTAMP,LOCAL_limit,MREMOTE_VALUE,MREMOTE_TIMESTAMP,REMOTE_VALUE,REMOTE_TIMESTAMP
(where MLOCAL_limit and LOCAL_limit are a subtract result between VOLUME_ALOCATED and MLOCAL_VALUE or LOCAL_VALUE)
So, in my output file, fields position should be arranged like:
4th field =MLOCAL_VALUE,5th field =MLOCAL_TIMESTMP,7th field=LOCAL_VALUE,
8th field=LOCAL_TIMESTAMP,10th field=MREMOTE_VALUE,11th field=MREMOTE_TIMESTAMP,12th field=REMOTE_VALUE,13th field=REMOTE_TIMESTAMP
Now, an example would be this:
for the following input: name,ID,VOLUME_ALLOCATED,MLOCAL,MREMOTE,33222,56,22/10/2012,18/10/2012
name,ID,VOLUME_ALLOCATED,REMOTE,234455,19/12/2012
I should process this line and the output should be this:
name,ID,VOLUME_ALLOCATED,33222,22/10/2012,MLOCAL_LIMIT, ,,,56,18/10/2012,,
7th ,8th, 9th,12th, and 13th fields are empty because there is no info related to: LOCAL_VALUE,LOCAL_TIMESTAMP,LOCAL_limit,REMOTE_VALUE, and REMOTE_TIMESTAMP
OR
name,ID,VOLUME_ALLOCATED,,,,,,,,,234455,9/12/2012
4th,5th,6th,7th,8th,9th,10thand ,11th, fields should be empty values because there is no info about: MLOCAL_VALUE,MLOCAL_TIMESTAMP,MLOCAL_LIMIT,LOCAL_VALUE,LOCAL_TIMESTAMP,LOCAL_LIMIT,MREMOTE_VALUE,MREMOTE_TIMESTAMP
VOLUME_ALLOCATED is retrieved from other csv file (called "info.csv") based on the ID field which is processed earlier in the script like:
info.csv
VOLUME_ALLOCATED,ID,CLIENT
5242881,64,subscriber
567743,24,visitor
data.csv
NAME,64,MLOCAL,341993,23/10/2012
NAME,24,LOCAL$REMOTE,2347$4324,19/12/2012$18/12/2012
Now, my code is this:
#! /usr/bin/bash
input="info.csv"
filedata="data.csv"
outfile="out"
nawk 'BEGIN{
while (getline < "'"$input"'")
{
split($0,ft,",");
volume=ft[1];
id=ft[2];
client=ft[3];
key=id;
volumeArr[key]=volume;
clientArr[key]=client;
}
close("'"$input"'");
while (getline < "'"$filedata"'")
{
gsub(/\$/,","); # substitute the $ separator with comma
split($0,ft,",");
volume=volumeArr[id]; # Get the volume from the volumeArr, using "id" as key
segment=clientArr[id]; # Get the client mode from the clientArr, using "id" as key
NAME=ft[1];
id=ft[2];
here I'm stuck, I can't find the right way to set the rest of the
fields since I don't know how to handle the 3rd and 4th fields.
? =ft[3];
? =ft[4];
Sorry, if I make you pretty confused but this is my current situation right now.
Thanks
You didn't provide the expected output from your sample input but here's a start to show how to get the values for the 2 different formats of input line:
$ cat tst.awk
BEGIN{ FS=","; OFS="\t" }
{
delete value # or use split("",value) if your awk cant delete arrays
if ($4 ~ /LOCAL|REMOTE/) {
value[$3] = $5
date[$3] = $7
value[$4] = $6
date[$4] = $8
}
else {
value[$3] = $4
date[$3] = $5
}
print
for (type in value) {
printf "%15s%15s%15s\n", type, value[type], date[type]
}
}
$ awk -f tst.awk file
xxxxxx,xx,MLOCAL,MREMOTE,33222,56,22/10/2012,18/10/2012
MREMOTE 56 18/10/2012
MLOCAL 33222 22/10/2012
xxxxxx,xx,MREMOTE,MLOCAL,33222,56,22/10/2012,18/10/2012
MREMOTE 33222 22/10/2012
MLOCAL 56 18/10/2012
xxxxxx,xx,MLOCAL,*341993,22/10/2012*
MLOCAL *341993 22/10/2012*
xxxxxx,xx,MREMOTE,9356828,08/10/2012
MREMOTE 9356828 08/10/2012
xxxxxx,xx,LOCAL,REMOTE,19316,15253,22/10/2012,22/10/2012
REMOTE 15253 22/10/2012
LOCAL 19316 22/10/2012
xxxxxx,xx,REMOTE,LOCAL,1865871,383666,22/10/2012,22/10/2012
REMOTE 1865871 22/10/2012
LOCAL 383666 22/10/2012
xxxxxx,xx,REMOTE,1180306134,19/10/2012
REMOTE 1180306134 19/10/2012
and if you post the expected output we could help you more.