TCL script to regroup csv file data - csv

I am new to tcl scripting.
I have 2 columns in my CSV file.
Example: My data
A 12
D 18
B 33
A 11
D 49
I would like to extract column 2 values using Column 1 data.
Required output:
A: 12,11
B: 33
D: 18, 49
Can someone please help me to write a tcl script to perform this task?
Thank you in advance.

This is a very well known problem, and Tcl's dict lappend is exactly the tool you need to solve it (together with csv::split from tcllib to do the parsing of your input data).
package require csv
# Read split and collect
set data {}
while {[gets stdin line] >= 0} {
lassign [csv::split $line] key value; # You might need to configure this
dict lappend data $key $value
}
# Write out
puts ""; # Blank header line
dict for {key values} $data {
puts [format "%s: %s" $key [join $values ", "]]
}
The above is written as a stdin-to-stdout filter. Adapting to work with files is left as an exercise.

Related

print dictionary keys and values in one column in tcl

I am new learner of tcl scripting language. I am using TCL version 8.5. I read text file through tcl script and count similar words frequency. I used for loop and dictionary to count similar words and their frequency but output of the program print like this: alpha 4 beta 2 gamma 1 delta 1
But I want to print it in one column each key, value pair of dictionary or we could say each key, value pair print line by line in output. Following is my script in tcl and its output at the end.
set f [open input.txt]
set text [read $f]
foreach word [split $text] {
dict incr words $word
}
puts $words
Output of the above script:
alpha 4 beta 2 gamma 1 delta 1
You would do:
dict for {key value} $words {
puts "$key $value"
}
When reading the dict documentation, take care about which subcommands require a dictionaryVariable (like dict incr) and which require a dictionaryValue (like dict for)
For nice formatting, as suggested by Donal, here's a very terse method:
set maxWid [tcl::mathfunc::max {*}[lmap w [dict keys $words] {string length $w}]]
dict for {word count} $words {puts [format "%-*s = %s" $maxWid $word $count]}
Or, look at the source code for the parray command for further inspiration:
parray tcl_platform ;# to load the proc
info body parray

Parse CSV into key value pair

I have a csv file which has hostname and attached serial numbers. I want to create a key value pair with key being hostname and value being the list of serial numbers. The serial numbers can be one or many.
For example:
A, 1, 2, 3, 4
B, 5, 6
C, 7, 8, 9
D, 10
I need to access key A and get {1 2 3 4} as output. And if I access D i should get {10}
How should I do this? As the version of TCL i am using doesn't support any packages like CSV and I also won't be able to install it as it is in the server, So I am looking at a solution which doesn't include any packages.
For now, I am splitting the line with \n and then I process each element. Then I split the elements with "," and then I get the host name and serial numbers in a list. I then use the 0th index of the list as hostname and remaining values as serial numbers. Is there a cleaner solution?
I'd do something like:
#!/usr/bin/env tclsh
package require csv
package require struct::queue
set filename "file.csv"
set fh [open $filename r]
set q [struct::queue]
csv::read2queue $fh $q
close $fh
set data [dict create]
while {[$q size] > 0} {
set values [lassign [$q get] hostname]
dict set data $hostname [lmap elem $values {string trimleft $elem}]
}
dict for {key value} $data {
puts "$key => $value"
}
then
$ tclsh csv.tcl
A => 1 2 3 4
B => 5 6
C => 7 8 9
D => 10
The repeated recommendation given here is to use the CSV package for this purpose. See also the answer by #glenn-jackman. If unavailable, the time is better invested in obtaining it at your server side.
To get you started, however, you might want to adopt something along the lines of:
set dat {
A, 1, 2, 3, 4
B, 5, 6
C, 7, 8, 9
D, 10
}
set d [dict create]
foreach row [split [string trim $dat] \n] {
set row [lassign [split $row ,] key]
dict set d [string trim $key] [concat {*}$row]
}
dict get $d A
dict get $d D
Be warned, however, such hand-knitted solutions typically only serve their purpose when you have full control of the data being processed and its representation. Again, time is better invested by obtaining the CSV package.
I tried this way and got it working. Thanks again for your inputs. Yes, I know csv package would be easy but I cannot install it in server/product.
set multihost "host_slno.csv"
set fh1 [open $multihost r]
set data [read -nonewline $fh1]
close $fh1
set hostslnodata [ split $data "\n" ]
set hostslno [dict create];
foreach line $hostslnodata {
set line1 [join [split $line ", "] ]
puts "$line1"
if {[regexp {([A-Za-z0-9_\-]+)\s+(.*)} $line1 match hostname serial_numbers]} {
dict lappend hostslno $hostname $serial_numbers
}
}
puts [dict get $hostslno]
The sourcecode from the csv package is available. If you are unable to install the full csv package, you can include the code from here:
http://core.tcl.tk/tcllib/artifact/2898cd911697ecdb
If you still can't use that option, then stripping out all the whitespace and splitting on "," is required.
An alternative to the earlier answers is using string map:
set row [split [string map {" " ""} $row ] ,]
The string map will remove all spaces, and then split on ","
Once you have converted the lines of text into valid tcl lists:
A 1 2 3 4
B 5 6
C 7 8 9
D 10
Then you can use the lindex and lrange commands to pluck off all the pieces.
foreach row $data {
set server [lindex $row 0]
set serial_numbers [lrange $row 1 end]
dict set ...
One possibility:
set hostslno [dict create]
set multihost "host_slno.csv"
set fh1 [open $multihost]
while {[gets $fh line] >= 0} {
set numbers [lassign [regexp -inline -all {[^\s,]+} $line] hostname]
dict set hostslno $hostname $numbers
}
close $fh1
puts [dict get $hostslno A]

How to get selective data from a file in TCL?

I am trying to parse selective data from a file based on certain key words using tcl,for example I have a file like this
...
...
..
...
data_start
30 abc1 xyz
90 abc2 xyz
214 abc3 xyz
data_end
...
...
...
How do I catch only the 30, 90 and 214 between "data_start" and "data_end"? What I have so far(tcl newbie),
proc get_data_value{ data_file } {
set lindex 0
set fp [open $data_file r]
set filecontent [read $fp]
while {[gets $filecontent line] >= 0} {
if { [string match "data_start" ]} {
#Capture only the first number?
#Use regex? or something else?
if { [string match "data_end" ] } {
break
} else {
##Do Nothing?
}
}
}
close $fp
}
If your file is smaller in size, then you can use read command to slurp the whole data into a variable and then apply regexp to extract the required information.
input.txt
data_start
30 abc1 xyz
90 abc2 xyz
214 abc3 xyz
data_end
data_start
130 abc1 xyz
190 abc2 xyz
1214 abc3 xyz
data_end
extractNumbers.tcl
set fp [open input.txt r]
set data [read $fp]
close $fp
set result [regexp -inline -all {data_start.*?\n(\d+).*?\n(\d+).*?\n(\d+).*?data_end} $data]
foreach {whole_match number1 number2 number3} $result {
puts "$number1, $number2, $number3"
}
Output :
30, 90, 214
130, 190, 1214
Update :
Reading a larger file's content into a single variable will cause the program to crash depends on the memory of your PC. When I tried to read a file of size 890MB with read command in a Win7 8GB RAM laptop, I got unable to realloc 531631112 bytes error message and tclsh crashed. After some bench-marking found that it is able to read a file with a size of 500,015,901 bytes. But the program will consume 500MB of memory since it has to hold the data.
Also, having a variable to hold this much data is not efficient when it comes to extracting the information via regexp. So, in such cases, it is better to go ahead with read the content line by line.
Read more about this here.
Load all the data from the file into a variable. Set start and end tokens and seek to those positions. Process the item line by line. Tcl uses lists of strings separated by white space so we can process the items in the line with foreach {a b c} $line {...}.
tcl:
set data {...
...
..
...
data_start
30 abc1 xyz
90 abc2 xyz
214 abc3 xyz
data_end
...
...
...}
set i 0
set start_str "data_start"
set start_len [string length $start_str]
set end_str "data_end"
set end_len [string length $end_str]
while {[set start [string first $start_str $data $i]] != -1} {
set start [expr $start + $start_len]
set end [string first $end_str $data $start]
set end [expr $end - 1]
set item [string range $data $start $end]
set lines [split $item "\n"]
foreach {line} $lines {
foreach {a b c} $line {
puts "a=$a, b=$b, c=$c"
}
}
set i [expr $end + $end_len]
}
output:
a=30, b=abc1, c=xyz
a=90, b=abc2, c=xyz
a=214, b=abc3, c=xyz
I'd write that as
set fid [open $data_file]
set p 0
while {[gets $fid line] != -1} {
switch -regexp -- $line {
{^data_end} {set p 0}
{^data_start} {set p 1}
default {
if {$p && [regexp {^(\d+)\M} $line -> num]} {
lappend nums $num
}
}
}
}
close $fid
puts $nums
or, even
set nums [exec sed -rn {/data_start/,/data_end/ {/^([[:digit:]]+).*/ s//\1/p}} $data_file]
puts $nums
My favorite method would be to declare procs for each of the acceptable tokens and utilize the unknown mechanism to quietly ignore the unacceptable ones.
proc 30 args {
... handle 30 $args
}
proc 90 args {
... process 90 $args
}
rename unknown original_unknown
proc unknown args {
# This space was deliberately left blank
}
source datafile.txt
rename original_unknown unknown
You'll be using Tcl's built-in parsing, which should be considerably faster. It also looks better in my opinion.
You can also put the line-handling logic into your unknown-procedure entirely:
rename unknown original_unknown
proc unknown {first args} {
process $first $args
}
source input.txt
rename original_unknown unknown
Either way, the trick is that Tcl's own parser (implemented in C) will be breaking up the input lines into tokens for you -- so you don't have to implement the parsing in Tcl yourself.
This does not always work -- if, for example, the input is using multi-line syntax (without { and }) or if the tokens are separated with something other than white space. But in your case it should do nicely.

Split Lines seperated by spaces to arrays

I have a text file that contains output from a program. It reads like this:
1 2
23 24
54 21
87 12
I need the output to be
arr[1]=2
arr[23]=24
arr[54]=21
arr[87]=12
and so on.
Each line is seperated by a space. How can I parse the lines to the array format as described above, using TCL? (I am doing this for NS2 by the way)
With awk:
awk '{ print "arr[" $1 "]=" $2 }' filename
You have mentioned that each line is separated by space, but gave the content separated by new line. I assume, you have each line separated by new line and in each line, the array index and it's value are separated by space.
If your text file contains only those texts given as below
1 2
23 24
54 21
87 12
then, you first read the whole file into a string.
set fp [open "input.txt" r]
set content [ read $fp ]
close $fp
Now, with array set we can easily convert them into an array.
# If your Tcl version less than 8.5, use the below line of code
eval array set legacy {$content}
foreach index [array names legacy] {
puts "array($index) = $legacy($index)"
}
# If you have Tcl 8.5 and more, use the below line of code
array set latest [list {*}$content]
foreach index [array names latest] {
puts "array($index) = $latest($index)"
}
Suppose if your file has some other contents along with these input contents, then you can get them alone using regexp and you can add elements to the array one by one with the classical approach.
You can use this in BASH:
declare -A arr
while read -r k v ; do
arr[$k]=$v
done < file
Testing:
declare -p arr
declare -A arr='([23]="24" [54]="21" [87]="12" [1]="2" )'

Looking for a search string in a file and using those lines for processing in TCL

To be more precise:
I need to be looking into a file abc.txt which has contents something like this:
files/f1/atmp.c 98 100
files/f1/atmp1.c 89 100
files/f1/atmp2.c !! 75 100
files/f2/btmp.c 92 100
files/f2/btmp2.c !! 85 100
files/f3/xtmp.c 92 100
The script needs to find "!!" and use those lines to print out the following as output:
atmp2.c 75
btmp2.c 85
Any help?
this should do the trick.
set data {files/f1/atmp.c 98 100
files/f1/atmp1.c 89 100
files/f1/atmp2.c !! 75 100
files/f2/btmp.c 92 100
files/f2/btmp2.c !! 85 100
files/f3/xtmp.c 92 100}
set lines [split $data \n]
foreach line $lines {
set match [regexp {(\S+)\s+!!\s+(\d+)} $line -> file num]
if {$match} {puts "$file $num"}
}
Although regexp has a -all switch I don't think we can use it here as we only get the last match vars with -all
If your file isn't huge, you can slurp the whole thing into memory, split the lines into a TCL list, and then iterate through the list looking for a match. For example:
set fh [open foo]
set lines [read $fh]
close $fh
set lines [split $lines "\n"]
foreach line $lines {
if { [regexp {.*/(\S+\.c)\s*!!\s*(\d+)} $line match file data] } {
puts "$file $data"
}
}
This will successfully return just the lines with "!!" in them. With your posted corpus, the results are:
atmp2.c 75
btmp2.c 85
I might be tempted in this case to exec to awk:
set output [exec awk {$2 == "!!" {print $1, $3}} abc.txt]
puts $output
The trick is to combine the code that reads lines from the file with a regular expression that detects matching lines and extracts the relevant parts (a one-step process with regexp). The only tricky part is working out what exactly to use as the regular expression, so that you get exactly what you want. I'm going to guess that you're after the parts of the filenames after the /, that those filenames won't contain spaces, and that the number you're after is the entirety of the first digit sequence after the double exclamation. (Other formats are possible, some of which are easier to extract with other tools such as scan.) That would give us something like this:
set f [open abc.txt]
while {[gets $f line] >= 0} {
if {[regexp {([^\s/]+)\s+!!\s+(\d+)} $line -> name value]} {
# Or do whatever you want with these
puts "$name $value"
}
}
close $f
(The gets command with two arguments returns the length of line read, or -1 on failure. For normal files the only failure mode is EOF, so we can just terminate the loop when we get a negative value. Other kinds of channels can be more complex…)