Splitting a huge file into smaller files

Splitting a huge file into smaller files - tcl

How can I split a huge file into n number of smaller files using Tcl? The file name to split and number of files to be created have to be given through command line. Here is what I have so far:
proc splitter { file no } {
set lnum 0
set file_open [open $file r]
while {[gets $file_open line] >= 0 } {
incr lnum
}
puts "$lnum"
set num [expr $lnum/$no]
close $file_open
}

Here is one way to split text files, which has the advantage of not holding much in memory at once. (You can also split binary files, but then you need to use read instead of gets, and also to consider whether there are record boundaries in the data; text is mostly simpler.)
#!/usr/bin/env tclsh8.5
proc splitter {filename fileCount} {
set targetFileSize [expr {[file size $filename] / $fileCount}]
set n 0
set fin [open $filename]
while {[gets $fin line]} {
if {![info exist fout]} {
set fout [open $filename.split_[incr n] w]
}
puts $fout $line
if {[tell $fout] > $targetFileSize} {
close $fout
unset fout
}
}
if {[info exist fout]} {
close $fout
}
close $fin
}
splitter {*}$argv; # Connect to outside command line

use the global argv array to access command line parameters
after you read the file to count the lines, instead of closing the file handle, you can seek back to the top of the file.
if you're on *nix, have you considered using exec to call out to split?

Related

Search in file for number, increment and replace

I have a VHDL file which has a line like this:
constant version_nr :integer := 47;
I want to increment the number in this line in the file. Is there a way to accomplish this with TCL?

This is principally a string operation. The tricky bit is finding the line to operate on and picking the number out of it. This can be occasionally awkward, but it is mainly a matter of choosing a suitable regular expression (as this is the kind of parsing task that they excel at). A raw RE to do the matching would be this:
^\s*constant\s+version_nr\s*:integer\s*:=\s*\d+\s*;\s*$
This is essentially converting all possible places for a whitespace sequence into \s* (except where whitespace is mandatory, which becomes \s+) and matching the number with \d+, i.e., a digit sequence. We then add in parentheses to capture the interesting substrings, which are the prefix, the number, and the suffix:
^(\s*constant\s+version_nr\s*:integer\s*:=\s*)(\d+)(\s*;\s*)$
Now we have enough to make the line transform (which we'll do as a procedure so we can give it a nice name):
proc lineTransform {line} {
set RE {^(\s*constant\s+version_nr\s*:integer\s*:=\s*)(\d+)(\s*;\s*)$}
if {[regexp $RE $line -> prefix number suffix]} {
# If we match, we increment the number...
incr number
# And reconcatenate it with the prefix and suffix to make the new line
set line $prefix$number$suffix
}
return $line
}
In Tcl 8.7 (which you won't be using yet) you can write this as this more succinct form:
proc lineTransform {line} {
# Yes, this version can be a single (long) line if you want
set RE {^(\s*constant\s+version_nr\s*:integer\s*:=\s*)(\d+)(\s*;\s*)$}
regsub -command $RE $line {apply {{- prefix number suffix} {
# Apply the increment when the RE matches and build the resulting line
string cat $prefix [incr number] $suffix
}}}
}
Now that we have a line transform, we've just got to apply that to all the lines of the file. This is easily done with a file that fits in memory (up to a few hundred MB) but requires additional measures for larger files as you need to stream from one file to another:
proc transformSmallFile {filename} {
# Read data into memory first
set f [open $filename]
set data [read $f]
close $f
# Then write it back out, applying the transform as we go
set f [open $filename w]
foreach line [split $data "\n"] {
puts $f [transformLine $line]
}
close $f
}
proc transformLargeFile {filename} {
set fin [open $filename]
# The [file tempfile] command makes working with temporary files easier
set fout [file tempfile tmp [file normalize $filename]]
# A streaming transform; requires that input and output files be different
while {[gets $fin line] >= 0} {
puts $fout [transformLine $line]
}
# Close both channels; flushes everything to disk too
close $fin
close $fout
# Rename our temporary over the original input file, replacing it
file rename $tmp $filename
}

Tcl/Tk write in a specific line

I want to write in a specific line in Textdocument but there´s a Problem with my code, i don´t know where the bug is.
set fp [open C:/Users/user/Desktop/tst/settings.txt w]
set count 0
while {[gets $fp line]!=-1} {
incr count
if {$count==28} {
break
}
}
puts $fp "TEST"
close $fp
The File only contains TEST.
Has anybody an idea?

With short text files (these days, short is up to hundreds of megabytes!) the easiest way is to read the whole file into memory, do the text surgery there, and then write the whole lot back out. For example:
set filename "C:/Users/user/Desktop/tst/settings.txt"
set fp [open $filename]
set lines [split [read $fp] "\n"]
close $fp
set lines [linsert $lines 28 "TEST"]
# Read a line with lindex, find a line with lsearch
# Replace a line with lset, replace a range of lines with lreplace
set fp [open $filename w]
puts $fp [join $lines "\n"]
close $fp
Doing it this way is enormously easier, and avoids a lot of complexities that can happen with updating a file in place; save those for gigabyte-sized files (which won't be called settings.txt in any sane world…)

You are using 'w' as access argument, which truncates the file. So you will loose all data from file while opening. Read more about open command
You can use 'r+' or 'a+'.
Also To write after a particular line you can move the pointer to the desired location.
set fp [open C:/Users/user/Desktop/tst/settings.txt r+]
set count 0
while {[gets $fp line]!=-1} {
incr count
if {$count==28} {
break
}
set offset [tell $fp]
}
seek $fp $offset
puts $fp "TEST"
close $fp
To replace a complete line it would be easier to do in following way. Rewrite all the lines and write new data on the desired line.
set fp [open C:/Users/user/Desktop/tst/settings.txt r+]
set count 0
set data [read $fp]
seek $fp 0
foreach line [split $data \n] {
incr count
if {$count==28} {
puts $fp "TEST"
} else {
puts $fp $line
}
}
close $fp

package require fileutil
set filename path/to/settings.txt
set count 0
set lines {}
::fileutil::foreachLine line $filename {
incr count
if {$count == 28} {
break
}
append lines $line\n
}
append lines TEST\n
::fileutil::writeFile $filename $lines
This is a simple and clean way to do it. Read the lines up to the point where you want to write, and then write back those lines with your new content added.

I'd suggest it would be easier to spawn an external program that specializes in this:
exec sed -i {28s/.*/TEST/} path/to/settings.txt

Tcl: Check for same content between two files

I have a program which I made in vimscript which checks two files if they are the same. It makes a system call to diff to verify if they are differents or not.
I need something similar in Tcl but without resorting to external commands or system calls. I don't need to know the difference or have comparison between the files, just to return 1 if both files have the same content or 0 if the contents are different.

proc comp_file {file1 file2} {
# optimization: check file size first
set equal 0
if {[file size $file1] == [file size $file2]} {
set fh1 [open $file1 r]
set fh2 [open $file2 r]
set equal [string equal [read $fh1] [read $fh2]]
close $fh1
close $fh2
}
return $equal
}
if {[comp_file /tmp/foo /tmp/bar]} {
puts "files are equal"
}

For a straight binary comparison, you can just work a chunk at a time. (4kB is probably quite enough per chunk though you can pick larger values; I/O overhead will dominate in any case.) The simplest way to express this is with a loop inside a try…finally (requires Tcl 8.6):
proc sameContent {file1 file2} {
set f1 [open $file1 "rb"]
set f2 [open $file2 "rb"]
try {
while 1 {
if {[read $f1 4096] ne [read $f2 4096]} {
return 0
} elseif {[eof $f1]} {
# The same if we got to EOF at the same time
return [eof $f2]
} elseif {[eof $f2]} {
return 0
}
}
} finally {
close $f1
close $f2
}
}
Otherwise, we can take advantage of the fact that we can see if a variable has been set to keep the logic fairly simple (which is quite a lot less clear) to make code that works in older versions of Tcl:
proc sameContent {file1 file2} {
set f1 [open $file1]
fconfigure $f1 -translation binary
set f2 [open $file2]
fconfigure $f2 -translation binary
while {![info exist same]} {
if {[read $f1 4096] ne [read $f2 4096]} {
set same 0
} elseif {[eof $f1]} {
# The same if we got to EOF at the same time
set same [eof $f2]
} elseif {[eof $f2]} {
set same 0
}
}
close $f1
close $f2
return $same
}
Both are invoked in the same way:
if {[sameContent "./foo.txt" "some/dir/bar.txt"]} {
puts "They're the same contents, byte-for-byte"
} else {
puts "A difference was found"
}

Reading multiple lines from a file using TCL?

How do I read more than a single line in a file using tcl? That is by default the gets command reads till a new line is found, how do I change this behaviour to read a file till a specific character is found?

If you don't mind reading over a bit, you can do it by looping with gets or read in a loop:
set data ""
while {[gets $chan line] >= 0} {
set idx [string first $whatToLookFor $line]
if {$idx == -1} {
append data $line\n
} else {
# Decrement idx; don't want first character of $whatToLookFor
append data [string range $line 0 [incr idx -1]]
break
}
}
# Data has everything up to but not including $whatToLookFor
If you're looking for multiline patterns, I suggest reading the whole file into memory and working on that. It's just so much easier than trying to write a correct matcher:
set data [read $chan]
set idx [string first $whatToLookFor $data]
if {$idx > -1} {
set data [string range $data 0 [incr idx -1]]
}
This latter form will also work just fine with binary data. Just remember to fconfigure $chan -translation binary first if you're doing that.

Use fconfigure.
set fp [open "somefile" r]
fconfigure $fp -eofchar "char"
set data [read $fp]
close $fp

In addition to Donal's good advice, you could get a list of records by reading the whole file and splitting on the record separator:
package require textutil::split
set records [textutil::splitx [read $chan] "record_separator"]
Documentation

Parsing a file with Tcl

I have a file in here which has multiple set statements. However I want to extract the lines of my interest. Can the following code help
set in [open filename r]
seek $in 0 start
while{ [gets $in line ] != -1} {
regexp (line to be extracted)
}

Other solution:
Instead of using gets I prefer using read function to read the whole contents of the file and then process those line by line. So we are in complete control of operation on file by having it as list of lines
set fileName [lindex $argv 0]
catch {set fptr [open $fileName r]} ;
set contents [read -nonewline $fptr] ;#Read the file contents
close $fptr ;#Close the file since it has been read now
set splitCont [split $contents "\n"] ;#Split the files contents on new line
foreach ele $splitCont {
if {[regexp {^set +(\S+) +(.*)} $ele -> name value]} {
puts "The name \"$name\" maps to the value \"$value\""
}
}
How to run this code:
say above code is saved in test.tcl
Then
tclsh test.tcl FileName
FileName is full path of file unless the file is in the same directory where the program is.

First, you don't need to seek to the beginning straight after opening a file for reading; that's where it starts.
Second, the pattern for reading a file is this:
set f [open $filename]
while {[gets $f line] > -1} {
# Process lines
if {[regexp {^set +(\S+) +(.*)} $line -> name value]} {
puts "The name \"$name\" maps to the value \"$value\""
}
}
close $f
OK, that's a very simple RE in the middle there (and for more complicated files you'll need several) but that's the general pattern. Note that, as usual for Tcl, the space after the while command word is important, as is the space between the while expression and the while body. For specific help with what RE to use for particular types of input data, ask further questions here on Stack Overflow.

Yet another solution:
as it looks like the source is a TCL script, create a new safe interpreter using interp which only has the set command exposed (and any others you need), hide all other commands and replace unknown to just skip anything unrecognised. source the input in this interpreter

Here is yet another solution: use the file scanning feature of Tclx. Please look up Tclx for more info. I like this solution for that you can have several scanmatch blocks.
package require Tclx
# Open a file, skip error checking for simplicity
set inputFile [open sample.tcl r]
# Scan the file
set scanHandle [scancontext create]
scanmatch $scanHandle {^\s*set} {
lassign $matchInfo(line) setCmd varName varValue; # parse the line
puts "$varName = $varValue"
}
scanfile $scanHandle $inputFile
close $inputFile
Yet another solution: use the grep command from the fileutil package:
package require fileutil
puts [lindex $argv 0]
set matchedLines [fileutil::grep {^\s*set} [lindex $argv 0]]
foreach line $matchedLines {
# Each line is in format: filename:line, for example
# sample.tcl:set foo bar
set varName [lindex $line 1]
set varValue [lindex $line 2]
puts "$varName = $varValue"
}

I've read your comments so far, and if I understand you correctly your input data file has 6 (or 9, depending which comment) data fields per line, separated by spaces. You want to use a regexp to parse them into 6 (or 9) arrays or lists, one per data field.
If so, I'd try something like this (using lists):
set f [open $filename]
while {[gets $f line] > -1} {
# Process lines
if {[regexp {(\S+) (\S+) (\S+) (\S+) (\S+) (\S+)} $line -> name source drain gate bulk inst]} {
lappend nameL $name
lappend sourceL $source
lappend drainL $drain
lappend gateL $gate
lappend bulkL $bulk
lappend instL $inst
}
}
close $f
Now you should have a set of 6 lists, one per field, with one entry in the list for each item in your input file. To access the i-th name, for example, you grab $nameL[$i].
If (as I suspect) your main goal is to get the parameters of the device whose name is "foo", you'd use a structure like this:
set name "foo"
set i [lsearch $nameL $name]
if {$i != -1} {
set source $sourceL[$i]
} else {
puts "item $name not found."
set source ''
# or set to 0, or whatever "not found" marker you like
}

set File [ open $fileName r ]
while { [ gets $File line ] >= 0 } {
regex {(set) ([a-zA-Z0-0]+) (.*)} $line str1 str2 str3 str4
#str2 contains "set";
#str3 contains variable to be set;
#str4 contains the value to be set;
close $File
}

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Splitting a huge file into smaller files - tcl

use the global argv array to access command line parameters after you read the file to count the lines, instead of closing the file handle, you can seek back to the top of the file. if you're on *nix, have you considered using exec to call out to split?

Related

Search in file for number, increment and replace

Tcl/Tk write in a specific line

Tcl: Check for same content between two files

Reading multiple lines from a file using TCL?

Parsing a file with Tcl

Categories

Resources