Change UTF-8 coding of file into ISO 8859-15 in tcl - tcl

I have written a code in Tcl which starts by getting the file My_Text_File.txt into it:
set myfile [open My_Text_File.txt]
set file_data [read $myfile]
The file My_Text_File.txt is encoded in UTF-8. But this file must be encoded in ISO 8859-15 (also referred to as Latin-9). Is there a way to extend a Tcl code in a way that it changes a UTF-8 encoded text file to an ISO 8859-15 encoded one?
I would like to emphasize that the change from UTF-8 to ISO 8859-15 must be done inside the Tcl code.
Thanks in advance!

You have to read your original file, converting from UTF-8 to tcl's native Unicode encoding, and then write the contents to a temporary file using ISO-8859-15 encoding, and finally replace the original with the temporary. tcl has a few commands to make it easy:
#!/usr/bin/env tclsh
# See `encoding names` for the list of character encodings supported
# by your version of tcl
proc convert_file {file to_encoding {from_encoding}} {
set infile [open $file]
# Assume original file is in the default system encoding if no
# explicit from encoding is given.
if {$from_encoding ne ""} {
chan configure $infile -encoding $from_encoding
}
# Create a temporary file to write the re-encoded text to
set outfile [file tempfile temp_name]
chan configure $outfile -encoding $to_encoding
# Efficiently read everything from one channel and write to another.
chan copy $infile $outfile
chan close $infile
chan close $outfile
# Rename the temporary file to the original
file copy -force $temp_name $file
file delete -force $temp_name
}
convert_file My_Text_File.txt iso8859-15 utf-8

If the file isn't too big, it's easy to just read everything into memory and write it out again with the new encoding.
# Very simple conversion utility
proc convertFile {filename fromEncoding toEncoding} {
# Read a file in a given encoding
set f [open $filename]
chan configure $f -encoding $fromEncoding
set contents [chan read $f]
chan close $f
# Write a file in a given encoding
set f [open $filename w]
chan configure $f -encoding $toEncoding
chan puts -nonewline $f $contents
chan close $f
}
# Apply to the particular case we care about
convertFile My_Text_File.txt utf-8 iso8859-15
For a large file, you need to stream the data from one file to another (and then you can rename the target file afterwards).
Beware! Converting UTF-8 to ISO 8859-15 can lose information if there are characters in the source text that are not present in the target encoding.

Related

Difference between file size and string bytelength of text file content?

I'm experimenting with coding a very small application-specific local server in Tcl and don't understand the proper method of determining Content-length. I read that it is bytes or decimal number of octets.
In the code below, [file size "index.html"] returns the correct length such that the browser read/loads all of the content; but [string bytelength $html] is too small and the browser does not read to the end.
Why is this and is there a better method? Thank you.
if { $op eq "GET" } {
if { $arg eq "/" } {
set fp [open "index.html" r]
set html [read $fp]
set resp "HTTP/1.1 200 OK\n"
append resp "Connection: Keep-Alive\n"
append resp "Content-Type: text/html; charset: utf-8\n"
append resp "Content-length: [file size "index.html"]\n\n"
#append resp "Content-length: [string bytelength $html]\n\n"
append resp $html
puts stdout $resp
puts $so $resp
close $fp
unset html resp
}
# Remainder of if $arg
}
The result of file size is the number of bytes that the file takes up on disk, and is exactly the number reported by the OS. (It's also the offset you'd be at if you opened the file and seeked to the end.)
If you were to read the file in in binary mode, the string length of what you read would be the same as the file size. When the file is read in (default) text mode it's different because it depends on the encoding that the file is read with; encodings like UTF-8 can use multiple bytes to describe a character and string length reports the number of characters in a string.
The string bytelength command reports the number of bytes used by the data when it is encoded using Tcl's internal encoding (which is rather similar to UTF-8 but not exactly; there are specific denormalizations). That encoding is not normally exposed to the outside world, and is only really of interest to C extensions. Of course, those C extensions can get the length of a string for themselves easily anyway: it's produced (as an OUT parameter because the string itself is the return value) by Tcl_GetStringFromObj() so string bytelength isn't very useful. Indeed, I've only ever found one (1) legitimate use for it, and a better job of integration work with that extension would have got rid of it.
The value reported by string bytelength is not the amount of storage currently used by a value, but rather just (closely related to, by a static difference) the amount of storage used by the standard “string” interpretation. If the value has any other (“internal”) representation as well, which is common (numbers, binary data, true-unicode data, lists, dictionaries, command names, channel handlers, executable code, all those may have additional representation data) then that is not counted.
In your case, you want to open the file in binary mode and use that. And also do this:
set filename "index.html"
set fp [open $filename rb]; # NB: rb — b is for BINARY; this is important
set size [file size $filename]
# HTTP spec says headers are ISO 8859-1 and CRLF-separated
fconfigure $so -encoding iso8859-1 -translation crlf
set headers ""
append headers "HTTP/1.1 200 OK\n"
append headers "Connection: Keep-Alive\n"
# Detecting the content type of a file is its own chunk of complexity
append headers "Content-Type: text/html; charset: utf-8\n"
append headers "Content-length: $size\n"
puts stdout $headers
puts $so $headers
# Ship the data in binary mode; fcopy is VERY efficient
fconfigure $so -translation binary
fcopy $fp $so -size $size
close $fp
Writing HTTP messages to the console is a bit messy because of the mixed encoding used; it's not normally a good idea to write the body of a file. But for debugging you would do:
set data [read $fp]
puts stdout $data
# Additional -nonewline to not add a line terminator
puts -nonewline $so $data
However, the fcopy command (also called chan copy in newer Tcl as part of a command systematization effort) is much more efficient when moving binary data from one place to another. The only way we could make it significantly more efficient would be to move the copy into the OS kernel.
tl;dr: You don't want to use string bytelength. What it does is subtly not useful.

how to copy a row in a .csv file into a column in another .csv file in tcl ?

I wish to copy a specific row in a .csv file to a specific column in another .csv file using tcl.
What i've tried is to copy the row i wanted into a new .csv file and then copy this row manually into my .csv file. But I wish to automate all this and directly copy a row in the .csv into a column in an existing .csv file.
Here is what i tried:
package require csv
set fp [open "filenameSource.csv" r]
set secondColumnData {}
while {[gets $fp line]>=0} {
if {[llength $line]>0} {
lappend secondColumnData [lindex [split $line ","] 1]
}
}
close $fp
puts $secondColumnData
set filename "Destination.csv"
set fileId [open $filename "w"]
puts -nonewline $fileId $secondColumnData
close
Is there a way to have a pointer at row x in the source file and copy it into a specific destination into the Destination file.
I am new to tcl. Please provide example.
Thanks,
IEK
One thing you'll need to learn as a newcomer to Tcl is that there's a lot of useful code in Tcllib, a suite of packages written by the Tcl community. In this case, the csv and struct::matrix packages make this task trivial (as I understand it), which is great because CSV files have some tricky aspects that aren't obvious.
package require csv
package require struct::matrix
# Read the source data
set srcMatrix [struct::matrix]
set f [open "filenameSource.csv" r]
csv::read2matrix $f $srcMatrix
close $f
# Read the destination data so we can UPDATE it
set dstMatrix [struct::matrix]
set f [open "Destination.csv" r+]
csv::read2matrix $f $dstMatrix
# Leaving the file open; we're going to rewrite it…
# Do the copying operation; I assume you know which row and column to copy from/to
$dstMatrix set column 2 [$srcMatrix get row 34]
# Write back
chan seek $f 0
csv::writematrix $f $dstMatrix
chan truncate $f; # Make sure there's no junk left if the file shortened
close $f

how to remove last newline character from a file in TCL

i need a code to remove only a last newline character from a file in TCL.
suppose a file
aaa 11
bbb 12
cc 14
newline character
now how to remove that newline character from a file in TCl
please help me in this!
Seeking and truncating are your friend. (Requires Tcl 8.5 or later.)
set f [open "theFile.txt" r+]
# Skip to where last newline should be; use -2 on Windows (because of CRLF)
chan seek $f -1 end
# Save the offset for later
set offset [chan tell $f]
# Only truncate if we're really sure we've got a final newline
if {[chan read $f] eq "\n"} {
# Do the truncation!
chan truncate $f $offset
}
close $f
For removing data from anywhere other than the end of the file, it's easiest to rewrite the file (either by loading the data all into memory or by streaming and transforming to a new file that you move back over, the latter being harder but necessary with large files). Truncation can only work at the end.

How to mask the sensitive information contained in a file using tcl?

I'm trying to implement a tcl script which reads a text file, and masks all the sensitive information (such as passwords, ip addresses etc) contained it and writes the output to another file.
As of now I'm just substituting this data with ** or ##### and searching the entire file with regexp to find the stuff which I need to mask. But since my text file can be 100K lines of text or more, this is turning out to be incredibly inefficient.
Are there any built in tcl functions/commands I can make use of to do this faster? Do any of the add on packages provide extra options which can help get this done?
Note: I'm using tcl 8.4 (But if there are ways to do this in newer versions of tcl, please do point me to them)
Generally speaking, you should put your code in a procedure to get best performance out of Tcl. (You have got a few more related options in 8.5 and 8.6, such as lambda terms and class methods, but they're closely related to procedures.) You should also be careful with a number of other things:
Put your expressions in braces (expr {$a + $b} instead of expr $a + $b) as that enables a much more efficient compilation strategy.
Pick your channel encodings carefully. (If you do fconfigure $chan -translation binary, that channel will transfer bytes and not characters. However, gets is not be very efficient on byte-oriented channels in 8.4. Using -encoding iso8859-1 -translation lf will give most of the benefits there.)
Tcl does channel buffering quite well.
It might be worth benchmarking your code with different versions of Tcl to see which works best. Try using a tclkit build for testing if you don't want to go to the (minor) hassle of having multiple Tcl interpreters installed just for testing.
The idiomatic way to do line-oriented transformations would be:
proc transformFile {sourceFile targetFile RE replacement} {
# Open for reading
set fin [open $sourceFile]
fconfigure $fin -encoding iso8859-1 -translation lf
# Open for writing
set fout [open $targetFile w]
fconfigure $fout -encoding iso8859-1 -translation lf
# Iterate over the lines, applying the replacement
while {[gets $fin line] >= 0} {
regsub -- $RE $line $replacement line
puts $fout $line
}
# All done
close $fin
close $fout
}
If the file is small enough that it can all fit in memory easily, this is more efficient because the entire match-replace loop is hoisted into the C level:
proc transformFile {sourceFile targetFile RE replacement} {
# Open for reading
set fin [open $sourceFile]
fconfigure $fin -encoding iso8859-1 -translation lf
# Open for writing
set fout [open $targetFile w]
fconfigure $fout -encoding iso8859-1 -translation lf
# Apply the replacement over all lines
regsub -all -line -- $RE [read $fin] $replacement outputlines
puts $fout $outputlines
# All done
close $fin
close $fout
}
Finally, regular expressions aren't necessarily the fastest way to do matching of strings (for example, string match is much faster, but accepts a far more restricted type of pattern). Transforming one style of replacement code to another and getting it to go really fast is not 100% trivial (REs are really flexible).
Especially for very large files - as mentioned - it's not the best way to read the whole file into a variable. As soon as your system runs out of memory you can't prevent your app crashes. For data that is separated by line breaks, the easiest solution is to buffer one line and process it.
Just to give you an example:
# Open old and new file
set old [open "input.txt" r]
set new [open "output.txt" w]
# Configure input channel to provide data separated by line breaks
fconfigure $old -buffering line
# Until the end of the file is reached:
while {[gets $old ln] != -1} {
# Mask sensitive information on variable ln
...
# Write back line to new file
puts $new $ln
}
# Close channels
close $old
close $new
I can't think of any better way to process large files in Tcl - please feel free to tell me any better solution. But Tcl was not made to process large data files. For real performance you may use a compiled instead of a scripted programming language.
Edit: Replaced ![eof $old] in while loop.
A file with 100K lines is not that much (unless every line is 1K chars long :) so I'd suggest you read the entire file into a var and make the substitution on that var:
set fd [open file r+]
set buf [read $fd]
set buf [regsub -all $(the-passwd-pattern) $buf ****]
# write it back
seek $fd 0; # This is not safe! See potrzebie's comment for details.
puts -nonewline $fd $buf
close $fd

Bad file size in video.dat in ns-2

I am using a tcl script which takes a movie file trace and convert it into binary file which is further used by the application agent in ns-2. Here is the code snippet of the script which converts the movie file trace into binary file:
set original_file_name Verbose_Silence_of_the_Lambs_VBR_H263.dat
set trace_file_name video.dat
set original_file_id [open $original_file_name r]
set trace_file_id [open $trace_file_name w]
set last_time 0
while {[eof $original_file_id] == 0} {
gets $original_file_id current_line
if {[string length $current_line] == 0 ||
[string compare [string index $current_line 0] "#"] == 0} {
continue
}
scan $current_line "%d%s%d" next_time type length
set time [expr 1000*($next_time-$last_time)]
set last_time $next_time
puts -nonewline $trace_file_id [binary format "II" $time $length]
}
close $original_file_id
close $trace_file_id
But when I used this created video.dat file further for traffic generation used by application agent I got the following error:
Bad file siz in video.dat
Segmenatation fault
Kindly have a loot at this. what is the meaning of binary format "II" in the code. as I have not found it mentioned in tcl-binary(n) documentation or is it outdated and not supported now.
The problem is probably that you don't open your file in binary mode.
Change
set trace_file_id [open $trace_file_name w]
to
set trace_file_id [open $trace_file_name wb]
Otherwise Tcl will change the output, e.g. replaces \n with \r\n on windows.
(And for byte values > 127 it will be treated as unicode code point, then converted to your system encoding and thereby messing up your entire binary stuff)
While such things are fine for text files, it generates problems with binary files.
Fortunately only a single character is needed to fix that: b as modifier for open
Edit: I just looked up in the change list for Tcl, the b modifier for open was added with 8.5. I usually only use 8.5 or 8.6, so if you are using an older version of Tcl, add the following line after the open:
fconfigure $trace_file_id -translation binary
The b modifier is just a shortcut for that.