Tcl partial string match of one list against another - tcl

I'm trying to find the items in list1 that are partial string matches against items from list2 using Tcl.
I'm using this, but it's very slow. Is there any more efficient way to do this?
set list1 [list abc bcd cde]
set list2 [list ab cd]
set l_matchlist [list]
foreach item1 $list1 {
foreach item2 $list2 {
if {[string match -nocase "*${item2}*" $item1]} {
lappend l_matchlist $item1
break
}
}
}
my actual lists are very long and this takes a long time. Is this the best way to do this?

In addition to being slow, there is also a problem if list2 contains elements that have glob wildcard characters, such as '?' and '*'.
I expect the following method will work faster. At least it fixes the issue mentioned above:
set list1 [list abc BCD ace cde]
set list2 [list cd ab de]
set l_matchlist [list]
foreach item2 $list2 {
lappend l_matchlist \
{*}[lsearch -all -inline -nocase -regexp $list1 (?q)$item2]
}
The -regexp option in combination with (?q) may seem strange at first. It uses regexp matching and then tells regexp to treat the pattern as a literal string. But this has the effect of performing the partial match that you're after.
This differs with your version in that it may produce the results in a different order and the same item from list1 may be reported multiple times if it matches more than one item in list2.
If that is undesired, you can follow up with:
set l_matchlist [lmap item1 $list1 {
if {$item1 ni $l_matchlist} continue
set item1
}]
Of course that will reduce some of the speed gains achieved earlier.

You could cheat a bit and turn it from a list-processing task to a string processing task. The latter are usually quite a bit faster in Tcl.
Below I first turn list1 into a string with the original list elements separated by the ASCII field separator character "\x1F". Then the result can be gotten in a single loop via a regular expression search. The regular expression finds the first substring bounded by the field separator chars that contains item2:
# convert list to string:
set string1 \x1F[join $list1 \x1F]\x1F
set l_matchlist [list]
foreach item2 $list2 {
# escape out regexp special chars:
set item2 [regsub -all {\W} $item2 {\\&}]
# use append to assemble regexp pattern
set item2 [append x {[^\x1F]*} $item2 {[^\x1F]*}][unset x]
if {[regexp -nocase $item2 $string1 match]} {
lappend l_matchlist $match
}
}

Related

Is there a simple way to parse a line of Tcl into its command and its arguments (not just splitting by whitespace)

Suppose I have a string which is also a Tcl command.
set line {lsort -unique [list a b c a]}
How can I convert this string into a list equivalent to this?
{
{lsort}
{-unique}
{[list a b c a]}
}
Because of whitespace inside the square brackets, I can't just use lindex.
For example:
> lindex $line 2
--> [list
The reason I'm asking is because I have a large Tcl script that I want to parse and re-write. I would like certain lines in the re-written script to have swapped argument order or some numerical arguments scaled by a factor.
I know I could parse the string character by character, keeping track of {}, [], and " characters, but this feels like re-inventing something that might already exist. I've been looking at the info and interp commands but couldn't find anything there.
I used info complete successfully in this proc.
proc command_to_list {command} {
# split by whitespace
set words [regexp -all -inline {\S+} $command]
set spaces [regexp -all -inline {\s+} $command]
set output_list [list]
set buffer ""
foreach word $words space $spaces {
append buffer $word
if {[info complete $buffer]} {
lappend output_list $buffer
set buffer ""
} else {
append buffer $space
}
}
return $output_list
}
This proc will group whitespace separated 'words' until they have no unmatched curlies, double quotes, or square brackets. Whitespace is preserved inside of matching pairs of curlies, double quotes or square brackets.
> set command {foreach {k v} [list k1 v1 k2 v2] {puts "$k $v"}}
> foreach word [command_to_list $command] {puts $word}
foreach
{k v}
[list k1 v1 k2 v2]
{puts "$k $v"}

How to copy exactly in tcl?

I created a list using:
set list1 { o\\/one o\\/two o\\/three }
now I want to copy this list to another list by adding { } to each item
my new list should become :
{ {o\\/one} {o\\/two} {o\\/three} }
I tried using
foreach a $list1 {
set x "{$a}"
append new_list " " "{$a}"
lappend new_list1 $x
}
newlist → {o\/one} {o\/two} {o\/three}
newlist1 → {{o\/one}} {{o\/two}} {{o\/three}}
Please help?
Your original list has these items in it (as you can verify with lindex):
puts [lindex $list1 0] → o\/one
puts [lindex $list1 1] → o\/two
puts [lindex $list1 2] → o\/three
Any list that has those elements in it, however encoded, is pairwise-equivalent. The canonical form (as produced by Tcl's own list operations) of the list is:
{o\/one} {o\/two} {o\/three}
Perhaps the easiest way of obtaining that is:
set list2 [lrange $list1 0 end]
The lrange command uses Tcl's standard list-to-string engine (shared with a great many other commands). That prefers to not add braces, but prefers adding braces to adding backslashes; backslashes are a last resort because they're ugly and hard to read. But it works with arbitrary contents in the elements; just blindly adding braces is vulnerable to tricky edge cases.
Another way of getting the above canonical form is this (provided you're not stuck on versions of Tcl so old they're no longer supported):
set list2 [list {*}$list1]
[EDIT]: If you've got a string with some things in it separated by spaces, you might want to convert it into a proper list; this is useful particularly when the input data contains list metacharacters like braces and (relevant in this case) backslashes. There are two main ways to do this:
set theList [split $inputString]
set theList [regexp -all -inline {\S+} $inputString]
They differ in what happens when the input string has two (or more) spaces between two words:
set inputString "a b c d"; # NB: two spaces between b and c
puts [split $inputString]; # ==> a b {} c d
puts [regexp -all -inline {\S+} $inputString]; # ==> a b c d
There are use-cases for both.

list searching to find exact matches using TCL lsearch

I have a list and need to search some strings in this list. My list is like following:
list1 = {slt0_reg_11.CK slt0_reg_11.Q slt0_reg_12.CK slt0_reg_12.Q}
I am trying to use lsearch to check if above list includes some strings or not. Strings are like:
string1 = {slt0_reg_1 slt0_reg_1}
I am doing the following to check this:
set listInd [lsearch -all -exact -nocase -regexp $list1 $string1]
This commands gives the indexes if list1 includes $string1 (This is what I want). However, problem is if I have a string like slt0_reg_1, the above command identifies the first two elements of the list (slt0_reg_11.CK slt0_reg_11.Q) because these covers the string I search.
How can I make exact search?
It sound like you want to add in word-boundary constraints (\y) to your RE. (Don't use -exact and -regexp at the same time; only one of those modes can be used on any run because they change the comparison engine used.) A little care must be taken because we can't enclose the RE in braces as we want to do variable substitution within it.
set list1 {slt0_reg_11.CK slt0_reg_11.Q slt0_reg_12.CK slt0_reg_12.Q}
foreach str {slt0_reg_11 slt0_reg_1} {
set matches [lsearch -all -regexp $list1 "\\y$str\\y"]
puts "$str: $matches"
}
Prints:
slt0_reg_11: 0 1
slt0_reg_1:
If you want to compare your list for an exact match of the part before the dot against another list, you may be better off using lmap:
set index -1
set listInd [lmap str $list1 {
incr index
if {[lindex [split $str .] 0] ni $string1} continue
set index
}]

Replace same strings with swap difference?

To manipulate Strings in Tcl, we use the string command.
If you need to replace comma:
set value { 10.00 }
puts [string map -nocase { . , } $value]
# Return: 10,00
We can replace several strings:
set text "This is a replacement test text"
puts [string map -nocase { e E s S a A } $text]
# Returns: THIS IS A TEXT OF REPLACEMENT TEST
Of course, we can replace words:
set text "This is a replacement test text"
puts [string map -nocase {test TEST a {second}} $text]
# Returns: This is the second replacement TEST text.
So far so good!
But one question that does not want to be silent is .. How to replace more than one identical occurrence in the sentence, giving a DIFFERENT substitution for each of them?
For example:
set time {10:02:12}
puts [string map -nocase { { : +} {: =} } $time]
I would like this result: 10 + 02 = 12
proc seqmap {str match args} {
set rc $str
foreach l [lreverse [regexp -all -indices -inline ***=$match $str]] \
replacement [lreverse $args] {
set rc [string replace $rc {*}$l $replacement]
}
return $rc
}
seqmap 10:02:12 : { + } { = }
=> 10 + 02 = 12
I'm using lreverse in case the replacement has a different length than the string it replaces. The indices would be off if the replacements were done from left to right.
The ***= is used to avoid special treatment of wildcard characters in the match string.
Of course, things get a lot more complicated if you want to handle the case where the number of occurrences doesn't match the number of provided substitutions. And even more if you want to replace several different strings.
This version handles the complications mentioned above:
proc seqmap {map str} {
# Transform the map into a dict with each key containing a list of replacements
set mapdict {}
foreach {s r} $map {dict lappend mapdict $s $r}
# Build a map where each key maps to a unique tag
# At the same time build a dict that maps our tags to the replacements
# First map the chosen tag character in case it is present in the string
set newmap {# #00}
set mapdict [dict map {s r} $mapdict {
lappend newmap $s [set s [format #%02d [incr num]]]
set r
}]
# Add the tag character to the dict so it can be mapped back
dict set mapdict #00 #
# Map the tags into the string
set rc [string map $newmap $str]
# Locate the positions where the tags ended up
set match [regexp -all -indices -inline {#\d\d} $rc]
# Create a list of replacements matching the tags
set replace [lmap l $match {
# Extract the tag
set t [string range $rc {*}$l]
# Obtain a replacement for this tag
set s [lassign [dict get $mapdict $t] r]
# Return the used replacement to the end of the list
dict set mapdict $t [linsert $s end $r]
# Add the replacement to the list
set r
}]
# Walk the two lists in reverse order, replacing the tags with the selected replacements
foreach l [lreverse $match] r [lreverse $replace] {
set rc [string replace $rc {*}$l $r]
}
# Done
return $rc
}
You call it just like you would string map, so with a key-value mapping and the string to perform the replacements on. Any duplicated keys specify the subsequent values to be substituted for each occurrence of the key. When the list is exhausted it starts over from the beginning.
So puts [seqmap {: + : = : *} 10:02:12] => 10+02=12
And puts [seqmap {: + : =} 10:02:12:04:16] => 10+02=12+04=16
As presented, the command can handle up to 99 unique keys. But it can easily be updated if more are needed.

How I can get unmatched part of string using TCL?

I am comparing two strings, how I can get the part of string which did not match between these two
This is an interesting problem that requires a longest common subsequence algorithm. Tcl's got one of those already in Tcllib, but it's for lists. Fortunately, we can convert a string into a list of characters with split:
package require struct::list
set a "the quick brown fox"
set b "the slow green fox"
set listA [split $a ""]; set lenA [llength $listA]
set listB [split $b ""]; set lenB [llength $listB]
set correspondences [struct::list longestCommonSubsequence $listA $listB]
set differences [struct::list lcsInvertMerge $correspondences $lenA $lenB]
Now we can get the parts that didn't match up by picking the parts from the differences that are added, changed or deleted:
set common {}
set unmatchedA {}
set unmatchedB {}
foreach diff $differences {
lassign $diff type rangeA rangeB
switch $type {
unchanged {
lappend common [join [lrange $listA {*}$rangeA] ""]
}
added {
lappend unmatchedB [join [lrange $listB {*}$rangeB] ""]
}
changed {
lappend unmatchedA [join [lrange $listA {*}$rangeA] ""]
lappend unmatchedB [join [lrange $listB {*}$rangeB] ""]
}
deleted {
lappend unmatchedA [join [lrange $listA {*}$rangeA] ""]
}
}
}
puts common->$common
# common->{the } ow {n fox}
puts A->$unmatchedA
# A->{quick br}
puts B->$unmatchedB
# B->sl { gree}
In this case, we see the following correspondences (. is a spacer I've inserted to help line things up):
the quick br..ow.....n fox
the ........slow green fox
Whether this is exactly what you want, I don't know (and there's more detail in the computed differences; they're just a bit hard to read). You can easily switch to doing a word-by-word correspondence instead if that's more to your taste. It's pretty much just removing the split and join…
If you have a string and you want to remove a fixed substring, for example
set str "this is a larger? string"
set substr "a larger?"
Then you can do this:
set parts [split [string map [list $s2 \uffff] $s1] \uffff]
# returns the list: {this is } { string}
That globally replaces the substring within the larger string with a single character, then splits the result on that same character.