How can i verify that input string contains numbers(0-9) and multiple .(special char) in TCL - tcl

I got three different entries "10576.53012.46344.35174" , "10" and "Doc-15" in foreach loop. Out of these 3 entries, i want 10576.53012.46344.35174. How can i verify that current string contains multiple . and numbers.
Im new to TCL, Need suggestion

This is the sort of task that is a pretty good fit for regular expressions.
The string 10576.53012.46344.35174 is matched by a RE like this: ^\d{5}(?:\.\d{5}){3}$ though you might want something a little less strict (e.g., with more flexibility in the number of digits per group — 5 — or the number of groups following a . — 3).
You test if a string matches a regular expression with the regexp command:
if {[regexp {^\d{5}(?:\.\d{5}){3}$} $theVarWithTheString]} {
puts "the regular expression matched $theVarWithTheString"
}
An alternative approach is to split the string by . and check that each group is what you want:
set goodBits 0
set badString 0
foreach group [split $theVarWithTheString "."] {
if {![string is integer -strict $group]} {
set badString 1
break
}
incr goodBits
}
if {!$badString && $goodBits == 4} {
puts "the string was OK"
}
I greatly prefer the regular expression approach myself (with occasional help from string is as appropriate). Writing non-RE validators can be challenging and tends to require a lot of code.

Related

How can I define a Raku grammar to parse TSV text?

I have some TSV data
ID Name Email
1 test test#email.com
321 stan stan#nowhere.net
I would like to parse this into a list of hashes
#entities[0]<Name> eq "test";
#entities[1]<Email> eq "stan#nowhere.net";
I'm having trouble with using the newline metacharacter to delimit the header row from the value rows. My grammar definition:
use v6;
grammar Parser {
token TOP { <headerRow><valueRow>+ }
token headerRow { [\s*<header>]+\n }
token header { \S+ }
token valueRow { [\s*<value>]+\n? }
token value { \S+ }
}
my $dat = q:to/EOF/;
ID Name Email
1 test test#email.com
321 stan stan#nowhere.net
EOF
say Parser.parse($dat);
But this is returning Nil. I think I'm misunderstanding something fundamental about regexes in raku.
Probably the main thing that's throwing it off is that \s matches horizontal and vertical space. To match just horizontal space, use \h, and to match just vertical space, \v.
One small recommendation I'd make is to avoid including the newlines in the token. You might also want to use the alternation operators % or %%, as they're designed for handling this type work:
grammar Parser {
token TOP {
<headerRow> \n
<valueRow>+ %% \n
}
token headerRow { <.ws>* %% <header> }
token valueRow { <.ws>* %% <value> }
token header { \S+ }
token value { \S+ }
token ws { \h* }
}
The result of Parser.parse($dat) for this is the following:
「ID Name Email
1 test test#email.com
321 stan stan#nowhere.net
」
headerRow => 「ID Name Email」
header => 「ID」
header => 「Name」
header => 「Email」
valueRow => 「 1 test test#email.com」
value => 「1」
value => 「test」
value => 「test#email.com」
valueRow => 「 321 stan stan#nowhere.net」
value => 「321」
value => 「stan」
value => 「stan#nowhere.net」
valueRow => 「」
which shows us that the grammar has successfully parsed everything. However, let's focus on the second part of your question, that you want to it to be available in a variable for you. To do that, you'll need to supply an actions class which is very simple for this project. You just make a class whose methods match the methods of your grammar (although very simple ones, like value/header that don't require special processing besides stringification, can be ignored). There are some more creative/compact ways to handle processing of yours, but I'll go with a fairly rudimentary approach for illustration. Here's our class:
class ParserActions {
method headerRow ($/) { ... }
method valueRow ($/) { ... }
method TOP ($/) { ... }
}
Each method has the signature ($/) which is the regex match variable. So now, let's ask what information we want from each token. In header row, we want each of the header values, in a row. So:
method headerRow ($/) { 
my #headers = $<header>.map: *.Str
make #headers;
}
Any token with a quantifier on it will be treated as a Positional, so we could also access each individual header match with $<header>[0], $<header>[1], etc. But those are match objects, so we just quickly stringify them. The make command allows other tokens to access this special data that we've created.
Our value row will look identically, because the $<value> tokens are what we care about.
method valueRow ($/) { 
my #values = $<value>.map: *.Str
make #values;
}
When we get to last method, we will want to create the array with hashes.
method TOP ($/) {
my #entries;
my #headers = $<headerRow>.made;
my #rows = $<valueRow>.map: *.made;
for #rows -> #values {
my %entry = flat #headers Z #values;
#entries.push: %entry;
}
make #entries;
}
Here you can see how we access the stuff we processed in headerRow() and valueRow(): You use the .made method. Because there are multiple valueRows, to get each of their made values, we need to do a map (this is a situation where I tend to write my grammar to have simply <header><data> in the grammar, and defeine the data as being multiple rows, but this is simple enough it's not too bad).
Now that we have the headers and rows in two arrays, it's simply a matter of making them an array of hashes, which we do in the for loop. The flat #x Z #y just intercolates the elements, and the hash assignment Does What We Mean, but there are other ways to get the array in hash you want.
Once you're done, you just make it, and then it will be available in the made of the parse:
say Parser.parse($dat, :actions(ParserActions)).made
-> [{Email => test#email.com, ID => 1, Name => test} {Email => stan#nowhere.net, ID => 321, Name => stan} {}]
It's fairly common to wrap these into a method, like
sub parse-tsv($tsv) {
return Parser.parse($tsv, :actions(ParserActions)).made
}
That way you can just say
my #entries = parse-tsv($dat);
say #entries[0]<Name>; # test
say #entries[1]<Email>; # stan#nowhere.net
TL;DR: you don't. Just use Text::CSV, which is able to deal with every format.
I will show how old Text::CSV will probably be useful:
use Text::CSV;
my $text = q:to/EOF/;
ID Name Email
1 test test#email.com
321 stan stan#nowhere.net
EOF
my #data = $text.lines.map: *.split(/\t/).list;
say #data.perl;
my $csv = csv( in => #data, key => "ID");
print $csv.perl;
The key part here is the data munging that converts the initial file into an array or arrays (in #data). It's only needed, however, because the csv command is not able to deal with strings; if data is in a file, you're good to go.
The last line will print:
${" 1" => ${:Email("test\#email.com"), :ID(" 1"), :Name("test")}, " 321" => ${:Email("stan\#nowhere.net"), :ID(" 321"), :Name("stan")}}%
The ID field will become the key to the hash, and the whole thing an array of hashes.
TL;DR regexs backtrack. tokens don't. That's why your pattern isn't matching. This answer focuses on explaining that, and how to trivially fix your grammar. However, you should probably rewrite it, or use an existing parser, which is what you should definitely do if you just want to parse TSV rather than learn about raku regexes.
A fundamental misunderstanding?
I think I'm misunderstanding something fundamental about regexes in raku.
(If you already know the term "regexes" is a highly ambiguous one, consider skipping this section.)
One fundamental thing you may be misunderstanding is the meaning of the word "regexes". Here are some popular meanings folk assume:
Formal regular expressions.
Perl regexes.
Perl Compatible Regular Expressions (PCRE).
Text pattern matching expressions called "regexes" that look like any of the above and do something similar.
None of these meanings are compatible with each other.
While Perl regexes are semantically a superset of formal regular expressions, they are far more useful in many ways but also more vulnerable to pathological backtracking.
While Perl Compatible Regular Expressions are compatible with Perl in the sense they were originally the same as standard Perl regexes in the late 1990s, and in the sense that Perl supports pluggable regex engines including the PCRE engine, PCRE regex syntax is not identical to the standard Perl regex used by default by Perl in 2020.
And while text pattern matching expressions called "regexes" generally do look somewhat like each other, and do all match text, there are dozens, perhaps hundreds, of variations in syntax, and even in semantics for the same syntax.
Raku text pattern matching expressions are typically called either "rules" or "regexes". The use of the term "regexes" conveys the fact that they look somewhat like other regexes (although the syntax has been cleaned up). The term "rules" conveys the fact they are part of a much broader set of features and tools that scale up to parsing (and beyond).
The quick fix
With the above fundamental aspect of the word "regexes" out of the way, I can now turn to the fundamental aspect of your "regex"'s behavior.
If we switch three of the patterns in your grammar for the token declarator to the regex declarator, your grammar works as you intended:
grammar Parser {
regex TOP { <headerRow><valueRow>+ }
regex headerRow { [\s*<header>]+\n }
token header { \S+ }
regex valueRow { [\s*<value>]+\n? }
token value { \S+ }
}
The sole difference between a token and a regex is that a regex backtracks whereas a token doesn't. Thus:
say 'ab' ~~ regex { [ \s* a ]+ b } # 「ab」
say 'ab' ~~ token { [ \s* a ]+ b } # 「ab」
say 'ab' ~~ regex { [ \s* \S ]+ b } # 「ab」
say 'ab' ~~ token { [ \s* \S ]+ b } # Nil
During processing of the last pattern (that could be and often is called a "regex", but whose actual declarator is token, not regex), the \S will swallow the 'b', just as it temporarily will have done during processing of the regex in the prior line. But, because the pattern is declared as a token, the rules engine (aka "regex engine") does not backtrack, so the overall match fails.
That's what's going on in your OP.
The right fix
A better solution in general is to wean yourself from assuming backtracking behavior, because it can be slow and even catastrophically slow (indistinguishable from the program hanging) when used in matching against a maliciously constructed string or one with an accidentally unfortunate combination of characters.
Sometimes regexs are appropriate. For example, if you're writing a one-off and a regex does the job, then you're done. That's fine. That's part of the reason that / ... / syntax in raku declares a backtracking pattern, just like regex. (Then again you can write / :r ... / if you want to switch on ratcheting -- "ratchet" means the opposite of "backtrack", so :r switches a regex to token semantics.)
Occasionally backtracking still has a role in a parsing context. For example, while the grammar for raku generally eschews backtracking, and instead has hundreds of rules and tokens, it nevertheless still has 3 regexs.
I've upvoted #user0721090601++'s answer because it's useful. It also addresses several things that immediately seemed to me to be idiomatically off in your code, and, importantly, sticks to tokens. It may well be the answer you prefer, which will be cool.

How do I search for a string in this JSON with Python

My JSON file looks something like:
{
"generator": {
"name": "Xfer Records Serum",
....
},
"generator": {
"name: "Lennar Digital Sylenth1",
....
}
}
I ask the user for search term and the input is searched for in the name key only. All matching results are returned. It means if I input 's' only then also both the above ones would be returned. Also please explain me how to return all the object names which are generators. The more simple method the better it will be for me. I use json library. However if another library is required not a problem.
Before switching to JSON I tried XML but it did not work.
If your goal is just to search all name properties, this will do the trick:
import re
def search_names(term, lines):
name_search = re.compile('\s*"name"\s*:\s*"(.*' + term + '.*)",?$', re.I)
return [x.group(1) for x in [name_search.search(y) for y in lines] if x]
with open('path/to/your.json') as f:
lines = f.readlines()
print(search_names('s', lines))
which would return both names you listed in your example.
The way the search_names() function works is it builds a regular expression that will match any line starting with "name": " (with varying amount of whitespace) followed by your search term with any other characters around it then terminated with " followed by an optional , and the end of string. Then applies that to each line from the file. Finally it filters out any non-matching lines and returns the value of the name property (the capture group contents) for each match.

Get the element after/before the matched one in json with jq

Suppose I have the following json input to the jq command:
[
{"type":"dog", "set":"foo"},
{"type":"bird", "set":"bar"},
{"type":"cat", "set":"blaz"},
{"type":"fish", "set":"mor"}
]
I know that there is an element in this array whose type is "bird", in this case, the second element. But I want its next (or previous) sibling, that is, the element after (before) it, in this case, the third (first) element. How can I get it in jq?
Also, I have another question: If the matched element (that is, the element whose value of type I know) is the last one in the array, I want it to get the first one as next (that is, cycle through the array) instead of returning nothing. The same whether the matched element is the first one (then I want to get the last element).
For the sake of specificity, let's suppose you want to extract the the (before, focus, after) triple as an array, where before and after wrap around as described. For simplicity, let's also suppose the source array is of length at least 2.
Next, for ease of exposition and understanding, we will define a helper function for extracting the three items:
# $i is assumed to be a valid index into the input array,
# which is assumed to be of length at least 2
def triple($i):
if $i == 0 then [last] + .[0:2]
elif $i == (length-1) then .[$i-1: $i+2] + [first]
else .[$i-1: $i+2]
end;
Now we have only to find the index, and use it:
(map(.type) | index("bird")) as $i
| if $i then triple($i) else empty end
Using this approach, other variants can easily be obtained.
let me also offer you an alternative solution, based on a walk-path unix tool for JSON: jtc - there you "encode" your query logic right into the path:
e.g. to find a "type":"bird" record and then it's preceding sibling (in the parent's array) would be like this:
bash $ <file.json jtc -w'[type]:<bird> [-1]<idx>k [-1]>idx<t-1' -r
{ "set": "foo", "type": "dog" }
let me break it down for you:
[type]:<bird> - will find recursively a record "type":"bird"
[-1]<idx>k - will step up 1 tier in JSON tree (select parent, effectively select the entire record {"type":"bird", "set":"bar"}) and will memorize its array index into the namespace idx
[-1]>idx<t-1 - will again step up 1 level in JSON (selecting the top array) and will search (non-recursively) for the entry with index (stored in idx) offset by -1
Equally once can select a next sibling:
bash $ <file.json jtc -w'[type]:<bird>[-1]<idx>k[-1]>idx<t1'
{ "set": "blaz", "type": "cat" }
Or, select the first entry (based on the last match):
bash $ <file.json jtc -w'[type]:<fish>[-1]<idx>k[-1]>idx<t-1000' -r
{ "set": "foo", "type": "dog" }
(just put some surely low value as the relative quantifier - it'll get normalized to the first entry)
PS> Disclosure: I'm the creator of the jtc tool

Ruby search for match in large json

I have a pretty huge json file of short lines from a screenplay. I am trying to match keywords to keywords in the json file so I can pull out a line from the json.
The json file structure is like this:
[
"Yeah, well I wasn't looking for a long term relationship. I was on TV. ",
"Ok, yeah, you guys got to put a negative spin on everything. ",
"No no I'm not ready, things are starting to happen. ",
"Ok, it's forgotten. ",
"Yeah, ok. ",
"Hey hey, whoa come on give me a hug... "
]
(plus lots more...2444 lines in total)
So far I have this, but it's not making any matches.
# screenplay is read in from a json file
#screenplay_lines = JSON.parse(#jsonfile.read)
#text_to_find = ["relationship","negative","hug"]
#matching_results = []
#screenplay_lines.each do |line|
if line.match(Regexp.union(#text_to_find))
#matching_results << line
end
end
puts "found #{#matching_results.length} matches..."
puts #matching_results
I'm not getting any hits so not sure what's not working. Plus I'm sure it's a pretty expensive process doing it this way with a large amount of data. Any ideas? Thanks.
Yes, Regexp matching is slower than just checking if the String is included in a line of text. But this also depends on the number of keywords and the length of the lines and much more. So best would be to run at least a micro-benchmark.
lines = [
"Yeah, well I wasn't looking for a long term relationship. I was on TV. ",
"Ok, yeah, you guys got to put a negative spin on everything. ",
"No no I'm not ready, things are starting to happen. ",
"Ok, it's forgotten. ",
"Yeah, ok. ",
"Hey hey, whoa come on give me a hug... "
]
keywords = ["relationship","negative","hug"]
def find1(lines, keywords)
regexp = Regexp.union(keywords)
lines.select { |line| regexp.match(line) }
end
def find2(lines, keywords)
lines.select { |line| keywords.any? { |keyword| line.include?(keyword) } }
end
def find3(lines, keywords)
regexp = Regexp.union(keywords)
lines.select { |line| regexp.match?(line) }
end
require 'benchmark/ips'
Benchmark.ips do |x|
x.compare!
x.report('match') { find1(lines, keywords) }
x.report('include?') { find2(lines, keywords) }
x.report('match?') { find3(lines, keywords) }
end
In this setup the include? variant is way faster:
Comparison:
include?: 288083.4 i/s
match?: 91505.7 i/s - 3.15x slower
match: 65866.7 i/s - 4.37x slower
Please note:
I've moved creation of the regexp out of the loop. It does not need to be created for every line. Creation of a regexp is an expensive operation (your variant clocked in at 1/5 of the speed of the regexp outside of the loop)
match? is only available in Ruby 2.4+, it is faster because it does not assign any match results (side-effect free)
I would not worry to much about performance for 2500 lines of text. If it is fast enough then stop searching for a better solution.
There is a possible solution, try this one:
json_expressions

Decoding and using JSON data in Perl

I am confused about accessing the contents of some JSON data that I have decoded. Here is an example
I don't understand why this solution works and my own does not. My questions are rephrased below
my $json_raw = getJSON();
my $content = decode_json($json_raw);
print Data::Dumper($content);
At this point my JSON data has been transformed into this
$VAR1 = { 'items' => [ 1, 2, 3, 4 ] };
My guess tells me that, once decoded, the object will be a hash with one element that has the key items and an array reference as the value.
$content{'items'}[0]
where $content{'items'} would obtain the array reference, and the outer $...[0] would access the first element in the array and interpret it as a scalar. However this does not work. I get an error message use of uninitialized value [...]
However, the following does work:
$content->{items}[0]
where $content->{items} yields the array reference and [0] accesses the first element of that array.
Questions
Why does $content{'items'} not return an array reference? I even tried #{content{'items'}}, thinking that, once I got the value from content{'items'}, it would need to be interpreted as an array. But still, I receive the uninitialized array reference.
How can I access the array reference without using the arrow operator?
Beginner's answer to beginner :) Sure not as profesional as should be, but maybe helps you.
use strict; #use this all times
use warnings; #this too - helps a lot!
use JSON;
my $json_str = ' { "items" : [ 1, 2, 3, 4 ] } ';
my $content = decode_json($json_str);
You wrote:
My guess tells me that, once decoded, the object will be a hash with
one element that has the key items and an array reference as the value.
Yes, it is a hash, but the the decode_json returns a reference, in this case, the reference to hash. (from the docs)
expects an UTF-8 (binary) string and tries to parse that
as an UTF-8 encoded JSON text,
returning the resulting reference.
In the line
my $content = decode_json($json_str);
you assigning to an SCALAR variable (not to hash).
Because you know: it is a reference, you can do the next:
printf "reftype:%s\n", ref($content);
#print: reftype:HASH ^
#therefore the +------- is a SCALAR value containing a reference to hash
It is a hashref - you can dump all keys
print "key: $_\n" for keys %{$content}; #or in short %$content
#prints: key: items
also you can assing the value of the "items" (arrayref) to an scalar variable
my $aref = $content->{items}; #$hashref->{key}
#or
#my $aref = ${$content}{items}; #$hash{key}
but NOT
#my $aref = $content{items}; #throws error if "use strict;"
#Global symbol "%content" requires explicit package name at script.pl line 20.
The $content{item} is requesting a value from the hash %content and you never defined/assigned such variable. the $content is an scalar variable not hash variable %content.
{
#in perl 5.20 you can also
use 5.020;
use experimental 'postderef';
print "key-postderef: $_\n" for keys $content->%*;
}
Now step deeper - to the arrayref - again you can print out the reference type
printf "reftype:%s\n", ref($aref);
#reftype:ARRAY
print all elements of array
print "arr-item: $_\n" for #{$aref};
but again NOT
#print "$_\n" for #aref;
#dies: Global symbol "#aref" requires explicit package name at script.pl line 37.
{
#in perl 5.20 you can also
use 5.020;
use experimental 'postderef';
print "aref-postderef: $_\n" for $aref->#*;
}
Here is an simple rule:
my #arr; #array variable
my $arr_ref = \#arr; #scalar - containing a reference to #arr
#{$arr_ref} is the same as #arr
^^^^^^^^^^ - array reference in curly brackets
If you have an $arrayref - use the #{$array_ref} everywhere you want use the array.
my %hash; #hash variable
my $hash_ref = \%hash; #scalar - containing a reference to %hash
%{$hash_ref} is the same as %hash
^^^^^^^^^^^ - hash reference in curly brackets
If you have an $hash_ref - use the %{$hash_ref} everywhere you want use the hash.
For the whole structure, the following
say $content->{items}->[0];
say $content->{items}[0];
say ${$content}{items}->[0];
say ${$content}{items}[0];
say ${$content->{items}}[0];
say ${${$content}{items}}[0];
prints the same value 1.
$content is a hash reference, so you always need to use an arrow to access its contents. $content{items} would refer to a %content hash, which you don't have. That's where you're getting that "use of uninitialized value" error from.
I actually asked a similar question here
The answer:
In Perl, a function can only really return a scalar or a list.
Since hashes can be initialized or assigned from lists (e.g. %foo = (a => 1, b => 2)), I guess you're asking why json_decode returns something like { a => 1, b => 2 } (a reference to an anonymous hash) rather than (a => 1, b => 2) (a list that can be copied into a hash).
I can think of a few good reasons for this:
in Perl, an array or hash always contains scalars. So in something like { "a": { "b": 3 } }, the { "b": 3 } part has to be a scalar; and for consistency, it makes sense for the whole thing to be a scalar in the same way.
if the hash is quite large (many keys at top-level), it's pointless and expensive to iterate over all the elements to convert it into a list, and then build a new hash from that list.
in JSON, the top-level element can be either an object (= Perl hash) or an array (= Perl array). If json_decode returned a list in the former case, it's not clear what it would return in the latter case. After decoding the JSON string, how could you examine the result to know what to do with it? (And it wouldn't be safe to write %foo = json_decode(...) unless you already knew that you had a hash.) So json_decode's behavior works better for any general-purpose library code that has to use it without already knowing very much about the data it's working with.
I have to wonder exactly what you passed as an array to json_decode, because my results differ from yours.
#!/usr/bin/perl
use JSON qw (decode_json);
use Data::Dumper;
my $json = '["1", "2", "3", "4"]';
my $fromJSON = decode_json($json);
print Dumper($fromJSON);
The result is $VAR1 = [ '1', '2', '3', '4' ];
Which is an array ref, where your result is a hash ref
So did you pass in a hash with element items which was a reference to an array?
In my example you would get the array by doing
my #array = #{ $fromJSON };
In yours
my #array = #{ $content->{'items'} }
I don't understand why you dislike the arrow operator so much!
The decode_json function from the JSON module will always return a data reference.
Suppose you have a Perl program like this
use strict;
use warnings;
use JSON;
my $json_data = '{ "items": [ 1, 2, 3, 4 ] }';
my $content = decode_json($json_data);
use Data::Dump;
dd $content;
which outputs this text
{ items => [1 .. 4] }
showing that $content is a hash reference. Then you can access the array reference, as you found, with
dd $content->{items};
which shows
[1 .. 4]
and you can print the first element of the array by writing
print $content->{items}[0], "\n";
which, again as you have found, shows just
1
which is the first element of the array.
As #cjm mentions in a comment, it is imperative that you use strict and use warnings at the start of every Perl program. If you had those in place in the program where you tried to access $content{items}, your program would have failed to compile, and you would have seen the message
Global symbol "%content" requires explicit package name
which is a (poorly-phrased) way of telling you that there is no %content so there can be no items element.
The scalar variable $content is completely independent from the hash variable %content, which you are trying to access when you write $content{items}. %content has never been mentioned before and it is empty, so there is no items element. If you had tried #{$content->{items}} then it would have worked, as would #{${$content}{items}}
If you really have a problem with the arrow operator, then you could write
print ${$content}{items}[0], "\n";
which produces the same output; but I don't understand what is wrong with the original version.