I'm reading through the book, "SAS Functions by Example - Second Edition" and having trouble trying to understand a certain function due to the example and output they get.
Function: FINDC
Purpose: To locate a character that appears or does not appear within a string. With optional arguments, you can define the starting point for the search, set the direction of the search, ignore case or trailing blanks, or look for characters except the ones listed.
Syntax: FINDC(character-value, find-characters <,'modifiers'> <,start>)
Two of the modifiers are i and k:
i ignore case
k count only characters that are not in the list of find-characters
So now one of the examples has this:
Note: STRING1 = "Apples and Books"
FINDC(STRING1,"aple",'ki')
For the Output, they said it returns 1 because the position of "A" in Apple. However this is what confuses me, because I thought the k modifier says to find characters that are not in the find-characters list. So why is it searching for a when the letter "A", case-ignored, is in the find-characters list. To me, I feel like this example should output 6 for the "s" in Apples.
Is anyone able to help explain the k modifier to me any better, and why the output for this answer is 1 instead of 6?
Edit 1
Reading the SAS documentation online, I found this example which seems to contradict the book I'm reading:
Example 3: Searching for Characters and Using the K Modifier
This example searches a character string and returns the characters that do
not appear in the character list.
data _null_;
string = 'Hi, ho!';
charlist = 'hi';
j = 0;
do until (j = 0);
j = findc(string, charlist, "k", j+1);
if j = 0 then put +3 "That's all";
else do;
c = substr(string, j, 1);
put +3 j= c=;
end;
end;
run;
SAS writes the following output to the log:
j=1 c=H
j=3 c=,
j=4 c=
j=6 c=o
j=7 c=!
That's all
So, is the book wrong?
The book is wrong.
511 data _null_;
512 STRING1 = "Apples and Books" ;
513 x=FINDC(STRING1,"aple",'ki');
514 put x=;
515 if x then do;
516 ch=char(string1,x);
517 put ch=;
518 end;
519 run;
x=6
ch=s
I am working on generating an HTML page using a CGI script in Perl.
I need filter some sequences in order to check whether they contain a specific pattern; if they contain it I need to print those sequences on my page with 50 bases per line, and highlight the pattern in the sequences. My sequences are in an hash called %hash; the keys are the names, the values are the actual sequences.
my %hash2;
foreach my $key (keys %hash) {
if ($hash{$key} =~ s!(aaagg)!<b>$1</b>!) {
$hash2{$key} = $hash{$key}
}
}
foreach my $key (keys %hash2) {
print "<p> <b> $key </b> </p>";
print "<p>$_</p>\n" for unpack '(A50)*', $hash2{$key};
}
This method "does" the job however if I highlight the pattern "aaagg" using this method I am messing up the unpacking of the line (for unpack '(A50)*'); because now the sequences contains the extra characters of the bold tags which are included in the unpacking count. This beside making the lines of different length it is also a big problem if the tag falls between 2 lines due to unpacking 50 characters, it basically remains open and everything after that is bold.
The script below uses a single randomly generated DNA sequence of length 243 (generated using http://www.bioinformatics.org/sms2/random_dna.html) and a variable length pattern.
It works by first recording the positions which need to be highlighted instead of changing the sequence string. The highlighting is inserted after the sequence is split into chunks of 50 bases.
The highlighting is done in reverse order to minimize bookkeeping busy work.
#!/usr/bin/env perl
use utf8;
use strict;
use warnings;
use YAML::XS;
my $PRETTY_WIDTH = 50;
# I am using bold-italic so the highlighting
# is visible on Stackoverflow, but in real
# life, this would be something like:
# my #PRETTY_MARKUP = ('<span class="highlighted-match">', '</span>');
my #PRETTY_MARKUP = ('<b><i>', '</i></b>');
use constant { BAŞ => 0, SON => 1, ROW => 0, COL => 1 };
my $sequence = q{ccggtgagacatccagttagttcactgagccgacttgcatcagtcatgcttttccccgtaatgagggccccatattcaggccgtcgtccggaattgtcttggatccggaatgcagcttttctcaccgcttgatgaacattcactgaatatctgacgccgcgaaaacagggtcactagcctgtttccggtcgcccgagaccggcgagtttgtggtatcgcgagcgcccccgggcggtagggtct};
my $wanted = 'c..?gg';
my #pos;
while ($sequence =~ /($wanted)/g) {
push #pos, [ pos($sequence) - length($1), pos($sequence) ];
}
print Dump \#pos;
my #output = unpack "(A$PRETTY_WIDTH)*", $sequence;
print Dump \#output;
while (my $pos = pop #pos) {
my #rc = map pos_to_rc($_, $PRETTY_WIDTH), #$pos;
substr($output[ $rc[$_][ROW] ], $rc[$_][COL], 0, $PRETTY_MARKUP[$_]) for SON, BAŞ;
}
print Dump \#output;
sub pos_to_rc {
my $r = int( $_[0] / $_[1] );
my $c = $_[0] - $r * $_[1];
[ $r, $c ];
}
Output:
C:\...\Temp> perl s.pl
---
- - 0
- 4
- - 76
- 80
- - 87
- 91
- - 97
- 102
- - 104
- 108
- - 165
- 170
- - 184
- 188
- - 198
- 202
- - 226
- 231
---
- ccggtgagacatccagttagttcactgagccgacttgcatcagtcatgct
- tttccccgtaatgagggccccatattcaggccgtcgtccggaattgtctt
- ggatccggaatgcagcttttctcaccgcttgatgaacattcactgaatat
- ctgacgccgcgaaaacagggtcactagcctgtttccggtcgcccgagacc
- ggcgagtttgtggtatcgcgagcgcccccgggcggtagggtct
---
- ccggtgagacatccagttagttcactgagccgacttgcatcagtcatgct
- tttccccgtaatgagggccccatattcaggccgtcgtccggaattgtctt
- ggatccggaatgcagcttttctcaccgcttgatgaacattcactgaatat
- ctgacgccgcgaaaacagggtcactagcctgtttccggtcgcccgagacc
- ggcgagtttgtggtatcgcgagcgcccccgggcggtagggtct
Especially since this turns out to have been a homework assignment, it is now up to you to take this and apply it to all sequences in your hash table.
Is it possible for MySQL database to generate a 5 or 6 digit code comprised of only numbers and letters when I insert a record? If so how?
Just like goo.gl, bit.ly and jsfiddle do it. For exaple:
http://bit.ly/3PKQcJ
http://jsfiddle.net/XzKvP
cZ6ahF, 3t5mM, xGNPN, xswUdS...
So UUID_SHORT() will not work because it returns a value like 23043966240817183
Requirements:
Must be unique (non-repeating)
Can be but not required to be based off of primary key integer value
Must scale (grow by one character when all possible combinations have been used)
Must look random. (item 1234 cannot be BCDE while item 1235 be BCDF)
Must be generated on insert.
Would greatly appreciate code examples.
Try this:
SELECT LEFT(UUID(), 6);
I recommend using Redis for this task, actually. It has all the features that make this task suitable for its use. Foremost, it is very good at searching a big list for a value.
We will create two lists, buffered_ids, and used_ids. A cronjob will run every 5 minutes (or whatever interval you like), which will check the length of buffered_ids and keep it above, say, 5000 in length. When you need to use an id, pop it from buffered_ids and add it to used_ids.
Redis has sets, which are unique items in a collection. Think of it as a hash where the keys are unique and all the values are "true".
Your cronjob, in bash:
log(){ local x=$1 n=2 l=-1;if [ "$2" != "" ];then n=$x;x=$2;fi;while((x));do let l+=1 x/=n;done;echo $l; }
scale=`redis-cli SCARD used_ids`
scale=`log 16 $scale`
scale=$[ scale + 6]
while [ `redis-cli SCARD buffered_ids` -lt 5000 ]; do
uuid=`cat /dev/urandom | tr -cd "[:alnum:]" | head -c ${1:-$scale}`
if [ `redis-cli SISMEMBER used_ids $uuid` == 1]; then
continue
fi
redis-cli SADD buffered_ids $uuid
done
To grab the next uid for use in your application (in pseudocode because you did not specify a language)
$uid = redis('SPOP buffered_ids');
redis('SADD used_ids ' . $uid);
edit actually there's a race condition there. To safely pop a value, add it to used_ids first, then remove it from buffered_ids.
$uid = redis('SRANDMEMBER buffered_ids');
redis('SADD used_ids ' . $uid);
redis('SREM buffered_ids ' . $uid);
I have a large .csv file to to process and my elements are arranged randomly like this:
xxxxxx,xx,MLOCAL,MREMOTE,33222,56,22/10/2012,18/10/2012
xxxxxx,xx,MREMOTE,MLOCAL,33222,56,22/10/2012,18/10/2012
xxxxxx,xx,MLOCAL,341993,22/10/2012
xxxxxx,xx,MREMOTE,9356828,08/10/2012
xxxxxx,xx,LOCAL,REMOTE,19316,15253,22/10/2012,22/10/2012
xxxxxx,xx,REMOTE,LOCAL,1865871,383666,22/10/2012,22/10/2012
xxxxxx,xx,REMOTE,1180306134,19/10/2012
where fields LOCAL, REMOTE, MLOCAL or MREMOTE are displayed like:
when they are displayed as pairs (LOCAL/REMOTE) if 3rd field is MLOCAL, and 4th field is MREMOTE, then 5th and 7th field represent the value and date of MLOCAL, and 6th and 8th represent the value and date of MREMOTE
when they are displayed as single (only LOCAL or only REMOTE) then the 4th and 5th fields represent the value and date of field 3.
Now, I have split these rows using:
nawk 'BEGIN{
while (getline < "'"$filedata"'")
split($0,ft,",");
name=ft[1];
ID=ft[2]
?=ft[3]
?=ft[4]
....................
but because I can't find a pattern for the 3rd and 4th field I'm pretty stuck to continue to assign var names for each of the array elements in order to use them for further processing.
Now, I tried to use "case" statement but isn't working for awk or nawk (only in gawk is working as expected). I also tried this:
if ( ft[3] == "MLOCAL" && ft[4]!= "MREMOTE" )
{
MLOCAL=ft[3];
MLOCAL_qty=ft[4];
MLOCAL_TIMESTAMP=ft[5];
}
else if ( ft[3] == MLOCAL && ft[4] == MREMOTE )
{
MLOCAL=ft[3];
MREMOTE=ft[4];
MOCAL_qty=ft[5];
MREMOTE_qty=ft[6];
MOCAL_TIMESTAMP=ft[7];
MREMOTE_TIMESTAMP=ft[8];
}
else if ( ft[3] == MREMOTE && ft[4] != MOCAL )
{
MREMOTE=ft[3];
MREMOTE_qty=ft[4];
MREMOTE_TIMESTAMP=ft[5];
..........................................
but it's not working as well.
So, if you have any idea how to handle this, I would be grateful to give me a hint in order to be able to find a pattern in order to cover all the possible situations from above.
EDIT
I don't know how to thank you for all this help. Now, what I have to do is more complex than I wrote above, I'll try to describe as simple as I can otherwise I'll make you guys pretty confused.
My output should be like following:
NAME,UNIQUE_ID,VOLUME_ALOCATED,MLOCAL_VALUE,MLOCAL_TIMESTMP,MLOCAL_limit,LOCAL_VALUE,LOCAL_TIMESTAMP,LOCAL_limit,MREMOTE_VALUE,MREMOTE_TIMESTAMP,REMOTE_VALUE,REMOTE_TIMESTAMP
(where MLOCAL_limit and LOCAL_limit are a subtract result between VOLUME_ALOCATED and MLOCAL_VALUE or LOCAL_VALUE)
So, in my output file, fields position should be arranged like:
4th field =MLOCAL_VALUE,5th field =MLOCAL_TIMESTMP,7th field=LOCAL_VALUE,
8th field=LOCAL_TIMESTAMP,10th field=MREMOTE_VALUE,11th field=MREMOTE_TIMESTAMP,12th field=REMOTE_VALUE,13th field=REMOTE_TIMESTAMP
Now, an example would be this:
for the following input: name,ID,VOLUME_ALLOCATED,MLOCAL,MREMOTE,33222,56,22/10/2012,18/10/2012
name,ID,VOLUME_ALLOCATED,REMOTE,234455,19/12/2012
I should process this line and the output should be this:
name,ID,VOLUME_ALLOCATED,33222,22/10/2012,MLOCAL_LIMIT, ,,,56,18/10/2012,,
7th ,8th, 9th,12th, and 13th fields are empty because there is no info related to: LOCAL_VALUE,LOCAL_TIMESTAMP,LOCAL_limit,REMOTE_VALUE, and REMOTE_TIMESTAMP
OR
name,ID,VOLUME_ALLOCATED,,,,,,,,,234455,9/12/2012
4th,5th,6th,7th,8th,9th,10thand ,11th, fields should be empty values because there is no info about: MLOCAL_VALUE,MLOCAL_TIMESTAMP,MLOCAL_LIMIT,LOCAL_VALUE,LOCAL_TIMESTAMP,LOCAL_LIMIT,MREMOTE_VALUE,MREMOTE_TIMESTAMP
VOLUME_ALLOCATED is retrieved from other csv file (called "info.csv") based on the ID field which is processed earlier in the script like:
info.csv
VOLUME_ALLOCATED,ID,CLIENT
5242881,64,subscriber
567743,24,visitor
data.csv
NAME,64,MLOCAL,341993,23/10/2012
NAME,24,LOCAL$REMOTE,2347$4324,19/12/2012$18/12/2012
Now, my code is this:
#! /usr/bin/bash
input="info.csv"
filedata="data.csv"
outfile="out"
nawk 'BEGIN{
while (getline < "'"$input"'")
{
split($0,ft,",");
volume=ft[1];
id=ft[2];
client=ft[3];
key=id;
volumeArr[key]=volume;
clientArr[key]=client;
}
close("'"$input"'");
while (getline < "'"$filedata"'")
{
gsub(/\$/,","); # substitute the $ separator with comma
split($0,ft,",");
volume=volumeArr[id]; # Get the volume from the volumeArr, using "id" as key
segment=clientArr[id]; # Get the client mode from the clientArr, using "id" as key
NAME=ft[1];
id=ft[2];
here I'm stuck, I can't find the right way to set the rest of the
fields since I don't know how to handle the 3rd and 4th fields.
? =ft[3];
? =ft[4];
Sorry, if I make you pretty confused but this is my current situation right now.
Thanks
You didn't provide the expected output from your sample input but here's a start to show how to get the values for the 2 different formats of input line:
$ cat tst.awk
BEGIN{ FS=","; OFS="\t" }
{
delete value # or use split("",value) if your awk cant delete arrays
if ($4 ~ /LOCAL|REMOTE/) {
value[$3] = $5
date[$3] = $7
value[$4] = $6
date[$4] = $8
}
else {
value[$3] = $4
date[$3] = $5
}
print
for (type in value) {
printf "%15s%15s%15s\n", type, value[type], date[type]
}
}
$ awk -f tst.awk file
xxxxxx,xx,MLOCAL,MREMOTE,33222,56,22/10/2012,18/10/2012
MREMOTE 56 18/10/2012
MLOCAL 33222 22/10/2012
xxxxxx,xx,MREMOTE,MLOCAL,33222,56,22/10/2012,18/10/2012
MREMOTE 33222 22/10/2012
MLOCAL 56 18/10/2012
xxxxxx,xx,MLOCAL,*341993,22/10/2012*
MLOCAL *341993 22/10/2012*
xxxxxx,xx,MREMOTE,9356828,08/10/2012
MREMOTE 9356828 08/10/2012
xxxxxx,xx,LOCAL,REMOTE,19316,15253,22/10/2012,22/10/2012
REMOTE 15253 22/10/2012
LOCAL 19316 22/10/2012
xxxxxx,xx,REMOTE,LOCAL,1865871,383666,22/10/2012,22/10/2012
REMOTE 1865871 22/10/2012
LOCAL 383666 22/10/2012
xxxxxx,xx,REMOTE,1180306134,19/10/2012
REMOTE 1180306134 19/10/2012
and if you post the expected output we could help you more.