Regex for s3 bucket name - json

I'm trying to create an s3 bucket through cloudformation. I tried using regex ^([0-9a-z.-]){3,63}$, but it also accepts the patterns "..." and "---" which are invalid according to new s3 naming conventions. (Ref: https://docs.aws.amazon.com/AmazonS3/latest/dev/BucketRestrictions.html) Please help?

Answer
The simplest and safest regex is:
(?!(^xn--|.+-s3alias$))^[a-z0-9][a-z0-9-]{1,61}[a-z0-9]$
It ensures that names work for all cases - including when you are using S3 Transfer Acceleration. Also, as it doesn't include any backslashes, it's easier to use in string contexts.
Alternative
If you need S3 bucket names that include dots (and you don't use S3 Transfer Acceleration), you can use this instead:
(?!(^((2(5[0-5]|[0-4][0-9])|[01]?[0-9]{1,2})\.){3}(2(5[0-5]|[0-4][0-9])|[01]?[0-9]{1,2})$|^xn--|.+-s3alias$))^[a-z0-9][a-z0-9.-]{1,61}[a-z0-9]$
Explanation
The Amazon S3 bucket naming rules as of 2022-05-14 are:
Bucket names must be between 3 (min) and 63 (max) characters long.
Bucket names can consist only of lowercase letters, numbers, dots (.), and hyphens (-).
Bucket names must begin and end with a letter or number.
Bucket names must not be formatted as an IP address (for example, 192.168.5.4).
Bucket names must not start with the prefix xn--.
Bucket names must not end with the suffix -s3alias.
Buckets used with Amazon S3 Transfer Acceleration can't have dots (.) in their names.
This regex matches all the rules (including rule 7):
(?!(^xn--|.+-s3alias$))^[a-z0-9][a-z0-9-]{1,61}[a-z0-9]$
The first group (?!(^xn--|-s3alias$)) is a negative lookahead that ensures that the name doesn't start with xn-- or end with -s3alias (satisfying rules 5 and 6).
The rest of the expression ^[a-z0-9][a-z0-9-]{1,61}[a-z0-9]$ ensures that:
the name starts with a lowercase letter or number (^[a-z0-9]) and ends with a lowercase letter or number ([a-z0-9]$) (rule 3).
the rest of the name consists of 1 to 61 lowercase letters, numbers or hyphens ([a-z0-9-]{1,61}) (rule 2).
the entire expression matches names from 3 to 63 characters in length (rule 1).
Lastly, we don't need to worry about rule 4 (which forbids names that look like IP addresses) because rule 7 implicitly covers this by forbidding dots in names.
If you do not use Amazon S3 Transfer Acceleration and want to permit more complex bucket names, then you can use this more complicated regular expression:
(?!(^((2(5[0-5]|[0-4][0-9])|[01]?[0-9]{1,2})\.){3}(2(5[0-5]|[0-4][0-9])|[01]?[0-9]{1,2})$|^xn--|.+-s3alias$))^[a-z0-9][a-z0-9.-]{1,61}[a-z0-9]$
The main change is the addition of the expression to match IPv4 addresses (while the spec simply says that bucket names must not be formatted as IP addresses, as IPv6 addresses contain colons, they are already forbidden by rule 2.)

The following regex fulfils the AWS specifications provided the fact you don't want to allow . in the bucket name (which is a recommendation, otherwise Transfer Acceleration can't be enabled):
^((?!xn--)(?!.*-s3alias$)[a-z0-9][a-z0-9-]{1,61}[a-z0-9])$
This one is good because it allows to be incorporated in more complex checks simply replacing ^ and $ with other strings, thus allowing for ARN checks and so on.
EDIT:
added -s3alias exclusion as per the comment by #ryanjdillon

I've adapted Zak's answer a little bit. I found it was a little too complicated and threw out valid domain names. Here's the new regex (available with tests on regex101.com**):
(?!^(\d{1,3}\.){3}\d{1,3}$)(^[a-z0-9]([a-z0-9-]*(\.[a-z0-9])?)*$)
The first part is the negative lookahead (?!^(\d{1,3}\.){3}\d{1,3}$), which only matches valid IP addresses. Basically, we try to match 1-3 numbers followed by a period 3 times (\d{1,3}\.){3}) followed by 1-3 numbers (\d{1,3}).
The second part says that the name must start with a lowercase letter or a number (^[a-z0-9]) followed by lowercase letters, numbers, or hyphens repeated 0 to many times ([a-z0-9-]*). If there is a period, it must be followed by a lowercase letter or number ((\.[a-z0-9])?). These last 2 patterns are repeated 0 to many times (([a-z0-9-]*(\.[a-z0-9])?)*).
The regex does not attempt to enforce the size restrictions set forth by AWS (3-63 characters). That can either be handled by another regex (.{3,6}) or by checking the size of the string.
** At that link, one of the tests I added are failing, but if you switch to the test area and type in the same pattern, it passes. It also works if you copy/paste it into the terminal, so I assume that's a bug on the regex101.com side.

Regular expression for S3 Bucket Name:
String S3_REPORT_NAME_PATTERN = "[0-9A-Za-z!\\-_.*\'()]+";
String S3_PREFIX_PATTERN = "[0-9A-Za-z!\\-_.*\\'()/]*";
String S3_BUCKET_PATTERN = "(?=^.{3,63}$)(?!^(\\d+\\.)+\\d+$)(^(([a-z0-9]|[a-z0-9][a-z0-9\\-]*[a-z0-9])\\.)*([a-z0-9]|[a-z0-9][a-z0-9\\-]*[a-z0-9])$)";

I used #Zak regex but it isn't 100% correct. I used this for all rules for AWS bucket name. I make validation step by step so it looks like this:
Bucket names must be at least 3 and no more than 63 characters long -> ^.{3,63}$
Bucket names must not contain uppercase characters or underscores -> [A-Z_]
Bucket names must start with a lowercase letter or number -> ^[a-z0-9]
Bucket names must not be formatted as an IP address (for example, 192.168.5.4) ->^(\d+\.)+\d+$. That is more restricted then AWS.
Bucket names must be a series of one or more labels. Adjacent labels are separated by a single period (.) -> In python if ".." in bucket_name:
.. Each label must end with a lowercase letter or a number ->^(.*[a-z0-9]\.)*.*[a-z0-9]$

var bucketRGEX = new RegExp(/(?=^.{3,63}$)/);
var bucketRGEX1 = new RegExp(/(?!^(\d+\.)+\d+$)/);
var bucketRGEX2 = new RegExp(/(^(([a-z0-9]|[a-z0-9][a-z0-9\-]*[a-z0-9])\.)*([a-z0-9]|[a-z0-9][a-z0-9\-]*[a-z0-9])$)/);
var result = bucketRGEX.test(bucketName);
var result1 = bucketRGEX1.test(bucketName);
var result2 = bucketRGEX2.test(bucketName);
console.log('bucketName '+bucketName +' result '+result);
console.log('bucketName '+bucketName +' result1 '+result1);
console.log('bucketName '+bucketName +' result 2 '+result2);
if(result && result1 && result2)
{
//condition pass
}
else
{
//not valid bucket name
}

AWS issued new guidelines where '.' is considered not recommended and bucket names starting with 'xn--' are now prohibited (https://docs.aws.amazon.com/AmazonS3/latest/dev/BucketRestrictions.html). If you disallow '.' the regex becomes much more readable:
(?=^.{3,63}$)(?!xn--)([a-z0-9](?:[a-z0-9-]*)[a-z0-9])$

I tried passing a wrong bucket name to the S3 API itself to see how it validates. Looks like the following are valid regex patterns as returned in the API response.
Bucket name must match the regex
^[a-zA-Z0-9.\-_]{1,255}$
or be an ARN matching the regex
^arn:(aws).*:(s3|s3-object-lambda):[a-z\-0-9]+:[0-9]{12}:accesspoint[/:][a-zA-Z0-9\-]{1,63}$|^arn:(aws).*:s3-outposts:[a-z\-0-9]+:[0-9]{12}:outpost[/:][a-zA-Z0-9\-]{1,63}[/:]accesspoint[/:][a-zA-Z0-9\-]{1,63}$

Edit: Modified the regexp to allow required size (3-63) and add some other options.
The names must be DNS-compliant, so you could try with:
^[A-Za-z0-9][A-Za-z0-9\-]{1,61}[A-Za-z0-9]$
See: https://regexr.com/3psne
Use this if you need to use periods:
^[A-Za-z0-9][A-Za-z0-9\-.]{1,61}[A-Za-z0-9]$
See: https://regexr.com/3psnb
Finally, if you want to disallow two consecutive 'non-word' characters, you can use:
^[A-Za-z0-9](?!.*[.-]{2})[A-Za-z0-9\-.]{1,61}[A-Za-z0-9]$
See: https://regexr.com/3psn8
Based on: Regexp for subdomain

If you do not use transfer accelration, you can simply remove the period option. This code accounts for all the rules, including
no double dots
no slash/doto combo in a ow
no dot/slash combo
it also allows for you to put a trailing slash at the bucket name or not, depending if you want your user to do so
(?!^|xn--)[a-z0-9]{1}[a-z0-9-.]{1,61}[a-z0-9]{1}(?<!-s3alias|\.\..{1,63}|.{1,63}\.\.|\.\-.{1,63}|.{1,63}\.\-|\-\..{1,63}|.{1,63}\-\.)(?=$|\/)

Related

Capture a value from a repeating group on every iteration (as opposed to just last occurrence)

How does one capture a value recursively with regex, where value is a part of a group that repeats?
I have a serialized array in mysql database
These are 3 examples of a serialized array
a:2:{i:0;s:2:"OR";i:1;s:2:"WA";}
a:1:{i:0;s:2:"CA";}
a:4:{i:0;s:2:"CA";i:1;s:2:"ID";i:2;s:2:"OR";i:3;s:2:"WA";}
a:1 stands for array:{number of elements}
then in between {} i:0 means element 0, i:1 means element 1 etc.
then the actual value s:2:"CA" means string with length of 2
so I have 2 elements in first array, 1 element in the second and 4 elements in the last
I have this data in mysql database and I DO NOT HAVE an option to parse this with back-end code - this has to be done in mysql (10.0.23-MariaDB-log)
the repeating pattern is inside of the curly braces
the number of repeats is variable (as in 3 examples each has a different number of repeating patterns),
the number of repeating patterns is defined by the number at 3rd position (if that helps)
for the first example it's a:2:
and so there are 2 repeating blocks:
i:0;s:2:"OR";
i:1;s:2:"WA";
I only care to extract the values in bold
So I came up with this regex
^a:(?:\d+):\{(?:i:(?:\d+);s:(?:\d+):\"(\w\w)\";)+}$
it captures the values I want all right but problem is it only captures the last one in each repeating group
so going back to the example what would be captured is
WA
CA
WA
What I would want is
OR|WA
CA
CA|ID|OR|WA
these are the language specific regex functions available to me:
https://mariadb.com/kb/en/library/regular-expressions-functions/
I don't care which one is used to solve the problem
Ultimately I need this in as sensible form that can be presented to the client e.g. CA,ID,OR or CA|ID|OR
Current thoughts are perhaps this isn't possible in a one liner, and I have to write a multi-step function where
extract the repeating portion between the curly braces
then somehow iterate over each repeating portion
then use the regex on each
then return the results as one string with separated elements
I doubt if such a capture is possible. However, this would probably do the job for your specific purpose.
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(str1, '^a:\\d+:\{', ''),
'i:\\d+;s:\\d+:\"(\\w\\w)\";',
'\\1,'
),
'\,?\}$',
''
)
Basically, this works with the input string (or column) str1 like
remove the first part
replace every cell with the string you want
remove the last 2 characters, ,}
and voila! You get a string CA,ID,OR.
Aftenote
It may or may not work well when the original array before serialised is empty (it depends how it is serialised).

Regex getting the tags from an <a href= ...> </a> and the likes

I've tried the answers I've found in SOF, but none supported here : https://regexr.com
I essentially have an .OPML file with a large number of podcasts and descriptions.
in the following format:
<outline text="Software Engineering Daily" type="rss" xmlUrl="http://softwareengineeringdaily.com/feed/podcast/" htmlUrl="http://softwareengineeringdaily.com" />
What regex I can use to so I can just get the title and the link:
Software Engineering Daily
http://softwareengineeringdaily.com/feed/podcast/
Brief
There are many ways to go about this. The best way is likely using an XML parser. I would definitely read this post that discusses use of regex, especially with XML.
As you can see there are many answers to your question. It also depends on which language you are using since regex engines differ. Some accept backreferences, whilst others do not. I'll post multiple methods below that work in different circumstances/for different regex flavours. You can probably piece together from the multiple regex methods below which parts work best for you.
Code
Method 1
This method works in almost any regex flavour (at least the normal ones).
This method only checks against the attribute value opening and closing marks of " and doesn't include the possibility for whitespace before or after the = symbol. This is the simplest solution to get the values you want.
See regex in use here
\b(text|xmlUrl)="[^"]*"
Similarly, the following methods add more value to the above expression
\b(text|xmlUrl)\s*=\s*"[^"]*" Allows whitespace around =
\b(text|xmlUrl)=(?:"[^"]*"|'[^']*') Allows for ' to be used as attribute value delimiter
As another alternative (following the comments below my answer), if you wanted to grab every attribute except specific ones, you can use the following. Note that I use \w, which should cover most attributes, but you can just replace this with whatever valid characters you want. \S can be used to specify any non-whitespace characters or a set such as [\w-] may be used to specify any word or hyphen character. The negation of the specific attributes occurs with (?!text|xmlUrl), which says don't match those characters. Also, note that the word boundary \b at the beginning ensures that we're matching the full attribute name of text and not the possibility of other attributes with the same termination such as subtext.
\b((?!text|xmlUrl)\w+)="[^"]*"
Method 2
This method only works with regex flavours that allow backreferences. Apparently JGsoft applications, Delphi, Perl, Python, Ruby, PHP, R, Boost, and Tcl support single-digit backreferences. Double-digit backreferences are supported by JGsoft applications, Delphi, Python, and Boost. Information according this article about numbered backreferences from Regular-Expressions.info
See regex in use here
This method uses a backreference to ensure the same closing mark is used at the start and end of the attribute's value and also includes the possibility of whitespace surrounding the = symbol. This doesn't allow the possibility for attributes with no delimiter specified (using xmlUrl=http://softwareengineeringdaily.com/feed/podcast/ may also be valid).
See regex in use here
\b(text|xmlUrl)\s*=\s*(["'])(.*?)\2
Method 3
This method is the same as Method 2 but also allows attributes with no delimiters (note that delimiters are now considered to be space characters, thus, it will only match until the next space).
See regex in use here
\b(text|xmlUrl)\s*=\s*(?:(["'])(.*?)\2|(\S*))
Method 4
While Method 3 works, some people might complain that the attribute values might either of 2 groups. This can be fixed by either of the following methods.
Method 4.A
Branch reset groups are only possible in a few languages, notably JGsoft V2, PCRE 7.2+, PHP, Delphi, R (with PCRE enabled), Boost 1.42+ according to Regular-Expressions.info
This also shows the method you would use if backreferences aren't possible and you wanted to match multiple delimiters ("([^"])"|'([^']*))
See regex in use here
\b(text|xmlUrl)\s*=\s*(?|"([^"]*)"|'([^']*)'|(\S*))
Method 4.B
Duplicate subpatterns are not often supported. See this Regular-Expresions.info article for more information
This method uses the J regex flag, which allows duplicate subpattern names ((?<v>) is in there twice)
See regex in use here
\b(text|xmlUrl)\s*=\s*(?:(["'])(?<v>.*?)\2|(?<v>\S*))
Results
Input
<outline text="Software Engineering Daily" type="rss" xmlUrl="http://softwareengineeringdaily.com/feed/podcast/" htmlUrl="http://softwareengineeringdaily.com" />
Output
Each line below represents a different group. New matches are separated by two lines.
text
Software Engineering Daily
xmlUrl
http://softwareengineeringdaily.com/feed/podcast/
Explanation
I'll explain different parts of the regexes used in the Code section that way you understand the usage of each of these parts. This is more of a reference to the methods above.
"[^"]*" This is the fastest method possible (to the best of my knowledge) to grabbing anything between two " symbols. Note that it does not check for escaped backslashes, it will match any non-" character between two ". Whilst "(.*?)" can also be used, it's slightly slower
(["'])(.*?)\2 is basically shorthand for "(.*?)"|'(.*?)'. You can use any of the following methods to get the same result:
(?:"(.*?)"|'(.*?)')
(?:"([^"])"|'([^']*)') <-- slightly faster than line above
(?|) This is a branch reset group. When you place groups inside it like (?|(x)|(y)) it returns the same group index for both matches. This means that if x is captured, it'll get group index of 1, and if y is captured, it'll also get a group index of 1.
For simple HTML strings you might get along with
Url=(['"])(.+?)\1
Here, take group $2, see a demo on regex101.com.
Obligatory: consider using a parser instead (see here).

How to replace a character by another in a variable

I want to know if there is a way to replace a character by another in a variable. For example replacing every dots with underscores in a string variable.
I haven't tried it, but based on the Variables specification, the way I'd try to approach this would be to try to match on the text before and after a dot, and then make new variables based on the matches. Something like:
set "value" "abc.def";
if string :matches "${value}" "*.*" {
set "newvalue" "${1}_${2}
}
This will, of course, only match on a single period because Sieve doesn't include any looping structures. While there's a regex match option, I'm not aware of any regex replacement Sieve extensions.
Another approach to complex mail filtering you can do with Dovecot (if you do need loops and have full access to the mail server) is their Dovecot-specific extensions like vnd.dovecot.pipe which allows the mail administrator to define full programs (written in whatever language one wishes) to process mail on its way through.
Following #BluE's comment, if your use case is to store e-mails in folders per recipient address or something like that, perhaps you don't actually want a generic character replace function but some way to create mailboxes with dots in their names. In the case of dovecot, there seem to be a solution: [Dovecot] . (dot) in maildir folder names
https://wiki2.dovecot.org/Plugins/Listescape
Ensure one of the files in /etc/dovecot/conf.d contains this line:
mail_plugins = listescape
Then you can filter mailing lists into separate boxes based on their IDs.
This Sieve script snippet picks the ID from the x-list-id header:
if exists "x-list-id" {
if header :regex "x-list-id" "<([\.#a-z_0-9-]+)" {
set :lower "listname" "${1}";
fileinto :create "mailing_list\\${listname}";
} else {
keep;
}
stop;
}

mumps query related to %%

What is the meaning of I $E(R%%,I%%)>1 ? and why using %%?
Actually, if you are talking about Standard MUMPS (not any particular implementation)
the R%% is illegal syntax. I have seen the non-standard use of % in extensions to MUMPS, such as EsiObjects or InterSystems Cache Object Script, but the use in the question above is actually nonsense in standard MUMPS.
There is no particular significance to %%. Its just part of the variable name and I still don't understand MUMPS community obsession with using % in variable names and making them more obscure.
so the statement means IF $EXTRACT(R%%,I%%)>1 i.e if the extracted value from the string R%% at position I%% is greater than 1, do some more obscure stuff.
$EXTRACT(string,from) extracts a
single character in the position
specified by from. The from value can
be an integer count from the beginning
of the string, an asterisk specifying
the last character of the string, or
an asterisk with a negative integer
specifying a count backwards from the
end of the string.
Link to documentation: http://docs.intersystems.com/cache20102/csp/docbook/DocBook.UI.Page.cls?KEY=RCOS_fextract

In MATLAB, what ASCII characters are allowed to be in a function name?

I have a set of objects that I read information out of that contain information that ends up becoming a MATLAB m file. One piece of information ends up being a function name in MATLAB. I need to remove all of the not-allowed characters from that string before writing the M file out to the filesystem. Can someone tell me what characters make up the set of allowed characters in a function name for MATLAB?
Legal names follow the pattern [A-Za-z][A-Za-z0-9_]*, i.e. an alphabetic character followed by zero or more alphanumeric-or-underscore characters, up to NAMELENGTHMAX characters.
Since MATLAB variable and function naming rules are the same, you might find genvarname useful. It sanitizes arbitrary strings into legal MATLAB names.
The short answer...
Any alphanumeric characters or underscores, as long as the name starts with a letter.
The longer answer...
The MATLAB documentation has a section "Working with M-Files" that discusses naming with a little more detail. Specifically, it points out the functions NAMELENGTHMAX (the maximum number of characters in the name that the OS will pay attention to), ISVARNAME (to check if the variable/function name is valid), and ISKEYWORD (to display restricted keywords).
Edited:
this may be more informative:
http://scv.bu.edu/documentation/tutorials/MATLAB/functions.html