Regex Select numeric string between elements - html

I have been searching stack and Google and have figured out some of what I'm trying to do. Here is the string
?w=2796&
I need to get the numerics between w= and &
So it should return w=2796
Here's what I have so far: w=(.*?)&
This does work but is returning the &
Is there a way to to return only w=numeric
This regex is also returning some other elements which I don't need in the HTML in some instances.

It's going to depend on which regex engine you use. & may have special meaning. I think the simplest approach here is something like /w=(\d+)/ or /w=([0-9]+)/ assuming no negatives or decimals in your numbers

Related

Regex getting the tags from an <a href= ...> </a> and the likes

I've tried the answers I've found in SOF, but none supported here : https://regexr.com
I essentially have an .OPML file with a large number of podcasts and descriptions.
in the following format:
<outline text="Software Engineering Daily" type="rss" xmlUrl="http://softwareengineeringdaily.com/feed/podcast/" htmlUrl="http://softwareengineeringdaily.com" />
What regex I can use to so I can just get the title and the link:
Software Engineering Daily
http://softwareengineeringdaily.com/feed/podcast/
Brief
There are many ways to go about this. The best way is likely using an XML parser. I would definitely read this post that discusses use of regex, especially with XML.
As you can see there are many answers to your question. It also depends on which language you are using since regex engines differ. Some accept backreferences, whilst others do not. I'll post multiple methods below that work in different circumstances/for different regex flavours. You can probably piece together from the multiple regex methods below which parts work best for you.
Code
Method 1
This method works in almost any regex flavour (at least the normal ones).
This method only checks against the attribute value opening and closing marks of " and doesn't include the possibility for whitespace before or after the = symbol. This is the simplest solution to get the values you want.
See regex in use here
\b(text|xmlUrl)="[^"]*"
Similarly, the following methods add more value to the above expression
\b(text|xmlUrl)\s*=\s*"[^"]*" Allows whitespace around =
\b(text|xmlUrl)=(?:"[^"]*"|'[^']*') Allows for ' to be used as attribute value delimiter
As another alternative (following the comments below my answer), if you wanted to grab every attribute except specific ones, you can use the following. Note that I use \w, which should cover most attributes, but you can just replace this with whatever valid characters you want. \S can be used to specify any non-whitespace characters or a set such as [\w-] may be used to specify any word or hyphen character. The negation of the specific attributes occurs with (?!text|xmlUrl), which says don't match those characters. Also, note that the word boundary \b at the beginning ensures that we're matching the full attribute name of text and not the possibility of other attributes with the same termination such as subtext.
\b((?!text|xmlUrl)\w+)="[^"]*"
Method 2
This method only works with regex flavours that allow backreferences. Apparently JGsoft applications, Delphi, Perl, Python, Ruby, PHP, R, Boost, and Tcl support single-digit backreferences. Double-digit backreferences are supported by JGsoft applications, Delphi, Python, and Boost. Information according this article about numbered backreferences from Regular-Expressions.info
See regex in use here
This method uses a backreference to ensure the same closing mark is used at the start and end of the attribute's value and also includes the possibility of whitespace surrounding the = symbol. This doesn't allow the possibility for attributes with no delimiter specified (using xmlUrl=http://softwareengineeringdaily.com/feed/podcast/ may also be valid).
See regex in use here
\b(text|xmlUrl)\s*=\s*(["'])(.*?)\2
Method 3
This method is the same as Method 2 but also allows attributes with no delimiters (note that delimiters are now considered to be space characters, thus, it will only match until the next space).
See regex in use here
\b(text|xmlUrl)\s*=\s*(?:(["'])(.*?)\2|(\S*))
Method 4
While Method 3 works, some people might complain that the attribute values might either of 2 groups. This can be fixed by either of the following methods.
Method 4.A
Branch reset groups are only possible in a few languages, notably JGsoft V2, PCRE 7.2+, PHP, Delphi, R (with PCRE enabled), Boost 1.42+ according to Regular-Expressions.info
This also shows the method you would use if backreferences aren't possible and you wanted to match multiple delimiters ("([^"])"|'([^']*))
See regex in use here
\b(text|xmlUrl)\s*=\s*(?|"([^"]*)"|'([^']*)'|(\S*))
Method 4.B
Duplicate subpatterns are not often supported. See this Regular-Expresions.info article for more information
This method uses the J regex flag, which allows duplicate subpattern names ((?<v>) is in there twice)
See regex in use here
\b(text|xmlUrl)\s*=\s*(?:(["'])(?<v>.*?)\2|(?<v>\S*))
Results
Input
<outline text="Software Engineering Daily" type="rss" xmlUrl="http://softwareengineeringdaily.com/feed/podcast/" htmlUrl="http://softwareengineeringdaily.com" />
Output
Each line below represents a different group. New matches are separated by two lines.
text
Software Engineering Daily
xmlUrl
http://softwareengineeringdaily.com/feed/podcast/
Explanation
I'll explain different parts of the regexes used in the Code section that way you understand the usage of each of these parts. This is more of a reference to the methods above.
"[^"]*" This is the fastest method possible (to the best of my knowledge) to grabbing anything between two " symbols. Note that it does not check for escaped backslashes, it will match any non-" character between two ". Whilst "(.*?)" can also be used, it's slightly slower
(["'])(.*?)\2 is basically shorthand for "(.*?)"|'(.*?)'. You can use any of the following methods to get the same result:
(?:"(.*?)"|'(.*?)')
(?:"([^"])"|'([^']*)') <-- slightly faster than line above
(?|) This is a branch reset group. When you place groups inside it like (?|(x)|(y)) it returns the same group index for both matches. This means that if x is captured, it'll get group index of 1, and if y is captured, it'll also get a group index of 1.
For simple HTML strings you might get along with
Url=(['"])(.+?)\1
Here, take group $2, see a demo on regex101.com.
Obligatory: consider using a parser instead (see here).

Get Unique String from a longer string when the unique string is at 2 different locations

In a web application that I am creating tests for, there are 2 sets of strings from which I wish to get a substring (which is unique) to use for identifying that element on the Web Page:
Parent Form:
InputText-eLeType-AQAAAAAAAAAAAAAAAAAAAVWZ-bMs-bms_9999999_3512-bMs-obj-bMsDot-com-bMsDot-bmssolutions-bMsDot-COMPONENT-bMsDot-bms_9999999_109-bMs-textField-bMs-ABNylGGXXu8IPwjI4jMM5y1K
SubForm:
InputText-eLeType-AQAAAAAAAAAAAAAAAAAAAVXJ-bMs-bms_FK_9999999_406_ID-bMs-obj-bMsDot-com-bMsDot-bmssolutions-bMsDot-COMPONENT-bMsDot-bms_9999999_177-bMs-searchLookupField-bMs-ABNylGGXXu8IPwjI4jMM5y1K-bMs-AQAAAAAAAAAAAAAAAAAAAVWZ-bMs-PRIMARY9999999_480-bMs-obj-bMsDot-com-bMsDot-bmssolutions-bMsDot-COMPONENT-bMsDot-bms_9999999_109
I wish to get the substring from both of these using a single function, so that I don't have to create a different functions for each type I encounter:
Substring in the above 2 provided strings is:
ABNylGGXXu8IPwjI4jMM5y1K
This substring can change for each element on the web page, but is unique for each element of the page and so useful to identify.
I cannot use the full string, as it changes for each environment or if I generate a new environment to host the web pages (the complete string depends on the Meta Data).
We tried doing it for the Parent Form, by using the "-" as the delimiter and identifying the last -bMs- and then taking the string, but that does not work for the SubForm.
So, my main question is, is there some RegEx that can be created to extract only that string (composed of alphabets [upper & lower case] and numbers) from the full string? Or some other simpler way to identify that string?
You could try a combination of positive Lookbehind, [A-Z] and [a-z]. Try this code:
(?<=-bMs-)[A-Z]{3}[a-z]\w+
Demo: https://regex101.com/r/YUZiFa/1
It seems to work without even the positive Lookbehind
[A-Z]{3}[a-z]\w+
Demo: https://regex101.com/r/YUZiFa/2
If you're happy to base the selection of the elements on the previous one, then this might work for you:
(?<=searchLookupField-bMs-|textField-bMs-)\w+
Example
And if you wanted to be extra certain, you could append a second lookahead to the end.
(?<=searchLookupField-bMs-|textField-bMs-)\w+(?=-bMs-|$)
Example
If these don't work, or if the whole string varies greatly, then some more examples would help us narrow it down and come up with a great answer!

Restrict text field to mathematical expressions - Convert String to mathematical expression

How can I restrict the input of a TextFieldsuch that it can only contain mathematical expressions?
Accepted inputs would be:
"3+5"
"-5 + 6"
"3/2(6*4)"
"6--5"
"+5-3"
etc..
And rejected inputs would be:
"5+++3"
"6(7)"
"6-6-+-7"
and so on.
Basically; the syntax I want it to be restricted to is the kind of syntax that programming languages normally use for evaluating mathematical expressions, kinda like the syntax input you'd expect from your everyday calculator.
I'm making a program in which I want the user to be able to input numbers and/or calculations into a text box, instead of having to use a calculator to do it and then arduously type out a number with 7 decimal places.
I've done a little look around and I've seen a lot of stuff involving Regex, postfix, BNF, and the like. A lot of it looked very complicated, too complex for my understanding, and none of it had anything to do with AS3.
However, I've had a thought about making this problem a whole lot simpler by just converting the string into a mathematical expression that AS3 can understand, and let Flash handle the errors using try catch, but I don't know how to do that either (Number("3+5")resulted to NaN).
I'm currently restricting text input to just numbers using Event.CHANGE, like this:
function Restrict(event:Event):void
{
if (event.currentTarget.text.indexOf(".") == -1)
{
event.currentTarget.restrict = "0-9.";
}
else
{
event.currentTarget.restrict = "0-9";
}
}
and it's seemed to work well so far.
I intend to implement this new restriction in this manner, but if there is a much more efficient way of restricting text input, please feel free to include it in a response.
Just to reiterate for clarity, I am asking how to implement functionality that will enable someone to input a mathematical expression into a TextField, and the program will register that input as an expression and calculate it.
Thanks for reading.
EDIT: I've done a bit more research and I've stumbled upon a Reverse Polish Notation calculator/parser/utility class/library/thingy that looks very useful. Seems kinda similar to the Executer class that fsbmain mentioned, but it looks a lot simpler to use and easier for me to understand.
However the problem still remains that I'd have to find an efficient way to restrict the syntax of user input to mathematical expressions, but at least now I have at least two ways of converting the string into a number for calculations.
That's quite a tough question actually, even definition for valid mathematical expressions which you mentioned is very complicated itself, i.e. expression 6-6-+-7 is a valid from as3 syntax point of view and gives result 7.
Regarding second part of your question:
converting the string into a mathematical expression that AS3 can understand
That's not possible to do with only native as3 means since eval-like functions are gone since as2 time, but you can try to use some as3-written syntax translator, i.e. Executer from flash-console project:
var exec:Executer = new Executer();
var res:* = exec.exec(this, "6-7");
trace("exec = " + res); //output "-1"
Although it's failed with some complex expressions from your question:
var exec:Executer = new Executer();
var res:* = exec.exec(this, "6-6-+-7");
trace("exec = " + res); //output "- 7"

mySQL replace string + additional string with a static value

Unlike PHP, I don't believe mySQL has any preg_replace() feature, only matching via REGEXP. Here are the strings I have in the code:
http://ourcompany.com/theapplestore/...
http://ourcompany.com/anotherstore/...
http://ourcompany.com/yetanotherstore/...
As you can see, there is a constant in there, http://ourcompany.com/, but there is also a variable string namely theapplestore, anotherstore, etc. etc.
I want to replace the constant string, plus the variable string(s), and then the trailing slash (/) after the variable string(s), with a single shortcode value, namely {{store url=''}}
EDIT
If it helps, the store codes are always the same length, they are going to be
sch131785
sch185399
sch634019
etc.
i.e., they are all 9 characters long
How would I do this? Thanks.
I thought this might be useful: there is currently NO WAY to do this in mysql. Find using REGEXP, yes; replace, no. That said, there is another post with an extension library mentioned, sagi:
Is there a MySQL equivalent of PHP's preg_replace?
MariaDB-10.0.5 has REGEXP_REPLACE(), REGEXP_INSTR() and REGEXP_SUBSTR()
You can use following regex,
(ourcompany.com\/\w+\/)
Demo
Uses the concept of Group Capture

How should substring() work?

I do not understand why Java's [String.substring() method](http://java.sun.com/j2se/1.5.0/docs/api/java/lang/String.html#substring(int,%20int%29) is specified the way it is. I can't tell it to start at a numbered-position and return a specified number of characters; I have to compute the end position myself. And if I specify an end position beyond the end of the String, instead of just returning the rest of the String for me, Java throws an Exception.
I'm used to languages where substring() (or substr()) takes two parameters: a start position, and a length. Is this objectively better than the way Java does it, and if so, can you prove it? What's the best language specification for substring() that you have seen, and when if ever would it be a good idea for a language to do things differently? Is that IndexOutOfBoundsException that Java throws a good design idea, or not? Does all this just come down to personal preference?
There are times when the second parameter being a length is more convenient, and there are times when the second parameter being the "offset to stop before" is more convenient. Likewise there are times when "if I give you something that's too big, just go to the end of the string" is convenient, and there are times when it indicates a bug and should really throw an exception.
The second parameter being a length is useful if you've got a fixed length of field. For instance:
// C#
String guid = fullString.Substring(offset, 36);
The second parameter being an offset is useful if you're going up to another delimited:
// Java
int nextColon = fullString.indexOf(':', start);
if (start == -1)
{
// Handle error
}
else
{
String value = fullString.substring(start, nextColon);
}
Typically, the one you want to use is the opposite to the one that's provided on your current platform, in my experience :)
I'm used to languages where
substring() (or substr()) takes two
parameters: a start position, and a
length. Is this objectively better
than the way Java does it, and if so,
can you prove it?
No, it's not objectively better. It all depends on the context in which you want to use it. If you want to extract a substring of a specific length, it's bad, but if you want to extract a substring that ends at, say, the first occurrence of "." in the string, it's better than if you first had to compute a length. The question is: which requirement is more common? I'd say the latter. Of course, the best solution would be to have both versions in the API, but if you need the length-based one all the time, using a static utility method isn't that horrible.
As for the exception, yeah, that's definitely good design. You asked for something specific, and when you can't get that specific thing, the API should not try to guess what you might have wanted instead - that way, bugs become apparent more quickly.
Also, Java DOES have an alternative substring() method that returns the substring from a start index until the end of the string.
second parameter should be optional, first parameter should accept negative values..
If you leave off the 2nd parameter it will go to the end of the string for you without you having to compute it.
Having gotten some feedback, I see when the second-parameter-as-index scenario is useful, but so far all of those scenarios seem to be working around other language/API limitations. For example, the API doesn't provide a convenient routine to give me the Strings before and after the first colon in the input String, so instead I get that String's index and call substring(). (And this explains why the second position parameter in substr() overshoots the desired index by 1, IMO.)
It seems to me that with a more comprehensive set of string-processing functions in the language's toolkit, the second-parameter-as-index scenario loses out to second-parameter-as-length. But somebody please post me a counterexample. :)
If you store this away, the problem should stop plaguing your dreams and you'll finally achieve a good night's rest:
public String skipsSubstring(String s, int index, int length) {
return s.subString(index, index+length);
}