How to better deal with regex capture group between HTML tags? - html

I'm trying to capture this content inside the html tag document in the string below. The result yields the desired match, but also a weird entry "t", the last letter before the close tag.
I'm pretty new to regex and I wonder what is going on? What should I read up about?
PS: If I remove the () brackets around the pattern, only 't' is captured. I'm not sure I can see what difference the bracket (i.e. defining a capture group) make in this case.
example = '''ABCDE<DOCUMENT>
Lorem ipsum
dolor sit amet</DOCUMENT>
EFGHIJK.'''
re.findall(r'(<DOCUMENT>(.|\s)*<\/DOCUMENT>)', example)
Outputs:
[('<DOCUMENT>\nLorem ipsum\ndolor sit amet</DOCUMENT>', 't')]

Try using the re.DOTALL flag instead of using \s to capture whitespaces:
re.findall(r'(<DOCUMENT>.*<\/DOCUMENT>)', example, flags = re.DOTALL)
Explaining the issue
re.findall documentation states that:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group
You have two capturing groups (defined by the parenthesis) in your regex:
over all the pattern, defined by the first and last parenthesis
over the .|\s pattern
That's why the return is a list of a tuple with two elements: \nLorem ipsum\ndolor sit amet and t.
When you use the * outside the capturing group, you are actually matching it multiple times. The last time the group matches, is the last t of "amet" in the input string, thus findall returns it as the value of the capturing group.

Here, we can use this expression,
<DOCUMENT>(.*?)<\/DOCUMENT>
Please see this demo for explanation.
with s flag, or any of these expressions:
<DOCUMENT>([\s\S]*?)<\/DOCUMENT>
<DOCUMENT>([\d\D]*?)<\/DOCUMENT>
<DOCUMENT>([\w\W]*?)<\/DOCUMENT>
with m flag, and our problem would be likely solved.
Please see this demo for explanation.
Test
import re
regex = r"<DOCUMENT>([\s\S]*?)<\/DOCUMENT>"
test_str = ("ABCDE<DOCUMENT>\n"
"Lorem ipsum\n\n\n\n"
"dolor sit amet</DOCUMENT>\n"
"EFGHIJK.")
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

Related

Can I include conditional logic in VS Code snippets?

I would like to write a snippet in VS Code that writes a "switch" expression (in Javascript), but one where I can define the number of cases.
Currently there is a snippet that produces the outline of a switch expression with 1 case, and allows you to tab into the condition, case name, and the code contained within.
I want to be able to type "switch5" ("5" being any number) and a switch with 5 cases to be created, where I can tab through the relevant code within.
I know the snippets are written in a JSON file, can I include such conditional logic in this, or is it not possible?
Thanks!
The short answer is that you cannot do that kind of thing in a standard vscode snippet because it cannot dynamically evaluate any input outside of its designated variables with some limited workarounds like I'll mention next.
You might - I and others have written answers on SO about his - type your various case values first and then trigger a snippet tat would transform them into a switch statement. It is sort of doing it backwords but it might be possible.
There are extensions, however, that do allow you to evaluate javascript right in a snippet or setting and output the result. macro-commander is one such extension. I'll show another simpler extension doing what you want: HyperSnips.
In your javascript.hsnips:
snippet `switch(\d)` "add number of cases to a switch statement" A
``
let numCases = Number(m[1]) // 'm' is an array of regex capture groups
let caseString = ''
if (numCases) { // if not 'switch0'
let tabStopNum = 1
caseString = `switch (\${${tabStopNum++}:key}) {\n`
for (let index = 0; index < m[1]; index++) {
caseString += `\tcase \${${tabStopNum++}:value}:\n\t\t\$${tabStopNum++}\n`
caseString += '\t\tbreak;\n\n'
}
caseString += '\tdefault:\n'
caseString += '\t\tbreak;\n}\n'
}
rv = `${caseString}` // return value
``
endsnippet
The trickiest part was getting the unknown number of tabstops to work correctly. This is how I did it:
\${${tabStopNum++}:key}
which will resolve to ${n:defaultValue} where n gets incremented every time a tabstop is inserted. And :defaultValue is an optional default value to that tabstop. If you don't need a defaultValue just use \$${tabStopNum++} there.
See https://stackoverflow.com/a/62562886/836330 for more info on how to set up HyperSnips.

How to replace numbered list elements with an identifier containing the number

I have gotten amazing help here today!
I'm trying to do something else. I have a numbered list of questions in a Google Doc, and I'd like to replace the numbers with something else.
For example, I'd like to replace the numbers in a list such as:
The Earth is closest to the Sun in which month of the year?
~July
~June
=January
~March
~September
In Australia (in the Southern Hemisphere), when are the days the shortest and the nights the longest?
~in late December
~in late March
=in late June
~in late April
~days and nights are pretty much the same length throughout the year in Australia
With:
::Q09:: The Earth is closest to the Sun in which month of the year?
~July
~June
=January
~March
~September
::Q11:: In Australia (in the Southern Hemisphere), when are the days the shortest and the nights the longest?
~in late December
~in late March
=in late June
~in late April
~days and nights are pretty much the same length throughout the year in Australia
I've tried using suggestions from previous posts but have come up only with things such as the following, which doesn't seem to work.
Thank you for being here!!!
function questionName2(){
var body = DocumentApp.getActiveDocument().getBody();
var text = body.editAsText();
var pattern = "^[1-9]";
var found = body.findText(pattern);
var matchPosition = found.getStartOffset();
while(found){
text.insertText(matchPosition,'::Q0');
found = body.findText(pattern, found);
}
}
Regular expressions
Text.findText(searchPattern) uses a string that will be parsed as a regular expression using Google's RE2 library for the searchPattern. Using a string in this way requires we add an extra backslash whenever we are removing special meaning from a character, such as matching the period after the question number, or using a character matching set like \d for digits.
^\\s*\\d+?\\. will match a set of digits, of any non-zero length, that begin a line, with any length (including zero) of leading white space. \d is for digits, + is one or more, and the combination +? makes the match lazy. The lazy part is not required here, but it's my habit to default to lazy to avoid bugs. An alternative would be \d{1,2} to specifically match 1 to 2 digits.
To extract just the digits from the matched text, we can use a JavaScript RegExp object. Unlike the Doc regular expression, this regular expression will not require extra backslashes and will allow us to use capture groups using parentheses.
^\s*(\d+?)\. is almost the same as above, except no extraneous slashes and we will now "save" the digits so we can use them in our replacement string. We mark what we want to save using parentheses. Because this will be a normal JavaScript regular expression literal, we will wrap the whole thing in slashes: /^\s*(\d+?)\./, but the starting and ending / are just to indicate this is a RegExp literal.
text elements and text strings
Text.findText can return more than just the exact match we asked for: it returns the entire element that contains the text plus indices for what the regular expression matched. In order to perform search and replace with capture groups, we have to use the indices to delete the old text and then insert the new text.
The following assignments get us all the data we need to do the search and replace: first the element, then the start & stop indices, and finally extracting the matched text string using slice (note that slice uses an exclusive end, whereas the Doc API uses an inclusive end, hence the +1).
var found = DocumentApp.getActiveDocument().getBody().findText(pattern);
var matchStart = found.getStartOffset();
var matchEnd = found.getEndOffsetInclusive();
var matchElement = found.getElement().asText();
var matchText = matchElement.getText().slice(matchStart, matchEnd + 1);
Caveats
As Tanaike pointed out in the comments, this assumes the numbering is not List Items, which automatically generates numbers, but numbers you typed in manually. If you are using an automatically generated list of numbers, the API does not allow you to edit the format of the numbering.
This answer also assumes that in the example, when you mapped "9." to "::Q09::" and "10." to "::Q11::", that the mapping of 10 to 11 was a typo. If this was intended, please update the question to clarify the rules for why the numbering might change.
Also assumed is that the numbers are supposed to be less than 100, given the example zero padding of "Q09". The example should be flexible enough to allow you to update this to a different padding scheme if needed.
Full example
Since the question did not use any V8 features, this assumes the older Rhino environment.
/**
* Replaces "1." with "::Q01::"
*/
function updateQuestionNumbering(){
var text = DocumentApp.getActiveDocument().getBody();
var pattern = "^\\s*\\d+?\\.";
var found = text.findText(pattern);
while(found){
var matchStart = found.getStartOffset();
var matchEnd = found.getEndOffsetInclusive();
var matchElement = found.getElement().asText();
var matchText = matchElement.getText().slice(matchStart, matchEnd + 1);
matchElement.deleteText(matchStart, matchEnd);
matchElement.insertText(matchStart, matchText.replace(/^\s*(\d+?)\./, replacer));
found = text.findText(pattern, found);
}
/**
* #param {string} _ - full match (ignored)
* #param {string} number - the sequence of digits matched
*/
function replacer(_, number) {
return "::Q" + padStart(number, 2, "0") + "::";
}
// use String.prototype.padStart() in V8 environment
// above usage would become `number.padStart(2, "0")`
function padStart(string, targetLength, padString) {
while (string.length < targetLength) string = padString + string;
return string;
}
}

How to write regex expression for this type of text?

I'm trying to extract the price from the following HTML.
<td>$75.00/<span class='small font-weight-bold text-
danger'>Piece</span></small> *some more text here* </td>
What is the regex expression to get the number 75.00?
Is it something like:
<td>$*/<span class='small font-weight-bold text-danger'>
The dollar sign is a special character in regex, so you need to escape it with a backslash. Also, you only want to capture digits, so you should use character classes.
<td>\$(\d+[.]\d\d)<span
As the other respondent mentioned, regex changes a bit with each implementing language, so you may have to make some adjustments, but this should get you started.
I think you can go with /[0-9]+\.[0-9]+/.
[0-9] matches a single number. In this example you should get the number 7.
The + afterwards just says that it should look for more then just one number. So [0-9]+ will match with 75. It stops there because the character after 5 is a period.
Said so we will add a period to the regex and make sure it's escaped. A period usually means "every character". By escaping it will just look for a period. So we have /[0-9]+\./ so far.
Next we just to add [0-9]+ so it will find the other number(s) too.
It's important that you don't give it the global-flag like this /[0-9]+\.[0-9]+/g. Unless you want it to find more then just the first number/period-combination.
There is another regex you can use. It uses the parentheses to group the part you're looking for like this: /<td>\$(.+)<span/
It will match everything from <td>$ up to <span. From there you can filter out the group/part you're looking for. See the examples below.
// JavaScript
const text = "<td>$something<span class='small font-weight..."
const regex = /<td>\$(.+)<span/g
const match = regex.exec(text) // this will return an Array
console.log( match[1] ) // prints out "something"
// python
text = "<td>$something<span class='small font-weight..."
regex = re.compile(r"<td>\$(.+)<span")
print( regex.search(text).group(1) ) // prints out "something"
As an alternative you could use a DOMParser.
Wrap your <td> inside a table, use for example querySelector to get your element and get the first node from the childNodes.
That would give you $75.00/.
To remove the $ and the trailing forward slash you could use slice or use a regex like \$(\d+\.\d+) and get the value from capture group 1.
let html = `<table><tr><td>$75.00/<span class='small font-weight-bold text-
danger'>Piece</span></small> *some more text here* </td></tr></table>`;
let parser = new DOMParser();
let doc = parser.parseFromString(html, "text/html");
let result = doc.querySelector("td");
let textContent = result.childNodes.item(0).nodeValue;
console.log(textContent.slice(1, -1));
console.log(textContent.match(/\$(\d+\.\d+)/)[1]);

insert html into string at several positions at the same time

So I have a Peptide, which is a string of letters, corresponding to aminoacids
Say the peptide is
peptide_sequence = "VEILANDQGNR"
And it has a modification on L at position 4 and R at position 11,
I would like to insert a "<span class=\"modified_aa\"> and </span> before and after those positions at the same time.
Here is what I tried:
My modifications are stored in an array pep_mods of objects modification containing an attribute location with the position, in this case 4 and 11
pep_mods.each do |m|
peptide_sequence.gsub(peptide_sequence[m.position.to_i-1], "<span class=\"mod\">#{#peptide_sequence[m.location.to_i-1]}</span>" )
end
But since there are two modifications after the first insert of the html span tag the positions in the string become all different
How could I achieve what I intend to do? I hope it was clear
You should work backwards- make the modification starting with the last one. That way the index of earlier modifications is unchanged.
You might need to sort the array of indices in reverse order - then you can use the code you currently have.
Floris's answer is correct, but if you want to do it the hard way (O(n^2) instead of O(nlgn)) here is the basic idea.
Instead of relying on gsub you can iterate over the characters checking if each has an index corresponding to one of the modifications. If the index matches, perform the modification. Otherwise, keep the original character.
modified = peptide_sequence.each_with_index
.to_a
.map do |c, i|
pep_mods.each do |m|
if m.location.to_i = i
%Q{<span class="mod">#{c}</span>}
else
c
end
end
end.join('')
Ok, just in case this is helpful for anyone else, this is how I finally did it:
I first converted the peptide sequence to an array :
pep_seq_arr = peptide_sequence.split("")
then used each_with_index as Casey mentioned
pep_seq_arr.each_with_index do |aa, i|
pep_mods.each do |m|
pep_seq_arr[i] = "<span class='mod'>#{aa}</span>" if i == m.location.to_i-1
end
end
and finally joined the array:
pep_seq_arr.join
It was easier than I first thought

Why does my use of Perl's split function not split?

I'm trying to split an HTML document into its head and body:
my #contentsArray = split( /<\/head>/is, $fileContents, 1);
if( scalar #contentsArray == 2 ){
$bodyContents = $dbh->quote(trim($contentsArray[1]));
$headContents = $dbh->quote(trim($contentsArray[0]) . "</head>");
}
is what i have. $fileContents contains the HTML code. When I run this, it doesn't split. Any one know why?
The third parameter to split is how many results to produce, so if you want to apply the expression only once, you would pass 2.
Note that this does actually limit the number of times the pattern is used to split the string (to one fewer than the number passed), not just limit the number of results returned, so this:
print join ":", split /,/, "a,b,c", 2;
outputs:
a:b,c
not:
a:b
sorry, figured it out. Thought the 1 was how many times it would find the expression not limit the results. Changed to 2 and works.