-- Converts tabs to spaces
function detab(text)
local tab_width = 4
local function rep(match)
local spaces = -match:len()
print("match:"..match)
while spaces<1 do spaces = spaces + tab_width end
print("Found "..spaces.." spaces")
return match .. string.rep(" ", spaces)
end
text = text:gsub("([^\n]-)\t", rep)
return text
end
str=' thisisa string'
--thiis is a string
print("length: "..str:len())
print(detab(str))
print(str:gsub("\t"," "))
I have this piece of code from markdown.lua that converts tabs to spaces(as its name suggests). What I have managed to figured out is that it searches from the beginning of
the string until it finds a tab and passes the matched substring to the 'rep' function. It does this repeatedly until there are no more matches.
My problem is in trying to figure out what the rep function is doing especially in the
while loop. Why does the loop stop at 1? Why does it count up?.
Suprisingly, it counts the number of spaces in the string, how exactly is a mystery.
If you compare its output with the output from the last gsub replacement you'll find that they are different. Detab maintains
the alignment of the characters while the gsub replacement doesn't. Why is that so?
Bonus question. When I switch on whitespace in Scite, I can see that the tab before the 't' is longer than the tab before the third 's'. Why are they different?
From analyzing the rep function, this is what it appears to be doing. First, it takes the length of the match string passed in and make it negative (eg like multiplying it by -1). In the while loop it keeps adding to space until it becomes positive.
It might be easier to visualize this using a number line:
<--|----|-------|----|----|----|----|----|----|----|----|--->
-n -spaces -2 -1 0 1 2 n
In essence, the loop is trying to figure how many "tab_widths" can fit into spaces before it "overflows". Here it's using the transition from 0 to 1 as the cutoff point. After the loop, spaces will have how much it overflowed by.
In fact, the while loop is mimicking a mathematical operation you might know as modulo. In other words the inner rep function can be rewritten as this:
local function rep(match)
local spaces = tab_width - match:len() % tab_width
return match .. string.rep(" ", spaces)
end
This differs from the outter str:gsub("\t", " ") where that one indiscriminately substitutes all tab characters with 4 spaces. OTOH, in detab function, the number of spaces that replaces the tab character depends on the length of the matching capture.
eg.
matching length is 1, replace tab with 3 spaces
matching length is 2, replace tab with 2 spaces
matching length is 3, replace tab with 1 space
matching length is n, replace tab with tab_width - (n % tab_width) spaces
etc.
To answer the bonus question:
Tab characters align to tabstops. A tabstop is eight characters. The first tab starts on column six so it needs to pad three spaces. The second tab starts on column 16 so it only needs to be one space wide.
The loop stops when spaces becomes a positive number because the loop has been adding spaces in 'indent' increments until it has enough spaces to be longer than the matched text. When it then combines that number of spaces with the matched text it has constructed a string which is padded to the correct tabstop.
That's also why the gsub differs. The gsub isn't treating tabs as tabstop characters but rather as four spaces. So the second tab doesn't pad to the tabstop but instead expands to four spaces.
Related
I'm building a word unscrambler using MySQL, Think about it like the SCRABBLE game, there is a string which is the letter tiles and the query should return all words that can be constructed from these letters, I was able to achieve that using this query:
SELECT * FROM words
WHERE word REGEXP '^[hello]{2,}$'
AND NOT word REGEXP 'h(.*?h){1}|e(.*?e){1}|l(.*?l){2}|l(.*?l){2}|o(.*?o){1}'
The first part of the query makes sure that the output words are constructed from the letter tiles, the second part takes care of the words occurrences, so the above query will return words like: hello, hell, hole, etc..
My issue is when there is a blank tile (a wildcard), so for example if the string was: "he?lo", the "?" Can be replaced with any letter, so for example it will output: helio, helot.
Can someone suggest any modification on the query that will make it support the wildcards and also takes care of the occurrence. (The blank tiles could be up to 2)
I've got something that comes close. With a single blank tile, use:
SELECT * FROM words
WHERE word REGEXP '^[acre]*.[acre]*$'
AND word not REGEXP 'a(.*?a){1}|r(.*?r){1}|c(.*?c){1}|e(.*?e){1}'
with 2 blank tiles use:
SELECT * FROM words
WHERE word REGEXP '^[acre]*.[acre]*.[acre]*$'
AND word NOT REGEXP 'a(.*?a){1}|r(.*?r){1}|c(.*?c){1}|e(.*?e){1}'
The . in the first regexp allows a character that isn't one of the tiles with a letter on it.
The only problem with this is that the second regexp prevents duplicates of the lettered tiles, but a blank should be allowed to duplicate one of the letters. I'm not sure how to fix this. You could add 1 to the counts in {}, but then it would allow you to duplicate multiple letters even though you only have one blank tile.
A possible starting point:
Sort the letters in the words; sort the letters in the tiles (eg, "ehllo", "acer", "aerr").
That will avoid some of the ORing, but still has other complexities.
If this is really Scrabble, what about the need to attach to an existing letter or letters? And do you primarily want to find a way to use all 7 letters?
I would like to regex replace Plus in the below text, but only when it's not wrapped in a header tag:
<h4 class="Somethingsomething" id="something">Plus plan</h4>The <b>Plus</b> plan starts at $14 per person per month and comes with everything from Basic.
In the above I would like to replace the second "Plus" but not the first.
My regex attempt so far is:
(?!<h\d*>)\bPlus\b(?!<\\h>)
Meaning:
Do not capture the following if in a <h + 1 digit and 0 or more characters and end an closing <\h>
Capture only if the group "Plus" is surrounded by spaces or white space
However - this captures both occurrences. Can someone point out my mistake and correct this?
I want to use this in VBA but should be a general regex question, as far as I understand.
Somewhat related but not addressing my problem in regex
Not relevant, as not RegEx
You can use
\bPlus\b(?![^>]*<\/h\d+>)
See the regex demo. To use the match inside the replacement pattern, use the $& backreference in your VBA code.
Details:
\bPlus\b - a whole word Plus
(?![^>]*<\/h\d+>) - a negative lookahead that fails the match if, immediately to the right of the current location, there are
[^>]* - zero or more chars other than >
<\/h - </h string
\d+ - one or more digits
> - a > char.
I want to create a pattern for an HTML input field that needs to have at least 10 numbers in it and may also have spaces and a plus sign on top of that, but it's not required.
It's important that numbers and spaces can be mixed though. Also, the whole field can only have 17 characters all in all.
I'm not sure if it's even possible. I started doing something like that:
pattern="[0-9+\s]{10,17}*"
But like this, it's not guaranteed that there are at least 10 numbers.
Thanks in advance! Hope the question doesn't exist already, I looked but couldn't find it.
You can use
pattern="(?:[+\s]*\d){10,17}[+\s]*"
The regex matches
(?:[+\s]*\d){10,17} - ten to seveteen occurrences of zero or more + or whitespaces and then a digit
[+\s]* - zero or more + or whitespaces.
Note the pattern is anchored by default (it is wrapped with ^(?: and )$), so nothing else is allowed.
I'm attempting to parse and format some text from an HTML file into Word. I'm doing this by capturing each paragraph into an array and then writing it into the word document one paragraph at a time. However, there are superscripted references sprinkled throughout the text. I'm looking for a way to superscript these references in the new Word file and thought I would use regex and split to make this work. Here is an example paragraph:
$p = "This is an example sentence.1 The number is a reference note that should be superscripted and can be one or two digits long."
Here is the code I tried to split and select the digit(s):
[regex]::Split($p,"(\d{1,2})")
This works for single and double digits. However, if there are more than two digits, it still splits it, but moves the extra numbers to the next line. Like so:
This is an example sentence.
10
0
The number is a reference note that should be superscripted and can be one or two digits long.
This is important because there are sometimes larger numbers (3-10 digits) in the text that I don't want to split on. My goal is to take a block of text with reference note numbers and seperate out the notes so I can perform formatting functions on them when I write it out to the Word file. Something like this (untested):
$paragraphs | % {
$a = #([regex]::Split($_,"(\d{1,2})"))
$a | % {
$text = $_
if ($text -match "(\d{1,2})")
{
$objSelection.Font.SuperScript = 1
$objSelection.TypeText("$text")
$objSelection.Font.SuperScript = 0
}
Else
{
$objSelection.Style="Normal"
$objSelection.TypeText("$text")
}
}
$text = "`v"
$objSelection.TypeText("$text")
$objSelection.TypeParagraph()
}
EDIT:
The following regex expression works when I test it with the above loop in it's own script:
"(?<![\d\s])(\d{1,2})(?!\d)"
However, when I run it in the parent script, I get the following error:
Cannot find an overload for "Split" and the argument count: "2"
$a = [regex]::Split($_,"(?<![\d\s])(\d{1,2})(?!\d)")
How would I go about troubleshooting this error?
You may use
[regex]::Split($p,"(?<![\d\s])(\d{1,2})(?!\d)\s*")
It only matches and captures one or two digits that are neither followed nor preceded with another digit, and not preceded with any whitespace char. Any trailing whitespace is matched with \s* and is thus removed from the items that are added into the resulting array.
See this regex demo:
Details
(?<![\d\s]) - a negative lookbehind that fails the match if, immediately to the left of the current position, there is a digit or a whitespace
(\d{1,2}) - Group 1: one or two digits
(?!\d) - that cannot be followed with another digit (it is a negative lookahead that fails the match if its pattern matches immediately to the right of the current location)
\s* - 0+ whitespaces.
People,
Currently i have a string mysql field Class on a table.
It´s a code plus a description. I need to extract the description only (without a whitespace in the begining of the string).
The rule of formation of this field data follows:
N.N Description(without any digit or dots) or N.N. Description (without any digit or dots)
Where N is a number between 1 and 10.
I´ve tried this multiple replace but it remains two cases with one leading white space that i could not remove:
' Suspension'
and
' Reduction'
My multiplce replace is:
REPLACE(REPLACE(REPLACE(TRIM(BOTH ' ' FROM (REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(class,'.',''),'1',''),'0',''),'2',''),'3',''),'4',''),'5',''),'6',''),'7',''),'8',''),'9',''))),'\r',''),'\t',''),'\n','')
What this leading white space could be so i can replace it? What could be missing?
Or anyone have a better idea how to solve this?
Probably you are replacing with white space and this the modified string ends up with that. Why can't you use TRIM() function on your final replaced string to get rid of those leading spaces.