>>> a = '{"key1": "aaaaaaaaaaaaaaaaa", "key2": "bbbbbbbbbbbbbbbbbbbbbbbb"}'
>>> len(a)
64
>>> textwrap.wrap(a, 32, drop_whitespace=False)
['{"key1": "aaaaaaaaaaaaaaaaa", ', '"key2": ', '"bbbbbbbbbbbbbbbbbbbbbbb"}']
I was expecting
['{"key1": "aaaaaaaaaaaaaaaaa", "k', 'ey2": "bbbbbbbbbbbbbbbbbbbbbbb"}']
I'am missing something ?
Your expectation is wrong, according to the official documentation:
Wraps the single paragraph in text (a string) so every line is at most width characters long. Returns a list of output lines, without final newlines.
[...]
Text is preferably wrapped on whitespaces and right after the hyphens in hyphenated words; only then will long words be broken if necessary, unless TextWrapper.break_long_words is set to false.
Your expected output is literally broken off after 32 characters, whereas the actual output is split into segments of 30, 8, and 27 characters long – broken only on the whitespace characters in the original string.
The second segment is much shorter than the others because the first string plus this next non-whitespace run "key2": is longer than 32 characters, and this short run plus the next phrase is also longer than 32 characters. Only when there is absolutely no possibility to break on a space or hyphen, a break in the middle of a non-whitespace run will occur.
Related
I'm trying to write a query to identify what rows have special characters in them, but I want it to ignore spaces
So far I've got
SELECT word FROM `games_hangman_words` WHERE word REGEXP '[^[:alnum:]]'
Currently this matches those that use all special characters, what I want is to ignore if the special character is space
So if I have these rows
Alice
4 Kings
Another Story
Ene-tan
Go-Busters Logo
Lea's Request
I want it to match
Ene-tan, Go-Busters Logo and Lea's Request
Simply extend your class.
... WHERE word REGEXP '[^[:alnum:] ]' ...
for only a "regular" space (ASCII 32) or
... WHERE word REGEXP '[^[:alnum:][:space:]]' ...
for all kind of white space characters.
I want to update the page so all the sentences in Chinese all contain extra spaces and every Chinese character gets a space before it.
The page will be a mess if I press \S to find all the all extra space, then delete all.
It will take lot of time pressing \S to find all the all extra space in the code, then cut out the specific Chinese character one by one.
(I just saw that you are doing it in an editor. The following is in JavaScript. You can install Node.js and write a simple program to read in each line and replace it with the correct content, write it back out to a file. For example, Google for Node fs.)
You could use:
const s = "Oscar list 奧 斯 卡 提 名 名 單 出 爐 - 最 佳 導演全男班 today";
const result = s.replace(/([^\u0000-\u00FF])[ \t]*(?![\u0000-\u00FF])/gu, "$1");
console.log(result);
// stringify to show string:
console.log(JSON.stringify(result));
Basically it is saying, if it is not the usual 8-bit extended ASCII but is unicode character $1, followed by some space, and it is not 8-bit extended ASCII afterwards (just lookahead), then replace it with just $1.
You can change it to 7-bit ASCII if you want, which is [^\u0000-\u007F]
I'm attempting to parse and format some text from an HTML file into Word. I'm doing this by capturing each paragraph into an array and then writing it into the word document one paragraph at a time. However, there are superscripted references sprinkled throughout the text. I'm looking for a way to superscript these references in the new Word file and thought I would use regex and split to make this work. Here is an example paragraph:
$p = "This is an example sentence.1 The number is a reference note that should be superscripted and can be one or two digits long."
Here is the code I tried to split and select the digit(s):
[regex]::Split($p,"(\d{1,2})")
This works for single and double digits. However, if there are more than two digits, it still splits it, but moves the extra numbers to the next line. Like so:
This is an example sentence.
10
0
The number is a reference note that should be superscripted and can be one or two digits long.
This is important because there are sometimes larger numbers (3-10 digits) in the text that I don't want to split on. My goal is to take a block of text with reference note numbers and seperate out the notes so I can perform formatting functions on them when I write it out to the Word file. Something like this (untested):
$paragraphs | % {
$a = #([regex]::Split($_,"(\d{1,2})"))
$a | % {
$text = $_
if ($text -match "(\d{1,2})")
{
$objSelection.Font.SuperScript = 1
$objSelection.TypeText("$text")
$objSelection.Font.SuperScript = 0
}
Else
{
$objSelection.Style="Normal"
$objSelection.TypeText("$text")
}
}
$text = "`v"
$objSelection.TypeText("$text")
$objSelection.TypeParagraph()
}
EDIT:
The following regex expression works when I test it with the above loop in it's own script:
"(?<![\d\s])(\d{1,2})(?!\d)"
However, when I run it in the parent script, I get the following error:
Cannot find an overload for "Split" and the argument count: "2"
$a = [regex]::Split($_,"(?<![\d\s])(\d{1,2})(?!\d)")
How would I go about troubleshooting this error?
You may use
[regex]::Split($p,"(?<![\d\s])(\d{1,2})(?!\d)\s*")
It only matches and captures one or two digits that are neither followed nor preceded with another digit, and not preceded with any whitespace char. Any trailing whitespace is matched with \s* and is thus removed from the items that are added into the resulting array.
See this regex demo:
Details
(?<![\d\s]) - a negative lookbehind that fails the match if, immediately to the left of the current position, there is a digit or a whitespace
(\d{1,2}) - Group 1: one or two digits
(?!\d) - that cannot be followed with another digit (it is a negative lookahead that fails the match if its pattern matches immediately to the right of the current location)
\s* - 0+ whitespaces.
-- Converts tabs to spaces
function detab(text)
local tab_width = 4
local function rep(match)
local spaces = -match:len()
print("match:"..match)
while spaces<1 do spaces = spaces + tab_width end
print("Found "..spaces.." spaces")
return match .. string.rep(" ", spaces)
end
text = text:gsub("([^\n]-)\t", rep)
return text
end
str=' thisisa string'
--thiis is a string
print("length: "..str:len())
print(detab(str))
print(str:gsub("\t"," "))
I have this piece of code from markdown.lua that converts tabs to spaces(as its name suggests). What I have managed to figured out is that it searches from the beginning of
the string until it finds a tab and passes the matched substring to the 'rep' function. It does this repeatedly until there are no more matches.
My problem is in trying to figure out what the rep function is doing especially in the
while loop. Why does the loop stop at 1? Why does it count up?.
Suprisingly, it counts the number of spaces in the string, how exactly is a mystery.
If you compare its output with the output from the last gsub replacement you'll find that they are different. Detab maintains
the alignment of the characters while the gsub replacement doesn't. Why is that so?
Bonus question. When I switch on whitespace in Scite, I can see that the tab before the 't' is longer than the tab before the third 's'. Why are they different?
From analyzing the rep function, this is what it appears to be doing. First, it takes the length of the match string passed in and make it negative (eg like multiplying it by -1). In the while loop it keeps adding to space until it becomes positive.
It might be easier to visualize this using a number line:
<--|----|-------|----|----|----|----|----|----|----|----|--->
-n -spaces -2 -1 0 1 2 n
In essence, the loop is trying to figure how many "tab_widths" can fit into spaces before it "overflows". Here it's using the transition from 0 to 1 as the cutoff point. After the loop, spaces will have how much it overflowed by.
In fact, the while loop is mimicking a mathematical operation you might know as modulo. In other words the inner rep function can be rewritten as this:
local function rep(match)
local spaces = tab_width - match:len() % tab_width
return match .. string.rep(" ", spaces)
end
This differs from the outter str:gsub("\t", " ") where that one indiscriminately substitutes all tab characters with 4 spaces. OTOH, in detab function, the number of spaces that replaces the tab character depends on the length of the matching capture.
eg.
matching length is 1, replace tab with 3 spaces
matching length is 2, replace tab with 2 spaces
matching length is 3, replace tab with 1 space
matching length is n, replace tab with tab_width - (n % tab_width) spaces
etc.
To answer the bonus question:
Tab characters align to tabstops. A tabstop is eight characters. The first tab starts on column six so it needs to pad three spaces. The second tab starts on column 16 so it only needs to be one space wide.
The loop stops when spaces becomes a positive number because the loop has been adding spaces in 'indent' increments until it has enough spaces to be longer than the matched text. When it then combines that number of spaces with the matched text it has constructed a string which is padded to the correct tabstop.
That's also why the gsub differs. The gsub isn't treating tabs as tabstop characters but rather as four spaces. So the second tab doesn't pad to the tabstop but instead expands to four spaces.
I found a strange issue when browsing the older Ext documentation, http://extjs.cachefly.net/ext-3.2.1/docs/?class=Ext.grid.PropertyGrid
The layout of the inheritance box (top right) is somewhat shattered.
broken layout http://img339.imageshack.us/img339/374/bildschirmfoto20110427u.png
But after executing
var resblock = document.getElementById('docs-Ext.grid.PropertyGrid').getElementsByClassName('res-block-inner')[0];
resblock.innerHTML = resblock.innerHTML; // should be a no-op(?)
everything is okay.
okay layout http://img204.imageshack.us/img204/374/bildschirmfoto20110427u.png
How can that be? A bug in Firefox 4?
Edit
A minimal testcase: http://jsfiddle.net/uZ3eC/
Yes, it looks like a bug in the way Firefox 4 , over the handling of handles line endings.
The resblock element is a <pre> element containing a number of text nodes, which deal with new lines and indentations. When they are constructed through the scripts, they contain a CARRIAGE RETURN (U+000D) followed by a sequence of non-breaking spaces.
However, after running resblock.innerHTML = resblock.innerHTML; they now contain a LINE FEED (U+000A) followed by the non-breaking spaces.
It seems that Firefox 4 is only treating the line feed character as a line break, and rendering the parts of the class hierarchy on new lines.
Edit: What Boris said.
The HTML5 draft spec Section 8.2.2.3 Preprocessing the input stream says:
U+000D CARRIAGE RETURN (CR) characters
and U+000A LINE FEED (LF) characters
are treated specially. Any CR
characters that are followed by LF
characters must be removed, and any CR
characters not followed by LF
characters must be converted to LF
characters. Thus, newlines in HTML
DOMs are represented by LF characters,
and there are never any CR characters
in the input to the tokenization
stage.