Don't see special font characters in run.text (sometimes) - python-docx

I have a word document mixing some Wingdings characters with Cambria text. When I look into the runs, I see sometimes a run.text with length 1 and the character is in hex e.g. 0xf063. The run.font.name is e.g. Wingdings 2. This is as expected. But often I see an empty run.text (font name still Wingdings). Nevertheless, the characters must be there, because, when I append the run to a new paragraph, I can see them in Word, at least when I pass them just through. When I however duplicate the run (as best as I can), the characters are lost, probably, because, when I dup the run, I miss something. So my question is, where are the characters stored when run.text is empty, and what do I have to observe when I duplicate such a run.
The characters are not lost during run duplication, however, if the run.text is not empty. Thus, the problem originates when the document is read, and sometimes the character is in run.text, and sometimes somewhere else. Which one is unpredictable to me.
I just had the idea to unzip the doc and look into document.xml. There I see
<w:r w:rsidRPr="00946796">
<w:rPr> <w:color w:val="EE9512"/>
<w:lang w:val="de-DE"/>
</w:rPr>
<w:t xml:space="preserve">YYYYYYY
</w:t>
</w:r>
<w:r w:rsidR="009E034B" w:rsidRPr="00695B07">
<w:rPr>
<w:rFonts w:ascii="Wingdings 3" w:hAnsi="Wingdings 3"/>
<w:color w:val="EE9512"/>
</w:rPr>
<w:sym w:font="Wingdings 2" w:char="F038"/>
</w:r>
So when run.text is empty, the chars are in a w:sym element, else in a w:t element.

You can see the special character as a "symbol" here:
<w:r w:rsidR="009E034B" w:rsidRPr="00695B07">
<w:rPr>
<w:rFonts w:ascii="Wingdings 3" w:hAnsi="Wingdings 3"/>
<w:color w:val="EE9512"/>
</w:rPr>
<w:sym w:font="Wingdings 2" w:char="F038"/> <!-- <<==== this line -->
</w:r>
I haven't researched this in depth, but I expect the distinction here is that the glyphs in this "font" are not stylized versions of the unicode codepoint at which they appear.
For example, there are no "A", "B", "C" characters in this font, those positions are taken by arrows or something instead.
I imagine the distinction is important because you couldn't get good results by substituting a similar font if Windings 2 is not installed on the current machine. So at least that font-substitution behavior would be different for this symbol than for regular characters.
There is no API support yet for symbols in runs, so you'd need to use lxml calls to access these elements, perhaps something like:
from docx.oxml.ns import qn
syms = run._r.xpath("./w:sym")
for sym in syms:
print("font == %s" % sym.get(qn("w:font")))
print("char == %s" % sym.get(qn("w:char")))

After a few more hours I think I see the complete picture. First, as scanny wrote above, python-docx does not handle w:sym elements at all (yet?), so these are lost after reading the docx, unless you resort to lxml. Then, why do I sometimes see a Wingdings character in w:t, sometimes in w:sym? Well, if I use the Word Symbol chooser (a window with all the characters in a font, where you can select one and then press "Insert" at the bottom), then you get a w:sym element. If you just set the font to Wingdings, and then type the suitable character on the keyboard (e.g. an 8 for a Wingdings 2 Circle with Dot inside), then you get a w:t element.
Thus I managed to remove all w:sym elements. To determine the "suitable" character, google for "Wingdings translator".

Related

Is it possible to remove extra space only for Chinese character but keep necessary code symbol in html?

I want to update the page so all the sentences in Chinese all contain extra spaces and every Chinese character gets a space before it.
The page will be a mess if I press \S to find all the all extra space, then delete all.
It will take lot of time pressing \S to find all the all extra space in the code, then cut out the specific Chinese character one by one.
(I just saw that you are doing it in an editor. The following is in JavaScript. You can install Node.js and write a simple program to read in each line and replace it with the correct content, write it back out to a file. For example, Google for Node fs.)
You could use:
const s = "Oscar list 奧 斯 卡 提 名 名 單 出 爐 - 最 佳 導演全男班 today";
const result = s.replace(/([^\u0000-\u00FF])[ \t]*(?![\u0000-\u00FF])/gu, "$1");
console.log(result);
// stringify to show string:
console.log(JSON.stringify(result));
Basically it is saying, if it is not the usual 8-bit extended ASCII but is unicode character $1, followed by some space, and it is not 8-bit extended ASCII afterwards (just lookahead), then replace it with just $1.
You can change it to 7-bit ASCII if you want, which is [^\u0000-\u007F]

What are the exponent characters (in non-formatted text)? How can I create these exponent characters?

I´m searching for a list of exponents like ¹²³ and so on and the same with letters. Note these still remain superscripted even in plain text.
Does something like these exist? If not, how can I create those?
(I need them for a website-project)
Unicode versions of superscripted/subscripted characters exist for all ten digits but not for all letters. They remain superscripted/subscripted in a plain-text environment without the need of format tags such as <sup>/<sub>.
However (as of v14), not all letters have Unicode superscripts. Furthermore, they are scattered along different Unicode ranges, and are in fact used mainly for phonetic transcription. Additionally, they are used for compatibility purposes especially if the text does not support markup superscripts and subscripts.
Exponent characters:
These are mostly used for mathematical and referencing usage.
- ⁰ [U+2070]
- ¹ [U+00B9, Latin-1 Supplement]
- ² [U+00B2, Latin-1 Supplement]
- ³ [U+00B3, Latin-1 Supplement]
- ⁴ [U+2074]
- ⁵ [U+2075]
- ⁶ [U+2076]
- ⁷ [U+2077]
- ⁸ [U+2078]
- ⁹ [U+2079]
- ⁺ [U+207A]
- ⁻ [U+207B]
- ⁼ [U+207C]
- ⁽ [U+207D]
- ⁾ [U+207E]
- ⁿ [U+207F]
- ⁱ [U+2071]
The "linear", "squared", and "cubed" subscripts are the most familiar and are found in Latin-1 Supplement. All the others are found in Superscripts and Subscripts. Add 0x2070 to all the non-Latin-1 Supplement superscripts to obtain the code point value of these digits. See this Wikipedia article and the official Unicode codepage segment.
Interesting notes
There are also subtle differences between <sup> subscripts and Unicode subscripts; Unicode subscripts are entirely different codepoints altogether, and some fonts professionally design subscripted letters because <sup> subscripts may look thin.
Compare x² with x2, similarly x⁺ with x+ (the first involves Unicode, the second is markup)
The best solution is to use markup, such as <sup>.
You can't create the characters, but you can format then as super-scripts if you are generating HTML.
As to find which exist, you just have to use an unicode-character searching resource and look for "superscript" to have a listing -
This query, for example:
https://www.fileformat.info/info/unicode/char/search.htm?q=superscript&preview=entity
As you can see, all digits are available (more than once, even), but very few letters.
However, if you intend to generate HTML output, the <sup> tag will work for any text you want, and give the necessary semantic meaning to the text - you can read about it and try it online here: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/sup

Mysql to XSLT replace \r\n

I'm using data from a MySQLi query and placing it into an XSLT file that eventually replaces the document.xml file in Word. The issue I am having is with \r and \n coming into my Word document.
The XSLT files are quite large so I will not paste all the code here, however, an example of one of the fields is below:
<w:p w:rsidR="00C61454" w:rsidRPr="00430555" w:rsidRDefault="00AE5B7C" w:rsidP="00C61454">
<w:pPr>
<w:rPr>
<w:rFonts w:ascii="Calibri" w:hAnsi="Calibri" w:cs="Segoe UI"/>
</w:rPr>
</w:pPr>
<w:proofErr w:type="spellStart"/>
<w:r w:rsidRPr="00AE5B7C">
<w:rPr>
<w:rFonts w:ascii="Calibri" w:hAnsi="Calibri" w:cs="Segoe UI"/>
<w:sz w:val="18"/>
<w:szCs w:val="18"/>
</w:rPr>
<w:t><xsl:value-of select="comments"/></w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
</w:p>
In the above code, comments is being placed from the mysqli query into the xslt.
There are a few other fields that may contain \r\n so something that works for the entire file would be best.
If doing a replace in the SELECT script works, I can take that route too, however, being a novice, I wouldn't know how to replace multiple items (the \r and the \n), nor what to replace them with to create the line feed / carriage return in Word.
In XSLT/XPath, normalize-space() will consolidate embedded whitespace and trim surrounding whitespace. Perhaps it could help with your unwanted \r\n characters.
See also How to add a newline (line break) in XML file? (Semantics of newline and line break characters can depend upon the processing application.)
Finally, in OOXML (the markup standard behind Microsoft Word's DOCX format), use <w:br/> (between w:t text elements within w:r run elements) to force a hard line break.

HTML's handling of white-space characters depends on context - but what are the rules?

The Unicode catalogue includes a number of white-space characters, some of which don't appear to work in any context in HTML documents - but some of which, rather usefully, do.
Here is an example:
<h1 title="Hi! As a title attribute, 
I can contain horizontal tabs 
and carriage returns
and line feeds.">HTML's handling of &009; | &010; | &013;</h1>
<p>Hello. As a paragraph element, I can't contain horizontal tabs 
or carriage returns
or line feeds.</p>
<input type="submit" value="I am a value attribute and
like title I can also handle line feeds" /><br />
<input type="submit" value="I am another value attribute. Like title I can handle horizontal tabs" /><br />
<input type="submit" value="I am a third value attribute. 
Unlike title I can't handle carriage returns" />
Is there any official spec or series of guidelines which detail which white-space characters can be deployed in HTML documents and where?
It's a little unclear what you mean by work, but I'm going to assume you mean rendering, at which point what happens is really up to CSS.
https://www.w3.org/TR/CSS2/text.html#white-space-model defines how most whitespace characters are normalized away, unless you adjust the white-space property.
Note that the display of toolbars (such as from the title attribute) and form controls (such as from input elements) is not defined by any standard, leaving that effectively up to browsers.
Disclaimer: this answer was composed for the question as originally written, making explicit references to ASCII control characters. It was apparently a red herring so the information here may look confusing now.
First of all, I don't think nobody uses ASCII any more. In 2016 the only sensible encoding is UTF-8. Whatever, UTF-8 is a superset of ASCII (and you can use ASCII anyway) so the question is still be valid.
Secondly, your example isn't correct. All the HTML entities you mention are printable characters:
is 'CHARACTER TABULATION' (U+0009) (i.e. a tab)
is 'CARRIAGE RETURN (CR)' (U+000D) (i.e. a legacy MacOS line feed)
is 'LINE FEED (LF)' (U+000A) (i.e. a Unix line feed)
(And please note that Windows line feeds are a combination of CR+LF.)
If you're really talking about control characters:
EOT End of Transmission
ACK Acknowledgement
BEL Bell
...
... we first need to understand that HTML is meant to be plain text (as such, it's MIME content type is text/html). The HTML5 Living Standard provides a definition of control character that's wider than the ASCII one but in any case it doesn't seem to be allowed:
Any occurrences of any characters in the ranges U+0001 to U+0008,
U+000E to U+001F, U+007F to U+009F, U+FDD0 to U+FDEF, and characters
U+000B, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, U+3FFFE,
U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE, U+5FFFF, U+6FFFE, U+6FFFF,
U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF, U+9FFFE, U+9FFFF, U+AFFFE,
U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE, U+CFFFF, U+DFFFE, U+DFFFF,
U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF, U+10FFFE, and U+10FFFF are parse
errors. These are all control characters or permanently undefined
Unicode characters (noncharacters).
Any character that is a not a Unicode character, i.e. any isolated
surrogate, is a parse error. (These can only find their way into the
input stream via script APIs such as document.write().)
If you actually refer to the characters in your example, some of then are considered exceptions in the parsing stage:
U+000D CARRIAGE RETURN (CR) characters and U+000A LINE FEED (LF)
characters are treated specially. Any LF character that immediately
follows a CR character must be ignored, and all CR characters must
then be converted to LF characters. Thus, newlines in HTML DOMs are
represented by LF characters, and there are never any CR characters in
the input to the tokenization stage.
... but I suspect you are only interested in white-space collapsing:
In HTML, only the following characters are defined as white space
characters:
ASCII space ( )
ASCII tab ( )
ASCII form feed ()
Zero-width space (​)
[...]
In particular, user agents should collapse input white space sequences
when producing output inter-word space.
[...]
The PRE element is used for preformatted text, where white space is
significant.
In other words, consecutive white space characters become a simple space (except inside <pre> tag). (I could only find a link for HTML 4 but that's something that hasn't changed significantly).
Is there any official spec or series of guidelines? Sure they are: you have the official W3C recommendations and the WHATWG specs but they're basically technical documentation mostly addressed at browser vendors: extensive, comprehensive and hard to decipher into plain English ;-)

Hex color code multiple #

While editing some old ColdFusion code I found a <td> which has a bgcolor property. The value of it was ##89969E. I copied the code to a HTML file and found out the color was different in ColdFusion.
Now, i noticed the double #, so i removed one and the color was the same. Why does the color change when adding/removing a #?
jsFiddle
As a basic premise, additional hashes are interpreted as a missing/erroneous value and so replaced with a zero, so ##89969E becomes #0089969E. Note that HEX codes can be as long as 8 digits following a hash (if aRGB), where the last two refer to transparency
Missing digits are treated as 0[...]. An incorrect digit is simply
interpreted as 0. For example the values #F0F0F0, F0F0F0, F0F0F, #FxFxFx and FxFxFx are all the same.
When color strings longer than 8 characters or shorter than 4
characters are used, things start to get strange.
However there are a lot of nuances - you can find out more about this here, and for some fairly entertaining results this creates, have a little read here