I'm using data from a MySQLi query and placing it into an XSLT file that eventually replaces the document.xml file in Word. The issue I am having is with \r and \n coming into my Word document.
The XSLT files are quite large so I will not paste all the code here, however, an example of one of the fields is below:
<w:p w:rsidR="00C61454" w:rsidRPr="00430555" w:rsidRDefault="00AE5B7C" w:rsidP="00C61454">
<w:pPr>
<w:rPr>
<w:rFonts w:ascii="Calibri" w:hAnsi="Calibri" w:cs="Segoe UI"/>
</w:rPr>
</w:pPr>
<w:proofErr w:type="spellStart"/>
<w:r w:rsidRPr="00AE5B7C">
<w:rPr>
<w:rFonts w:ascii="Calibri" w:hAnsi="Calibri" w:cs="Segoe UI"/>
<w:sz w:val="18"/>
<w:szCs w:val="18"/>
</w:rPr>
<w:t><xsl:value-of select="comments"/></w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
</w:p>
In the above code, comments is being placed from the mysqli query into the xslt.
There are a few other fields that may contain \r\n so something that works for the entire file would be best.
If doing a replace in the SELECT script works, I can take that route too, however, being a novice, I wouldn't know how to replace multiple items (the \r and the \n), nor what to replace them with to create the line feed / carriage return in Word.
In XSLT/XPath, normalize-space() will consolidate embedded whitespace and trim surrounding whitespace. Perhaps it could help with your unwanted \r\n characters.
See also How to add a newline (line break) in XML file? (Semantics of newline and line break characters can depend upon the processing application.)
Finally, in OOXML (the markup standard behind Microsoft Word's DOCX format), use <w:br/> (between w:t text elements within w:r run elements) to force a hard line break.
Related
The goal is to prepare an HTML file to be transformed to Markdown using PowerShell.
The PowerShell script includes these lines:
-replace '<pre.*?>(.*?)</pre>', '`$1`'`
-replace '<code.*?>(.*?)</code>', '`<b>$1</b>`'`
Sometimes the HTML includes text <pre><code>text</code></pre> text. Sometimes it only includes text <code>/text</code> text.
Because Markdown interprets text surrounded by single backticks (`) to be "code" for stylistic purposes, I want the PowerShell search/replace to:
If <pre>...</pre> is present, replace<pre>...</pre> with backticks, not <code>...</code>.
If <pre>...</pre> is absent, replace<code>...</code> with backticks.
(If I am going about it all wrong, I would be grateful to know.)
I am going in the wrong direction, because no Regex I have tried is working.
^(?!.*?[</pre>]).*$<code.*?>(.*?)</code> (no matches)
^((?!</pre>$).)*<code.*?>(.*?)</code> (matches even when </pre> is present)
^(?!</pre>$).*<code.*?>(.*?)</code> (matches even when </pre> is present)
Etc.
Can anyone point me in the right direction? Thank you for any help.
(I know there are tools that transform HTML to Markdown automatically and am using one - this is just a unique preparatory step based on irregularities in our specific output.)
#'
...
... <pre><code>bingo</code></pre> ...
... <code>bongo</code> ...
...
'# -replace '(?s)(?:(?:<pre>\s*)?<code>)(.*?)(?:</code>(?:\s*</pre>)?)', '`$1`'
Note: For brevity and simplicity, I'm assuming that the opening <pre> and <code> tags contain neither attributes nor whitespace before their closing >, and, similarly, that the closing tags contain no whitespace before their closing >. It is variability like this that makes it generally preferable to use a dedicated HTML parser rather than regular expressions.
The above yields:
...
... `bingo` ...
... `bongo` ...
...
(?s) is the SingleLine inline regex option that makes . match newlines too (in case the value to enclose in `...` spans multiple lines - though note that in later Markdown rendering those newlines may be lost).
(?:...) constructs are non-capturing subexpressions, useful for subexpressions that are needed for logical reasons, without needing what they match to be referenced later.
I have a word document mixing some Wingdings characters with Cambria text. When I look into the runs, I see sometimes a run.text with length 1 and the character is in hex e.g. 0xf063. The run.font.name is e.g. Wingdings 2. This is as expected. But often I see an empty run.text (font name still Wingdings). Nevertheless, the characters must be there, because, when I append the run to a new paragraph, I can see them in Word, at least when I pass them just through. When I however duplicate the run (as best as I can), the characters are lost, probably, because, when I dup the run, I miss something. So my question is, where are the characters stored when run.text is empty, and what do I have to observe when I duplicate such a run.
The characters are not lost during run duplication, however, if the run.text is not empty. Thus, the problem originates when the document is read, and sometimes the character is in run.text, and sometimes somewhere else. Which one is unpredictable to me.
I just had the idea to unzip the doc and look into document.xml. There I see
<w:r w:rsidRPr="00946796">
<w:rPr> <w:color w:val="EE9512"/>
<w:lang w:val="de-DE"/>
</w:rPr>
<w:t xml:space="preserve">YYYYYYY
</w:t>
</w:r>
<w:r w:rsidR="009E034B" w:rsidRPr="00695B07">
<w:rPr>
<w:rFonts w:ascii="Wingdings 3" w:hAnsi="Wingdings 3"/>
<w:color w:val="EE9512"/>
</w:rPr>
<w:sym w:font="Wingdings 2" w:char="F038"/>
</w:r>
So when run.text is empty, the chars are in a w:sym element, else in a w:t element.
You can see the special character as a "symbol" here:
<w:r w:rsidR="009E034B" w:rsidRPr="00695B07">
<w:rPr>
<w:rFonts w:ascii="Wingdings 3" w:hAnsi="Wingdings 3"/>
<w:color w:val="EE9512"/>
</w:rPr>
<w:sym w:font="Wingdings 2" w:char="F038"/> <!-- <<==== this line -->
</w:r>
I haven't researched this in depth, but I expect the distinction here is that the glyphs in this "font" are not stylized versions of the unicode codepoint at which they appear.
For example, there are no "A", "B", "C" characters in this font, those positions are taken by arrows or something instead.
I imagine the distinction is important because you couldn't get good results by substituting a similar font if Windings 2 is not installed on the current machine. So at least that font-substitution behavior would be different for this symbol than for regular characters.
There is no API support yet for symbols in runs, so you'd need to use lxml calls to access these elements, perhaps something like:
from docx.oxml.ns import qn
syms = run._r.xpath("./w:sym")
for sym in syms:
print("font == %s" % sym.get(qn("w:font")))
print("char == %s" % sym.get(qn("w:char")))
After a few more hours I think I see the complete picture. First, as scanny wrote above, python-docx does not handle w:sym elements at all (yet?), so these are lost after reading the docx, unless you resort to lxml. Then, why do I sometimes see a Wingdings character in w:t, sometimes in w:sym? Well, if I use the Word Symbol chooser (a window with all the characters in a font, where you can select one and then press "Insert" at the bottom), then you get a w:sym element. If you just set the font to Wingdings, and then type the suitable character on the keyboard (e.g. an 8 for a Wingdings 2 Circle with Dot inside), then you get a w:t element.
Thus I managed to remove all w:sym elements. To determine the "suitable" character, google for "Wingdings translator".
The Unicode catalogue includes a number of white-space characters, some of which don't appear to work in any context in HTML documents - but some of which, rather usefully, do.
Here is an example:
<h1 title="Hi! As a title attribute,
I can contain horizontal tabs
and carriage returns
and line feeds.">HTML's handling of &009; | &010; | &013;</h1>
<p>Hello. As a paragraph element, I can't contain horizontal tabs
or carriage returns
or line feeds.</p>
<input type="submit" value="I am a value attribute and
like title I can also handle line feeds" /><br />
<input type="submit" value="I am another value attribute. Like title I can handle horizontal tabs" /><br />
<input type="submit" value="I am a third value attribute.
Unlike title I can't handle carriage returns" />
Is there any official spec or series of guidelines which detail which white-space characters can be deployed in HTML documents and where?
It's a little unclear what you mean by work, but I'm going to assume you mean rendering, at which point what happens is really up to CSS.
https://www.w3.org/TR/CSS2/text.html#white-space-model defines how most whitespace characters are normalized away, unless you adjust the white-space property.
Note that the display of toolbars (such as from the title attribute) and form controls (such as from input elements) is not defined by any standard, leaving that effectively up to browsers.
Disclaimer: this answer was composed for the question as originally written, making explicit references to ASCII control characters. It was apparently a red herring so the information here may look confusing now.
First of all, I don't think nobody uses ASCII any more. In 2016 the only sensible encoding is UTF-8. Whatever, UTF-8 is a superset of ASCII (and you can use ASCII anyway) so the question is still be valid.
Secondly, your example isn't correct. All the HTML entities you mention are printable characters:
is 'CHARACTER TABULATION' (U+0009) (i.e. a tab)
is 'CARRIAGE RETURN (CR)' (U+000D) (i.e. a legacy MacOS line feed)
is 'LINE FEED (LF)' (U+000A) (i.e. a Unix line feed)
(And please note that Windows line feeds are a combination of CR+LF.)
If you're really talking about control characters:
EOT End of Transmission
ACK Acknowledgement
BEL Bell
...
... we first need to understand that HTML is meant to be plain text (as such, it's MIME content type is text/html). The HTML5 Living Standard provides a definition of control character that's wider than the ASCII one but in any case it doesn't seem to be allowed:
Any occurrences of any characters in the ranges U+0001 to U+0008,
U+000E to U+001F, U+007F to U+009F, U+FDD0 to U+FDEF, and characters
U+000B, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, U+3FFFE,
U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE, U+5FFFF, U+6FFFE, U+6FFFF,
U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF, U+9FFFE, U+9FFFF, U+AFFFE,
U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE, U+CFFFF, U+DFFFE, U+DFFFF,
U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF, U+10FFFE, and U+10FFFF are parse
errors. These are all control characters or permanently undefined
Unicode characters (noncharacters).
Any character that is a not a Unicode character, i.e. any isolated
surrogate, is a parse error. (These can only find their way into the
input stream via script APIs such as document.write().)
If you actually refer to the characters in your example, some of then are considered exceptions in the parsing stage:
U+000D CARRIAGE RETURN (CR) characters and U+000A LINE FEED (LF)
characters are treated specially. Any LF character that immediately
follows a CR character must be ignored, and all CR characters must
then be converted to LF characters. Thus, newlines in HTML DOMs are
represented by LF characters, and there are never any CR characters in
the input to the tokenization stage.
... but I suspect you are only interested in white-space collapsing:
In HTML, only the following characters are defined as white space
characters:
ASCII space ( )
ASCII tab ( )
ASCII form feed ()
Zero-width space ()
[...]
In particular, user agents should collapse input white space sequences
when producing output inter-word space.
[...]
The PRE element is used for preformatted text, where white space is
significant.
In other words, consecutive white space characters become a simple space (except inside <pre> tag). (I could only find a link for HTML 4 but that's something that hasn't changed significantly).
Is there any official spec or series of guidelines? Sure they are: you have the official W3C recommendations and the WHATWG specs but they're basically technical documentation mostly addressed at browser vendors: extensive, comprehensive and hard to decipher into plain English ;-)
I've two questions:
If I take out any text using text() or as_trimmed_text() function and want to push in some element then do I need to use HTML::Entities::encode_entities? :
my $text=$node->as_trimmed_text();
$a->push_content($text); # Do I need to use encode_entities here?
Secondly after processing and generating whole html document using as_HTML() it's sometimes generating some special characters for example: Â(Â) as an extra char when all I see is single space in Dreamweaver.
I have two answers:
Assuming that you want the content of $a to be the same as the content of $node, you do not need to encode_entities as push_content inserts the passed string as a text node rather than parsing it as markup. OTOH, if the content of $node is <span> (represented in HTML source as <span>) and you actually want $a to display <span> (represented in HTML source as <span>), you would call encode_entities on it.
Chances are that your input text contains raw UTF-8 characters which the code is interpreting as Latin-1 or a similar encoding. The "single space" characters are actually U+00A0, non-breaking space, which is represented in UTF-8 by the two bytes 0xc2 0xa0, which when interpreted in Latin-1 are "Â" and non-breaking space.
Certain fields in our mysql db appear to contain newline characters so that if I SELECT on them something like the following will be returned for a single SQL call:
Life to be sure is nothing much to lose
But young men think it is and we were young
If I want to preserve the line breaks when displaying this field on a webpage, is the standard solution to write a script to replace '\n\r' with a br HTML tag or is there a better way?
Thanks!
Assuming PHP here...
nl2br() adds in <br /> for every \n. Don't forget to escape the content first, to prevent XSS attacks. See below:
<?php echo nl2br(htmlspecialchars($content)); ?>
HTML is a markup language. Regardless of how many linebreaks you put in the source code, you won't see anything from it back in the presentation (of course assuming you aren't using <pre> or white-space:pre). HTML uses the <br> element to represent a linebreak. So you basically indeed need to convert the real and invisible linebreaks denoted by the characters xA (newline, linefeed, LF, \n) and/or xD (carriage return, CR, \r) by a HTML <br> element.
In most programming languages you can just do this by a string replace of "\n" by "<br>".
You can wrap it in <pre> .. </pre>.