MS-Word 2010 can't unescape HTML entity e.g.

MS-Word 2010 can't unescape HTML entity e.g. — ’” - html

I was using a JSP and struts2-core-2.0.6.jar and ognl-2.6.11.jar to generate a MS-Word.It was work perfect.But when i upgrade to struts2-core-2.3.28.1.jar and ognl-3.0.14.jar,the MS-Word can't open.Because some special HTML characters such as — being escape as — by struts tag <s:property> in the higher version jar.But MS-Word can't recognize —.
For example,I have a field called "nameAndURL" in the database table saved String contained "—" with url,like this:
vincent—http://localhost/a/http.action?dataFormat=html&ymdhms=20130101000000
and for some reason I can't convert my data to that :
vincent—http://localhost/a/http.action?dataFormat=html&ymdhms=20130101000000
or:
vincent—http://localhost/a/http.action?dataFormat=html&ymdhms=20130101000000
jsp code:
<s:iterator value="nameAndURL">
<w:p wsp:rsidR="00AA5956" wsp:rsidRDefault="00571D82">
<w:pPr>
<w:ind w:left="426" w:first-line-chars="0" w:first-line="0" />
<w:jc w:val="left" />
</w:pPr>
<w:r>
<w:rPr>
<wx:font wx:val="宋体" />
<w:sz w:val="18" />
<w:sz-cs w:val="18" />
</w:rPr>
<w:t><s:property escapeHtml="true"/></w:t>
</w:r>
</w:p>
</s:iterator>
The question is when I use <s:property value="nameAndURL" escapeHtml="true"/>,the character — would be escape as — MS-Word can't recognize. The wrong code of MS-Word just like that :
<w:r>
<w:t>vincent—http://localhost/a/http.action?dataFormat=html&ymdhms=20130101000000</w:t>
</w:r>
But if I try to use <s:property value="nameAndURL" escapeHtml="false"/>,the character "—" would be unescape as — MS-Word can recognize,but & would be unescape as & MS-Word can't recognize.The wrong code of MS-Word just like that:
<w:r>
<w:t>vincent—http://localhost/a/http.action?dataFormat=html&ymdhms=20130101000000</w:t>
</w:r>
How can i make MS-Word able to recognized — or &? Or how can i make <s:property> unescape — but escape &amp?
And why the "—" being escape as — after i upgrade to struts2-core-2.3.28.1.jar and ognl-3.0.14.jar?
Thanks for your answer.

If you want to escape only &, I'd suggest you to do it server-side, in the action (or in a business layer), keeping the rest unescaped in the page.
This way you'll be less safe, however, so you should also strip at least <script> blocks.

Related

Don't see special font characters in run.text (sometimes)

I have a word document mixing some Wingdings characters with Cambria text. When I look into the runs, I see sometimes a run.text with length 1 and the character is in hex e.g. 0xf063. The run.font.name is e.g. Wingdings 2. This is as expected. But often I see an empty run.text (font name still Wingdings). Nevertheless, the characters must be there, because, when I append the run to a new paragraph, I can see them in Word, at least when I pass them just through. When I however duplicate the run (as best as I can), the characters are lost, probably, because, when I dup the run, I miss something. So my question is, where are the characters stored when run.text is empty, and what do I have to observe when I duplicate such a run.
The characters are not lost during run duplication, however, if the run.text is not empty. Thus, the problem originates when the document is read, and sometimes the character is in run.text, and sometimes somewhere else. Which one is unpredictable to me.
I just had the idea to unzip the doc and look into document.xml. There I see
<w:r w:rsidRPr="00946796">
<w:rPr> <w:color w:val="EE9512"/>
<w:lang w:val="de-DE"/>
</w:rPr>
<w:t xml:space="preserve">YYYYYYY
</w:t>
</w:r>
<w:r w:rsidR="009E034B" w:rsidRPr="00695B07">
<w:rPr>
<w:rFonts w:ascii="Wingdings 3" w:hAnsi="Wingdings 3"/>
<w:color w:val="EE9512"/>
</w:rPr>
<w:sym w:font="Wingdings 2" w:char="F038"/>
</w:r>
So when run.text is empty, the chars are in a w:sym element, else in a w:t element.

You can see the special character as a "symbol" here:
<w:r w:rsidR="009E034B" w:rsidRPr="00695B07">
<w:rPr>
<w:rFonts w:ascii="Wingdings 3" w:hAnsi="Wingdings 3"/>
<w:color w:val="EE9512"/>
</w:rPr>
<w:sym w:font="Wingdings 2" w:char="F038"/> <!-- <<==== this line -->
</w:r>
I haven't researched this in depth, but I expect the distinction here is that the glyphs in this "font" are not stylized versions of the unicode codepoint at which they appear.
For example, there are no "A", "B", "C" characters in this font, those positions are taken by arrows or something instead.
I imagine the distinction is important because you couldn't get good results by substituting a similar font if Windings 2 is not installed on the current machine. So at least that font-substitution behavior would be different for this symbol than for regular characters.
There is no API support yet for symbols in runs, so you'd need to use lxml calls to access these elements, perhaps something like:
from docx.oxml.ns import qn
syms = run._r.xpath("./w:sym")
for sym in syms:
print("font == %s" % sym.get(qn("w:font")))
print("char == %s" % sym.get(qn("w:char")))

After a few more hours I think I see the complete picture. First, as scanny wrote above, python-docx does not handle w:sym elements at all (yet?), so these are lost after reading the docx, unless you resort to lxml. Then, why do I sometimes see a Wingdings character in w:t, sometimes in w:sym? Well, if I use the Word Symbol chooser (a window with all the characters in a font, where you can select one and then press "Insert" at the bottom), then you get a w:sym element. If you just set the font to Wingdings, and then type the suitable character on the keyboard (e.g. an 8 for a Wingdings 2 Circle with Dot inside), then you get a w:t element.
Thus I managed to remove all w:sym elements. To determine the "suitable" character, google for "Wingdings translator".

Mysql to XSLT replace \r\n

I'm using data from a MySQLi query and placing it into an XSLT file that eventually replaces the document.xml file in Word. The issue I am having is with \r and \n coming into my Word document.
The XSLT files are quite large so I will not paste all the code here, however, an example of one of the fields is below:
<w:p w:rsidR="00C61454" w:rsidRPr="00430555" w:rsidRDefault="00AE5B7C" w:rsidP="00C61454">
<w:pPr>
<w:rPr>
<w:rFonts w:ascii="Calibri" w:hAnsi="Calibri" w:cs="Segoe UI"/>
</w:rPr>
</w:pPr>
<w:proofErr w:type="spellStart"/>
<w:r w:rsidRPr="00AE5B7C">
<w:rPr>
<w:rFonts w:ascii="Calibri" w:hAnsi="Calibri" w:cs="Segoe UI"/>
<w:sz w:val="18"/>
<w:szCs w:val="18"/>
</w:rPr>
<w:t><xsl:value-of select="comments"/></w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
</w:p>
In the above code, comments is being placed from the mysqli query into the xslt.
There are a few other fields that may contain \r\n so something that works for the entire file would be best.
If doing a replace in the SELECT script works, I can take that route too, however, being a novice, I wouldn't know how to replace multiple items (the \r and the \n), nor what to replace them with to create the line feed / carriage return in Word.

In XSLT/XPath, normalize-space() will consolidate embedded whitespace and trim surrounding whitespace. Perhaps it could help with your unwanted \r\n characters.
See also How to add a newline (line break) in XML file? (Semantics of newline and line break characters can depend upon the processing application.)
Finally, in OOXML (the markup standard behind Microsoft Word's DOCX format), use <w:br/> (between w:t text elements within w:r run elements) to force a hard line break.

UTF-8 Dingbats In ColdFusion

I'm trying to display some hex and decimal encoded special characters (UTF-8 Dingbats) in a coldfusion (v11) page.
<cfloop>
<td id="..." align="center">✂</td>
<td id="..." align="center">✂</td>
</cfloop>
Based on the compiler error it most certainly seems like an issue with the pound (#) character, which of course is a special character in coldfusion.
So, is what I'm trying to do even possible, maybe by escaping #?

In ColdFusion the # is used to output variables within an <cfoutput> block. For example.
<cfoutput>The time is #now()#</cfoutput>
If you need to preserve the #, then you need to escape it, which you can do with a double #. For example:
<cfoutput>My dingbats: &##9986; &##x2702;</cfoutput>
If you're not inside a cfoutput block then you don't need to escape it. For example:
<cfoutput>My dingbat: &##9986;</cfoutput><br>
My dingbat: ✂

WordML bold font on right-to-left language

Need your help with formatting text as bold + right-to-left.
For now I succeed to get only one of the two, but not both in the same time.
I've tried this:
<w:p>
<w:pPr>
<w:pStyle w:val="D12"/>
<w:bidi/>
</w:pPr>
<w:r>
<w:rPr>
<w:rtl/>
<w:bold w:lang w:bidi="HE"/>
</w:rPr>
<w:t>טקסט</w:t>
</w:r>
</w:p>
and it didn't work.

Use <w:b-cs/> rather then <w:bold w:lang w:bidi="HE"/>
For more tags you can create a document in word that contain the style that you need, save it on word2003 (xml format) and open it with notepad and see which tag Microsoft Word used.

Is > ever necessary?

I now develop websites and XML interfaces since 7 years, and never, ever came in a situation, where it was really necessary to use the > for a >. All disambiguition could so far be handled by quoting <, &, " and ' alone.
Has anyone ever been in a situation (related to, e.g., SGML processing, browser issues, XSLT, ...) where you found it indespensable to escape the greater-than sign with >?
Update: I just checked with the XML spec, where it says, for example, about character data in section 2.4:
Character Data
[14] CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*)
So even there, the > isn't mentioned as something special, except from the ending sequence of a CDATA section.
This one single case, where the > is of any significance, would be the ending of a CDATA section, ]]>, but then again, if you'd quote it, the quote (i.e., the literal string ]]>) would land literally in the output (since it's CDATA).

You don't need to absolutely because almost any XML interpreter will understand what you mean. But still you use a special character without any protection if you do so.
XML is all about semantic, and this is not really semantic compliant.
About your update, you forgot this part :
The right angle bracket (>) may be represented using the string " > ", and must, for compatibility, be escaped using either " > " or a character reference when it appears in the string " ]]> " in content, when that string is not marking the end of a CDATA section.
The use case given in the documentation is more about something like this :
<xmlmarkup>
]]>
</xmlmarkup>
Here the ]]> part could be a problem with old SGML parsers, so it must be escaped into = ]]> for compatibilities reasons.

I used one not 19 hours ago to pass a strict xml validator. Another case is when you use them actually in html/xml content text (rather than attributes), like this: <.
Sure, a lax parser will accept most anything you throw at it, but if you're ever worried about XSS, < is your friend.
Update: Here's an example where you need to escape > in Firefox:
<?xml version="1.0" encoding="utf-8" ?>
<test>
]]>
</test>
Granted, it still isn't an example of having to escape a lone >.

Not so much as an author of (x)html documents, but more as a user of sloppy written comments fields in websites, that "offer" you to insert html.
I mean if you do your site the right way, you wouldn't hardcode your content anyway, right? So your call to htmlentities or whatever (long time no see, php) would take care of replacing special characters for you.
So sure, you wouldn't manually type > but I hope you take measures so > is automatically replaced.

I just thought of another example, where you need to quote > in HTML5 (not XHTML5) documents: If you need it in attributes without quotes (which is something, that can be argued of course).
<img src=arrow.png alt=>>
should be equivalent to XHTML
<img src="arrow.png" alt=">" />
But then again, (?<!X)HTML is not SGML.

Imagine that you have the following text this is a not a ]]> nice day and you decide to surround it by CDATA sections <![CDATA[this is a not a ]]> nice day]]>.
In order to avoid that (and for allowing parsing of SGML fragments with unterminated marked sections), clause 10.4 of ISO 8879:1986 declares that the occurrence of ]]> outside a marked
section is an error.
Also, in the times of SGML marked sections were very popular, as they were not only used for CDATA (as in XML), but also for RCDATA (only entities and character references allowed) and IGNORE and INCLUDE (which allowed for recognition of markup inside them).
For instance, in SGML one could write:
<!ENTITY %WHATTODO "INCLUDE">
<![%WHATTODO;[<b>]]></b>]]>
Which is equivalent to:
<b>]]></b>

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

MS-Word 2010 can't unescape HTML entity e.g. — ’” - html

If you want to escape only &, I'd suggest you to do it server-side, in the action (or in a business layer), keeping the rest unescaped in the page. This way you'll be less safe, however, so you should also strip at least <script> blocks.

Related

Don't see special font characters in run.text (sometimes)

Mysql to XSLT replace \r\n

UTF-8 Dingbats In ColdFusion

WordML bold font on right-to-left language

Is > ever necessary?

Categories

Resources