WordML bold font on right-to-left language - wordml

Need your help with formatting text as bold + right-to-left.
For now I succeed to get only one of the two, but not both in the same time.
I've tried this:
<w:p>
<w:pPr>
<w:pStyle w:val="D12"/>
<w:bidi/>
</w:pPr>
<w:r>
<w:rPr>
<w:rtl/>
<w:bold w:lang w:bidi="HE"/>
</w:rPr>
<w:t>טקסט</w:t>
</w:r>
</w:p>
and it didn't work.

Use <w:b-cs/> rather then <w:bold w:lang w:bidi="HE"/>
For more tags you can create a document in word that contain the style that you need, save it on word2003 (xml format) and open it with notepad and see which tag Microsoft Word used.

Related

Random Letter html Tag

I was wondering if you can use a random letter as an html tag. Like, f isn't a tag, but I tried it in some code and it worked just like a span tag. Sorry if this is a bad question, I've just been curious about it for a while, and I couldn't find anything online.
I was wondering if you can use a random letter as an html tag.
Yes and no.
"Yes" - in that it works, but it isn't correct: when you have something like <z> it only works because the web (HTML+CSS+JS) has a degree of forwards compatibility built-in: browsers will render HTML elements that they don't recognize basically the same as a <span> (i.e. an inline element that doesn't do anything other than reify a range of the document's text).
However, to use HTML5 Custom Elements correctly you need to conform to the Custom Elements specification which states:
The name of a custom element must contain a dash (-). So <x-tags>, <my-element>, and <my-awesome-app> are all valid names, while <tabs> and <foo_bar> are not. This requirement is so the HTML parser can distinguish custom elements from regular elements. It also ensures forward compatibility when new tags are added to HTML.
So if you use <my-z> then you'll be fine.
The HTML Living Standard document, as of 2021-12-04, indeed makes an explicit reference to forward-compatibility in its list of requirements for custom element names:
https://html.spec.whatwg.org/#valid-custom-element-name
They start with an ASCII lower alpha, ensuring that the HTML parser will treat them as tags instead of as text.
They do not contain any ASCII upper alphas, ensuring that the user agent can always treat HTML elements ASCII-case-insensitively.
They contain a hyphen, used for namespacing and to ensure forward compatibility (since no elements will be added to HTML, SVG, or MathML with hyphen-containing local names in the future).
They can always be created with createElement() and createElementNS(), which have restrictions that go beyond the parser's.
Apart from these restrictions, a large variety of names is allowed, to give maximum flexibility for use cases like <math-α> or <emotion-😍>.
So, by example:
<a>, <q>, <b>, <i>, <u>, <p>, <s>
No: these single-letter elements are already used by HTML.
<z>
No: element names that don't contain a hyphen - cannot be custom elements and will be interpreted by present-day browsers as invalid/unrecognized markup that they will nevertheless (largely) treat the same as a <span> element.
<a:z>
No: using a colon to use an XML element namespace is not a thing in HTML5 unless you're using XHTML5.
<-z>
No - the element name must start with a lowercase ASCII character from a to z, so - is not allowed.
<a-z>
Yes - this is fine.
<a-> and <a-->
Unsure - these two names are curious:
The HTML spec says the name must match the grammar rule [a-z] (PCENChar)* '-' (PCENChar)*.
The * denotes "zero-or-more" which is odd, because that implies the hyphen doesn't need to be followed by another character.
PCENChar represents a huge range of visible characters permitted in element names, curiously this includes -, so by that rule <a--> should be valid.
But note that -- is a reserved character sequence in the greater SGML-family (including HTML and XML) which may cause weirdness. YMMV!

Don't see special font characters in run.text (sometimes)

I have a word document mixing some Wingdings characters with Cambria text. When I look into the runs, I see sometimes a run.text with length 1 and the character is in hex e.g. 0xf063. The run.font.name is e.g. Wingdings 2. This is as expected. But often I see an empty run.text (font name still Wingdings). Nevertheless, the characters must be there, because, when I append the run to a new paragraph, I can see them in Word, at least when I pass them just through. When I however duplicate the run (as best as I can), the characters are lost, probably, because, when I dup the run, I miss something. So my question is, where are the characters stored when run.text is empty, and what do I have to observe when I duplicate such a run.
The characters are not lost during run duplication, however, if the run.text is not empty. Thus, the problem originates when the document is read, and sometimes the character is in run.text, and sometimes somewhere else. Which one is unpredictable to me.
I just had the idea to unzip the doc and look into document.xml. There I see
<w:r w:rsidRPr="00946796">
<w:rPr> <w:color w:val="EE9512"/>
<w:lang w:val="de-DE"/>
</w:rPr>
<w:t xml:space="preserve">YYYYYYY
</w:t>
</w:r>
<w:r w:rsidR="009E034B" w:rsidRPr="00695B07">
<w:rPr>
<w:rFonts w:ascii="Wingdings 3" w:hAnsi="Wingdings 3"/>
<w:color w:val="EE9512"/>
</w:rPr>
<w:sym w:font="Wingdings 2" w:char="F038"/>
</w:r>
So when run.text is empty, the chars are in a w:sym element, else in a w:t element.
You can see the special character as a "symbol" here:
<w:r w:rsidR="009E034B" w:rsidRPr="00695B07">
<w:rPr>
<w:rFonts w:ascii="Wingdings 3" w:hAnsi="Wingdings 3"/>
<w:color w:val="EE9512"/>
</w:rPr>
<w:sym w:font="Wingdings 2" w:char="F038"/> <!-- <<==== this line -->
</w:r>
I haven't researched this in depth, but I expect the distinction here is that the glyphs in this "font" are not stylized versions of the unicode codepoint at which they appear.
For example, there are no "A", "B", "C" characters in this font, those positions are taken by arrows or something instead.
I imagine the distinction is important because you couldn't get good results by substituting a similar font if Windings 2 is not installed on the current machine. So at least that font-substitution behavior would be different for this symbol than for regular characters.
There is no API support yet for symbols in runs, so you'd need to use lxml calls to access these elements, perhaps something like:
from docx.oxml.ns import qn
syms = run._r.xpath("./w:sym")
for sym in syms:
print("font == %s" % sym.get(qn("w:font")))
print("char == %s" % sym.get(qn("w:char")))
After a few more hours I think I see the complete picture. First, as scanny wrote above, python-docx does not handle w:sym elements at all (yet?), so these are lost after reading the docx, unless you resort to lxml. Then, why do I sometimes see a Wingdings character in w:t, sometimes in w:sym? Well, if I use the Word Symbol chooser (a window with all the characters in a font, where you can select one and then press "Insert" at the bottom), then you get a w:sym element. If you just set the font to Wingdings, and then type the suitable character on the keyboard (e.g. an 8 for a Wingdings 2 Circle with Dot inside), then you get a w:t element.
Thus I managed to remove all w:sym elements. To determine the "suitable" character, google for "Wingdings translator".

Mysql to XSLT replace \r\n

I'm using data from a MySQLi query and placing it into an XSLT file that eventually replaces the document.xml file in Word. The issue I am having is with \r and \n coming into my Word document.
The XSLT files are quite large so I will not paste all the code here, however, an example of one of the fields is below:
<w:p w:rsidR="00C61454" w:rsidRPr="00430555" w:rsidRDefault="00AE5B7C" w:rsidP="00C61454">
<w:pPr>
<w:rPr>
<w:rFonts w:ascii="Calibri" w:hAnsi="Calibri" w:cs="Segoe UI"/>
</w:rPr>
</w:pPr>
<w:proofErr w:type="spellStart"/>
<w:r w:rsidRPr="00AE5B7C">
<w:rPr>
<w:rFonts w:ascii="Calibri" w:hAnsi="Calibri" w:cs="Segoe UI"/>
<w:sz w:val="18"/>
<w:szCs w:val="18"/>
</w:rPr>
<w:t><xsl:value-of select="comments"/></w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
</w:p>
In the above code, comments is being placed from the mysqli query into the xslt.
There are a few other fields that may contain \r\n so something that works for the entire file would be best.
If doing a replace in the SELECT script works, I can take that route too, however, being a novice, I wouldn't know how to replace multiple items (the \r and the \n), nor what to replace them with to create the line feed / carriage return in Word.
In XSLT/XPath, normalize-space() will consolidate embedded whitespace and trim surrounding whitespace. Perhaps it could help with your unwanted \r\n characters.
See also How to add a newline (line break) in XML file? (Semantics of newline and line break characters can depend upon the processing application.)
Finally, in OOXML (the markup standard behind Microsoft Word's DOCX format), use <w:br/> (between w:t text elements within w:r run elements) to force a hard line break.

MS-Word 2010 can't unescape HTML entity e.g. — ’”

I was using a JSP and struts2-core-2.0.6.jar and ognl-2.6.11.jar to generate a MS-Word.It was work perfect.But when i upgrade to struts2-core-2.3.28.1.jar and ognl-3.0.14.jar,the MS-Word can't open.Because some special HTML characters such as — being escape as — by struts tag <s:property> in the higher version jar.But MS-Word can't recognize —.
For example,I have a field called "nameAndURL" in the database table saved String contained "—" with url,like this:
vincent—http://localhost/a/http.action?dataFormat=html&ymdhms=20130101000000
and for some reason I can't convert my data to that :
vincent—http://localhost/a/http.action?dataFormat=html&ymdhms=20130101000000
or:
vincent—http://localhost/a/http.action?dataFormat=html&ymdhms=20130101000000
jsp code:
<s:iterator value="nameAndURL">
<w:p wsp:rsidR="00AA5956" wsp:rsidRDefault="00571D82">
<w:pPr>
<w:ind w:left="426" w:first-line-chars="0" w:first-line="0" />
<w:jc w:val="left" />
</w:pPr>
<w:r>
<w:rPr>
<wx:font wx:val="宋体" />
<w:sz w:val="18" />
<w:sz-cs w:val="18" />
</w:rPr>
<w:t><s:property escapeHtml="true"/></w:t>
</w:r>
</w:p>
</s:iterator>
The question is when I use <s:property value="nameAndURL" escapeHtml="true"/>,the character — would be escape as — MS-Word can't recognize. The wrong code of MS-Word just like that :
<w:r>
<w:t>vincent—http://localhost/a/http.action?dataFormat=html&ymdhms=20130101000000</w:t>
</w:r>
But if I try to use <s:property value="nameAndURL" escapeHtml="false"/>,the character "—" would be unescape as — MS-Word can recognize,but & would be unescape as & MS-Word can't recognize.The wrong code of MS-Word just like that:
<w:r>
<w:t>vincent—http://localhost/a/http.action?dataFormat=html&ymdhms=20130101000000</w:t>
</w:r>
How can i make MS-Word able to recognized — or &? Or how can i make <s:property> unescape — but escape &amp?
And why the "—" being escape as — after i upgrade to struts2-core-2.3.28.1.jar and ognl-3.0.14.jar?
Thanks for your answer.
If you want to escape only &, I'd suggest you to do it server-side, in the action (or in a business layer), keeping the rest unescaped in the page.
This way you'll be less safe, however, so you should also strip at least <script> blocks.

tcl XML parsing error

I am trying to parse the XML file using dom package, but here is the error which I got:
unterminatedattribute {invalid attribute list around line 4}
Here is the simple test:
package require dom;
set XML "
<Top>
<Name name='name' />
<Group number=1>
<Member name='name1' test='test1' l=100/>
</Group>
</Top>"
set doc [::dom::parse $XML]
set root [$doc cget -documentElement]
set node [$root cget -firstChild]
puts "[$node cget -nodeValue]"
That “XML” is actually formally invalid; all attribute values must be quoted. If you can, fix that.
set XML "
<Top>
<Name name='name' />
<Group number='1'>
<Member name='name1' test='test1' l='100'/>
</Group>
</Top>"
If you can't fix that, you might try using tDOM instead in HTML mode (which is a lot laxer about well-formedness constraints, though it also lower-cases all element and attribute names). Mind you, even with that it still fails on your particular input document:
% package require tdom
0.8.3
% set doc [dom parse -html $XML]
error "Unterminated element 'group' (within 'member')" at position 114
">
<group number=1>
<member name='name1' test='test1' l=100/>
</group> <--Error--
</Top>"
Fixing your document is the #1 thing to do!
The problem is that you have to enclose the element values with " or '. After fixing your XML the parsing was successful.
I usually don't use the dom package, instead I use the tdom package.
The tdom package has a -html option that enables loose parsing.