What is Run-level content in python-docx? - python-docx

I am a bit confused about the concept of 'Run-level content' in python-docx.. I get that if I wanna check whether a paragraph is in bold or not, I need to check the run.bold, but what exactly is it?
The official definition is: A run is the object most closely associated with inline content; text, pictures, and other items that are flowed between the block-item boundaries within a paragraph.
So, is it singular character level content in the paragraph? am I missing anything here?

A simple way to understand a run in Word is a sequence of characters that all share the same character formatting.
So if you have a sentence like this and want a bold word to appear, you can't tell the sentence to be bold (that would bold too much) and you don't want to tell each individual character to be bold (that would bold too-little at a time).
So you group the characters into runs and apply character formatting to the run (and that is juuuust right :).
The example sentence would need three runs. One before the bold word, one for the bold word itself, and one for after the bold word. The middle run would be set bold; the other two would have no special formatting.
There are more things to know about runs, like they are subordinate to a paragraph (so the same run can't start in one paragraph and end in another), but this is the main gist of the concept.

Related

What's the semantically correct way to represent one sentence that contains a multi-line quote?

I have a sentence which includes a quote of a multi-line poem. I'm wondering what's the correct way to express it in HTML.
Example
Yesterday the poem
Would it be ok if I took some of your time?
Would it be ok if I wrote you a rhyme?
kept bouncing in my head.
I believe this whole example should live in a <p> tag, since it wouldn't make sense to split "Yesterday the poem" or "kept bouncing in my head." away from the rest: they're not valid independently. The whole example together is one sentence.
For a poem represented this way I would normally use a <blockquote>. However you can't nest a blockquote inside a p, thus I need to change something.
What's the semantically correct way to represent my sentence?

Correct sentence and paragraph casing

I have a database full of descriptions in all lower case. This doesn't look very nice. I am trying to figure out a way to correct the casing, ie., capitalize the first letter after every period, and after every Paragraph or new line. Something along the lines of,
UPDATE table_name SET description = REPLACE(description, '. a','. A')
but of course looping through all possible cases, etc.
EDIT: Here is an example.
I need to change,
this sentence needs to be properly capitalized. and this sentence too. there could be multiple sentences in this paragraph.
a new paragraph should be capitalized as well. so it looks nicer.
to this,
This sentence needs to be properly capitalized. And this sentence too. There could be multiple sentences in this paragraph.
A new paragraph should be capitalized as well. So it looks nicer.
Also, new lines are in the form of '\n'. I hope that clarifies things.

Close tags dropping below highlighted line

I have minimal experience with HTML script so this may all go horribly wrong here.
Alright so I have a very simple yet very time consuming task of taking complete papers and converting them into HTML script. I'm using Sublime Text 3 with Emmet plugin.
Basically,
This is the first header
This is the first paragraph that needs to be tagged
This is the second header
This is the second paragraph that needs to be tagged
So super simple I need to put header tags on the headers and paragraph tags on the paragraphs.
What I have been doing is holding Ctrl and manually highlighting the desired text as it is all rather random. Problem is that takes forever to manually highlight the text like that.
I am aware of other ways to highlight such as Ctrl + L for the line. Problem is my close tags end up under the highlighted line.
Example:
<h2>This is the first header
</h2><p>This is the first paragraph that needs to be tagged
</p>
It's not a big deal but it makes the code harder to go through later and really chaotic.
The same problem persists if I click the corresponding number of the line.
Seeing as I have hundreds of pages to enter and even more headers, paragraphs, and pictures to properly tag; I'm looking for a solution to the tag dropping below the line or a faster method to entering text.
So, is there a fast method for entering text from a word document to Sublime text and quickly get the corresponding tags? e.g. <h2>,<h3>,<p>,<ul>,<li> and so on.
Any help will save my sanity, thanks.
When you select a line with CtrlL, it automatically selects the entire line, and moves the cursor down to the first position on the following line. There are two ways around this. The first is to place the cursor in the first position on the line you want to select, then just hit ShiftEnd and the line will be selected, with the cursor now sitting in the last position on that same line. Alternatively, use CtrlL, then hit Shift← (left arrow) to move the cursor from the first position on the next line to the last position on the selected line. Either way, you can now hit the key combo in Emmet for inserting a tag pair, and you're all set.

Why is a trailing punctuation mark rendered at the start with direction:rtl?

This is more a sort of curiosity. While working on a multilingual web application I noticed that certain characters like punctuation marks (!?.;,) at the end of a block element are rendered as if they were placed at the beginning instead when the writing direction is right-to-left (as it is the case for certain Asian languages I do not speak).
In other words, The string
Hello, World!
is rendered as
!Hello, World
when placed in a div block with direction: rtl
This becomes even more evident if the text is split in two parts and given different colors: a contiguous chunk of text at the end is rendered in two separated regions:
http://jsfiddle.net/22Qk9/
What's the point of this behavior? I guess this must be a peculiarity of (all?) right-to-left languages which is automatically handled by the browser, so I don't need to care about it, or should I?
If you want to fix this behavior add the LRM character ‎ in the end. It's a non=printing character.
Source : http://dotancohen.com/howto/rtl_right_to_left.html
Example : http://jsfiddle.net/yobjj6ed/
The reason is that the exclamation mark “!” has the BiDi class O.N. ('Other Neutrals'), which means effectively that it adapts to the directionality of the surrounding text. In the example case, it is therefore placed to the left of the text before it. This is quite correct for languages written right to left: the terminating punctuation mark appears at the end, i.e. on the left.
Normally, you use the CSS code direction: rtl or, preferably, the HTML attribute dir=rtl for texts in a language that is written right to left, and only for them. For them, this behavior is a solution, not a problem.
If you instead use direction: rtl or dir=rtl just for special effects, like making table columns laid out right to left, then you need to consider the implications. For example, in the table case, you would need to set direction to ltr for each cell of the table (unless you want them to be rendered as primarily right to left text).
If you have, say, an English sentence quoted inside a block of Arabic text, then you need to set the directionality of an element containing the English text to ltr, e.g.
<blockquote dir=ltr>Hello, World!</blockquote>
A similar case (just with Arabic inside English text) is discussed as use case 6 in the W3C document What you need to know about the bidi algorithm and inline markup (which has a few oddities, though, like using cite markup for quoted text, against W3C recommendations).
The accepted answer https://stackoverflow.com/a/20799360/477420 works if you can control markup/CSS of the value, if you have no control over HTML following approach could work.
If you don't know if page will be rendered RTL or LTR but some text is definitely LTR (i.e. English-only) you can wrap the value with LRE/PDF marks to signify that is LTR region. Text will be rendered LTR irrespective of page's LTR or RTL direction.
This works when you have some code that tries to render text without ability to change markup of how exactly it will show up on the page. I.e. you rendering value for "song tile" or "company name" field in some nested child component (or server side) without ability to control surrounding HTML elements.
One drawback of this and similar approaches (like LRM proposal in this question) with adding marks to text is copy-paste of such value from the resulting HTML page will generally preserve the marks but they are not visible/zero width. While for most cases it is fine consider if that is a problem for you.
Approximate sample code (some companies have "Inc." at the end which will end up with dot at the beginning when rendered as-is on RTL page):
// comanyName = "Alphabet Inc." - really likes dot at the end including RTL
if(stringIsDefinitelyAscii(companyName))
{
companyName = "\u202A" + companyName + "\u202C"
}
return companyName;
Details on LRE/PDF symbols can be found in https://unicode.org/reports/tr9/#Explicit_Directional_Embeddings:
LRE U+202A LEFT-TO-RIGHT EMBEDDING
Treat the following text as embedded left-to-right.
PDF U+202C POP DIRECTIONAL FORMATTING End the scope of the last LRE, RLE, RLO, or LRO.
Some approaches to figure out if string has RTL characters can be found in How to detect whether a character belongs to a Right To Left language?, JavaScript: how to check if character is RTL?, How to detect if a string contains any Right-to-Left character?.

what is the principle of displaying different languages(Arabic and English) together?

Such as this sentence:
عفواً يبدو أن النظام لا يستطيع تحديد أنك من عملاء STC أم لا، فإذا كنت عميل STC الرجاء الضغط على زر "إعادة المحاولة"، وإذا لم تكن من عملاء STC الرجاء الضغط على زر " لست عميل STC
Arabic is RTL and English is LTR. Sometimes after copy and paste the text goes disorder. When I move the cursor inside the sentence between English and Arabic characters it jumps in a very strange way. And I am also confused with how this stored in the memory. Can anyone help to explain this?
In memory this is all stored as a sequence of Unicode code points (hopefully; there were very werid things before that, but let's not go there) – that's the text itself, how it is represented in the computer. The text is independent from writing direction at first, it's just a sequence of characters.
This sequence goes through a rendering engine that knows the Unicode Bidi algorithm and thus can shape the text into glyphs to display at a particular position. Every character in Unicode has a Bidi property that controls how it behaves in such contexts. This specifies that a is a LTR character while א is an RTL character; it controls that parentheses are correctly mirrored in RTL contexts (an opening parenthesis is still ( in the text, even though you see )); and several characters can appear in both contexts. This is all very simplified, and there are quite a few things at work there. Finally, multiple glyphs can overlay each other (e.g. diacritics) or form ligatures; those are then graphemes which is essentially what we perceive as a “letter”.
Cursor movement is easy to do then, because the cursor can only be betweeen two graphemes (it gets more complicated at the start of a LTR or RTL segment, but let's leave it at that for now) and → moved it forwards through them while ← moves backwards. In RTL forwards means left, of course; it follows the text direction. What order the two graphemes have relative to each other doesn't really matter in positioning the cursor.
I admit though, that it can be confusing to see mixed RTL and LTR text, but I guess people in Arabic- or Hebrew-speaking countries are quite used to it.
Regarding the problem that the correct text layout is sometimes lost when you copy-paste text, I guess the most common problem is application or layout engine support for the respective script. If the layout engine does not know how to layout Arabic text all you get are the characters in their logical order from left to right. No ligatures are formed, no text direction applied. For example, browsers have quite good support by now for this kind of thing, but if I take the Arabic text and paste it into Word it will look wrong (was the case for Word 2007; PowerPoint did it fine, though). There is sadly no easy fix for that, but generally the text you copied is exactly the same, it's just the display that's wrong.
Disclaimer: I have lurked for a long time on the Unicode mailing list, but I'm by no means an expert on these things. I speak two languages and both are trivial what layout is concerned. This is a recollection of how I think it might work and might not be actual fact.
The letters are stored in logical order; meaning that a sentence such as "Hello! Salaam!" is in fact stored with the letters in precisely that order.
In addition to that, however, certain unicode flags are also added to the text that inform the text layout engine that the "Salaam" part of the sentence should be reversed when displayed; so the final text layout becomes "Hello! maalaS!", as well it should be.
These flags are either set through natural BIDI classification; e.g. غ; or through use of the Unicode RTL and LTR markers, U+200E and U+200F.
If you pay attention, the cursor doesn't in fact jump strangely, it always follows logical character order.