Standards describing how navigation is handled in non-monospace text? - language-agnostic

For a recent project I've been working on a simple word processor, and because I need fine-grained control have had to implement a lot of the text shaping myself. Most of this is fairly straight-forward and described in detail places like here and here.
It's less obvious how to handle pressing down or up on the keyboard when dealing with non-monospaced text split across many lines. In monospaced text the algorithm is simple: move the text caret down one line and the same number of characters to the right that it was. But what about variable-width fonts? I've tried an algorithm like this (in pseudocode):
; Return text offset into next line after navigating down
function moveCaretDown():
Move text caret to start of next line
targetPixelOffset := previous pixel offset of caret in line above
textOffsetIntoLine := 0
pixelOffsetIntoLine := 0
prevDelta := Infinity
for each char in text of new line:
delta = abs(pixelOffsetIntoLine - targetPixelOffset)
; We are now further from the desired cursor offset than before, this must
; be closest slot to the caret's previous horizontal offset in this line.
if (delta > prevDelta):
return textOffsetIntoLine - 1
prevDelta = delta
pixelOffsetIntoLine += measureWidth(char)
currentOffset++
; Else return the offset of the last character in the line
return length of newline - 1
But I've found its behavior differs from text inputs in major web browsers and/or text editors (I can come up with some specific examples if needed). Is there some standard algorithm for this used by GUI toolkits or text shaping libraries? I was surprised I couldn't find a W3C standard on it, for example, considering this is behavior needed in every web browser.
* Inserting line breaks into a string at the correct places, handling ragged or fully-justified text, etc.

I don't think there's a standard other than to follow the Principle of Least Astonishment. Nowadays, that typically means seeing what the major applications do, since that will likely be familiar behavior to the user.
On the current line, you know the current horizontal offset. Let's call it x. I'm talking about the pixel position, not the number of characters or glyphs since the beginning of that line.
On the destination line, there is a set of horizontal offsets the caret can be placed (e.g., between glyphs). So you want to pick the one of those that's as similar to your current x as possible.
Furthermore, if the user moves the caret vertically several times in a row, you probably want to find the nearest to the original x. The caret may wiggle horizontally a bit as the user moves up and down, but you don't want it to drift. Once the user does something that intentionally changes the horizontal offset (e.g., inserts a character, uses a horizontal arrow, clicks the mouse, etc.) that's the best time to update x.
If you already have code to find the closest caret position to a mouse click, you might be able to re-use it as though the user had clicked the point exactly one line above or below the current x.
I've also seen some editors (including monospace text editors) that treat the end of the line as a special case. So if you move up or down when you're at the end of a line, you move to the end of the preceding or succeeding line. That seems a nice way to handle ragged right text and short lines at the end of a paragraph.

Related

pymupdf detect two paragraph which text blocks coordinates is closed as one

I face a problem that When I use fitz to detect pdf layout. The two paragraph will be detect as one textblock if the two block as a close line margin.
for example. I want detect the text and the isolated formula as to text blocks. but for now fitz detect them as one text block.How could i handdle this.
Shoud I detect words coordinates and sort it with normal reading order or some methods like this.
PyMuPDF also has ways to adjust the granularity of text extraction: there are more levels between and beyond block extraction and word extraction.
You can extract by line, by text span (both are a higher level than word) and by character (level below word). And all of them deliver wrapping rectangles of the respective text, plus a plethora of text font proprerties (font size, font weight, font style, font color), writing direction.
Here is an example that extracts lines of text:
details = page.get_text("dict", flags=fitz.TEXTFLAGS_TEXT) # skips images!
for block in details["blocks"]: # delivers the block level
for line in block["lines"]: # the lines in this block
bbox = fitz.Rect(line["bbox"]) # wraps this line
line_text = "".join([span["text"] for span in line["spans"]])
Please do have a look at this picture in the documentation - it shows an overview of the dictionary layout: https://pymupdf.readthedocs.io/en/latest/_images/img-textpage.png.

Restricting typing at a distance from the end on an input field

I have an input field, at the end of which i've created a character counter:
The problem is that now, it is possible to type beneath the counter which is no good:
I would like the typing area to be restricted a certain distance before the end of the input field, something like this:
I am aware of maxlenght but since the letters have different lengths i.e. you can fit 183 "i" but only 57 "W", which would make for a really unintuitive typing experience, if your typing is cut off at the middle of the field.
The two possible solutions that occur to me.
1.
Simply shortening the input and positioning the counter next to the input, then styling a common parent element to look like the input. This is the more simple and less error prone solution.
2.
This way is a bit more complicated, but basically what you would do is create a hidden element somewhere (NB not display: none;) with the same font size/weight/family and attach a keydown event handler to the input field.
In this handler you copy the contents on the input to the hidden element, measure the width in pixels and compare that to your input. If the difference is too small, you return false in your input handler, making sure you're not preventing the user from pressing delete or backspace first.
It should be noted however that this method is pretty difficult to get right and I would consider it to be the "dirty" solution.

Close tags dropping below highlighted line

I have minimal experience with HTML script so this may all go horribly wrong here.
Alright so I have a very simple yet very time consuming task of taking complete papers and converting them into HTML script. I'm using Sublime Text 3 with Emmet plugin.
Basically,
This is the first header
This is the first paragraph that needs to be tagged
This is the second header
This is the second paragraph that needs to be tagged
So super simple I need to put header tags on the headers and paragraph tags on the paragraphs.
What I have been doing is holding Ctrl and manually highlighting the desired text as it is all rather random. Problem is that takes forever to manually highlight the text like that.
I am aware of other ways to highlight such as Ctrl + L for the line. Problem is my close tags end up under the highlighted line.
Example:
<h2>This is the first header
</h2><p>This is the first paragraph that needs to be tagged
</p>
It's not a big deal but it makes the code harder to go through later and really chaotic.
The same problem persists if I click the corresponding number of the line.
Seeing as I have hundreds of pages to enter and even more headers, paragraphs, and pictures to properly tag; I'm looking for a solution to the tag dropping below the line or a faster method to entering text.
So, is there a fast method for entering text from a word document to Sublime text and quickly get the corresponding tags? e.g. <h2>,<h3>,<p>,<ul>,<li> and so on.
Any help will save my sanity, thanks.
When you select a line with CtrlL, it automatically selects the entire line, and moves the cursor down to the first position on the following line. There are two ways around this. The first is to place the cursor in the first position on the line you want to select, then just hit ShiftEnd and the line will be selected, with the cursor now sitting in the last position on that same line. Alternatively, use CtrlL, then hit Shift← (left arrow) to move the cursor from the first position on the next line to the last position on the selected line. Either way, you can now hit the key combo in Emmet for inserting a tag pair, and you're all set.

what is the principle of displaying different languages(Arabic and English) together?

Such as this sentence:
عفواً يبدو أن النظام لا يستطيع تحديد أنك من عملاء STC أم لا، فإذا كنت عميل STC الرجاء الضغط على زر "إعادة المحاولة"، وإذا لم تكن من عملاء STC الرجاء الضغط على زر " لست عميل STC
Arabic is RTL and English is LTR. Sometimes after copy and paste the text goes disorder. When I move the cursor inside the sentence between English and Arabic characters it jumps in a very strange way. And I am also confused with how this stored in the memory. Can anyone help to explain this?
In memory this is all stored as a sequence of Unicode code points (hopefully; there were very werid things before that, but let's not go there) – that's the text itself, how it is represented in the computer. The text is independent from writing direction at first, it's just a sequence of characters.
This sequence goes through a rendering engine that knows the Unicode Bidi algorithm and thus can shape the text into glyphs to display at a particular position. Every character in Unicode has a Bidi property that controls how it behaves in such contexts. This specifies that a is a LTR character while א is an RTL character; it controls that parentheses are correctly mirrored in RTL contexts (an opening parenthesis is still ( in the text, even though you see )); and several characters can appear in both contexts. This is all very simplified, and there are quite a few things at work there. Finally, multiple glyphs can overlay each other (e.g. diacritics) or form ligatures; those are then graphemes which is essentially what we perceive as a “letter”.
Cursor movement is easy to do then, because the cursor can only be betweeen two graphemes (it gets more complicated at the start of a LTR or RTL segment, but let's leave it at that for now) and → moved it forwards through them while ← moves backwards. In RTL forwards means left, of course; it follows the text direction. What order the two graphemes have relative to each other doesn't really matter in positioning the cursor.
I admit though, that it can be confusing to see mixed RTL and LTR text, but I guess people in Arabic- or Hebrew-speaking countries are quite used to it.
Regarding the problem that the correct text layout is sometimes lost when you copy-paste text, I guess the most common problem is application or layout engine support for the respective script. If the layout engine does not know how to layout Arabic text all you get are the characters in their logical order from left to right. No ligatures are formed, no text direction applied. For example, browsers have quite good support by now for this kind of thing, but if I take the Arabic text and paste it into Word it will look wrong (was the case for Word 2007; PowerPoint did it fine, though). There is sadly no easy fix for that, but generally the text you copied is exactly the same, it's just the display that's wrong.
Disclaimer: I have lurked for a long time on the Unicode mailing list, but I'm by no means an expert on these things. I speak two languages and both are trivial what layout is concerned. This is a recollection of how I think it might work and might not be actual fact.
The letters are stored in logical order; meaning that a sentence such as "Hello! Salaam!" is in fact stored with the letters in precisely that order.
In addition to that, however, certain unicode flags are also added to the text that inform the text layout engine that the "Salaam" part of the sentence should be reversed when displayed; so the final text layout becomes "Hello! maalaS!", as well it should be.
These flags are either set through natural BIDI classification; e.g. غ; or through use of the Unicode RTL and LTR markers, U+200E and U+200F.
If you pay attention, the cursor doesn't in fact jump strangely, it always follows logical character order.

Dealing with very tall textboxes and pagination in SSRS 2005

I have a report in SQL Server Reporting Services 2005. It makes use of a page header and footer and has no subreports. The body portion contains a few smaller elements and then a simple single column table. The table has a single header row and a single detail row. The header is just a label, basically. The detail row is a single textbox with a simple Fields!FieldName.Value as its output.
The problem is that FieldName, in this case, is a highly variable length string. It can be a sentence up to 8000 characters (usually no more than 2 pages worth). The text can contain line/paragraph breaks (returns) but no other special formatting. Everything is fine so long as the content fits on one page. Once the text exceeds a single page (8.5x11), the text is very nastily cut off abruptly. Since this is a pagination problem, it is only visible when exporting to PDF or when viewing the report in Print Layout.
It seems as though there is a maximum size the row can grow to on the first page and then it chops it off and starts it up on the second. But this cutoff is not carefully managed in relation to the text. It can occur right in the middle of a line, causing it to show the top halves of the letters on the first page and the bottom halves at the top of the second page.
Obviously, this is unacceptable, as it looks very unprofessional and can impair the readability of the line that was so messily split. I also can never be sure it'll split badly, as sometimes it more or less ends the page evenly, though usually I can still see the hanging tails of certain letters on the next page (g and p for instance).
The secondary problem is that I'd really like the table row header to repeat on each page. Setting the obvious property, "RepeatOnNewPage" has no effect. I suspect this is because it's still trying to show the single really vertically tall row. It seems like it's okay repeating headers and splitting pages nicely between detail rows. But because this is basically just a big block of text, and thus just one really tall row, it doesn't split it nicely.
What can I do or use to solve this problem? I can live without the repeating header so long as it just doesn't cut off text in the middle of a line.
Unfortunately, page break fine tuning is one of the biggest weak points of SSRS.
I can only suggest that you break up the long text into multiple rows before SSRS ever gets it. You'd want to parse the text to look for word breaks. The result will be odd looking breaks in the output since you won't know where the break will come on a line in the printed report. However, it'd be much more readable than cutting text in half.
If the text is comprised of reasonably sized paragraphs, you could parse it out that way instead.
You might even go so far as to measure the text using SQLCLR and the System.Drawing.Graphics.MeasureString method to fine tune the output but I wouldn't recommend that route for the feint of heart.
In SSRS 2008 R2 and Visual Studio 2008:
Click (not-right click) a textbox and go to the properties window (lower right side of VS) -> KeepTogether = false.
The text will cleanly cut between a line and continue on the next page.
Just thought to add here as searching for this doesn't return many results.
I have done what JC has suggested in the past where I've broken down the text into paragraphs and each paragraph would in effect be its own row. Works pretty well given the limitations of SSRS.
One thing to be careful about is that you would need to make sure that your paragraphs sort properly. In most cases it would display them in the correct order, but adding in a column with sortID to give some sorting hints to the table would probably be a good idea.
In the end, the cut-off-text problem was due to non-standard padding on the textbox in question.
For whatever reason, having padding any greater than the defaults (2pt all around) seemed to cause its pagination to go sour. I imagine it is due to the algorithm not taking padding into consideration when deciding where to break the paragraph. With default padding, the line always ends cleanly and nicely on each page.
As a workaround (since I liked the extra white space the padding gave to the layout), I used a rectangle to achieve the border and made the textbox inside it smaller than the rectangle by about an eighth of an inch. This gave the box some inner padding while still apparently allowing the pagination to correctly determine when to break up lines.
Still, a lot of unnecessary headache.