Does <br> correspond to LINE SEPARATOR (U+2028)? - html

Let say I have the following text (in typing order from left to right) where the line break is
U+2028 and capital letter represent arabic letter and everything else represent itself.
foo FOO
!BAR#
I put them in html like this,
<p dir="auto">foo FOO<br>!BAR#</p>
chromium and firefox both display them as,
foo OOF
!RAB#
Based on my understanding of Unicode Bidirectional Algorithm (and by also viewing
the plain text in text editor) the '!' and '#' should be displayed next to each
other. Like this,
foo OOF
RAB!#
Is this a bug in the browser or <br> does not actually correspond to U+2028? And
how to insert (or have the semantic of) U+2028?
Browsers only display blank horizontal space for 
 (I put the spaces for quoting).

The visually rendered output of this HTML...
<p dir="auto"> foo FOO <br> !BAR# </p>
will be
foo FOO
!BAR#
because <br> tag is used for line breaker

Related

How do I make BitBucket recognize line breaks in commit comments?

I work with mercurial, and use long(ish), multi-line commit comments.
Recently, I've put my project on BitBucket.org, and have noticed that when my commit comments are appended to issue pages (see this SO question for information on how/when that happens), the newlines are replaced with spaces, while double-newlines stay double-newlines.
How should I mark single-newlines in commit messages so that BitBucket acknowledges them? I'd like to do this in the least-obtrusive way for when I read the comments normally from the command line.
Break paragraphs (generating <p> tags) with a blank line. Break lines (generating a <br> tag) by ending the first line with two or more spaces, e.g.
Line one␣␣
Line two
Bitbucket formats comments using Markdown, which has this to say about paragraphs and line breaks:
Paragraphs and Line Breaks
A paragraph is simply one or more consecutive lines of text, separated by one or more blank lines. (A blank line is any line that looks like a blank line — a line containing nothing but spaces or tabs is considered blank.) Normal paragraphs should not be indented with spaces or tabs.
The implication of the "one or more consecutive lines of text" rule is that Markdown supports "hard-wrapped" text paragraphs. This differs significantly from most other text-to-HTML formatters (including Movable Type’s "Convert Line Breaks" option) which translate every line break character in a paragraph into a <br /> tag.
When you do want to insert a <br /> break tag using Markdown, you end a line with two or more spaces, then type return.
Yes, this takes a tad more effort to create a <br />, but a simplistic "every line break is a <br />" rule wouldn’t work for Markdown. Markdown’s email-style blockquoting and multi-paragraph list items work best — and look better — when you format them with hard breaks.

difference between " " and nbsp; or " "

Hello I am trying to compile an EPUB v2.0 with html code extracted from Indesign. I have noticed there are a lot of "special characters" either at the beginning of a paragraph or at the end. For example
<p class="text_indent0px font_size0_8em line_height1_325 margin_bottom1px margin_left0px margin_right0px sans_serif floatleft">E<span class="small_caps">VELYNE</span> </p>
What is this
and can I either get rid of it or replace it with a "nbsp;"?
&#9
Is the ascii code for tabs. So I guess the paragraphs were indented with tabs.
If you want to replace them with then use 4 of them
That would be a horizontal tab (i.e. the same as using the tab key).
If you want to replace it, I would suggest doing a find/replace using an ePub editor like Sigil (http://sigil-ebook.com/).
represents the horizontal tab
Similarly represent space.
To replace you have to use
In the HTML encoding &#{number}, {number} is the ascii code. Therefore, is a tab which typically condenses down to one space in HTML, unless you use CSS (or the <pre> tag) to treat it as pre formatted text.
Therefore, it's not safe to replace it with a non-breaking or a regular space unless you can guarantee that it's not being displayed as a tab anywhere.
div:first-child {
white-space: pre;
}
<div> Test</div>
<div> Test</div>
<pre> Test</pre>
See https://developer.mozilla.org/en-US/docs/Web/CSS/white-space and http://ascii.cl/
is the entity used to represent a non-breaking space
decimal char code of space what we enter using keyboard spacebar
decimal char code of horizontal tab
and both represent space but is non-breaking means multiple sequential occurrence will not be collapsed into one where as for the same case, ` will collapse to one space
= approx. 4 spaces and approx. 8 spaces
There are four types of character reference scheme used.
Using decimal character codes (regex-pattern: &#[0-9]+;),
Using hexadecimal character codes (regex-pattern: &#x[a-f0-9]+;),
Using named character codes (regex-pattern: &[a-z]+;),
Using the actual characters (regex-pattern: .).
Al these conversions are rendered same way. But, the coding style is different. For example, if you need to display a latin small letter E with diaeresis then you could use any of the below convention:
ë (decimal notation),
ë (hexadecimal notation),
ë (html notation),
ë (actual character),
Likewise, as you said, what should be used (a) (decimal notation) or (b) (html notation) or (c) (decimal notation).
So, from the above analogy, it can be said that the (a), (b) and (c) are three different kind of notation of three different characters.
And, this is for your information that, (a) is a Horizontal Tab, the (b) one is the non-breaking space which is actually   in decimal notation and the (c) is the decimal notation for normal space character.
Now, technically space at the end of the paragraph, is nothing but meaningless. Better, you could discard those all. And if you still need to use space inside <pre> elements, not in <p> or <div>.
Hope this helps...

Truly selecting in HTML tags in Vim

In an HTML doc say I have this:
<p>
fdhjfkdj hfkjdfhkjdfhkjdh dfhdkf kjdh kjdhkjdhk
fhkdj hdjfhjkdh kjdh kjdf jkdhf d
jfdfhkdjfhkjdf
fjdj fhkd fdhfkjd hfkjdfhkjdf kdhfd
fdhjkfjk dhjdfhkjdf kjdfhdk fhdk
</p>
If I do the normal vit command in vim it'll select the text inside if I yank it, but if I try to do anything such as tab over or run gqit affects the entire <p>. For example, doing vit then gq ends up looking something like
<p> fdhjfkdj hfkjdfhkjdfhkjdh dfhdkf kjdh kjdhkjdhk lkd sldj lks jlkdf
jlsdkf jlsdf jdl dlsjl fhkdj hdjfhjkdh kjdh kjdf jkdhf d jfdfhkdjfhkjdf fjdj
fhkd fdhfkjd hfkjdfhkjdf kdhfd fdhjkfjk dhjdfhkjdf kjdfhdk fhdk </p>
Indenting it wont indent the text, but the whole tag. How do I truly select on the the text inside so I can run commands on it like the ones above?
That's because the inner HTML of the paragraph starts immediately after the starting <p> tag, so it includes the newline character immediately after it (which you'll also see after vit). As you've recognized, reformatting and indenting are line-based, so that single character counts.
To make the text object work like you want, you need to move to the start of the selection (o), then reduce it to the next line (easiest with j; for indenting and formatting, the exact start column isn't important, anyway). So the sequence for reformatting would be:
vitojgq
If you want something quicker, you need to write your own text object. Have a look at my CountJump plugin, or the textobj-user plugin; they can help with defining one.

Is it allowed to use other tags inside <title>?

Is it correct practice or valid syntax to use other tags inside a <title>?
An example for multi-language title
<html lang=en>
<title>Some title in English and a <i lang=fr>word in French</i></title>
See http://www.w3.org/TR/html401/struct/global.html#h-7.4.2:
Titles may contain character entities (for accented characters, special characters, etc.), but may not contain other markup (including comments).
(my emphasis)
No, it may not
http://www.w3.org/Provider/Style/TITLE.html
You can try to use whatever you want, but it will all be used as title string, without any additional parsing/processing from the browser (if that's what you expect). RFC says you have to resist from placing markup in title, though.
TLDR: The <title> tag (1) must contain text (it must not be empty), (2) must only contain text (i.e. no other elements), and (3) must contain text that is not just white-space.
In HTML 5, the Content Model of the title element is:
Text that is not inter-element white space.
where inter-element white space is any Text node that is either empty or only contains sequences of space characters:
U+0020 SPACE
U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+000D CARRIAGE RETURN (CR)

Inserting HTML tag in the middle of Arabic word breaks word connection (cursive)

From wikipedia:
Cursive (from Latin curro, currere, cucurri, cursum, to run, hasten) is any style of handwriting that is designed for writing notes and letters quickly by hand. In the Arabic, Latin, and Cyrillic writing systems, the letters in a word are connected, making a word one single complex stroke.
In the above languages when we want to format one single word with e.g. <span> tag to apply custom css style it breaks word conection, so is there any solution for this.
example this is for example normal arabic word: كتب
but when we want to color last letter in other color using the span tag get this:
because first two letter are in one tag and last is in other to color it.
Is there something I can do to avoid word breaks.
Here is the full html:
<p>كت<span style="color: Red;">ب</span></p>
I'm not sure if there's any HTML way to do it, but you can fix it by adding a zero-width joiner Unicode character before the opening span tag:
<p>كت‍<span style="color: Red;">ب</span></p>
You can use the actual Unicode character instead of the HTML character entity, of course, but that wouldn't be visible here. Or you can use the prettier ‍ entity.
Here it is in action (using an invisible <b> tag, since I can't do color here), without the joiner:
كتب
and with the joiner:
كت‍ب
It's supposed to work without the joiner as far as I understand it, though, and it does in some browsers, but clearly not all of them.
Update 2020/5
Google Chrome (Checked version 81.0.4044.138) and Firefox (76.0.1) have solved this issue when rendreing Arabic and Farsi words and there is no more need to handle the situation manually. Simply wrap the keyword with <span style="color:red">Keyword</span> works fine with both connecting and non-connecting characters.
For this reason, you probably can not see the difference between Correct and Wrong examples below:
Main post:
After 7 years of accepted answer I would like to add a new answer with more practical details as my native language is Farsi. I assume that we want to replace a keyword within a long word. This answer considers the following details:
1- Sometimes it is not enough to add ‍ only to the previous character becase next character should also has a tail to complete the connection.
body{font-size:36pt;}
span{color:red}
Wrong: مک‍<span>انیک</span>
<br>
Correct: مک‍<span>‍انیک</span>
2- We may also need to add ‍ after the keyword to connect it to next character.
body{font-size:36pt;}
span{color:red}
Wrong: مک‍<span>‍انیک</span>ی
<br>
Correct: مک‍<span>‍انیک‍</span>‍ی
3- There are some characters that accept tail before but not after. So we have to exclude them from accepting tail after them. This is the list of non-connecting characters to next characters: ا آ د ذ ر ز ژ و
4- Finally to respect search engines and scrappers, I recommend using javascript (jquery) to replace keywords after DOM ready to keep the page source clean.
This is my final code with regards to all details above:
$(document).ready(function(){
var tail="\u200D";
var keyword="ستر";
$(".searchableContent").each(function(){
var htm=$(this).html();
/*
preserve keywords which have space both before and after
with a temp sign say #fullHolder#
*/
htm=htm.split(' '+keyword+' ').join(' #fullHolder# ');
/*
preserve keywords which have only space after
with a temp sign say #preHolder#
*/
htm=htm.split(keyword+' ').join('#preHolder#'+' ');
/*
preserve keywords which have only space before
with a temp sign say #nextHolder#
*/
htm=htm.split(' '+keyword).join(' '+'#nextHolder#');
/*
replace remaining keywords with marked up span.
Add tail to both side of span to make sure it is
connected to both letters before and after
*/
htm=htm.split(keyword).join(tail+'<span style="color:#ff0000">'+tail+keyword+tail+'</span>'+tail);
//Deal #preHolder# by adding tail only before the keyword
htm=htm.split('#preHolder#'+' ').join(tail+'<span style="color:#ff0000">'+tail+keyword+'</span>'+' ');
//Deal #nextHolder# by adding tail only after the keyword
htm=htm.split(' '+'#nextHolder#').join(' '+'<span style="color:#ff0000">'+keyword+tail+'</span>'+tail);
//Deal #fullHolder# by adding markup only without tail
htm=htm.split(' '+'#fullHolder#'+' ').join(' '+'<span style="color:#ff0000">'+keyword+'</span>'+' ');
//Remove all possible combination of added tails to non-connecting characters
var nonConnectings=['ا','آ','د','ذ','ر','ز','ژ','و'];
for (x = 0; x < nonConnectings.length; x++) {
htm=htm.split(nonConnectings[x]+tail).join(nonConnectings[x]);
htm=htm.split(nonConnectings[x]+'<span style="color:#ff0000">'+tail).join(nonConnectings[x]+'<span style="color:#ff0000">');
htm=htm.split(nonConnectings[x]+'</span>'+tail).join(nonConnectings[x]+'</span>');
}
$(this).html(htm);
})
})
div{font-size:26pt}
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
<div class="searchableContent">
سترون - بستری - آستر - بستر - استراحت
</div>