I am working with bidirectional text (mixed English and Hebrew) for a project. The text is displayed in HTML, so sometimes a LTR or RTL mark ( or ) is required to make 'weak characters' like punctuation display properly. These marks are not present in the source text due to technical limitations, so we need to add them in order for the final displayed text to appear correct.
For instance, the following text: (example: מדגם) sample renders as sample (מדגם :example) in right-to-left mode. The corrected string would look like (example: מדגם) sample and would render as sample (מדגם (example:.
We'd like to do on-the-fly insertion of these marks rather than re-authoring all the text. At first this seems simple: just append an to each instance of punctuation. However, some of the text that needs to get modified on-the-fly contains HTML and CSS. The reasons for this are unfortunate and unavoidable.
Short of parsing HTML/CSS, is there a known algorithm for on-the-fly insertion of Unicode directional marks (pseudo-strong characters)?
I don't know of an algorithm to insert directional marks into an HTML string safely without parsing it. Parsing the HTML into a DOM and manipulating the text nodes is the safest way of ensuring you don't accidentally add directional marks to text inside <script> and <style> tags.
Here is a short Python script which might help you transform your files automatically. The logic should be easy to translate into other languages if necessary. I'm not familiar enough with the RTL rules you're trying to encode, but you can tweak the regexp '(\W([^\W]+)(\W)' and substituion pattern ur"\u200e\1\2\3\u200e" to get your expected result:
import re
import lxml.html
_RE_REPLACE = re.compile('(\W)([^\W]+)(\W)', re.M)
def _replace(text):
if not text:
return text
return _RE_REPLACE.sub(ur'\u200e\1\2\3\u200e', text)
text = u'''
<html><body>
<div>sample (\u05de\u05d3\u05d2\u05dd :example)</div>
<script type="text/javascript">var foo = "ignore this";</script>
<style type="text/css">div { font-size: 18px; }</style>
</body></html>
'''
# convert the text into an html dom
tree = lxml.html.fromstring(text)
body = tree.find('body')
# iterate over all children of <body> tag
for node in body.iterdescendants():
# transform text with trails after the current html tag
node.tail = _replace(node.tail)
# ignore text inside script and style tags
if node.tag in ('script','style'):
continue
# transform text inside the current html tag
node.text = _replace(node.text)
# render the modified tree back to html
print lxml.html.tostring(tree)
Output:
python convert.py
<html><body>
<div>sample (מדגם :example)</div>
<script type="text/javascript">var foo = "ignore this";</script>
<style type="text/css">div { font-size: 18px; }</style>
</body></html>
Related
I'm a bit stuck. I have scraped a website and would now like to convert it into markdown. My html looks like this:
Some text more text, and more text. Some text more text, and more text.
Once in a while <span class="bold">something is bold</span>.
Then some more text. And <span class="bold">more bold stuff</span>.
There are html to markdown modules available, however, they would only work if the text <b> looked like this </b>.
How could I go through the html, and everytime I find a span which is supposed to bold something, turn this piece of the html into bold markdown, that is, make it **look like this**
Try this one https://github.com/domchristie/to-markdown, an HTML to Markdown converter written in JavaScript.
It can be extended by passing in an array of converters to the options object:
toMarkdown(stringOfHTML, { converters: [converter1, converter2, …] });
In your case, the converter can be
{
filter: 'span',
replacement: function(content) {
return '**' + content + '**';
}
}
Refer to its readme for more details.
Notepad++ is an open-source editor that supports regex. This picture shows the basic idea.
You know how to use an editor to find and replace strings. In an editor like Notepad++ you can look for string patterns and replace parts of the patterns and keep what's left. In your case, you want to find strings that are framed by HTML markup. Here the regex in the 'Find what' edit box displays that, with the special notation ([^<]*) meaning save zero or more of any character other than the '<' for use in a replacement string. The 'Replace with' edit box says used what was saved (as \1) in the expression **\1** which gives you what you prefer to have in the text file. It remains to click on 'Replace all'.
To be able to do this you need to install Notepad++ and learn some basic Perl regex. To get this dialogue box click on Ctl-H. Of course, if you get it wrong there's always Ctl-Z.
Texts and/or markups are rendered to output as-is without any html-encoding as we already expect.
For the following, the plain text with markup must be html-encoded.(We don't care about the code output here.)
#{ var theVar = "xyz"; }
some text & other text >>#theVar
So, the html in the output;
some text & other text >>xx
So, when we want to write some static text that needs to be html-encoded we have to use constructs like;
#{ var theVar = "xyz"; }
#("some text & other text >>")#theVar
to get the following html in the output;
some text & other text >>xyz
and for clarity when viewed in browser;
some text & other text >>xyz
So, is there a simple way of doing this? Some shortcut to html encode texts instead of using #("...") for each text which will start to look nasty when there are multiples of them.
What would be the best practice? How do you do this?
So, it is not a big concern when we specify utf-8 encoding for the document. It is not required to html encode characters as entity references except special(<, >, &, ", ') characters when utf-8 encoding used for the document.
Even using & by itself is not wrong for lenient browsers but there would be ambigous cases to consider like volt&. So, it would be better to html encode all of these special characters.
Check the W3 Consortium articles "When to use escapes" section;
http://www.w3.org/International/questions/qa-escapes#use
I use an html parser (Neko) in order to extract the free-text of an html document.
Since I'm interested in text's semantic I must give special attention to the distance between words as it appears in browser.
for example:
<H1>My
title</H1>
<P>Hello
World</P>
Is rendered as:
My title
Hello world
While containing the paragraph inside <pre> tags or with style:
<style>
p { white-space:pre; }
</style>
would result:
My title
Hello
World
which I would like to treat differently since "Hello" for that matter is not semantically tied to the word "World". As said in other posts - there's a difference between what parsing does and what rendering does. I'm interested in the connection between words as it appears after rendering since obviously parsing doesn't collapse white-spaces as would been shown on browser.
Is there any way to extract whitespace-collapsed text from html as it's read on browser?
I have not used Neko before, but you will need to access the styles of the elements and see if the white-space property is set to either pre, pre-wrap, or preline.
If it is either pre or pre-wrap, replace any whitespace group in the text with a single space.
Else if pre-line, only replace groups of spaces/tabs with a single space.
Else, do not modify the text.
Here's an example using JQuery: JSFiddle
JQuery
function getRenderedText(obj) {
var text = obj.text();
var renderedText;
switch (obj.css('white-space')) {
case 'pre':
case 'pre-wrap':
renderedText = text;
break;
case 'pre-line':
renderedText = text.replace(/[ \t]+/,' ');
break;
default:
renderedText = text.replace(/\s+/,' ');
}
return renderedText;
}
Just look at this basic info on w3schools
http://www.w3schools.com/cssref/pr_text_white-space.asp
and a bit better explained with examples:
http://css-tricks.com/almanac/properties/w/whitespace/
i also think that you have to put hello in 1 <p> and world in another for the effect to work.
otherwise they both go to the right.
I have a strange problem:
In the database, I have a literal ampersand lt semicolon:
<div
whenever its printed into a html textarea tag, the source code of the page shows the > as >.
How do I stop this decoding?
You can't stop entities being decoded in a textarea since the content of a textarea is not (unlike a script or style element) intrinsic CDATA, even though error recovery may sometimes give the impression that it is.
The definition of the textarea element is:
<!ELEMENT TEXTAREA - - (#PCDATA) -- multi-line text field -->
i.e. it contains PCDATA which is described as:
Document text (indicated by the SGML construct "#PCDATA"). Text may contain character references. Recall that these begin with & and end with a semicolon (e.g., Hergé's adventures of Tintin contains the character entity reference for the e acute character).
This means that when you type (the invalid HTML of) "start of tag" (<) the browser corrects it to "less than sign" (<) but when you type "start of entity" (&), which is allowed, no error correction takes place.
You need to write what you mean. If you want to include some HTML as data then you must convert any character with special meaning to its respective character reference.
If the data is:
<div
Then the HTML must be:
<textarea><div</textarea>
You can use the standard functions for converting this (e.g. PHP's htmlspecialchars or Perl's HTML::Entities module).
NB 1: If you were using XHTML[2] (and really using it, it doesn't count if you serve it as text/html) then you could use an explicit CDATA block:
<textarea><![CDATA[<div]]></textarea>
NB 2: Or if browsers implemented HTML 4 correctly
Ok , but the question is . why it decodes them anyway ? assuming i've added & , save the textarea , ti will be saved < , but displayed as < , saving it again will convert it back to < (but it will remain < in the database) , saving again will save it a < in the database , why the textarea decodes it ?
The server sends (to the browser) data encoded as HTML.
The browser sends (to the server) data encoded as application/x-www-form-urlencoded (or multipart/form-data).
Since the browser is not sending the data as HTML, the characters are not represented as HTML entities.
If you take the data received from the client and then put it into an HTML document, then you must encode it as HTML first.
In PHP, this can be done using htmlentities(). Example below.
<?php
$content = "This string contains the TM symbol: ™";
print "<textarea>". htmlentities($content) ."</textarea>";
?>
Without htmlentities(), the textarea would interpret and display the TM symbol (™) instead of "™".
http://php.net/manual/en/function.htmlentities.php
You have to be sure that this is rendered to the browser:
<textarea name="somename"><div</textarea>
Essentially, this means that the & in < has to be html encoded to &. How to do it will depend on the technologies you're using.
UPDATE: Think about it like this. If you want to display <div> inside a textarea, you'll have to encode <> because otherwise, <div> would be a normal HTML element to the browser:
<textarea name="somename"><div></textarea>
Having said this, if you want to display <div> inside a textarea, you'll have to encode & again, because the browser decodes HTML entities when rendering HTML. It has nothing to do with your database.
You can serve your DB-content from a separate page and then place it in the textarea using a Javascript (jQuery) Ajax-call:
request = $.ajax
({
type: "GET",
url: "url-with-the-troubled-content.php",
success: function(data)
{
document.getElementById('id-of-text-area').value = data;
}
});
Explained at
http://www.endtask.net/how-to-prevent-a-textarea-element-from-decoding-html-entities/
I had the same problem and I just made two replacements on the text to show from the database before letting it into the text area:
myString = Replace(myString, "&", "&")
myString = Replace(myString, "<", "<")
Replace n:o 1 to trick the textarea to show the codes.
replace n:o 2: Without this replacement you can not show the word "" inside the textarea (it would end the textarea tag).
(Asp / vbscript code above, translate to a replace method of your language choice)
I found an alternative solution for reading and working with in-browser, simply read the element's text() using jQuery, it returns the characters as display characters and allows me to write from a textarea to a div's innerHTML using the property via html()...
With only JS and HTML...
...to answer the actual question, with a bare-minimal example:
<textarea id=myta></textarea>
<script id=mytext type=text/plain>
™
</script>
<script> myta.value = mytext.innerText; </script>
Explanation:
Script tags do not render html nor entities. By storing text in a script tag, it will remain unadultered-- problem is it will try to execute as JavaScript. So we use an empty textarea and store the text in a script tag (here, the first one).
To prevent that, we change the mime-type to text/plain instead of it's default, which is text/javascript. This will prevent it from running.
Then to populate the textarea, we copy the script tag's content to it (here done in the second script tag).
The only caveats I have found with this are you have to use JavaScript and you cannot include script tags directly in it.
This question already has an answer here:
How to add a newline (line break) in XML file?
(1 answer)
Closed 4 years ago.
I'm a beginner in web development, and I'm trying to insert line breaks in my XML file.
This is what my XML looks like:
<musicpage>
<song>
<title>Song Title</title>
<lyric>Lyrics</lyric>
</song>
<song>
<title>Song Title</title>
<lyric>Lyrics</lyric>
</song>
<song>
<title>Song Title</title>
<lyric>Lyrics</lyric>
</song>
<song>
<title>Song Title</title>
<lyric>Lyrics</lyric>
</song>
</musicpage>
I want to have line breaks in between the sentences for the lyrics. I tried everything from /n,
and other codes similar to it, PHP parsing, etc., and nothing works! Have been googling online for hours and can't seem to find the answer. I'm using the XML to insert data to an HTML page using Javascript.
Does anyone know how to solve this problem?
And this is the JS code I used to insert the XML data to the HTML page:
<script type="text/javascript">
if (window.XMLHttpRequest) {
xhttp=new XMLHttpRequest();
} else {
xhttp=new ActiveXObject("Microsoft.XMLHTTP");
}
xhttp.open("GET","xml/musicpage_lyrics.xml",false);
xhttp.send("");
xmlDoc=xhttp.responseXML;
var x=xmlDoc.getElementsByTagName("songs");
for (i=0;i<x.length;i++) {
document.write("<p class='msg_head'>");
document.write(x[i].getElementsByTagName("title")[0].childNodes[0].nodeValue);
document.write("</p><p class='msg_body'>");
document.write(x[i].getElementsByTagName("lyric")[0].childNodes[0].nodeValue);
document.write("</p>");
}
</script>
#icktoofay was close with the CData
<myxml>
<record>
<![CDATA[
Line 1 <br />
Line 2 <br />
Line 3 <br />
]]>
</record>
</myxml>
In XML a line break is a normal character. You can do this:
<xml>
<text>some text
with
three lines</text>
</xml>
and the contents of <text> will be
some text
with
three lines
If this does not work for you, you are doing something wrong. Special "workarounds" like encoding the line break are unnecessary. Stuff like \n won't work, on the other hand, because XML has no escape sequences*.
* Note that
is the character entity that represents a line break in serialized XML. "XML has no escape sequences" means the situation when you interact with a DOM document, setting node values through the DOM API.
This is where neither
nor things like \n will work, but an actual newline character will. How this character ends up in the serialized document (i.e. "file") is up to the API and should not concern you.
Since you seem to wonder where your line breaks go in HTML: Take a look into your source code, there they are. HTML ignores line breaks in source code. Use <br> tags to force line breaks on screen.
Here is a JavaScript function that inserts <br> into a multi-line string:
function nl2br(s) { return s.split(/\r?\n/).join("<br>"); }
Alternatively you can force line breaks at new line characters with CSS:
div.lines {
white-space: pre-line;
}
just use <br> at the end of your lines.
At the end of your lines, simply add the following special character:
That special character defines the carriage-return character.
In the XML: use literal line-breaks, nothing else needed there.
The newlines will be preserved for Javascript to read them [1]. Note that any indentation-spaces and preceding or trailing line-breaks are preserved too (the reason you weren't seeing them is that HTML/CSS collapses whitespace into single space-characters by default).
Then the easiest way is: In the HTML: do nothing, just use CSS to preserve the line-breaks
.msg_body {
white-space: pre-line;
}
But this also preserves your extra lines from the XML document, and doesn't work in IE 6 or 7 [2].
So clean up the whitespace yourself; this is one way to do it (linebreaks for clarity - Javascript is happy with or without them [3]) [4]
[get lyric...].nodeValue
.replace(/^[\r\n\t ]+|[\r\n\t ]+$/g, '')
.replace(/[ \t]+/g, ' ')
.replace(/ ?([\r\n]) ?/g, '$1')
and then preserve those line-breaks with
.msg_body {
white-space: pre; // for IE 6 and 7
white-space: pre-wrap; // or pre-line
}
or, instead of that CSS, add a .replace(/\r?\n/g, '<br />') after the other JavaScript .replaces.
(Side note: Using document.write() like that is also not ideal and sometimes vulnerable to cross-site scripting attacks, but that's another subject. In relation to this answer, if you wanted to use the variation that replaces with <br>, you'd have to escape <,&(,>,",') before generating the <br>s.)
--
[1] reference: sections "Element White Space Handling" and "XML Schema White Space Control" http://www.usingxml.com/Basics/XmlSpace#ElementWhiteSpaceHandling
[2] http://www.quirksmode.org/css/whitespace.html
[3] except for a few places in Javascript's syntax where its semicolon insertion is particularly annoying.
[4] I wrote it and tested these regexps in Linux Node.js (which uses the same Javascript engine as Chrome, "V8"). There's a small risk some browser executes regexps differently. (My test string (in javascript syntax) "\n\nfoo bar baz\n\n\tmore lyrics \nare good\n\n")
<song>
<title>Song Tigle</title>
<lyrics>
<line>The is the very first line</line>
<line>Number two and I'm still feeling fine</line>
<line>Number three and a pattern begins</line>
<line>Add lines like this and everyone wins!</line>
</lyrics>
</song>
(Sung to the tune of Home on the Range)
If it was mine I'd wrap the choruses and verses in XML elements as well.
If you use CDATA, you could embed the line breaks directly into the XML I think. Example:
<song>
<title>Song Title</title>
<lyric><![CDATA[Line 1
Line 2
Line 3]]></lyric>
</song>
<description><![CDATA[first line<br/>second line<br/>]]></description>
If you are using CSS to style (Not recommended.) you can use display:block;, however, this will only give you line breaks before and after the styled element.