I have a peculiar situation where I have written VBA code that performs a few find/replace actions on HTML special character codes.
I am trying to publish this code to my website in HTML format, but the part where my code references codes like " or & are turning into the actual HTML characters (eg &).
Is there an HTML trick to prevent this from happening, so I can literally display the text " or & in my HTML code?
You may escape it by using a mixture of XML Escaped code and plain text in HTML.
<body>
<ul>
<li>& = &</li>
<li>' = '</li>
<li>" = "</li>
<li>< = <</li>
<li>> = ></li>
</ul>
</body>
To write the actual text value " or & in HTML code, you can use their corresponding character entity references.
For ", you can use " or ".
For &, you can use & or &.
For example, to write the text Hello & World in HTML code, you would write:
<h1>Hello & World</h1>
(Output='Hello & World')
<h1> Gayan said, "I love Coding!"</h1>
(Output='Gayan said, "I love Coding!"')
I need to write Persian words in left to right mode for write Math formula in textarea html css but I cant get it working with direction:ltr; or other solutions to fixed it with direction.
I tested align-text, direction, dir Attribute and another things...
I want my result is equals to this:
User writes: سجاد+آرش+تست+تست
HTML input Show this: تست + تست+ آرش + سجاد
It's because when you write تست+آرش , the HTML see it one word and just when you use space (" ") HTML break your word.
You can place your Persian words inside ( and ).
So instead using this:
سجاد+آرش+تست+تست
use this:
(سجاد)+(آرش)+(تست)+(تست)
I am writing html files from a stack. This is a bit of a pain because for every line I have to write something like the following if the file contains quotes.
write "<div id=hidden-" & quote & myKanton & quote && "style=" & quote & "display:block;" "e&&"class=" "e & "popuptable" "e& ">" & LF to file tOutputFileCH
Now I have to add a lot of html code again and I'm wondering if there is an easier way to be able to do something like:
write escaped("my html numbers and "txt" with quotes") to file
I do not need variables within the html text.
Often, people use functions like
function q theText
replace "'" with quote in theText
return theText
end q
which can be used as
write q("<div id=hidden-'" & myKanton & "' style='display:block;'" & "class='popuptable'>" & LF) to file tOutputFileCH
You can use a string like in above example but you can also use any container:
get q(myVariable)
put q(it) into field 1
put q(field 1) into field 2
put q(url myUrl) into url myOtherUrl
put q(the cProperty of me) into myVar
-- etc etc etc
You can also use ´ or ` instead of ' if you change the q function.
By the way, I noticed that you don't include hidden- in the quotes. Are you sure that's correct?
HTML allows use of quotes and single quotes, so you can...
put "<div style='border:1px'>" into tHTML
LiveCode's format command allows you to escape double quotes...
put format("my html numbers and \"txt\" with quotes") into tData
It is working now. I put the html lines in a custom stack property and use that as input when writing the file. Works perfectly. It even seems to work without the q function.
write ( the cMapOverlay of stack "AfaConverter" ) & LF to file tOutputFileCH
I also tried that because
onmouseover="nhpup.popup($('#hidden-VS').html(), {'width': 400});" href="./kantone/index_kanton_VS.html"
this is trouble with q without adaptions because ' is replaced with " which is a problem.
There are some good answers here. Let me suggest another approach. You could use a quoting function, but in a slightly different way:
function q pString
return quote & pString & quote
end q
Then use the LiveCode merge() function. Merge evaluates any LiveCode expression or variable enclosed in [[ ]] and incorporates it into the enclosing quoted text:
write merge("my html numbers and [[q("txt")]]") to file
How would I write the entity name in HTML and not have it do its function? Example: I'm doing a tutorial and want to tell someone how to use the non-breaking space in their code ( ) So, how to actually write out "&" "n" "b" "s" "p" ";" but have it be fluid with no spaces?
You can use & instead of &
So will be
You will need to write out a part of the code, in this example, I'll use the ampersand. Instead of writing , write out the ampersand, &, and then write nbsp;. Your final result should be , which will display on the webpage.
You could simply use the HTML for the ampersand as in which would display what you're looking for, i.e.
JavaScript can be used to change the text of HTML element, below example adds non-blocking space entity character into span element.
<p>A common character entity used in HTML is the non-breaking space: <span id="myid"></span></p>
<script>
document.getElementById("myid").textContent= " ";
</script>
I would like to extract from a general HTML page, all the text (displayed or not).
I would like to remove
any HTML tags
Any javascript
Any CSS styles
Is there a regular expression (one or more) that will achieve that?
Remove javascript and CSS:
<(script|style).*?</\1>
Remove tags
<.*?>
You can't really parse HTML with regular expressions. It's too complex. RE's won't handle <![CDATA[ sections correctly at all. Further, some kinds of common HTML things like <text> will work in a browser as proper text, but might baffle a naive RE.
You'll be happier and more successful with a proper HTML parser. Python folks often use something Beautiful Soup to parse HTML and strip out tags and scripts.
Also, browsers, by design, tolerate malformed HTML. So you will often find yourself trying to parse HTML which is clearly improper, but happens to work okay in a browser.
You might be able to parse bad HTML with RE's. All it requires is patience and hard work. But it's often simpler to use someone else's parser.
Needed a regex solution (in php) that would return the plain text just as well (or better than) PHPSimpleDOM, only much faster. Here is the solution that I came up with:
function plaintext($html)
{
// remove comments and any content found in the the comment area (strip_tags only removes the actual tags).
$plaintext = preg_replace('#<!--.*?-->#s', '', $html);
// put a space between list items (strip_tags just removes the tags).
$plaintext = preg_replace('#</li>#', ' </li>', $plaintext);
// remove all script and style tags
$plaintext = preg_replace('#<(script|style)\b[^>]*>(.*?)</(script|style)>#is', "", $plaintext);
// remove br tags (missed by strip_tags)
$plaintext = preg_replace("#<br[^>]*?>#", " ", $plaintext);
// remove all remaining html
$plaintext = strip_tags($plaintext);
return $plaintext;
}
When I tested this on some complicated sites (forums seem to contain some of the tougher html to parse), this method returned the same result as PHPSimpleDOM plaintext, only much, much faster. It also handled the list items (li tags) properly, where PHPSimpleDOM did not.
As for the speed:
SimpleDom: 0.03248 sec.
RegEx: 0.00087 sec.
37 times faster!
Contemplating doing this with regular expressions is daunting. Have you considered XSLT? The XPath expression to extract all of the text nodes in an XHTML document, minus script & style content, would be:
//body//text()[not(ancestor::script)][not(ancestor::style)]
Using perl syntax for defining the regexes, a start might be:
!<body.*?>(.*)</body>!smi
Then applying the following replace to the result of that group:
!<script.*?</script>!!smi
!<[^>]+/[ \t]*>!!smi
!</?([a-z]+).*?>!!smi
/<!--.*?-->//smi
This of course won't format things nicely as a text file, but it strip out all the HTML (mostly, there's a few cases where it might not work quite right). A better idea though is to use an XML parser in whatever language you are using to parse the HTML properly and extract the text out of that.
The simplest way for simple HTML (example in Python):
text = "<p>This is my> <strong>example</strong>HTML,<br /> containing tags</p>"
import re
" ".join([t.strip() for t in re.findall(r"<[^>]+>|[^<]+",text) if not '<' in t])
Returns this:
'This is my> example HTML, containing tags'
Here's a function to remove even most complex html tags.
function strip_html_tags( $text )
{
$text = preg_replace(
array(
// Remove invisible content
'#<head[^>]*?>.*?</head>#siu',
'#<style[^>]*?>.*?</style>#siu',
'#<script[^>]*?.*?</script>#siu',
'#<object[^>]*?.*?</object>#siu',
'#<embed[^>]*?.*?</embed>#siu',
'#<applet[^>]*?.*?</applet>#siu',
'#<noframes[^>]*?.*?</noframes>#siu',
'#<noscript[^>]*?.*?</noscript>#siu',
'#<noembed[^>]*?.*?</noembed>#siu',
// Add line breaks before & after blocks
'#<((br)|(hr))#iu',
'#</?((address)|(blockquote)|(center)|(del))#iu',
'#</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))#iu',
'#</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))#iu',
'#</?((table)|(th)|(td)|(caption))#iu',
'#</?((form)|(button)|(fieldset)|(legend)|(input))#iu',
'#</?((label)|(select)|(optgroup)|(option)|(textarea))#iu',
'#</?((frameset)|(frame)|(iframe))#iu',
),
array(
' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',
"\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0",
"\n\$0", "\n\$0",
),
$text );
// Remove all remaining tags and comments and return.
return strip_tags( $text );
}
If you're using PHP, try Simple HTML DOM, available at SourceForge.
Otherwise, Google html2text, and you'll find a variety of implementations for different languages that basically use a series of regular expressions to suck out all the markup. Be careful here, because tags without endings can sometimes be left in, as well as special characters such as & (which is &).
Also, watch out for comments and Javascript, as I've found it's particularly annoying to deal with for regular expressions, and why I generally just prefer to let a free parser do all the work for me.
I believe you can just do
document.body.innerText
Which will return the content of all text nodes in the document, visible or not.
[edit (olliej): sigh nevermind, this only works in Safari and IE, and i can't be bothered downloading a firefox nightly to see if it exists in trunk :-/ ]
Can't you just use the WebBrowser control available with C# ?
System.Windows.Forms.WebBrowser wc = new System.Windows.Forms.WebBrowser();
wc.DocumentText = "<html><body>blah blah<b>foo</b></body></html>";
System.Windows.Forms.HtmlDocument h = wc.Document;
Console.WriteLine(h.Body.InnerText);
string decode = System.Web.HttpUtility.HtmlDecode(your_htmlfile.html);
Regex objRegExp = new Regex("<(.|\n)+?>");
string replace = objRegExp.Replace(g, "");
replace = replace.Replace(k, string.Empty);
replace.Trim("\t\r\n ".ToCharArray());
then take a label and do "label.text=replace;" see on label out put
.