How could I make the "percent sign" searchable in .CHM? - chm

We use strings with "percent sign" like "%abc" or abc%def" or "xyz%" in our HTML files. We compile with htmlhelp workshop. We see the "percent sign" in the compiled CHM, but they are not searchable. Also the search with placeholder ?abc does not work. Any ideas?

Related

How to stop converting "greater than" and "less than" symbols to entities in TinyMCE?

for some reasons I need to save the > and < symbols AS IS in my TinyMCE 4.2.2 editor instance.
I know I can set the entity_encoding to raw but, while this option prevent most symbols to be translated to entities, "greater than" and "less than" always get converted to < and >
Does anyone knows if a special flag or option is available for that?
Best regards.
I had this exact question since I want to treat the content of the TinyMCE editor as XML that possibly contains child elements. So if there's a <sub>3</sub> in the editor field, I want to treat that as an XML element, not just some text. Anyway, this is something that TinyMCE just doesn't allow for. The TinyMCE 4.x docs state several times that:
The base entities < > & ' and " will always be entity encoded into their named equivalents. Though ' and " will only be encoded within attribute values and < > will only be encoded within text nodes. This is correct according to the HTML and XML specifications.
See here and here. There's also some old discussion here about this on the TinyMCE help forums.
The conclusion is that TinyMCE does not allow you to get unescaped less-than or greater-than symbols so you must convert them back from HTML entity codes after you get the text back from TinyMCE.

How to define a TOC in HTML for kindlegen to recognize

I convert a book which is written in DocBook into a single page HTML. The HTML contains a TOC:
<div class="toc">
<dl>
<dt><span class="preface">Preface</span></dt>
<dt><span class="chapter"><a href="#installation-und-versionsauswahl">1. Version Selection and
Installation</a></span></dt>
[...]
I'd like to use kindlegen to convert the HTML into a file I can use with a Kindle. That works without a problem. BUT the TOC is not recognized as a TOC. The Kindle user can't access the TOC directly with the TOC button.
What do I have to change that kindlegen recognize the TOC in my HTML file?
I'd recommend reading the official Kindle publishing guidlines from Amazon.
AFAIK kindlegen can't do that, you need a proper NCX file or an OPF with properly set TOC setting.
See also this short tutorial.
In case useful, I knocked up a quick PHP script to generate very basic NCX and OPF files to support the TOC without having to break up the document. I wrote the script based on a MS Word documented saved as HTML (so it is hard coded to use those style names). Just noting it here in case useful to anyone who comes along this post in the future. http://alankent.me/2016/03/05/creating-a-kindle-book-using-microsoft-word-quick-note/

How does a parser (for example, HTML) work?

For argument's sake lets assume a HTML parser.
I've read that it tokenizes everything first, and then parses it.
What does tokenize mean?
Does the parser read every character each, building up a multi dimensional array to store the structure?
For example, does it read a < and then begin to capture the element, and then once it meets a closing > (outside of an attribute) it is pushed onto a array stack somewhere?
I'm interested for the sake of knowing (I'm curious).
If I were to read through the source of something like HTML Purifier, would that give me a good idea of how HTML is parsed?
Tokenizing can be composed of a few steps, for example, if you have this html code:
<html>
<head>
<title>My HTML Page</title>
</head>
<body>
<p style="special">
This paragraph has special style
</p>
<p>
This paragraph is not special
</p>
</body>
</html>
the tokenizer may convert that string to a flat list of significant tokens, discarding whitespaces (thanks, SasQ for the correction):
["<", "html", ">",
"<", "head", ">",
"<", "title", ">", "My HTML Page", "</", "title", ">",
"</", "head", ">",
"<", "body", ">",
"<", "p", "style", "=", "\"", "special", "\"", ">",
"This paragraph has special style",
"</", "p", ">",
"<", "p", ">",
"This paragraph is not special",
"</", "p", ">",
"</", "body", ">",
"</", "html", ">"
]
there may be multiple tokenizing passes to convert a list of tokens to a list of even higher-level tokens like the following hypothetical HTML parser might do (which is still a flat list):
[("<html>", {}),
("<head>", {}),
("<title>", {}), "My HTML Page", "</title>",
"</head>",
("<body>", {}),
("<p>", {"style": "special"}),
"This paragraph has special style",
"</p>",
("<p>", {}),
"This paragraph is not special",
"</p>",
"</body>",
"</html>"
]
then the parser converts that list of tokens to form a tree or graph that represents the source text in a manner that is more convenient to access/manipulate by the program:
("<html>", {}, [
("<head>", {}, [
("<title>", {}, ["My HTML Page"]),
]),
("<body>", {}, [
("<p>", {"style": "special"}, ["This paragraph has special style"]),
("<p>", {}, ["This paragraph is not special"]),
]),
])
at this point, the parsing is complete; and it is then up to the user to interpret the tree, modify it, etc.
First of all, you should be aware that parsing HTML is particularly ugly -- HTML was in wide (and divergent) use before being standardized. This leads to all manner of ugliness, such as the standard specifying that some constructs aren't allowed, but then specifying required behavior for those constructs anyway.
Getting to your direct question: tokenization is roughly equivalent to taking English, and breaking it up into words. In English, most words are consecutive streams of letters, possibly including an apostrophe, hyphen, etc. Mostly words are surrounded by spaces, but a period, question mark, exclamation point, etc., can also signal the end of a word. Likewise for HTML (or whatever) you specify some rules about what can make up a token (word) in this language. The piece of code that breaks the input up into tokens is normally known as the lexer.
At least in a normal case, you do not break all the input up into tokens before you start parsing. Rather, the parser calls the lexer to get the next token when it needs one. When it's called, the lexer looks at enough of the input to find one token, delivers that to the parser, and no more of the input is tokenized until the next time the parser needs more input.
In a general way, you're right about how a parser works, but (at least in a typical parser) it uses a stack during the act of parsing a statement, but what it builds to represent a statement is normally a tree (and Abstract Syntax Tree, aka AST), not a multidimensional array.
Based on the complexity of parsing HTML, I'd reserve looking at a parser for it until you've read through a few others first. If you do some looking around, you should be able to find a fair number of parsers/lexers for things like mathematical expressions that are probably more suitable as an introduction (smaller, simpler, easier to understand, etc.)
Don't miss the W3C's notes on parsing HTML5.
For an interesting introduction to scanning/lexing, search the web for Efficient Generation of Table-Driven Scanners. It shows how scanning is ultimately driven by automata theory. A collection of regular expressions is transformed into a single NFA . The NFA is then transformed to a DFA to make state transitions deterministic. The paper then describes a method to transform the DFA into a transition table.
A key point: scanners use regular expression theory but likely don't use existing regular expression libraries. For better performance, state transitions are coded as giant case statements or in transition tables.
Scanners guarantee that correct words(tokens) are used. Parsers guarantee the words are used in the correct combination and order. Scanners use regular expression and automata theory. Parsers use grammar theory, especially context-free grammars.
A couple parsing resources:
http://www.cs.utk.edu/~eijkhout/594-LaTeX/handouts/parsing/parsing-tutorial.pdf
http://cryptodrm.engr.uconn.edu/c244lect/L2.pdf
HTML and XML syntax (and others based on SGML) are quite hard to parse and they don't fit well into the lexing scenario, because they're not regular. In the parsing theory, a regular grammar is the one with doesn't have any recursion, that is, self-similar, nested patterns, or parentheses-like wrappers which have to match each other. But HTML/XML/SGML-based languages does have nested patterns: tags could be nested. Syntax with nesting patterns is higher in level in the Chomsky's classification: it's context-free or even context-dependent.
But back to your question about lexer:
Each syntax consists of two kinds of symbols: non-terminal symbols (those which unwind into other syntax rules) and terminal symbols (those which are "atomic" - they are leafs of the syntax tree and don't unwind into anything else). Terminal symbols are often just the tokens. Tokens are pumped one by one from the lexer and matched to their corresponding terminal symbols.
Those terminal symbols (tokens) have often regular syntax, which is easier to recognize (and that's why it's factored out to the lexer, which is more specialized for regular grammars and could do it quicker than by using more general approach of non-regular grammars).
So, to write a lexer for HTML/XML/SGML-like language, you need to find parts of the syntax which are atomic enough and regular, to be dealt with easily by the lexer. And here the problem arises, because it's not at first obvious which parts are these. I struggled with this problem for a long time.
But Lie Ryan above have done a very good job in recognizing these parts. Bravo for him for that! The token types are following:
TagOpener: < lexeme, used for starting tags.
TagCloser: > lexeme, used for ending tags.
ClosingTagMarker: / lexeme used in closing tags.
Name: alphanumeric sequence starting with letter, used for tag names and attribute names.
Value: Text which can contain variety of different characters, spaces etc. Used for values of attributes.
Equals: = lexeme, used for separating attribute names from its values.
Quote: ' lexeme, used for enclosing attribute values.
DoubleQuote: " lexeme, used for enclosing attribute values.
PlainText: Any text not containing < character directly and not covered by the above types.
You can also have some tokens for entity references, like or &. Probably:
EntityReference: a lexeme consisting of & followed by some alphanumeric characters and ended with ;.
Why I used separate tokens for ' and " and not one token for attribute value? Because regular syntax couldn't recognize which of these characters should end the sequence - it depends on the character which started it (ending character have to match the starting character). This "parenthesizing" is considered non-regular syntax. So I promote it into a higher level - to the Parser. It'd be his job to match these tokens (starting and ending) together (or none at all, for simple attribute values not containing spaces).
Afterthought:
Unfortunately, some of these tokens may occur only inside other markup. So the use of lexical contexts is needed, which after all is another state machine controlling the state machines recognizing particular tokens. And that's why I've said that SGML-like languages don't fit well into the schema of lexical analysis.
This is how HTML 5 Parser works:

What's the HTML character entity for the # sign?

What's the HTML character entity for the # sign? I've looked around for "pound" (which keeps returning the currency), and "hash" and "number", but what I try doesn't seem to turn into the right character.
You can search it on the individual character at fileformat.info. Enter # as search string and the 1st hit will lead you to U+0023. Scroll a bit down to the 2nd table, Encodings, you'll see under each the following entries:
HTML Entity (decimal) #
HTML Entity (hex) #
For # we have &num;.
Bear in mind, though, it is a new entity (IE9 can't recognize it, for instance). For wide support, you'll have to resort, as said by others, the numerical references # and, in hex, &#x23.
If you need to find out others, there are some very useful tools around.
The "#" -- like most Unicode characters -- has no particular name assigned to it in the W3 list of
"Character entity references"
http://www.w3.org/TR/html4/sgml/entities.html
.
So in HTML it is either represented by itself as "#" or a numeric character entity "#" or "#" (without quotes), as described in
"HTML Document Representation"
http://www.w3.org/TR/html4/charset.html
.
Alas, all three of these are useless for escaping it in a URL.
To transmit a "#" character to the web server in a URL, you want to use "URL encoding" aka "percent encoding" as described in RFC 3986, and replace each "#" with a "%23" (without quotes).
There is no HTML character entity for the # character, as the character has no special meaning in HTML.
You have to use a character code entity like # if you wish to HTML encode it for some reason.
# or #
http://www.asciitable.com/ has information. Wikipedia also has pages for most unicode characters.
http://en.wikipedia.org/wiki/Number_sign
The numerical reference is #.
You can display "#" in some ways as shown below:
&num; or # or #
In addtion, you can display "♯" which is different from "#" in some ways as shown below:
♯ or ♯ or &sharp;
We've got some wild answers here and actually we might have to cut hairs to determine if it qualifies as an HTML Entity, but what I believe you're looking for is the named anchor.
This allows references to different sections within an HTML document via hyperlink and specifically uses the octothorp (hash symbol, number symbol, pound symbol)
exampleDomain.com/exampleSilo/examplePage.html#2ndBase
Would be an example of how it's used.
&num; is the best option because it is the only one that doesn't include the # (hash) in it. Supported by old browsers or not, it is the best practice going forward.
(What is the point of encoding something using the same symbol you are encoding?)

Unicode characters in URLs

In 2010, would you serve URLs containing UTF-8 characters in a large web portal?
Unicode characters are forbidden as per the RFC on URLs (see here). They would have to be percent encoded to be standards compliant.
My main point, though, is serving the unencoded characters for the sole purpose of having nice-looking URLs, so percent encoding is out.
All major browsers seem to be parsing those URLs okay no matter what the RFC says. My general impression, though, is that it gets very shaky when leaving the domain of web browsers:
URLs getting copy+pasted into text files, E-Mails, even Web sites with a different encoding
HTTP Client libraries
Exotic browsers, RSS readers
Is my impression correct that trouble is to be expected here, and thus it's not a practical solution (yet) if you're serving a non-technical audience and it's important that all your links work properly even if quoted and passed on?
Is there some magic way of serving nice-looking URLs in HTML
http://www.example.com/düsseldorf?neighbourhood=Lörick
that can be copy+pasted with the special characters intact, but work correctly when re-used in older clients?
Use percent encoding. Modern browsers will take care of display & paste issues and make it human-readable. E. g. http://ko.wikipedia.org/wiki/위키백과:대문
Edit: when you copy such an url in Firefox, the clipboard will hold the percent-encoded form (which is usually a good thing), but if you copy only a part of it, it will remain unencoded.
What Tgr said. Background:
http://www.example.com/düsseldorf?neighbourhood=Lörick
That's not a URI. But it is an IRI.
You can't include an IRI in an HTML4 document; the type of attributes like href is defined as URI and not IRI. Some browsers will handle an IRI here anyway, but it's not really a good idea.
To encode an IRI into a URI, take the path and query parts, UTF-8-encode them then percent-encode the non-ASCII bytes:
http://www.example.com/d%C3%BCsseldorf?neighbourhood=L%C3%B6rick
If there are non-ASCII characters in the hostname part of the IRI, eg. http://例え.テスト/, they have be encoded using Punycode instead.
Now you have a URI. It's an ugly URI. But most browsers will hide that for you: copy and paste it into the address bar or follow it in a link and you'll see it displayed with the original Unicode characters. Wikipedia have been using this for years, eg.:
http://en.wikipedia.org/wiki/ɸ
The one browser whose behaviour is unpredictable and doesn't always display the pretty IRI version is...
...well, you know.
Depending on your URL scheme, you can make the UTF-8 encoded part "not important". For example, if you look at Stack Overflow URLs, they're of the following form:
http://stackoverflow.com/questions/2742852/unicode-characters-in-urls
However, the server doesn't actually care if you get the part after the identifier wrong, so this also works:
http://stackoverflow.com/questions/2742852/これは、これを日本語のテキストです
So if you had a layout like this, then you could potentially use UTF-8 in the part after the identifier and it wouldn't really matter if it got garbled. Of course this probably only works in somewhat specialised circumstances...
Not sure if it is a good idea, but as mentioned in other comments and as I interpret it, many Unicode chars are valid in HTML5 URLs.
E.g., href docs say http://www.w3.org/TR/html5/links.html#attr-hyperlink-href:
The href attribute on a and area elements must have a value that is a valid URL potentially surrounded by spaces.
Then the definition of "valid URL" points to http://url.spec.whatwg.org/, which defines URL code points as:
ASCII alphanumeric, "!", "$", "&", "'", "(", ")", "*", "+", ",", "-", ".", "/", ":", ";", "=", "?", "#", "_", "~", and code points in the ranges U+00A0 to U+D7FF, U+E000 to U+FDCF, U+FDF0 to U+FFFD, U+10000 to U+1FFFD, U+20000 to U+2FFFD, U+30000 to U+3FFFD, U+40000 to U+4FFFD, U+50000 to U+5FFFD, U+60000 to U+6FFFD, U+70000 to U+7FFFD, U+80000 to U+8FFFD, U+90000 to U+9FFFD, U+A0000 to U+AFFFD, U+B0000 to U+BFFFD, U+C0000 to U+CFFFD, U+D0000 to U+DFFFD, U+E1000 to U+EFFFD, U+F0000 to U+FFFFD, U+100000 to U+10FFFD.
The term "URL code points" is then used in a few parts of the parsing algorithm, e.g. for the relative path state:
If c is not a URL code point and not "%", parse error.
Also the validator http://validator.w3.org/ passes for URLs like "你好", and does not pass for URLs with characters like spaces "a b"
Related: Which characters make a URL invalid?
As all of these comments are true, you should note that as far as ICANN approved Arabic (Persian) and Chinese characters to be registered as Domain Name, all of the browser-making companies (Microsoft, Mozilla, Apple, etc.) have to support Unicode in URLs without any encoding, and those should be searchable by Google, etc.
So this issue will resolve ASAP.
For me this is the correct way, This just worked:
$linker = rawurldecode("$link");
<?php echo $linker ;?>
This worked, and now links are displayed properly:
http://newspaper.annahar.com/article/121638-معرض--جوزف-حرب-في-غاليري-جانين-ربيز-لوحاته-الجدية-تبحث-وتكتشف-وتفرض-الاحترام
Link found on:
http://www.galeriejaninerubeiz.com/newsite/news
Use percent-encoded form. Some (mainly old) computers running Windows XP for example do not support Unicode, but rather ISO encodings. That is the reason percent-encoded URLs were invented. Also, if you give a URL printed on paper to a user, containing characters that cannot be easily typed, that user may have a hard time typing it (or just ignore it). Percent-encoded form can even be used in many of the oldest machines that ever existed (although they don't support internet of course).
There is a downside though, as percent-encoded characters are longer than the original ones, thus possibly resulting in really long URLs. But just try to ignore it, or use a URL shortener (I would recommend goo.gl in this case, which makes a 13-character long URL). Also, if you don't want to register for a Google account, try bit.ly (bit.ly makes slightly longer URLs, with the length being 14 characters).