XML, S-Expressions, and overlapping scope... What's it called? - html

I was reading XML is not S-Expressions. XML scoping is kind of strict, as are S-expressions. And in every programming language I've seen, you can't have the following:
<b>BOLD <i>BOTH </b>ITALIC</i> == BOLD BOTH ITALIC
It's not even expressible with S-Expressions:
(bold "BOLD" (italic "BOTH" ) "ITALIC" ) == :(
Does any programming language support this kind of "overlapping" scoping? Could there be any practical use for it?

Overlapping markup structures has many practical uses. Consider for example applications of concurrent markup for text analysis in the humanities. The International Workshop on Markup of Overlapping Structures noted that:
Overlapping structures are ubiquitous, appearing in applications of textual markup as varied as aircraft maintenance manuals and ancient scriptural and liturgical works. The “overlap issue“ raises its ugly head whenever text encoding looks beyond the snapshot view of a particular hierarchy to represent and process multiple concurrent aspects of a text, including features that reflect the text’s evolution across multiple versions and variants whether typographic or presentational, structural, annotational or referential, taxonomic or topical.
Overlap is a problem in texts as diverse as technical documents and product manuals (versioning), legal codes (effectivity), literary works (prosadic versus dramatic stucture, rhetorical structures, annotation), sacred texts (chapter plus verse reference versus sentence structure and commentary), and language corpora (multiple layers of linguistic annotation).
The Text Encoding Initiative (TEI) publishes Guidelines to handle non-nesting information and provides an XML syntax for overlap. They stated in 2004 that:
[N]o solution has yet been suggested which combines all the desirable attributes of formal simplicity, capacity to represent all occurring or imaginable kinds of structures, suitability for formal or mechanical validation, and clear identity with the notations needed for simpler cases (i.e. cases where the textual features do nest properly).
Some options to handle overlapping structures include:
SGML has a CONCUR feature that can be used to support overlapping structures, although Goldfarb (the author of the standard) writes that "“I therefore recommend that CONCUR not be used to create multiple logical views of a document".
GODDAG provides a data structure for representing documents with overlapping structures.
XCONCUR is an experimental markup language with the major goal to provide a convenient method to express concurrent hierarchies in an XML-like fashion.

There probably isn't any programming language that supports overlapping scopes in its formal definition. While technically possible, it would make the implementation more complex than it needed to be. It would also make the language ambiguous as to accept as valid what would very likely supposed to be a mistake.
The only practical use I can think of right now is that it's less typing and is written more intuitively, just as writing attributes in mark-up feel more intuitive without uneccessary quotes, as in <foo id=45 /> instead of <foo id="45" />.
I think that enforcing nested structures makes for more efficient processing, too. By enforcing nested structures, the parser can push and pop nodes onto a single stack to keep track of the list of open nodes. With overlapped scopes, you'd need an ordered list of open scopes that you'd have to append to whenever you come across a begin-new-scope token, and then scan each time you come across an end-scope token to see which open scope is most likely to be the one it closes.
Although no programming languages support overlapping scopes, there are HTML parsers that support it as part of their error-recovery algorithms, including the ones in all major browsers.
Also, the switch statement in C allows for constructs that look something like overlapping scopes, as in Duff's Device:
switch(count%8)
{
case 0: do{ *to = *from++;
case 7: *to = *from++;
case 6: *to = *from++;
case 5: *to = *from++;
case 4: *to = *from++;
case 3: *to = *from++;
case 2: *to = *from++;
case 1: *to = *from++;
} while(--n>0);
}
So, in theory, a programming language can have similar semantics for scopes in general to allow these kinds of tricks for optimization when needed but readability would be very low.
The goto statement, along with break and continue in some languages also lets you structure programs to behave like overlapped scopes:
BOLD: while (bold)
{ styles.add(bold)
print "BOLD"
while(italic)
{ styles.add(italic)
print "BOTH";
break BOLD;
}
}
italic-continued:
styles.remove(bold)
print "ITALIC"

Related

creating a common embedding for two languages

My task deals with multi-language like (english and hindi). For that I need a common embedding to represent both languages.
I know there are methods for learning multilingual embedding like 'MUSE', but this represents those two embeddings in a common vector space, obviously they are similar, but not the same.
So I wanted to know if there is any method or approach that can learn to represent both embedding in form of a single embedding that represents the both the language.
Any lead is strongly appreciated!!!
I think a good lead would be to look at past work that has been done in the field. A good overview to start with is Sebastian Ruder's talk, which gives you a multitude of approaches, depending on the level of information you have about your source/target language. This is basically what MUSE is doing, and I'm relatively sure that it is considered state-of-the-art.
The basic idea in most approaches is to map embedding spaces such that you minimize some (usually Euclidean) distance between the both (see p. 16 of the link). This obviously works best if you have a known dictionary and can precisely map the different translations, and works even better if the two languages have similar linguistic properties (not so sure about Hindi and English, to be honest).
Another recent approach is the one by Multilingual-BERT (mBERT), or similarly, XLM-RoBERTa, but those learn embeddings based on a shared vocabulary. This might again be less desirable if you have morphologically dissimilar languages, and also has the drawback that they incorporate a bunch of other, unrelated, languages.
Otherwise, I'm unclear on what exactly you are expecting from a "common embedding", but happy to extend the answer once clarified.

What is the definition of "language feature" in programming

I see that for such a widely used term I cannot find any definition or dedicated wikipedia article. I also use the term but I'm not sure whether I use it correctly in all cases.
By looking at sites that describe language features (e.g. http://es6-features.org) I can have a sense of what they are, but without specific bounds.
I also see that features are usually categorized (like in the above site). But again I cannot find any site mentioning about categorization of programming language features.
The word feature for programming languages is used with the meaning of characteristic or attribute, so there doesn't need to be a specific definition. See the word's definition through Google.
fea·ture
noun: feature; plural noun: features
a distinctive attribute or aspect of something. "safety features like dual air bags" synonyms: characteristic, attribute, quality,
property, trait, hallmark, trademark;
a part of the face, such as
the mouth or eyes, making a significant contribution to its overall
appearance. synonyms: face, countenance, physiognomy;
LINGUISTICS
a distinctive characteristic of a linguistic unit, especially a speech
sound or vocabulary item, that serves to distinguish it from others of
the same type.

l20n with HTML markup?

How would I use l20n if I wanted to create something like this:
About <strong>Firefox</strong>
I want to translate the phrase as a whole but I also want the markup. I don't want to have to do this:
<aboutBrowser "About {{ browserBrandShortName }}">
<aboutBrowserStrong "About <strong>{{ browserBrandShortName }}</strong>">
...as the translation itself is now duplicated.
I understand that this might not be in the scope of l20n, but it is probably a common enough case in the real world. Is there some kind of established way to go about this?
Sometimes duplicating the translation is the best thing you can do. Redundancy is good in localization: it allows to make fewer assumptions about translations into other languages. One of the core principles of L20n is that only the localizer will know what they really need.
Your solution is actually okay
The solution that you proposed is actually quite good. It's entirely possible that the emphasis that you're trying to express with <strong> will have some unknown implications in some language that we might not be aware of. For instance, some languages might use declensions or postpositions to mean "about something", in which case you—as a developer—shouldn't make too many assumptions about the exact position of the <strong> element. It might be that the entire translation will be a single word surrounded by <strong>.
Here's your code again, formatted using L20n's multiline string literals:
<aboutBrowser "About {{ browserBrandShortName }}">
<aboutBrowserEmphasized """
About <strong>{{ browserBrandShortName }}</strong>
""">
Note that for this to work as expected, you'll need to add a data-l10n-overlay attribute to the DOM node with data-l10n-id=aboutBrowserEmphasized. Otherwise, < and > will be escaped.
Making few assumptions matters
Let me digress quickly and bring up Bug 859035 — Do not use the same "unknown" entity for Size & Author when installing a WebApp from Firefox OS. The English-speaking developer assumed the they can use the "unknown" adjective to qualify both the size and the author in the installation dialog. However, in certain languages, like French or Polish, the adjective must be accorded with the object in terms of gender and plurals. So even though in English we can only have one string:
<unknown "Unknown">
…other languages might require two separate strings for each of the contexts they're used in. In French, you'd say "auteur inconnu" (unknown author, masculine) but "taille inconnue" (unknown size, feminine):
<unknownSize "inconnue">
<unknownAuthor "inconnu">
In English, this means some redundancy:
<unknownSize "Unknown">
<unknownAuthor "Unknown">
…but that's OK, because in the end the quality of localization is improved. It is generally a good practice to use unique strings everywhere and reuse sparingly. Ideally, you'd allow different translations for all strings. Even something as simple and common as "Yes" and "No" can be tricky if you consider languages like Welsh:
Welsh doesn't have a single word to use every time for yes and no questions. The word used depends on the form of the question. You must generally answer using the relevant form of the verb used in the question, or in questions where the verb is not the first element you use either 'ie' / 'nage'.

Is HTML a context-free language?

Reading some related questions made me think about the theoretical nature of HTML.
I'm not talking about XHTML-like code here. I'm talking about stuff like this crazy piece of markup, which is perfectly valid HTML(!)
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html<head>
<title//
<p ltr<span id=p></span</p>
</>
So given the enormous complexity that SGML injects here, is HTML a context-free language? Is it a formal language anyway? With a grammar?
What about HTML5?
I'm new to the concept of formal languages, so please bear with me. And yes, I have read the wikipedia article ;)
Context Free is a concept from language theory that has important implications in parser implementation. A Context Free Language can be described by a Context Free Grammar, which is one in which all rules have a single non-terminal symbol at the left of the arrow:
X→δ
That simple restriction allows X to be substituted by the right-hand side of the rules in which appears on the left without regard to what came before or after. For example, if while deriving or parsing one arrives at:
αXλ
one is sure that
αδλ
is also valid. Examples of non-context-free rules would be:
XY→δ
Xa→δ
aX→δ
Those would require knowing what could be derive arround X to determine if a rule applies, and that leads to non-determinism (what's around X would also like to know what it derives to), which is a no-no in parsing, and in any case we want a language to be well-defined.
The only way to prove that a language is context-free is by proving that there's a context-free grammar for it, which is not an easy task. Most programming languages one comes about are already described by CFGs, so the job is done. But there are other languages, including programming languages, that are described using logic or plain English, so work is required to find if they are context-free.
For HTML, the answer about its context-freedom is yes. SGML is a well defined Context Free Language, and HTML defined on top of it is also a CFL. Parsers and grammars for both languages abound on the Web. At any rate, that there exist LL(k) grammars for valid HTML is enough proof that the language is context-free, because LL is a proven subset of CF.
But the way HTML evolved over the life of the Web forced browsers to treat it as not that well defined. Modern Web browsers will go out of their way to try to render something sensible out of almost anything they find. The grammars they use are not CFGs, and the parsers are far more complex than the ones required for SGML/HTML.
HTML is defined at several levels.
At the lexical level there are the rules for valid characters, identifiers, strings, and so on.
At the next level is XML, which consists of the opening and closing <tags> that define a hierarchical document structure. You can use XML or something XML-like for any purpose, like Apache Ant does for build scripts.
At the next level are the tags that are valid in HTML, and the rules about which tags may be nested within which tags.
At the next level are the rules about which attributes are valid for which tags, languages that can be embedded in HTML like CSS and JavaScript.
Finally, you have the semantic rules about what a given HTML document means.
The syntactic part is defined well enough that it can be verified. The semantic part is much larger than the syntactic one, and is defined in terms of browser actions regarding HTTP, and the Document Object Model (DOM), and how a model should be rendered to the screen.
In the end:
Parsing correct HTML is extremely easy (it's context-free and LL/LR).
Parsing the HTML that actually exists over the Web is difficult.
Implementing the semantics (a browser) over HTML/CSS/DOM is extremely difficult.
Valid HTML is not a context-free language.
First of all, HTML being an application of SGML is fiction for all practical purposes, so analyzing SGML to answer the question is useless. (However, the SGML fiction probably isn't context-free, either.)
It's more useful to look at the actually defined HTML parsing algorithm. It works on two levels: tokenization and tree building. What HTML calls tokenization is a higher-level operation than what is usually called tokenization when talking about parsers. In the case of HTML, tokenization splits a stream of characters into units like start tags, end tags, comments and text. The tokenizer expands character references. Usually, when talking about parsers, you'd probably treat stuff like the less-than sign as "tokens" and would consider character references to consist of tokens instead of being resolved by the tokenizer.
If you consider the process of splitting the input stream into tokens, that level of the HTML language is regular (except for feedback from the tree builder).
However, there are three complications: The first one is that splitting the input stream into tokens is just the first and then there's the tree builder's side that actually cares about the identifiers in the tokens. The second one is that the tree builder feeds back into the tokenizer so that some state transitions made by the tokenizer depend on the state of the tree builder! The third one is that valid documents in the language are defined by rules that apply to the output of the tree builder stage and those rules are complex enough that they can't be fully defined using tree automata (as evidenced by RELAX NG not being expressive enough to describe all the validity constraints).
This isn't an actual proof, but you can probably develop real proofs by working from complications #2 and #3.
Note that the case of invalid documents is not particularly interesting as a question of whether the language is context-free in the sense of there being a context-free grammar that generates all the possible strings with no regard to the parse tree having some intelligible interpretation in terms of the tree that an HTML parser generates. The HTML parser will successfully consume all possible strings, so in that sense, all possible strings are in the "invalid HTML" language.
Edit: Interesting questions left as exercise to the reader:
Is HTML without parse errors but ignoring validity a context-free language?
Is HTML without parse errors and ignoring general validity but with only valid element names allowed a context-free language?
(Complication #2 applies in both cases.)
NO
See Edit Below
It depends.
If you are talking about the subset consisting of only theoretical HTML, then yes.
If you also include real life, working HTML that is accessed and used successfully by millions of people daily on many of the top sites on the internet then NO.
That is what gives HTML flexibility. The parsing engine adds tags, closes tags, and takes care of stuff that a theoretical CFG can't do. If you took automata you might remember that a production rule in a formal grammar cannot be empty (aka epsilon/lambda) on the lhs (left-hand side). Since the parsing engine is basically using knowledge that a formal grammar and automata couldn't have, it isn't restricted by that and the 'grammar' would have epsilon/lambda -> result where the specific epsilon/lambda rule is chosen based on information not available in the grammar.
Since I don't think empty lhs are allowed in any formal grammars, HTML cannot be defined by a formal grammar and is not a formal language at all.
Sure, HTML5 might try to move towards a 'more formal' language description but the likelihood that it becomes a context free language in reality (i.e. strings not matched by the grammar are rejected) is about the likelihood XHTML 2.0 takes the world by storm and replaces HTML altogether (XHTML is the attempt they made to make HTML a formal language...it was rejected en masse due to its fragility).
Noteworthy is the fact that HTML 5 is the FIRST HTML standard to be defined before being implemented! That's right, HTML 1-4 consist of random ideas someone just implemented in a browser, and were collected into standards after the fact based on which features were popularly used and widely implemented. Then they tried XHTML, which totally failed to be adopted. Even 'xhtml' on the web is automatically parsed as HTML under almost every circumstance to prevent stuff from just breaking with a cryptic syntax error. Now you can see how we got here and why it is unlikely to be formalized any time soon.
Lesson: "In theory, there is no difference between theory and practice. In practice, there is." - Yogi Berra
EDIT:
Actually, after reading through the documents it turns out that HTML, even according to the HTML 4.01 specification, doesn't actually conform to SGML. To see for yourself, view the HTML 4.01 Strict document type definition (doctype) at http://www.w3.org/TR/html4/strict.dtd and note the following lines:
The HTML 4.01 specification includes additional
syntactic constraints that cannot be expressed within
the DTDs.
So I would say that it is probably not a CFL due to those features (although it technically it doesn't disprove the hypothesis that there is some possible PDA that accepts HTML 4.01, it does prevent the argument that SGML is a CFL therefore HTML is a CFL).
HTML5 flip-flops, abandoning any implied conformance to SGML, but is presumably describable by a CFG. However it will still provide best-effort parsing not based on a cfg, so IMO the current situation (i.e. language specification is defined formally, with invalid strings still being accepted, parsed and rendered in a best effort fashion) in this regard is unlikely to change drastically for a long, long, long time.
HTML5 is different from previous HTML versions in that it strictly defines the parsing behaviour of code that isn't completely correct. Pre-HTML5 parsers vary and each do their best to 'guess' the intention of the code author.

What is semantic markup, and why would I want to use that?

Like it says.
Using semantic markup means that the (X)HTML code you use in a page contains metadata describing its purpose -- for example, an <h2> that contains an employee's name might be marked class="employee-name". Originally there were some people that hoped search engines would use this information, but as the web has evolved semantic markup has been mostly used for providing hooks for CSS.
With CSS and semantic markup, you can keep the visual design of the page separate from the markup. This results in bandwidth savings, because the design only has to be downloaded once, and easier modification of the design because it's not mixed in to the markup.
Another point is that the elements used should have a logical relationship to the data contained within them. For example, tables should be used for tabular data, <p> should be used for textual paragraphs, <ul> should be used for unordered lists, etc. This is in contrast to early web designs, which often used tables for everything.
Semantics literally means using "meaningful" language; in Web Development, this basically means using tags and identifiers which describe the content.
For example, applying IDs such as #Navigation, #Header and #Content to your <div> tags, rather than #Left, and #Main, or using unordered lists for a list of navigational links, rather than a table.
The main benefits are in future maintenance; you can easily change the layout or the presentation without losing the meaning of your content. You navigation bar can move from the left to the right, or your links displayed horizontally rather than vertically, without losing the meaning.
From http://www.digital-web.com/articles/writing_semantic_markup/ :
semantic markup is markup that is descriptive enough to allow us and the machines we program to recognize it and make decisions about it. In other words, markup means something when we can identify it and do useful things with it. In this way, semantic markup becomes more than merely descriptive. It becomes a brilliant mechanism that allows both humans and machines to “understand” the same information.
Besides the already mentioned goal of allowing software to 'understand' the data, there are more practical applications in using it to translate between ontologies, or for mapping between dis-similar representations of data - without having to translate or standardize the data (which can result in a loss of information, and typically prevents you from improving your understanding in the future).
There were at least 2 sessions at OSCon this year related to the use of semantic technologies. One was on BigData (slides are available here: http://en.oreilly.com/oscon2008/public/schedule/proceedings, the other was the guys from FreeBase.
BigData was using it to map between two dis-similar data models (including the use of query languages which were specifically created for working with semantic data sets). FreeBase is mapping between different data sets and then performing further analysis to derive meaning across those data sets.
Related topics to look into: OWL, OQL, SPARQL, Franz (AllegroGraph, RacerPRO and TopBraid).
Here is an example of a HTML5, semantically tagged website that I've been working on that uses the recently accepted Micro-formats as specified at http://schema.org along with the new more semantic tagging elements of HTML5.
http://blog-to-book.com/view/stuff/about/semantic%20web
Googles has a handy Semantic tagging test tool that will show you how adding semantic tags to content enables search engines to 'understand' far more about your web pages.
Here is the test tool: http://www.google.com/webmasters/tools/richsnippets?url=http%3A%2F%2Fblog-to-book.com%2Fview%2Fstuff%2Fabout%2Fsemantic+web&view=
Notice how google now knows that the 'things' on the page are books, and they have an isbn13 identifier. Adding additional metadata, such as price and author enables further inferences to be made.
Hope this points you in some interesting directions. More detailed semantic tagging can be achieved using the Good Relations Ontology which is pretty much the most comprehensive I can think of right now.