Is there a proper term for "curly braces" among programmers - { } - terminology

{ }
I'm wondering if there is a more professional term other than "Curly Brace" that is used by programmers.
A google search said they are also referred to as "brackets", but in my experience when people say brackets they mean the square "[ ]".

Use whatever floats your boat, in most contexts; on account of regional flavours. If you feel extra formal, you might be using "curly bracket" instead.
Wikipedia shows many variations on names for the { } symbols. https://en.wikipedia.org/wiki/Bracket#Names_for_various_bracket_symbols
You could also imply them, sometimes, by talking about blocks, loops, switch block, if block, and other similar structures if the programming language in question features them. Programmers will just visualize the brackets as being present, as necessary:
The else-block is never executed because X.
Another way: imply the type by its relative position/usage. If the context is right, it'll be a curly bracket.
You're missing a closing bracket for the while-loop, that's why the compiler failed.

As a working mathematician in publishing I find it best to call {} braces, [] brackets and () parens (or parentheses). We call <> angle brackets and || are abs or absolute brackets. All that matters is that printers get your meaning without wasting time.

Maybe you are searching for the term scope because usually in programming languages like Java, C#, Js etc the Curly Brace defines a scope

Related

How do you replace the content of html tags in vim?

For instance, if I want to replace <person>Nancy</person> with <person>Henry</person> for all occurrences of <person>*</person> in vim?
Currently, I have:
%s:/'<person>*<\/person>/<person>Henry<\/person>
But obviously, this is wrong.
For a single substitution, Vim offers the handy cit (change inner tag) command.
For a global substitution, the answer depends on how well-structured your tag (soup) is. HTML / XML have a quite flexible syntax, so you can express the same content in various ways, and it becomes increasingly harder to construct a regular expression that matches them all. (Attempting to catch all cases is futile; see this famous answer.)
:%s/\v(<person>).\{-}(<\/person>)/\1Henry\2/g
does what you want but yeah, what Ingo said.
\v means "very magic": it's a convenient way to avoid backslashitis.
(something) (or \(something\) without the \v modifier) is a sub-expression, you can have up to nine of them in your search pattern and reuse those capture groups with \1...\9 in your replacement pattern or even later in your search pattern. \0 represents the whole match. Here, the opening tag is referenced as \1 and the closing tag as \2.

What is the optimal regex for parsing HTML (even though you shouldn't)? Is there a perfect one? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Okay, we all know attempting to parse HTML with Regex brings upon the wrath of Cthulhu. Quite well. And there are some great responses as to why you shouldn't. I accept these, and have posted these links on questions more than once.
But let's put this question within the following scope: we have no option other than Regex to parse HTML. Why? It doesn't matter. But assume for the moment our developers want to lose their minds to Tony the Pony and take the best shot at doing the impossible. If this blows your mind, assume the question to be theoretical then. Whatever floats your boat. Just consider the idea of parsing HTML with regex, even though you shouldn't.
Here we see a claim that it is not possible to do, at least with perfection. But then there's a very wise comment beneath it from #NikiC:
This answer draws the right conclusion ("It's a bad idea to parse HTML with Regex") from wrong arguments ("Because HTML isn't a regular language"). The thing that most people nowadays mean when they say "regex" (PCRE) is well capable not only of parsing context-free grammars (that's trivial actually), but also of context-sensitive grammars (see https://stackoverflow.com/a/7434814/1222420)
Truth is, you can do some incredibly powerful things with modern regex, even if rather verbose. But many make this problem sound like the Halting Problem: you can try, but there will always be another case for which your solution breaks.
So here's the question, and its a bit of a 2-parter.
Is it possible to generate a perfect regular expression for parsing HTML?
If so, is the proof constructive? Do we only know we can, or has it been done?
If it is not possible, what is the most accurate one out there?
First of all let's get this straight:
Regex' incompatibility with HTML parsing is NOT a claim. Repeat after me: "Not a claim".
It's a scientifically proven and well known fact. Further more the world was not created in 7 days and big-foot ain't real either. End of discussion.
But let's put this question within the following scope: we have no
option other than Regex to parse HTML. Why? It doesn't matter
Funny that you write it doesn't matter. Given, that the "why" is actually what makes it either partially possible or completely impossible what you're planning to do. If there was one thing here that mattered, it'd be the "why".
If the "why" is "validation", then the answer is per definition: not possible. Validation requires no less than 100% language coverage. And regular expressions, being a subset of context-free grammars, therefor cannot cover 100%. By definition.
If the "why" however is "extraction", then you can get quite good results using regex. Never 100% reliable, but good enough for most cases.
Truth is, you can do some incredibly powerful things with modern regex, even if rather verbose.
The sheer length, redundancy and complexity of this pattern shows that while it may not be impossible to describe valid email addresses in regex it at least is disproportionately difficult and does actually rather resemble a brute force dictionary list, than a clean grammar. And while we"re at it: date string validation is even worse. Leap years just to begin with.
To put my differentiation between "validation" and "extraction" into perspective:
To validate a simple email address one needs a monolythic 6400+ chararacters long regular expression.
To "extract" the domain name from an email address however the simple #([^\s]+) or (?<=#)[^\s]+ would cover pretty much (if not exactly) 100%. Assuming the string is isolated and known to be a valid email address.
Is it possible to generate a perfect regular expression for parsing HTML?
You basically answered this one yourself by writing "perfect": No.
If so, is the proof constructive? Do we only know we can, or has it been done?
It's not about "is it just that nobody has managed to do it yet?" but more about "it's been mathematically proven to be impossible!". QED
If it is not possible, what is the most accurate one out there?
Given that it's by definition not possible the only correct answer to this would be "none".
The best approximation of a regex for parsing all (or as much as possible) of HTML would be an infinitely long regex pattern along the lines of x|y|z|… with x, y, z … being all (brute forced) possible productions of the grammar of HTML chained together in an infinitely long logical OR. It would be a proper regex (even to the truest terms of regex), cover all of HTML (it lists and hence matches all possible strings after all), be only theoretically possible (or at least feasible, just like the turing machine) and practically utterly useless.
Regex can describe type-3 Chomsky languages (regular languages), while HTML (and most other programming/markup languages) is a type-2 Chomsky language (context-free). The regular languages are a subset of context-free languages. A grammar of type-n can always cover a subset of a language of type-(n-x), but never all of it. Therefor regex can only describe a subset of HTML. Big-foot is a claim. This is a fact.
Being strictly left or right extended regex have NO understanding of balance or nesting (neither "S→aSa" nor the mixed linear "S→aA,A→Sb,S→ε"). Therefor you can NOT parse HTML.
A quick example for "S→aSa" (balanced nesting):
<div>
<div>
...
<div>
<div>
Yep, the very core of what's HTML/XML is incompatible with regex. Pretty damn bad position to begin with, isn't it? HTML parsing via regex is literally rotten in its core. Broken by design. Guaranteed to fail.
And one for "S→aA,A→Sb,S→ε" (counting):
It is impossible to validate the correct (matching) number of <td> per row:
<table>
<tr>
<td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
</tr>
<tr>
<td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
</tr>
</table>
Another thing to keep in mind: as soon as you reduce the scope of what you can recognize of language "X" you're NO LONGER recognizing language "X", but a NEW and self-contained subset language "Y".
In the field of languages it is either all or nothing. There is no in between.
Now to those saying PCRE can do it!: Yes, it's called a context-free grammar then.
…by which it not only would no longer be a regular expression and thus fail the test:
we have no option other than Regex
but also still be the wrong tool to begin with. There are dedicated parsers for such tasks. Use 'em.
The email matching regex (as linked by OP) is a nightmare to read, let alone maintain:
(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>#,;:\\".\[\] \000-\031]+(?:(
?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>#,;:\\".\[\]]))|"(?:[^\"\r\\]
|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)
?[ \t])*(?:[^()<>#,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]) ... (6400+ chars)
While here is an excerpt of the very same specification in form of a proper context-free grammar:
address = mailbox ; one addressee
/ group ; named list
group = phrase ":" [#mailbox] ";"
mailbox = addr-spec ; simple address
/ phrase route-addr ; name & addr-spec
route-addr = "<" [route] addr-spec ">"
route = 1#("#" domain) ":" ; path-relative
addr-spec = local-part "#" domain ; global address
local-part = word *("." word) ; uninterpreted
; case-preserved
domain = sub-domain *("." sub-domain)
sub-domain = domain-ref / domain-literal
domain-ref = atom ; symbolic reference
Some people, when confronted with HTML, think "I know, I'll use regular expressions."
Now they have two metric f*cktons of problems.
So tell me. Why on earth would anyone really been far even as decided to use even go want to do look more like?

What's a good Lucene analyzer for text and source code?

What would be a good Lucene analyzer to use for documents that are a mix of text and diverse source code?
For example, I want "C" and "C++" to be considered different words, and I want Charset.forName("utf-8") to be split between the class name and method name, and for the parameter to be considered either one or two words.
A good example dataset for what I'd like to look at is StackOverflow itself. I believe that StackOverflow uses Lucene.NET for search; does it use a stock analyzer, or has it been heavily customized?
You're probably best to use the WhitespaceTokenizer and customize it to strip off punctuation. For example we strip of all puncutation except '+', '-' so that words such as C++, etc... are left but opening and closing quotes and brackets, etc are left. In reality though for something like this you might have to add the document twice using different tokenizers to catch the different parts of the document. i.e. once with the StandardTokenizer and once with a WhitespaceTokenizer, in this case the StandardTokenizer will split all your code, e.g. between class and method names as the Whitespace one will pick-up the words such as C++. Obviously it kind of depends on the language though as e.g. Scala allows some punctuation characters in method names.

Looking for good bracket characters for a template engines code blocks

I am looking for a good character pair to use for enclosing template code within a template for the next version of our inhouse template engine.
The current one uses plain {} but this makes the parser very complex to be able to distinguish between real code blocks and random {} chars in the literal text in the template.
I think a dual char combination like the one used in asp.net or php is a better aproach but the question is char character pair should I use or is there some perfect single char that is never used and thats easy to write.
Some criteria that needs to be fullfilled:
Cannot be changed by HTMLEncode, the sources will be editable through webbased HTML editors and plain textareas and need to stay the same no matter what editor is used.
Regex will be used to clean code parts after editing in an HTML editor that might have encoded the internal part of the code block like & chars.
Should be resonably easy to write on both english and swedish keyboard layout.
Should be a very rare combination, the template will generate HTML and Text and could include CSS and Javascript literal text with JSON, so any combination that might collide with those is bad unless very rare. That means that {{}} is out as it can occur in JSON.
The code within the code block will contain spaces, underscores, dollar and many more combinations, not only fieldnames but if/while constructs as well.
The parser is generated with Antlr
I am looking for suggestions and objections to find one or more combinations that would work i as many situations as possible, possibly multiple alternative pairs for different situations.
Template-Toolkit defaults to [% template directives %], which works reasonably well.

HTML Escaping - Reg expressions?

I'd like to HTML escape a specific phrase automatically and logically that is currently a statement with words highlighted with quotation marks. Within the statement, quotation or inch marks could also be used to describe a distance.
The phrase could be:
Paul said "It missed us by about a foot". In fact it was only about 9".
To escape this phrase It should really be
<pre>Paul said “It missed us by about a foot”.
In fact it was only about 9′.</pre>
Which gives
<pre>Paul said “It missed us by about a foot”.
In fact it was only about 9″.</pre>
I can't think of a sample phrase to add in a " escape as well but that could be there!
I'm looking for some help on how to identify which of the escape values to replace " characters with at runtime. The phrase was just an example and it could be anything but should be correctly formed i.e. an opening and closing quote would be present if we are to correctly escape the text.
Would I use a regular expression to find a quoted phrase in the text i.e. two " " characters before a full stop and then replace the first then the second. with
“
then
”
If I found one " replace it with a
"
unless it was after a number where I replace it with
″
How would I deal with multiple quotes within a sentence?
"It just missed" Paul said "by a foot".
This would really stump me.....
<pre>"It just missed" Paul said "by 9" almost".</pre>
The above should read when escaped correctly. (I'm showing the actual characters this time)
“It just missed” Paul said “by 9″ almost”.
Obviously an edge case but I wondered if it's possible to escape this at runtime without an understanding of the content? If not help on the more obvious phrases would be appreciated.
I would do this in two passes:
The first pass searches for any "s which are immediately preceded by numbers and does that replacement:
s/([0-9])"/\1″/g
Depending on the text you're dealing with, you may want/need to extend this regex to also recognize numbers that are spelled out as words; I've only checked for digits for the sake of simplicity.
With all of those taken care of, a second pass can then easily convert pairs of "s as you've described:
s/"([^"]*)"/“\1”/g
Note the use of [^"]* rather than .* - we want to find two sets of double-quotes with any number of non-double-quote characters between them. By adding that restriction, there won't be any problems handling strings with multiple quoted sections. (This could also be accomplished using the non-greedy .*?, but a negated character class more clearly states your intent and, in most regex implementations, is more efficient.)
A stray, mismatched " somewhere in the string, or an inch marker which is missed by the first pass, can still cause problems, of course, but there's no way to avoid that possibility without implementing understanding of the content.
what you've described is basically a hidden markov model,
http://en.wikipedia.org/wiki/Hidden_Markov_model
you have a set of input symbols (your original text and ambiguous punctuation), and a set of output symbols (original text and more fine-grained punctuation) but no good way of really observing the connection between the two in a programmatic way. you could write some rules to cover some of the edge cases, but that will basically never work for the multiple quotes situation. in this case you can't really use a regex for the same reason, but with an hmm, and a bunch of training text you could probably mmake some pretty good guesses.
sorry that's probably not very helpful if you're trying to get something ready for deployment, but the input has greater ambiguity than the output, so your only option is to consider the context, and that basically means either a very lengthy set of rules, or some kind of machine learning approach.
interesting question though - it would be neat to see what kind of performance you could get. maybe someone's already written a paper on it?
I wondered if it's possible to escape
this at runtime without an
understanding of the content?
Considering that you're adding semantic meaning to the punctuation which is currently encoded in the other text... no, not really.
Regular expressions would be the easiest tool for at least part of it. I'd suggest looking for /\d+"/ for the inch number cases. But for quotes delimiters, after you'd looked for any other special cases or phrases, it may be easier to use an algorithm for matching pairs, like with parentheses and brackets: tokenize and count. Then test on real-world input and refine.
But I really have to ask: why?
I am not sure if it is possible at all to do that without understanding the meaning of the sentence. I tend to doubt it.
My first attempt would be the following.
go from left to right through the string
alternate replacing double primes with left and right double quotes, but replace with double primes if there is a number to the left
if the quotation marks are unbalanced at the end of the string go back until you find a number with double primes and change the double primes into left or right double quotes depending on the preceding double quotes.
I am quite sure that you can easily fail this strategy. But it is still the easy case - hard work starts when you have to deal with nested quotation marks.
I know this is off the wall, but have you considered Mechanical Turk? This is the sort of problem humans excel at, and computers, currently, are terrible at. Choosing the correct punctuation requires understanding of the meaning of the sentence, so a regex is bound to fail for edge cases.
You could try something like this. First replace the quotations with this regular expression:
"((?:[^"\d]+|\d"?)*)"
And than the inch sign:
(\d+)"
Here’s an example in JavaScript:
'"It just missed" Paul said "by 9" almost"'.replace(/"((?:[^"\d]*|\d["']?)+)"/g, "“$1”").replace(/(\d+)"/g, "$1″");