Language support for recursive comments - language-agnostic

Most languages I've worked with don't have support for recursive comments.
Is there any reason why language designers would choose not to implement this?
Is it deceptively complex?
Would it have undesired results?
Example of a recursive comment:
/*
for (int j = 0; j <= SourceTexture.Height; j += SampleSize)
{
...
}
// Comment within comment below:
/*for (int i = 0; i < TextureColour.Length; i++)
{
...
}*/
sourceTexture.SetData<Color>(TextureColour);
*/
EDIT: I understand the argument of the answers so far (problems occur when you have comment tokens in strings). However, the reason for my confusion is that you have that problem now.
For example, i know the code below wouldn't give the expected result.
/*
char *str = "/* string";
// Are we now 1 level inside a comment or 2 levels?
*/
printf("Hello world");
/*
char *str2 = "string */";
*/
But in my mind that's no different to an unexpected result in the case below:
/*
CODE "*/";
*/
Which would also yield an unexpected/undesired result.
So, while it could be a problem for recursive comments, my argument as to why that's not a reason not to do it, is that it is already a problem for non-recursive comments. As a programmer i know the compiler behaves like this and i work around it. I don't think it's much more effort to work around the same problem with recursive comments.

Is there any reason why language designers would choose not to
implement this?
It makes the lexical analysis more difficult to implement.
Is it deceptively complex?
IMHO, no, but this is subjective.
Would it have undesired results?
Hard to tell. You have already discovered that even normal block comments can make problems:
/* print ("*/"); */
I know 2 languages that have nesting block comments: Haskell and Frege.

I will make an example and perhaps it will be clearer:
/*
char *str = "/* string";
// Are we now 1 level inside a comment or 2 levels?
*/
printf("Hello world. Will this be printed? Or is it a comment?");
/*
char *str2 = "string */";
*/
You couldn't parse comments inside a comment without interpreting what is inside the comment. But you can't interpret what is inside a comment because it's a comment, so by definition "human text" and not "language".

Although C's multi-line comments can't be nested, the effect of recursive comments can more-or-less be achieved in C using #if 0 ... #endif (and I strongly recommend you use that when you want to disable a block of code, for exactly that reason).
Even the C preprocessor, designed to be as dumb as a post, would be perfectly capable of handling nested comments, just as it has to be capable of handling nested #if directives with false conditions. So it's not really to do with anything being difficult to define or parse, since although it makes comments more complex, they'd still be no more complex than other things done in preprocessing.
But, using #if 0 ... #endif requires of course that there not be any unmatched #endif in the code you're trying to exclude.
Fundamentally comments cannot be both (a) completely unstructured and (b) recursive. Either by happenstance or deliberate choice, C has gone with (a) -- commented text doesn't have to obey any syntax constraints other than not containing the comment-terminator sequence (or trigraph equivalent such as *??/<newline>/).

I believe it is just never considered to begin with and it becomes a "non-important" feature addition as things develop. Also, it requires a lot more validation.
Example scenario...
MyLang version 1: Objective
Provide Multi-line commenting
Developer: hmmm.. I know, every time I find a /* I will comment everything until the next */ - easy!
MyLang version 1 release
1 day later...
User: erm... I cant do recursive comments, help me.
Support: Please hold.
30 mins later...
Support Manager -> Developer: User cannot do recursive comments.
Developer: (What's a recursive...) hang on...
30 mins later
Developer: yeah, we dont support recursive commenting.

Related

Is there something like a Safe Navigation Operator that can be used on Arrays?

I have used Safe Navigation Operator for Objects to load on Asynchronous calls and it is pretty amazing. I thought I could reproduce the same for Arrays but it displays a template parse error in my Angular code. I know *ngIf is an alternative solution, but is there a more simpler(by code) way just like the Safe Navigation Operator?
<div class="mock">
<h5>{{data?.title}}</h5> //This works
<h6>{{data?.body}}</h6> //This works
<h6>{{simpleData?[0]}}</h6> // This is what I tried to implement
</div>
Is there something like a Safe Navigation Operator that can be used on Arrays?
Yes, what you are looking for is known as the Optional Chaining operator (JavaScript / TypeScript).
The syntax shown in the MDN JavaScript documentation is:
obj.val?.prop
obj.val?.[expr]
obj.arr?.[index]
obj.func?.(args)
So, to achieve what you want, you need to change your example from:
<h6>{{simpleData?[0]}}</h6>
To:
<h6>{{simpleData?.[0]}}</h6>
^
Also see How to use optional chaining with array in Typescript?.
is there a more simpler(by code) way just like the Safe Navigation Operator?
There is ternary operator.
condition ? expr1 : expr2
<h6>{{simpleData?simpleData[0]:''}}</h6>
Of cause it's a matter of taste, but in such cases I tend to use a shorter approach:
<h6>{{(simpleData || [])[0]}}</h6>
The other answers amount to the same thing, but I find foo && foo[0] to be the most readable. The right side of the logical-and operator won't be evaluated if the left side is falsy, so you safely get undefined (or I guess null, if you don't believe Douglas Crockford.) with minimal extra characters.
For that matter, you asked for a "simpler" solution, but actually *ngIf is probably correct for the use case you gave. If you use any of the answers here, you'll wind up with an empty h6 tag that you didn't need. If you make the tag itself conditional, you can just put foo[0] in the handlebars and be confident that it won't be evaluated when foo is still undefined, plus you never pollute the page with an empty tag.

How to generate hash from ~200k text/html that would match/compare to similar text?

I would like to make a sort of hash key out of a text (in my case html) that would match/compare to the hash of other similar text
ex of matching texts:
"2012/10/01 This is my webpage #1"+ 100k_of_same_text + random_words_1 + ..
"2012/10/02 This is my webpage #2"+ 100k_of_same_text + random_words_2 + ..
...
"2012/10/02 This is my webpage #2"+ 100k_of_same_text + random_words_3 + ..
So far I've thought of removing numbers and tags but that wold still leave the random words.
Is there anything out there that dose this?
I have root access to the server so I can add any UDF that is necesare and if needed I can do the processing in c or other languages.
The ideal would be a function like generateSimilarHash(text) and an other function compareSimilarHashes(hash1,hash2) that would return the procent of matching text.
Any function like compare(text1,text2) would not work as in my case as I have many pages to compare (~20 mil at the moment)
Any advice is welcomed!
UPDATE:
I'm refering to ahash function as it is described on wikipedia:
A hash function is any algorithm or subroutine that maps large data
sets of variable length to smaller data sets of a fixed length.
the fixed length part is not necessary in my case.
It sounds like you need to utilize a program like diff.
If you are just trying to compare text a hash is not the way to go because slight differences in input cause total and complete differnces in output. (Thus the reason why they are used to encode passwords, and secure text). Character difference programs are pretty complicated, unless you really are interested in how they work and are trying to write your own I would just use a solution like the one that is shown here using sdiff to get a percentage.
Percentage value with GNU Diff
You could use some sort of Levenshtein distance algoritm. this works for small pieces of text, but I'm rather sure that something similar can be applied to large chunks of text.
Ref: http://en.m.wikibooks.org/wiki/Algorithm_implementation/Strings/Levenshtein_distance
I've found out that tag order in webpages can create a very distinctive pattern, that remains the same even if portions of text / css / script change. So I've made a string generated by the tag order (ex: html head meta title body div table tr td span bold... => "hhmtbdttsb...") and then I just do exact matches between these strings. I can even apply the Levenshtein distance algorithm and get accurate results.
If I didn't have html, I would have used the punctuation/end-lines for splitting, or something similar.

Regex to extract text from inside an HTML tag

I know this has been asked at least a thousand times but I can't find a proper regex that will match a name in this string here:
<td><div id="topbarUserName">Donald</div></td>
I want to get the name 'Donald' and the regex that's the closest is >[a-zA-Z0-9]+ but the result is >Donald.
I'm coding in PureBasic (It's syntax is similar to that of Basic) and it uses the PCRE library for regular expressions.
Can anyone help?
Josh's pattern will work if you only make use of the numbered group, not the whole match. If you have to use the whole match, use something like (?<=>)(\w+?)(?=<)
Either way, regex is widely known to not be good for parsing HTML.
Explanation:
(?<=) is used to check if something appears before the current item.
\w+? will match any "word"-character, one or more times, but stop whenever the rest of the pattern matches something, for this situation the ? could have been left out.
(?=) is used to check if something appears after the current item.
Try this
It should capture anything that is a letter / number
>([\w]+)<
Also I'm not exactly sure what your project limitations are, but it would be much easier to do something like this
$('#topbarUserName').text();
in jQuery instead of using a regex.
>([a-zA-Z]+) should do the Trick. Remember to get the grouping right.
Why not doing it with plain old basic string-functions?
a.w = FindString(HTMLstring.s, "topbarUserName") + 16 ; 2 for "> and topbar...
If a > 0
b.w = FindString(HTMLstring, "<", a)
If b > 0
c.w = b - a
Donald.s = Mid(HTMLstring,a, c)
EndIf
EndIf
Debug Donald

Regex for unclosed HTML tags

Does someone have a regex to match unclosed HTML tags? For example, the regex would match the <b> and second <i>, but not the first <i> or the first's closing </i> tag:
<i><b>test<i>ing</i>
Is this too complex for regex? Might it require some recursive, programmatic processing?
I'm sure some regex guru can cobble something together that approximates a solution, but it's a bad idea: HTML isn't regular. Consider either a HTML parser that's capable of identifying such problems, or parsing it yourself.
Yes it requires recursive processing, and potentially quite deep (or a fancy loop of course), it is not going to be done with a regex. You could make a regex that handled a few levels deep, but not one that will work on just any html file. This is because the parser would have to remember what tags are open at any given point in the stream, and regex arent good at that.
Use a SAX parser with some counters, or use a stack with pop off/push on to keep your state. Think about how to code this game to see what I mean about html tag depth. http://en.wikipedia.org/wiki/Tower_of_Hanoi
As #Pesto said, HTML isn't regular, you would have to build html grammar rules, and apply them recursively.
If you are looking to fix HTML programatically, I have used a component called html tidy with considerable success. There are builds for it for most languages (COM+, Dotnet, PHP etc...).
If you just need to fix it manually, I'd recommend a good IDE. Visual Studio 2008 does a good job, so does the latest Dreamweaver.
No, that's to complex for a regular expression. Your problem is equivalent to test an arithmetic expression of proper usage of brackets which needs at least an pushdown automaton to success.
In your case you should split the HTML code in opening tags, closing tags and text nodes (e.g with an regular expression). Store the result in a list. Then you can iterate through node list and push every opening tag onto the stack. If you encounter a closing tag in your node list you must check that the topmost stack entry is a opening tag of the same type. Otherwise you found the html syntax error you looked for.
I've got a case where I am dealing with single, self-contained lines. The following regular expression worked for me: <[^/]+$ which matches a "<" and then anything that's not a "/".
You can use RegEx to identify all the html begin/end elements, and then enumerate with a Stack, Push new elements, and Pop the closing tags. Try this in C# -
public static bool ValidateHtmlTags(string html)
{
string expr = "(<([a-zA-Z]+)\\b[^>]*>)|(</([a-zA-Z]+) *>)";
Regex regex = new Regex(expr, RegexOptions.IgnoreCase);
var stack = new Stack<Tuple<string, string>>();
var result = new StringBuilder();
bool valid = true;
foreach (Match match in regex.Matches(html))
{
string element = match.Value;
string beginTag = match.Groups[2].Value;
string endTag = match.Groups[4].Value;
if (beginTag == "")
{
string previousTag = stack.Peek().Item1;
if (previousTag == endTag)
stack.Pop();
else
{
valid = false;
break;
}
}
else if (!element.EndsWith("/>"))
{
// Write more informative message here if desired
string message = string.Format("Char({0})", match.Index);
stack.Push(new Tuple<string, string>(beginTag, message));
}
}
if (stack.Count > 0)
valid = false;
// Alternative return stack.Peek().Item2 for more informative message
return valid;
}
I suggest using Nokogiri:
Nokogiri::HTML::DocumentFragment.parse(html).to_html

Qt Regex matches HTML Tag InnerText

I have a html file with one <pre>...</pre> tag. What regex is necessary to match all content within the pre's?
QString pattern = "<pre>(.*)</pre>";
QRegExp rx(pattern);
rx.setCaseSensitivity(cs);
int pos = 0;
QStringList list;
while ((pos = rx.indexIn(clipBoardData, pos)) != -1) {
list << rx.cap(1);
pos += rx.matchedLength();
}
list.count() is always 0
HTML is not a regular language, you do not use regular expressions to parse it.
Instead, use QXmlSimpleReader to load the XML, then QXmlQuery to find the PRE node and then extract its contents.
DO NOT PARSE HTML USING Regular Expressions!
Instead, use a real HTML parser, such as this one
i did it using substrings:
int begin = clipBoardData.indexOf("<pre");
int end = clipBoardData.indexOf("</body>");
QString result = data.mid(begin, end-begin);
The result includes the <pre's> but i found out thats even better ;)
I have to agree with the others. Drupal 6.x and older are using regex to do a lot of work on the HTML data. It quickly breaks if you create pages of 64Kb or more. So using a DOM or just indexOf() as you've done is a better much faster solution.
Now, for those interested in knowing more about regex, Qt uses the perl implementation. This means you can use the lazy operator. Your regex would become:
(<pre>.*?</pre>)+
to get each one of the <pre> block in your code (although if you have only one, then the question mark and the plus are not required.) Note that no delimiters at the start and end of the regular expression are required here.
QRegExp re("(<pre>.*?</pre>)+", Qt::CaseInsensitive);
re.indexIn(html_input);
QStringList list = re.capturedTexts();
Now list should have one <pre> tag or more.