Redacted comments in MS's source code for .NET [duplicate] - open-source

The Reference Source page for stringbuilder.cs has this comment in the ToString method:
if (chunk.m_ChunkLength > 0)
{
// Copy these into local variables so that they
// are stable even in the presence of ----s (hackers might do this)
char[] sourceArray = chunk.m_ChunkChars;
int chunkOffset = chunk.m_ChunkOffset;
int chunkLength = chunk.m_ChunkLength;
What does this mean? Is ----s something a malicious user might insert into a string to be formatted?

The source code for the published Reference Source is pushed through a filter that removes objectionable content from the source. Verboten words are one, Microsoft programmers use profanity in their comments. So are the names of devs, Microsoft wants to hide their identity. Such a word or name is substituted by dashes.
In this case you can tell what used to be there from the CoreCLR, the open-sourced version of the .NET Framework. It is a verboten word:
// Copy these into local variables so that they are stable even in the presence of race conditions
Which was hand-edited from the original that you looked at before being submitted to Github, Microsoft also doesn't want to accuse their customers of being hackers, it originally said races, thus turning into ----s :)

In the CoreCLR repository you have a fuller quote:
Copy these into local variables so that they are stable even in the presence of race conditions
Github
Basically: it's a threading consideration.

In addition to the great answer by #Jeroen, this is more than just a threading consideration. It's to prevent someone from intentionally creating a race condition and causing a buffer overflow in that manner. Later in the code, the length of that local variable is checked. If the code were to check the length of the accessible variable instead, it could have changed on a different thread between the time length was checked and wstrcpy was called:
// Check that we will not overrun our boundaries.
if ((uint)(chunkLength + chunkOffset) <= ret.Length && (uint)chunkLength <= (uint)sourceArray.Length)
{
///
/// imagine that another thread has changed the chunk.m_ChunkChars array here!
/// we're now in big trouble, our attempt to prevent a buffer overflow has been thawrted!
/// oh wait, we're ok, because we're using a local variable that the other thread can't access anyway.
fixed (char* sourcePtr = sourceArray)
string.wstrcpy(destinationPtr + chunkOffset, sourcePtr, chunkLength);
}
else
{
throw new ArgumentOutOfRangeException("chunkLength", Environment.GetResourceString("ArgumentOutOfRange_Index"));
}
}
chunk = chunk.m_ChunkPrevious;
} while (chunk != null);
Really interesting question though.

Don't think that this is the case - the code in question copies to local variables to prevent bad things happening if the string builder instance is mutated on another thread.
I think the ---- may relate to a four letter swear word...

Related

How to add back comments/whitespaces in translator using the Antlr4's visitor model

I'm currently writing a TSQL (Sybase/Microsoft SQL) to MySQL translator using the ANTLR4 visitor approach.
I'm able to push comments and whitespaces to different channels so that I can use that information later.
What's not super clear is:
how do I get the data back?
and more importantly how do I plug the comments and whitespaces back into my translated MySQL code?
Re: #1, this seems to work to get the list of all tokens including the comments/whitespaces:
public static List<Token> getHiddenTokensFromString(String sqlIn, int hiddenChannel) {
CharStream charStream = CharStreams.fromString(sqlIn);
CaseChangingCharStream upper = new CaseChangingCharStream(charStream, true);
TSqlLexer lexer = new TSqlLexer(upper);
CommonTokenStream commonTokenStream = new CommonTokenStream(lexer, hiddenChannel);
commonTokenStream.fill();
List<Token> hiddenTokens = commonTokenStream.getTokens();
return hiddenTokens;
}
Re #2, what makes it particularly challenging is that as part of the translation, lines of SQL have to be moved around, some lines removed and some lines added.
Any help will be greatly appreciated.
Thanks.
The ANTLR4 lexer creates a number of tokens, each with an index (a running number). Provided you didn't just skip a token, all tokens are available for later inspection, once the parsing step is done, regardless of their channels (the channel is actually just a number property on a token).
So, given you have a token you want to translate, get its index and then ask the token stream for the tokens with the next smaller index or next higher index. These are usually the hidden whitespaces.
Once you have the whitespace token use its start and stop index to get the original text from the char stream. And since you know where you are in the translation process when you do that, it should be easy to know where to insert the original text.

Is there any advantage in specifying types of variables and return type of functions?

I always set the types of my variables and functions, a habit I brought from my Java learning, seems the right thing to do.
But I always see "weak typing" in other people's code, but I can't disagree with that as I don't know what are the real advantages of keep everything strong typed.
I think my question is clear, but I gonna give some examples:
var id = "Z226";
function changeId(newId){
id = newId;
return newId;
}
My code would be like this:
var id:String = "Z226";
function changeId(newId:String):String{
id = newId;
return newId;
}
Yes, the big advantanges are:
faster code execution, because the runtime know the type, it does not have to evaluate the call
better tool support: auto completion and code hints will work with typed arguments and return types
far better readability
You get performance benefits from strongly typing. See http://gskinner.com/talks/quick/#45
I also find strongly typed code to be much more readable, but I guess depending on the person they may not care.
As pointed out by florian, two advantages of strongly typing are that development tools can can use the information to provide better code-hinting and code-completion, and that type, as an explicit indicator of how the variable or method is intended to be used, can make the code much easier to understand.
The question of performance seems to be up for debate. However, this answer on stackoverflow suggests that typed is definitely faster than untyped in certain benchmark tests but, as the author states, not so much that you would notice it under normal conditions.
However, I would argue that the biggest advantage of strong typing is that you get a compiler error if you attempt to assign or return a value of the wrong type. This helps prevent the kind of pernicious bug which can only be tracked down by actually running the program.
Consider the following contrived example in which ActionScript automatically converts the result to a string before returning. Strongly typing the method's parameter and return will ensure that the program will not compile and a warning is issued. This could potentially save you hours of debugging.
function increment(value) {
return value + 1;
}
trace(increment("1"));
// 11
While the points in the other answers about code hinting and error checking are accurate, I want to address the claim about performance. It's really not all that true. In theory, strong type allows the compiler to generate code that's closer to native. With the current VM though, such optimization doesn't happen. Here and there the AS3 compiler will employ an integer instruction instead of a floating point one. Otherwise the type indicators don't have much effect at runtime.
For example, consider the following code:
function hello():String {
return "Hello";
}
var s:String = hello() + ' world';
trace(s);
Here're the AVM2 op codes resulting from it:
getlocal_0
pushscope
getlocal_0
getlocal_0
callproperty 4 0 ; call hello()
pushstring 12 ; push ' world' onto stack
add ; concatenate the two
initproperty 5 ; save it to var s
findpropstrict 7 ; look up trace
getlocal_0 ; push this onto stack
getproperty 5 ; look up var s
callpropvoid 7 1 ; call trace
returnvoid
Now, if I remove the type indicators, I get the following:
getlocal_0
pushscope
getlocal_0
getlocal_0
callproperty 3 0
pushstring 11
add
initproperty 4
findpropstrict 6
getlocal_0
getproperty 4
callpropvoid 6 1
returnvoid
It's exactly the same, except all the name indices are one less since 'String' no longer appears in the constant table.
I'm not trying to discourage people from employing strong typing. One just shouldn't expect miracle on the performance front.
EDIT: In case anyone is interested, I've put my AS3 bytecode disassembler online:
http://flaczki.net46.net/codedump/
I've improved it so that it now dereferences the operands.

DbContext.ChangeTracker, DbContext.Entry() inconsistencies

Under the debugger I have a case where DbContext.ChangeTracker.Entry(e) returns an entry with a State of Detached. When I enumerate the results of DbContext.ChangeTracker.Entries() and the entries of the underlying ObjectContext when looking for e, I find an entry with a State of Unchanged (expected).
What is going on?
Here are some additional details:
using POCO entities.
change Tracking is on
proxy creation is off
lazy loading is off
problem does not occur when saving an entity for the first time (e.g. adding to context); occurs when getting old entity into context then trying to make changes to it. This is an aggregate root with many "reference" entities that aren't supposed to change
Equals is overridden on the entities and IEquatable<T> is implemented. That code is generated by T4.
I am using a generic repository implementation that is declaratively configured to generate rules for saving (e.g. whether entities should be added, attached/modified, attached/unchanged. It seems to be doing this in the right order. For example the aggregate root is added/attached last because attaching it first brings in other entities in a modified state (adding those first as unchanged prevents this).
(Answered in a question edit. Converted to a community wiki answer. See Question with no answers, but issue solved in the comments (or extended in chat) )
The OP wrote:
I have "solved" the problem, but I still want to know what's going on, because my solution doesn't do anything to address the root cause. My "solution" looks for an entity in the change tracker (I have also looked via the context.Entry() and context.Set().Local -- when I do it with this code (I did it as a loop instead of LINQ so I could set breakpoints), it works:
private DbEntityEntry GetChangeTrackedEntry(IEntity mine, Type type)
{
foreach (var en in context.ChangeTracker.Entries())
{
if (en.Entity.GetType() != type)
continue;
if (((IEntity)en.Entity).Id != mine.Id)
continue;
return en;
}
return null;
}
When I attempt to lookup an entity (via change tracker, the set, etc.) via using mine directly, that's when I end up with a detached case.
I thought perhaps there were cases of EF using ReferenceEquals but #Ladislav's comment may indicate something wrong with Equals implementation.
If anyone has a further explanation they can edit that into this community wiki answer.

Are hard-coded STRINGS ever acceptable?

Similar to Is hard-coding literals ever acceptable?, but I'm specifically thinking of "magic strings" here.
On a large project, we have a table of configuration options like these:
Name Value
---- -----
FOO_ENABLED Y
BAR_ENABLED N
...
(Hundreds of them).
The common practice is to call a generic function to test an option like this:
if (config_options.value('FOO_ENABLED') == 'Y') ...
(Of course, this same option may need to be checked in many places in the system code.)
When adding a new option, I was considering adding a function to hide the "magic string" like this:
if (config_options.foo_enabled()) ...
However, colleagues thought I'd gone overboard and objected to doing this, preferring the hard-coding because:
That's what we normally do
It makes it easier to see what's going on when debugging the code
The trouble is, I can see their point! Realistically, we are never going to rename the options for any reason, so about the only advantage I can think of for my function is that the compiler would catch any typo like fo_enabled(), but not 'FO_ENABLED'.
What do you think? Have I missed any other advantages/disadvantages?
If I use a string once in the code, I don't generally worry about making it a constant somewhere.
If I use a string twice in the code, I'll consider making it a constant.
If I use a string three times in the code, I'll almost certainly make it a constant.
if (config_options.isTrue('FOO_ENABLED')) {...
}
Restrict your hard coded Y check to one place, even if it means writing a wrapper class for your Map.
if (config_options.isFooEnabled()) {...
}
Might seem okay until you have 100 configuration options and 100 methods (so here you can make a judgement about future application growth and needs before deciding on your implementation). Otherwise it is better to have a class of static strings for parameter names.
if (config_options.isTrue(ConfigKeys.FOO_ENABLED)) {...
}
I realise the question is old, but it came up on my margin.
AFAIC, the issue here has not been identified accurately, either in the question, or the answers. Forget about 'harcoding strings" or not, for a moment.
The database has a Reference table, containing config_options. The PK is a string.
There are two types of PKs:
Meaningful Identifiers, that the users (and developers) see and use. These PKs are supposed to be stable, they can be relied upon.
Meaningless Id columns which the users should never see, that the developers have to be aware of, and code around. These cannot be relied upon.
It is ordinary, normal, to write code using the absolute value of a meaningful PK IF CustomerCode = "IBM" ... or IF CountryCode = "AUS" etc.
referencing the absolute value of a meaningless PK is not acceptable (due to auto-increment; gaps being changed; values being replaced wholesale).
.
Your reference table uses meaningful PKs. Referencing those literal strings in code is unavoidable. Hiding the value will make maintenance more difficult; the code is no longer literal; your colleagues are right. Plus there is the additional redundant function that chews cycles. If there is a typo in the literal, you will soon find that out during Dev testing, long before UAT.
hundreds of functions for hundreds of literals is absurd. If you do implement a function, then Normalise your code, and provide a single function that can be used for any of the hundreds of literals. In which case, we are back to a naked literal, and the function can be dispensed with.
the point is, the attempt to hide the literal has no value.
.
It cannot be construed as "hardcoding", that is something quite different. I think that is where your issue is, identifying these constructs as "hardcoded". It is just referencing a Meaningfull PK literally.
Now from the perspective of any code segment only, if you use the same value a few times, you can improve the code by capturing the literal string in a variable, and then using the variable in the rest of the code block. Certainly not a function. But that is an efficiency and good practice issue. Even that does not change the effect IF CountryCode = #cc_aus
I really should use constants and no hard coded literals.
You can say they won't be changed, but you may never know. And it is best to make it a habit. To use symbolic constants.
In my experience, this kind of issue is masking a deeper problem: failure to do actual OOP and to follow the DRY principle.
In a nutshell, capture the decision at startup time by an appropriate definition for each action inside the if statements, and then throw away both the config_options and the run-time tests.
Details below.
The sample usage was:
if (config_options.value('FOO_ENABLED') == 'Y') ...
which raises the obvious question, "What's going on in the ellipsis?", especially given the following statement:
(Of course, this same option may need to be checked in many places in the system code.)
Let's assume that each of these config_option values really does correspond to a single problem domain (or implementation strategy) concept.
Instead of doing this (repeatedly, in various places throughout the code):
Take a string (tag),
Find its corresponding other string (value),
Test that value as a boolean-equivalent,
Based on that test, decide whether to perform some action.
I suggest encapsulating the concept of a "configurable action".
Let's take as an example (obviously just as hypthetical as FOO_ENABLED ... ;-) that your code has to work in either English units or metric units. If METRIC_ENABLED is "true", convert user-entered data from metric to English for internal computation, and convert back prior to displaying results.
Define an interface:
public interface MetricConverter {
double toInches(double length);
double toCentimeters(double length);
double toPounds(double weight);
double toKilograms(double weight);
}
which identifies in one place all the behavior associated with the concept of METRIC_ENABLED.
Then write concrete implementations of all the ways those behaviors are to be carried out:
public class NullConv implements MetricConverter {
double toInches(double length) {return length;}
double toCentimeters(double length) {return length;}
double toPounds(double weight) {return weight;}
double toKilograms(double weight) {return weight;}
}
and
// lame implementation, just for illustration!!!!
public class MetricConv implements MetricConverter {
public static final double LBS_PER_KG = 2.2D;
public static final double CM_PER_IN = 2.54D
double toInches(double length) {return length * CM_PER_IN;}
double toCentimeters(double length) {return length / CM_PER_IN;}
double toPounds(double weight) {return weight * LBS_PER_KG;}
double toKilograms(double weight) {return weight / LBS_PER_KG;}
}
At startup time, instead of loading a bunch of config_options values, initialize a set of configurable actions, as in:
MetricConverter converter = (metricOption()) ? new MetricConv() : new NullConv();
(where the expression metricOption() above is a stand-in for whatever one-time-only check you need to make, including looking at the value of METRIC_ENABLED ;-)
Then, wherever the code would have said:
double length = getLengthFromGui();
if (config_options.value('METRIC_ENABLED') == 'Y') {
length = length / 2.54D;
}
// do some computation to produce result
// ...
if (config_options.value('METRIC_ENABLED') == 'Y') {
result = result * 2.54D;
}
displayResultingLengthOnGui(result);
rewrite it as:
double length = converter.toInches(getLengthFromGui());
// do some computation to produce result
// ...
displayResultingLengthOnGui(converter.toCentimeters(result));
Because all of the implementation details related to that one concept are now packaged cleanly, all future maintenance related to METRIC_ENABLED can be done in one place. In addition, the run-time trade-off is a win; the "overhead" of invoking a method is trivial compared with the overhead of fetching a String value from a Map and performing String#equals.
I believe that the two reasons you have mentioned, Possible misspelling in string, that cannot be detected until run time and the possibility (although slim) of a name change would justify your idea.
On top of that you can get typed functions, now it seems you only store booleans, what if you need to store an int, a string etc. I would rather use get_foo() with a type, than get_string("FOO") or get_int("FOO").
I think there are two different issues here:
In the current project, the convention of using hard-coded strings is already well established, so all the developers working on the project are familiar with it. It might be a sub-optimal convention for all the reasons that have been listed, but everybody familiar with the code can look at it and instinctively knows what the code is supposed to do. Changing the code so that in certain parts, it uses the "new" functionality will make the code slightly harder to read (because people will have to think and remember what the new convention does) and thus a little harder to maintain. But I would guess that changing over the whole project to the new convention would potentially be prohibitively expensive unless you can quickly script the conversion.
On a new project, symbolic constants are the way IMO, for all the reasons listed. Especially because anything that makes the compiler catch errors at compile time that would otherwise be caught by a human at run time is a very useful convention to establish.
Another thing to consider is intent. If you are on a project that requires localization hard coded strings can be ambiguous. Consider the following:
const string HELLO_WORLD = "Hello world!";
print(HELLO_WORLD);
The programmer's intent is clear. Using a constant implies that this string does not need to be localized. Now look at this example:
print("Hello world!");
Here we aren't so sure. Did the programmer really not want this string to be localized or did the programmer forget about localization while he was writing this code?
I too prefer a strongly-typed configuration class if it is used through-out the code. With properly named methods you don't lose any readability. If you need to do conversions from strings to another data type (decimal/float/int), you don't need to repeat the code that does the conversion in multiple places and can cache the result so the conversion only takes place once. You've already got the basis of this in place already so I don't think it would take much to get used to the new way of doing things.

api documentation and "value limits": do they match?

Do you often see in API documentation (as in 'javadoc of public functions' for example) the description of "value limits" as well as the classic documentation ?
Note: I am not talking about comments within the code
By "value limits", I mean:
does a parameter can support a null value (or an empty String, or...) ?
does a 'return value' can be null or is guaranteed to never be null (or can be "empty", or...) ?
Sample:
What I often see (without having access to source code) is:
/**
* Get all readers name for this current Report. <br />
* <b>Warning</b>The Report must have been published first.
* #param aReaderNameRegexp filter in order to return only reader matching the regexp
* #return array of reader names
*/
String[] getReaderNames(final String aReaderNameRegexp);
What I like to see would be:
/**
* Get all readers name for this current Report. <br />
* <b>Warning</b>The Report must have been published first.
* #param aReaderNameRegexp filter in order to return only reader matching the regexp
* (can be null or empty)
* #return array of reader names
* (null if Report has not yet been published,
* empty array if no reader match criteria,
* reader names array matching regexp, or all readers if regexp is null or empty)
*/
String[] getReaderNames(final String aReaderNameRegexp);
My point is:
When I use a library with a getReaderNames() function in it, I often do not even need to read the API documentation to guess what it does. But I need to be sure how to use it.
My only concern when I want to use this function is: what should I expect in term of parameters and return values ? That is all I need to know to safely setup my parameters and safely test the return value, yet I almost never see that kind of information in API documentation...
Edit:
This can influence the usage or not for checked or unchecked exceptions.
What do you think ? value limits and API, do they belong together or not ?
I think they can belong together but don't necessarily have to belong together. In your scenario, it seems like it makes sense that the limits are documented in such a way that they appear in the generated API documentation and intellisense (if the language/IDE support it).
I think it does depend on the language as well. For example, Ada has a native data type that is a "restricted integer", where you define an integer variable and explicitly indicate that it will only (and always) be within a certain numeric range. In that case, the datatype itself indicates the restriction. It should still be visible and discoverable through the API documentation and intellisense, but wouldn't be something that a developer has to specify in the comments.
However, languages like Java and C# don't have this type of restricted integer, so the developer would have to specify it in the comments if it were information that should become part of the public documentation.
I think those kinds of boundary conditions most definitely belong in the API. However, I would (and often do) go a step further and indicate WHAT those null values mean. Either I indicate it will throw an exception, or I explain what the expected results are when the boundary value is passed in.
It's hard to remember to always do this, but it's a good thing for users of your class. It's also difficult to maintain it if the contract the method presents changes (like null values are changed to no be allowed)... you have to be diligent also to update the docs when you change the semantics of the method.
Question 1
Do you often see in API documentation (as in 'javadoc of public functions' for example) the description of "value limits" as well as the classic documentation?
Almost never.
Question 2
My only concern when I want to use this function is: what should I expect in term of parameters and return values ? That is all I need to know to safely setup my parameters and safely test the return value, yet I almost never see that kind of information in API documentation...
If I used a function not properly I would expect a RuntimeException thrown by the method or a RuntimeException in another (sometimes very far) part of the program.
Comments like #param aReaderNameRegexp filter in order to ... (can be null or empty) seems to me a way to implement Design by Contract in a human-being language inside Javadoc.
Using Javadoc to enforce Design by Contract was used by iContract, now resurrected into JcontractS, that let you specify invariants, preconditions, postconditions, in more formalized way compared to the human-being language.
Question 3
This can influence the usage or not for checked or unchecked exceptions.
What do you think ? value limits and API, do they belong together or not ?
Java language doesn't have a Design by Contract feature, so you might be tempted to use Execption but I agree with you about the fact that you have to be aware about When to choose checked and unchecked exceptions. Probably you might use unchecked IllegalArgumentException, IllegalStateException, or you might use unit testing, but the major problem is how to communicate to other programmers that such code is about Design By Contract and should be considered as a contract before changing it too lightly.
I think they do, and have always placed comments in the header files (c++) arcordingly.
In addition to valid input/output/return comments, I also note which exceptions are likly to be thrown by the function (since I often want to use the return value for...well returning a value, I prefer exceptions over error codes)
//File:
// Should be a path to the teexture file to load, if it is not a full path (eg "c:\example.png") it will attempt to find the file usign the paths provided by the DataSearchPath list
//Return: The pointer to a Texture instance is returned, in the event of an error, an exception is thrown. When you are finished with the texture you chould call the Free() method.
//Exceptions:
//except::FileNotFound
//except::InvalidFile
//except::InvalidParams
//except::CreationFailed
Texture *GetTexture(const std::string &File);
#Fire Lancer: Right! I forgot about exception, but I would like to see them mentioned, especially the unchecked 'runtime' exception that this public method could throw
#Mike Stone:
you have to be diligent also to update the docs when you change the semantics of the method.
Mmmm I sure hope that the public API documentation is at the very least updated whenever a change -- that affects the contract of the function -- takes place. If not, those API documentations could be drop altogether.
To add food to yours thoughts (and go with #Scott Dorman), I just stumble upon the future of java7 annotations
What does that means ? That certain 'boundary conditions', rather than being in the documentation, should be better off in the API itself, and automatically used, at compilation time, with appropriate 'assert' generated code.
That way, if a '#CheckForNull' is in the API, the writer of the function might get away with not even documenting it! And if the semantic change, its API will reflect that change (like 'no more #CheckForNull' for instance)
That kind of approach suggests that documentation, for 'boundary conditions', is an extra bonus rather than a mandatory practice.
However, that does not cover the special values of the return object of a function. For that, a complete documentation is still needed.