Free text search integrated with code coverage

Free text search integrated with code coverage - language-agnostic

Is there any tool which will allow me to perform a free text search over a system's code, but only over the code which was actually executed during a particular invocation?
To give a bit of background, when learning my way around a new system, I frequently find myself wanting to discover where some particular value came from, but searching the entire code base turns up far more matches than I can reasonably assess individually.
For what it's worth, I've wanted this in Perl and Java at one time or another, but I'd love to know if any languages have a system supporting this feature.

You can generally twist a code coverage tool's arm and get a report that shows the paths that have been executed during a given run. This report should show the code itself, with the first few columns marked up according to the coverage tool's particular notation on whether a given path was executed.
You might be able to use this straight up, or you might have to preprocess it and either remove the code that was not executed, or add a new notation on each line that tells whether it was executed (most tools will only show path information at control points):
So from a coverage tool you might get a report like this:
T- if(sometest)
{
x somecode;
}
else
{
- someother_code;
}
The notation T- indicates that the if statement only ever evaluated to true, and so only the first part of the code executed. The later notation 'x' indicates that this line was executed.
You should be able to form a regex that matches only when the first column contains a T, F, or x so you can capture all the control statements executed and lines executed.
Sometimes you'll only get coverage information at each control point, which then requires you to parse the C file and mark the execute lines yourself. Not as easy, but not impossible either.
Still, this sounds like an interesting question where the solution is probably more work than it's worth...
-Adam

Related

Racket: Using "csv-reading" package within a function

I am using csv-reading to read from a csv file to convert it into a list.
When I call at the top level, like this
> (call-with-input-file "to-be-asked.csv" csv->list)
I am able to read csv file and convert it into list of lists.
However, if I call the same thing within a function, I am getting the error.
> (read-from-file "to-be-asked.csv")
csv->list: undefined;
cannot reference an identifier before its definition
in module: top-level
I am not getting what's going wrong. I have added (require csv-reading) before the function call.
My read-from-file code is:
(define (read-from-file file-name)
(call-with-input-file file-name csv->list))
EDIT
I am using racket within emacs using Geiser. When I (exit) the buffer and type C-c C-z, it is showing the error.
When I kill the buffer and start the Geiser again, it is working properly.
Is it the mistake of Geiser and emacs?

You've hit the classic problem with what I'll call resident programming environments (I don't know the right word for then). A resident programming environment is one where you talk to a running instance of the language, successively modifying its state.
The problem with these environments is that the state of the running language instance is more-or-less opaque and in particular it can get out of sync with the state you can see in files or buffers. That means that it can become obscure why something is happening and, worse, you can get into states where the results you get from the resident environment are essentially unreproducible later. This matters a lot for things like Jupyter notebooks where people doing scientific work can end up with results which they can't reproduce because the notebook was evaluated out of sequence or some of it was not evaluated at all.
On the other hand, these environments are an enormous joy to use which is why I use them. That outweighs the problems for me: you just have to be careful you can recreate the session and be willing to do so occasionally.
In this case you probably had something like this in the buffer/file:
(require csv-reading)
(define (read-from-file file-name)
(call-with-input-file file-name csv->list))
But you either failed to evaluate the first form at all, or (worse!) you evaluated the forms out of order. If you did this in Common Lisp or any traditional Lisp this would all be fine: evaluating the first form would make the second form work. But Racket decides once and for all what csv->list means (or does not mean) at the point the read-from-file is defined, and a later provide won't fix that. You then end up in the mysterious situation where the function you defined does not work, but if you define a new function which uses csv->list it will work. This is because it has much cleverer semantics than CL, but also semantics not designed for this kind of interactive development as far as I can tell (certainly DrRacket strongly discourages it).

Is there anyway to undo only parts of code that are highlighted in sublime text

Are there any plugins that would do this? Let's say you highlight a code block, when you press undo, it undoes the last change with in that code block?

I've done a lot of digging into Sublime's internals, and I don't think this is possible. Commands (processes executed when you select a menu item or hit a key combination) are implemented in one of two ways: either in Python, making use of the API, or internally in C++ and compiled directly into the executable or a library. If a command is implemented in pure Python, such as delete_word (source in Packages/Default/delete_word.py), you can edit the source if necessary or take portions to use in your own code. However, if a command is implemented internally, there's not much you can do to modify it, unless it has options that are documented somewhere. You basically have to use it as-is.
Which brings us to the undo/redo commands, and the edit history. As far as I can tell (since Sublime is not open source - yet), this entire functionality is completely implemented internally, with only the command names exposed. I have been completely unable to find any way of viewing or accessing the actual changes made to the undo/redo stack. The command_history() method of the sublime.View class accesses the commands in the undo/redo stack, but not the actual changes they made.
So, all of this is to say that one could not likely make a plugin that could access the change history of an arbitrary selection in Sublime. One of the major issues (aside from the fact that the change history of the view is not accessible) is that the text you select now might not correspond to anything at a certain point in the history - it might not exist, or could have been altered so fundamentally that it would be essentially impossible to identify which changes should be associated with the selection, and which not. I have never heard of a similar feature in any other editor, most likely for that exact reason.

Reusing existing functions

When adding a new feature to an existing system if you come across an existing function that almost does what you need is it best practice to:
Copy the existing function and make your changes on the new copy (knowing that copying code makes your fellow devs cry).
-or-
Edit the existing function to handle both the existing case and your new case risking that you may introduce new bugs into existing parts of the system (which makes the QA team cry)
If you edit the existing function where do you draw the line before you should just create a new independent function (based on a copy)...10% of the function, 50% of the function?

You can't always do this, but one solution here would be to split your existing function in other tiny parts, allowing you to use the parts you need without editing all the code, and making it easier to edit small pieces of code.
That being said, if you think you can introduce new bugs into existing parts of the system without noticing it, you should probably think about using units tests.

Rule of thumb I tend to follow is that if I can cover the new behaviour by adding an extra parameter (or new valid value) to the existing function, while leaving the code more-or-less "obviously the same" in the existing case, then there's not much danger in changing a function.
For example, old code:
def utf8len(s):
return len(s.encode('utf8')) # or maybe something more memory-efficient
New use case - I'm writing some code in a style that uses the null object pattern, so I want utf8len(None) to return None instead of throwing an exception. I could define a new function utf8len_nullobjectpattern, but that's going to get quite annoying quite quickly, so:
def utf8len(s):
if s != None:
return len(s.encode('utf8')) # old code path is untouched
else:
return None # new code path introduced
Then even if the unit tests for utf8len were incomplete, I can bet that I haven't changed the behavior for any input other than None. I also need to check that nobody was ever relying on utf8len to throw an exception for a None input, which is a question of (1) quality of documentation and/or tests; and (2) whether people actually pay any attention to defined interfaces, or just Use The Source. If the latter, I need to look at calling sites, but if things are done well then I pretty much don't.
Whether the old allowed inputs are still treated "obviously the same" isn't really a question of what percentage of code is modified, it's how it's modified. I've picked a deliberately trivial example, since the whole of the old function body is visibly still there in the new function, but I think it's something that you know when you see it. Another example would making something that was fixed configurable (perhaps by passing a value, or a dependency that's used to get a value) with a default parameter that just provides the old fixed value. Every instance of the old fixed thing is replaced with (a call to) the new parameter, so it's reasonably easy to see on a diff what the change means. You have (or write) at least some tests to give confidence that you haven't broken the old inputs via some stupid typo, so you can go ahead even without total confidence in your test coverage.
Of course you want comprehensive testing, but you don't necessarily have it. There are also two competing maintenance imperatives here: 1 - don't duplicate code, since if it has bugs in it, or behavior that might need to change in future, then you're duplicating the bugs / current behavior. 2 - the open/closed principle, which is a bit high-falutin' but basically says, "write stuff that works and then don't touch it". 1 says that you should refactor to share code between these two similar operations, 2 says no, you've shipped the old one, either it's usable for this new thing or it isn't, and if it isn't then leave it alone.

You should always strive to avoid code duplication. Therefore I would suggest that you try to write a new function that modifies the return value of the already existing function to implement your new feature.
I do realize that in some cases it might not be possible to do that. In those cases you definitely should consider rewriting the existing function without changing its interface. And introducing new bugs by doing that should be prevented by unit tests that can be run on the modified function before you add it to the project code.
And if you need only part of the existing function, consider extracting a new function from the existing one and use this new "helper" function in the existing and in your new function. Again confirming everything is working as intended via unit tests.

What does “Data is just dumb code, and code is just smart data” mean? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I just came across an idea in The Structure And Interpretation of Computer Programs:
Data is just dumb code, and code is just smart data
I fail to understand what it means. Can some one help me to understand it better?

This is one of the fundamental lessons of SICP and one of the most powerful ideas of computer science. It works like this:
What we think of as "code" doesn't actually have the power to do anything by itself. Code defines a program only within a context of interpretation -- outside of that context, it is just a stream of characters. (Really a stream of bits, which is really a stream of electrical impulses. But let's keep it simple.) The meaning of code is defined by the system within which you run it -- and this system just treats your code as data that tells it what you wanted to do. C source code is interpreted by a C compiler as data describing an object file you want it to create. An object file is treated by the loader as data describing some machine instructions you want to queue up for execution. Machine instructions are interpreted by the CPU as data defining the sequence of state transitions it should undergo.
Interpreted languages often contain mechanisms for treating data as code, which means you can pass code into a function in some form and then execute it -- or even generate code at run time:
#!/usr/bin/perl
# Note that the above line explicitly defines the interpretive context for the
# rest of this file. Without the context of a Perl interpreter, this script
# doesn't do anything.
sub foo {
my ($expression) = #_;
# $expression is just a string that happens to be valid Perl
print "$expression = " . eval("$expression") . "\n";
}
foo("1 + 1 + 2 + 3 + 5 + 8"); # sum of first six Fibonacci numbers
foo(join(' + ', map { $_ * $_ } (1..10))); # sum of first ten squares
Some languages like scheme have a concept of "first-class functions", which means that you can treat a function as data and pass it around without evaluating it until you really want to.
The upshot is that the division between "code" and "data" is pretty much arbitrary, a function of perspective only. The lower the level of abstraction, the "smarter" the code has to be: it has to contain more information about how it should be executed. On the other hand, the more information the interpreter supplies, the more dumb the code can be, until it starts to look like data with no smarts at all.
One of the most powerful ways to write code is as a simple description of what you need: Data which will be turned into code describing how to get you what you need by the interpretive context. We call this "declarative programming".
For a concrete example, consider HTML. HTML does not describe a Turing-complete programming language. It is merely structured data. Its structure contains some smarts that let it control the behavior of its interpretive context -- but not a lot of smarts. On the other hand, it contains more smarts than the paragraphs of text that appear on an average web page: Those are pretty dumb data.

In the context of security: Due to buffer overflows, what you thought of as data and thus harmless (such as an image) can become executed as code and p0wn your machine.
In the context of software development: Many developers are very afraid of "hardcoding" things and very keen on extracting parameters that might have to change into configuration files. This is often based on the idea that config files are just "data" and thus can be changed easily (perhapy by customers) without raising the issues (compilation, deployment, testing) that changing anything in the code would.
What these developers don't realize is that since this "data" influences the behaviour of the program, it really is code; it could break the program and the only reason not to require complete testing after such a change is that, if done correctly, the configurable values have a very specific, well-documented effect and any invalid value or a broken file structure will be caught by the program.
However, what all too often happens is that the config file structure becomes a programming language in its own right, complete with control flow and everything - one that's badly documented, has a quirky syntax and parser and which only the most experienced developers in the team can touch without breaking the application completely.

So, in a language like Scheme, even code is treated as first class data. You can treat functions and lambda expressions much like you treat other code, say passing them into other functions and lambda expressions. I recommend continuing with the text as this will all become quite clear.

This is something you should come to understand from writing in a compiler.
One common step in compilers is to transform the program into an abstract syntax tree. Representation will often be like trees such as [+, 2, 3] where + is the root, and 2, 3 are the children.
Lisp languages simply treats this as its data. So there is no separation between data and code which are both lists that look like AST trees.

Code is definitely data, but data is definitely not always code. Let's take a basic example - customer name. It's nothing to do with code, it's a functional (essential), as opposed to a technical (accidental) aspect of an application.
You could probably say that any technical/accidental data is code and that functional/essential data is not.

What's the difference between data and code?

To take an example, consider a set of discounts available to a supermarket shopper.
We could define these rules as data in some standard fashion (lists of qualifying items, applicable dates, coupon codes) and write generic code to handle these. Or, we could write each as a chunk of code, which checks for the appropriate things given the customer's shopping list and returns any applicable discounts.
You could reasonably store the rules as objects, serialised into Blobs or stored in code files, so that each rule could choose its own division between data and code, to allow for future rules that wouldn't fit the type of generic processor considered above.
It's often easy to criticise code that mixes data in, via if statements that check for 6 different things that should be in a file or a database, but is there a rule that helps in the edge cases?
Or is this the point of Object Oriented design, to stop us worrying about the line between data and code?
To clarify, the underlying question is this: How would you code the above example? Is there a rule of thumb that made you decide what is data and what is code?
(Note: I know, code can be compiled, but in a world of dynamic languages and JIT compilation, even that is a blurry concept.)

Fundamentally, there is of course no difference between data and code, but for real software infrastructures, there can be a big difference. Apart from obvious things like, as you mentioned, compilation, the biggest issue is this:
Most sufficiently large projects are designed to produce "releases" that are one big bundle, produced in 3-month (or longer) cycles, tested extensively and cannot be changed afterwards except in tightly controlled ways. "Code" most definitely cannot be changed, so anything that does need to be changed has to be factored out and made "configuration data" so that changing it becomes palatable those whose job it is to ensure that a release works.
Of course, in most cases bad configuration data can break a release just as thoroughly as bad code, so the whole thing is largely an illusion - in reality it doesn't matter whether it's code or "configuration data" that changes, what matters is that the interface between the main system and the parts that change is narrow and well-defined enough to give you a good chance that the person who does the change understands all consequences of what he's doing.
This is already harder than most people think when it's really just a few strings and numbers that are configured (I've personally witnessed a production mainframe system crash because it had one boolean value set differently than another system it was talking to). When your "configuration data" contains complex logic, it's almost impossible to achieve. But the situation isn't going to be any better ust because you use a badly-designed ad hoc "rules configuration" language instead of "real" code.

This is a rather philosophical question (which I like) so I'll answer it in a philosophical way: with nothing much to back it up. ;)
Data is the part of a system that can change. Code defines behavior; the way in which data can change into new data.
To put it more accurately: Data can be described by two components: a description of what the datum is supposed to represent (for instance, a variable with a name and a type) and a value.
The value of the variable can change according to rules defined in code. The description does not change, of course, because if it does, we have a whole new piece of information.
The code itself does not change, unless requirements (what we expect of the system) change.
To a compiler (or a VM), code is actually the data on which it performs its operations. However, the to-be-compiled code does not specify behavior for the compiler, the compiler's own code does that.

It all depends on the requirement. If the data is like lookup data and changes frequently you dont really want to do it in code, but things like Day of the Week, should not chnage for the next 200 years or so, so code that.
You might consider changing your topic, as the first thing I thought of when I saw it, was the age old LISP discussion of code vs data. Lucky in Scheme code and data looks the same, but thats about it, you can never accidentally mix code with data as is very possible in LISP with unhygienic macros.

Data are information that are processed by instructions called Code. I'm not sure I feel there's a blurring in OOD, there are still properties (Data) and methods (Code). The OO theory encapsulates both into a gestalt entity called a Class but they are still discrete within the Class.
How flexible you want to make your code in a matter of choice. Including constant values (what you are doing by using if statements as described above) is inflexible without re-processing your source, whereas using dynamically sourced data is more flexible. Is either approach wrong? I would say it really depends on the circumstances. As Leppie said, there are certain 'data' points that are invariate, like the days of the week that can be hard coded but even there it may be advantageous to do it dynamically in certain circumstances.

In Lisp, your code is data, and your
data is code
In Prolog clauses are terms, and terms
are clauses.

The important note is that you want to separate out the part of your code that will execute the same every time, (i.e. applying a discount) from the part of your code which could change (i.e. the products to be discounted, or the % of the discount, etc.)
This is simply for safety. If a discount changes, you won't have to re-write your discount code, you'll only need to go into your discounts repository (DB, or app file, or xml file, or however you choose to implement it) and make a small change to a number.
Also, if the discount code is separated into an XML file, then you can give the entire application to a manager, and with sufficient instructions, they won't need to pester you whenever they want to change the discount rates.
When you mix in data and code, you are exponentially increasing the odds of breaking when anything changes. So, as leppie said, you need to extract the constantly changing parts, and put them in a separate place.

Huge difference. Data is a given to system while code is a part of system.
Wrong data is senseless: our code===handler is good and what you put that you take, it is not a trouble of system that you meant something else. But if code is bad - system is bad.
In example, let's consider some JSON, some bad code parser.js by me and let's say good V8. For my system bad parser.js is a code and my system works wrong. But for Google system my bad parser is data that no how says about quality of V8.
The question is very practical, no sophistic.
https://en.wikipedia.org/wiki/Systems_engineering tries to make good answer and money.

Data is information. It's not about where you decide to put it, be it a db, config file, config through code or inside the classes.
The same happens for behaviors / code. It's not about where you decide to put it or how you choose to represent it.

The line between data and code (program) is blurry. It's ultimately just a question of terminology - for example, you could say that data is everything that is not code. But, as you wrote, they can be happily mixed together (although usually it's better to keep them separate).

Code is any data which can be executed. Now since all data is used as input to some program at some point of time, it can be said that this data is executed by a program! Thus your program acts as a virtual machine for your data. Hence in theory there is no difference between data and code!
In the end what matters is software engineering/development considerations like performance, efficiency etc. For example data driven programs may not be as efficient as programs which have hard coded (and hence fragile) conditional statements. Hence I choose to define code as any data which can be efficiently executed and all else being plain data.
It's a tradeoff between flexibility and efficiency. Executable data (like XML rules) offers more flexibility (sometimes) while the same data/rules when coded as part of the application will run more efficiently but changing it frequently becomes cumbersome. In other words executable data is easy to deploy but is inefficient and vice-versa. So ultimately the decision rests with you - the software designer.
Please correct me if I wrong.

Relationship between code and data is as follows:
code after compiled to a program processes the data while execution
program can extract data, transform data, load data, generate data ...
Also
program can extract code, transform code, load code, generate code tooooooo...
Hence code without compiled or interperator is useless, data is always worth..., but code after compiled can do all the above activities....
For eg)
Sourcecontrolsystem process Sourcecodes
here source code itself is a code
Backupscripts process files
here files is a data and so on...

I would say that the distinction between data, code and configuration is something to be made within the context of a particular component. Sometimes it's obvious, sometimes less so.
For example, to a compiler, the source code it consumes and the object code it creates are both data - and should be separated from the compiler's own code.
In your case you seem to be describing the option of a particularly powerful configuration file, which can contain code. Much as, for example, the GIMP lets you 'configure' plugins using Scheme. As the developer of the component that reads this configuration, you would think of it as data. When working at a different level -- writing the configuration -- you would think of it as code.
This is a very powerful way of designing.
Applying this to the underlying question ("How would you code the above example?"), one option might be to adopt or design a high level Domain Specific Language (DSL) for specifying rules. At startup, or when first required, the server reads the rule and executes it.
Provide an admin interface allowing the administrator to
test a new rule file
replace the current configuration with that from a new rule file
... all of which would happen at runtime.
A DSL might be something as simple as a table parser or an XML parser, or it could be something as sophisticated as a scripting language. From C, it's easy to embed Python or Lua. From Java it's easy to embed Groovy or Clojure.
You could switch in compiled code at runtime, with clever linking or classloader tricks. This seems more difficult and less valuable than the embedded DSL option, in my opinion.

The best practical answer to this question I found is this:
Any class that needs to be serialized, now or in any foreseeable future, is data.
Everything else is code.
That's why, for example, Java's HashMap is data - although it has a lot of code, API methods and specific implementation (i.e., it might look as code at first glance).

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008