Use of templates to switch between input versions of data and analyze outputs in palantir-foundry - palantir-foundry

We are looking to build a single pipeline within a code repository that cleans, harmonizes, and transforms data to features of interest. We would like to apply that single pipeline code on different inputs and then test how the outputs look.
For example, we would like to test the pipeline on synthetic data, version 1 of 'real' data that includes only retrospective data, and version 2 of 'real' data that includes retrospective and prospective data. The comparison of the outputs could be what percent of patients had diabetes in version 1 compared to version 2.
I saw that you could template code repositories in foundry. Is this a viable option? Could you template your code repository and apply to the three scenarios I have provided? Is there a better option?

If your data scale is reasonably small, I would recommend going down the test-driven path of development here instead of trying to compare and contrast results across a wide variety of datasets. You'll find the iteration time and difficulty in exactly comparing results probably quite high.
For this, you should follow the method I lay out here and create representative datasets for each input you expect as a .csv file in your repo, then you can incorporate these schemas as a unique input to your core code and inspect the outputs with ease.
This will let you 'tighten' your code much easier and faster, after which you can then run this logic on real full-scale data and generate your outputs as you wish.
Templating code is possible but should be incorporated with great care. If what you're truly solving for is comparing and contrasting the execution of your code on arbitrary schemas, then you should use test-driven in-repo development. If what you're after is running a core set of logic across a wide variety of outputs after the code is working, then generated transforms is going to work great. If what you're really after is rolling out a large codebase of transformations across differently-permissioned projects where each needs to be completely independent / configured separately of the other, then maybe you should consider templates. I would stick to test-driven development and generated transforms until you prove otherwise.

Related

Build system that is not file-centric

We have a software infrastructure which works pretty much like a software build system: Information is gathered from different sources and used to generate some outputs. Like in traditional software builds we have different types of output, dependency trees, etc.
The main difference is that our sources, intermediate results and outputs are not inherently file-based. Rather, they're (uniquely addressable) data objects.
Right now we're mapping our data structure to files and directories in combination with a traditional build system (SCons) but that does not scale, both w.r.t. performance but (more importantly) w.r.t. maintainability. Hence I'm looking for an infrastructure that's built for this purpose from the ground up.
As an illustration, assume you have 3 XML documents A, B and C. Let's say that B/foo/bar is to be calculated from A/x/y and A/x/z, and that similarly C/a/b is calculated from A/x/y. I need an infrastructure to
Implement these relationships (i.e. the transformations and their dependencies)
Automatically re-build the relevant parts after changes are made
One major problem with using files is that, if I map A, B and C to some files A.xml, B.xml and C.xml and use a traditional build system, then any change to A.xml will trigger a rebuild of B.xml and C.xml, even if A/x/y and A/x/z (the original dependencies of B) are not modified. For a fine-grained dependency resolution I therefore would need to map each of A, B and C not to a file, but to a directory where each sub-directory represents an element, files represents attributes, etc. As I said, this does not scale for us.
(Please note that our system is not actually based on XML)
Right now I'm looking for any existing software, infrastructure or concept which points into this direction, regardless of implementation language and underlying data structures.
It sounds like you need an active object database management system (ODBMS) like GemStone/S. ODBMSs provide the traditional persistence services without the old cost of mapping data structures to files and the well-known benefits of object technology. As you've mentioned dependency trees and addressable objects, in ODBMSs navigational references are stored as part of their data, allowing any complex interaction patterns among objects to be represented/accessed. This is specially true when you predict a system which makes use of inheritance, object nesting and cross-referencing.
Although an object engine may seem oversized for your requirements, it is common for large-scale production business systems to store and execute methods using OODBMs, within a concurrent and multiuser environment. It doesn't come for free because you have to invest in the human part of the equation (education and experience) but once the initial fear is overcome, it will pay the return of investment.
For re-building (subscribed) parts after changes (notifications from announcers) are made, you may use the Observer design pattern, or one of its variants (SASE or Announcements framework), to implement your announce/subscription architecture. Under this type of event frameworks there are intrinsic problems which are hard to solve with traditional file-based solutions, as you have noticed already. For example, it is typical for a dependency mechanism to manage the replacement of an object, or in your example an XML document, by another one. Any modern events framework should manage when an object is removed, all dependents plugged to the old object are updated to the new reference.
Finally, there is a free GemStone/S stack which includes object dependency framework so you may experiment with a real object-database.
So nothing comes to mind that solves exactly your problem, but there are a few tools that might get you a little closer than you are now:
1) You might be able to throw something together using Fuse that would give you better control of how your data objects are mapped out to files. Fuse basically allows you to construct arbitrary file systems from whatever backing data you want. (The python bindings are pretty friendly, but there are a number of other language interfaces available as well). Then you could use a traditional build tool, and take advantage of file like objects better associated w/your data.
2) Cmake has a pretty extensible language for writing custom targets that you might be able to press into service. Unfortunately its language is pretty didactic and has something of a steep learning curve, so it wouldn't be my first choice.

Parsing language for both binary and character files

The problem:
You have some data and your program needs specified input. For example strings which are numbers. You are searching for a way to transform the original data in a format you need.
And the problem is: The source can be anything. It can be XML, property lists, binary which
contains the needed data deeply embedded in binary junk. And your output format may vary
also: It can be number strings, float, doubles....
You don't want to program. You want routines which gives you commands capable to transform the data in a form you wish. Surely it contains regular expressions, but it is very good designed and it offers capabilities which are sometimes much more easier and more powerful.
ADDITION:
Many users have this problem and hope that their programs can convert, read and write data which is given by other sources. If it can't, they are doomed or use programs like business
intelligence. That is NOT the problem.
I am talking of a tool for a developer who knows what is he doing, but who is also dissatisfied to write every time routines in a regular language. A professional data manipulation tool, something like a hex editor, regex, vi, grep, parser melted together
accessible by routines or a REPL.
If you have the spec of the data format, you can access and transform the data at once. No need to debug or plan meticulously how to program the transformation. I am searching for a solution because I don't believe the problem is new.
It allows:
joining/grouping/merging of results
inserting/deleting/finding/replacing
write macros which allows to execute a command chain repeatedly
meta-grouping (lists->tables->n-dimensional tables)
Example (No, I am not looking for a solution to this, it is just an example):
You want to read xml strings embedded in a binary file with variable length records. Your
tool reads the record length and deletes the junk surrounding your text. Now it splits open
the xml and extracts the strings. Being Indian number glyphs and containing decimal commas instead of decimal points, your tool transforms it into ASCII and replaces commas with points. Now the results must be stored into matrices of variable length....etc. etc.
I am searching for a good language / language-design and if possible, an implementation.
Which design do you like or even, if it does not fulfill the conditions, wouldn't you want to miss ?
EDIT: The question is if a solution for the problem exists and if yes, which implementations are available. You DO NOT implement your own sorting algorithm if Quicksort, Mergesort and Heapsort is available. You DO NOT invent your own text parsing
method if you have regular expressions. You DO NOT invent your own 3D language for graphics if OpenGL/Direct3D is available. There are existing solutions or at least papers describing the problem and giving suggestions. And there are people who may have worked and experienced such problems and who can give ideas and suggestions. The idea that this problem is totally new and I should work out and implement it myself without background
knowledge seems for me, I must admit, totally off the mark.
UPDATE:
Unfortunately I had less time than anticipated to delve in the subject because our development team is currently in a hot phase. But I have contacted the author of TextTransformer and he kindly answered my questions.
I have investigated TextTransformer (http://www.texttransformer.de) in the meantime and so far I can see it offers a complete and efficient solution if you are going to parse character data.
For anyone who will give it a try to implement a good parsing language, the smallest set of operators to directly transform any input data to any output data if (!) they were powerful enough seems to be:
Insert/Remove: Self-explaining
Group/Ungroup: Split the input data into a set of tokens and organize them into groups
and supergroups (datastructures, lists, tables etc.)
Transform
Substituition: Change the content of the tokens (special operation: replace)
Transposition: Change the order of tokens (swap,merge etc.)
Have you investigated TextTransformer?
I have no experience with this, but it sounds pretty good and the author makes quite competent posts in the comp.compilers newsgroup.
You still have to some programming work.
For a programmer, I would suggest:
Perl against a SQL backend.
For a non-programmer, what it sounds like you're looking for is some sort of business intelligence suite.
This suggestion may broaden the scope of your search too much... but here it is:
You could either reuse, as-is, or otherwise get "inspiration" from the [open source] code of the SnapLogic framework.
Edit (answering the comment on SnapLogic documentation etc.)
I agree, the SnapLogic documentation leaves a bit to be desired, in particular for people in your situation, i.e. when just trying to quickly get an overview of what SnapLogic can do, and if it would generally meet their needs, without investing much time and learn the system in earnest.
Also, I realize that the scope and typical uses of of SnapLogic differ, somewhat, from the requirements expressed in the question, and I should have taken the time to better articulate the possible connection.
So here goes...
A salient and powerful feature of SnapLogic is its ability to [virtually] codelessly create "pipelines" i.e. processes made from pre-built components;
Components addressing the most common needs of Data Integration tasks at-large are supplied with the SnapLogic framework. For example, there are components to
read in and/or write to files in CSV or XML or fixed length format
connect to various SQL backends (for either input, output or both)
transform/format [readily parsed] data fields
sort records
join records for lookup and general "denormalized" record building (akin to SQL joins but applicable to any input [of reasonnable size])
merge sources
Filter records within a source (to select and, at a later step, work on say only records with attribute "State" equal to "NY")
see this list of available components for more details
A relatively weak area of functionality of SnapLogic (for the described purpose of the OP) is with regards to parsing. Standard components will only read generic file formats (XML, RSS, CSV, Fixed Len, DBMSes...) therefore structured (or semi-structured?) files such as the one described in the question, with mixed binary and text and such are unlikely to ever be a standard component.
You'd therefore need to write your own parsing logic, in Python or Java, respecting the SnapLogic API of course so the module can later "play nice" with the other ones.
BTW, the task of parsing the files described could be done in one of two ways, with a "monolithic" reader component (i.e. one which takes in the whole file and produces an array of readily parsed records), or with a multi-component approach, whereby an input component reads in and parse the file at "record" level (or line level or block level whatever this may be), and other standard or custom SnapLogic components are used to create a pipeline which effectively expresses the logic of parsing a record (or block or...) into its individual fields/attributes.
The second approach is of course more modular and may be applicable if the goal is to process many different files format, whereby each new format requires piecing together components with no or little coding. Whatever the approach used for the input / parsing of the file(s), the SnapLogic framework remains available to create pipelines to then process the parsed input in various fashion.
My understanding of the question therefore prompted me to suggest SnapLogic as a possible framework for the problem at hand, because I understood the gap in feature concerning the "codeless" parsing of odd-formatted files, but also saw some commonality of features with regards to creating various processing pipelines.
I also edged my suggestion, with an expression like "inspire onself from", because of the possible feature gap, but also because of the relative lack of maturity of the SnapLogic offering and its apparent commercial/open-source ambivalence.
(Note: this statement is neither a critique of the technical maturity/value of the framework per-se, nor a critique of business-oriented use of open-source, but rather a warning that business/commercial pressures may shape the offering in various direction)
To summarize:
Depending on the specific details of the vision expressed in the question, SnapLogic may be worthy of consideration, provided one understands that "some-assembly-required" will apply, in particular in the area of file parsing, and that the specific shape and nature of the product may evolve (but then again it is open source so one can freeze it or bend it as needed).
A more generic remark is that SnapLogic is based on Python which is a very swell language for coding various connectors, convertion logic etc.
In reply to Paul Nathan you mentioned writing throwaway code as something rather unpleasant. I don't see why it should be so. After all, all of our code will be thrown away and replaced eventually, no matter how perfect we wrote it. So my opinion is that writing throwaway code is pretty much ok, if you don't spend too much time writing it.
So, it seems that there are two approaches to solving your solution: either a) find some specific tool intended for the purpose (parse data, perform some basic operations on it and storing it in some specific structure) or b) use some general purpose language with lots of libraries and code it yourself.
I don't think that approach a) is viable because sooner or later you'll bump into an obstacle not covered by the tool and you'll spend your time and nerves hacking the tool, or mailing the authors and waiting for them to implement what you need. I might as well be wrong, so please if you find a perfect tool, drop here a link (I myself am doing lots of data processing in my day job and I can't swear that I couldn't do it more efficiently).
Approach b) may at first seem "unpleasant", but given a nice high-level expressive language with bunch of useful libraries (regexps, XML manipulation, creating parsers...) it shouldn't be too hard, and may be gradually turned into a DSL for the very purpose. Beside Perl which was already mentioned, Python and Ruby sound like good candidates for these languages (I bet some Lisp derivatives too, but I have no experience there).
You might find AntlrWorks useful if you go so far as defining formal grammars for what you're parsing.

Why should I use code generators

I have encountered this topic lately and couldn't understand why they are needed.
Can you explain why I should use them in my projects and how they can ease my life.
Examples will be great, and where from I can learn this topic little more.
At least you have framed the question from the correct perspective =)
The usual reasons for using a code generator are given as productivity and consistency because they assume that the solution to a consistent and repetitive problem is to throw more code at it. I would argue that any time you are considering code generation, look at why you are generating code and see if you can solve the problem through other means.
A classic example of this is data access; you could generate 250 classes ( 1 for each table in the schema ) effectively creating a table gateway solution, or you could build something more like a domain model and use NHibernate / ActiveRecord / LightSpeed / [pick your orm] to map a rich domain model onto the database.
While both the hand rolled solution and ORM are effectively code generators, the primary difference is when the code is generated. With the ORM it is an implicit step that happens at run-time and therefore is one-way by it's nature. The hand rolled solution requires and explicit step to generate the code during development and the likelihood that the generated classes will need customising at some point therefore creating problems when you re-generate the code. The explicit step that must happen during development introduces friction into the development process and often leads to code that violates DRY ( although some argue that generated code can never violate DRY ).
Another reason for touting code generation comes from the MDA / MDE world ( Model Driven Architecture / Engineering ). I don't put much stock in this but rather than providing a number of poorly expressed arguments, I'm simply going to co-opt someone elses - http://www.infoq.com/articles/8-reasons-why-MDE-fails.
IMHO code generation is the only solution in an exceedingly narrow set of problems and whenever you are considering it, you should probably take a second look at the real problem you are trying to solve and see if there is a better solution.
One type of code generation that really does enhance productivity is "micro code-generation" where the use of macros and templates allow a developer to generate new code directly in the IDE and tab / type their way through placeholders (eg namespace / classname etc). This sort of code generation is a feature of resharper and I use it heavily every day. The reason that micro-generation benefits where most large scale code generation fails is that the generated code is not tied back to any other resource that must be kept in sync and therefore once the code is generated, it is just like all the other code in the solution.
#John
Moving the creation of "basic classes" from the IDE into xml / dsl is often seen when doing big bang development - a classic example would be developers try to reverse engineer the database into a domain model. Unless the code generator is very well written it simply introduces an additional burden on the developer in that every time they need to update the domain model, they either have to context-switch and update the xml / dsl or they have to extend the domain model and then port those changes back to the xml / dsl ( effectively doing the work twice).
There are some code generators that work very well in this space ( the LightSpeed designer is the only one I can think of atm ) by acting as the engine for a design surface but often
these code generators generate terrible code that cannot be maintained (eg winforms / webforms design surfaces, EF1 design surface) and therefore rapidly undo any productivity benefits gained from using the code generator in the first place.
Well, it's either:
you write 250 classes, all pretty much the same, but slightly different, e.g. to do data access; takes you a week, and it's boring and error-prone and annoying
OR:
you invest 30 minutes into generating a code template, and let a generation engine handle the grunt work in another 30 minutes
So a code generator gives you:
speed
reproducability
a lot less errors
a lot more free time! :-)
Excellent examples:
Linq-to-SQL T4 templates by Damien Guard to generate one separate file per class in your database model, using the best kept Visual Studio 2008 secret - T4 templates
PLINQO - same thing, but for Codesmith's generator
and countless more.....
Anytime you need to produce large amounts of repetetive boilerplate code, the code generator is the guy for the job. Last time I used a code generator was when creating a custom Data Access Layer for a project, where the skeleton for various CRUD actions was created based on an object model. Instead of hand-coding all those classes, I put together a template-driven code generator (using StringTemplate) to make it for me. The advandages of this procedure was:
It was faster (there was a large amount of code to generate)
I could regenerate the code in a whim in case I detected a bug (code can sometimes have bugs in the early versions)
Less error prone; when we had an error in the generated code it was everywhere which means that it was more likely to be found (and, as noted in the previous point, it was easy to fix it and regenerate the code).
Using GUI builders, that will generate code for you is a common practice. Thanks to this you don't need to manually create all widgets. You just drag&drop them and the use generated code. For simple widgets this really saves time (I have used this a lot for wxWidgets).
Really, when you are using almost any programming language, you are using a "code generator" (except for assembly or machine code.) I often write little 200-line scripts that crank out a few thousand lines of C. There is also software you can get which helps generate certain types of code (yacc and lex, for example, are used to generate parsers to create programming languages.)
The key here is to think of your code generator's input as the actual source code, and think of the stuff it spits out as just part of the build process. In which case, you are writing in a higher-level language with fewer actual lines of code to deal with.
For example, here is a very long and tedious file I (didn't) write as part of my work modifying the Quake2-based game engine CRX. It takes the integer values of all #defined constants from two of the headers, and makes them into "cvars" (variables for the in-game console.)
http://meliaserlow.dyndns.tv:8000/alienarena/lua_source/game/cvar_constants.c
Here is the short Bash script which generated that code at compile-time:
http://meliaserlow.dyndns.tv:8000/alienarena/lua_source/autogen/constant_cvars.sh
Now, which would you rather maintain? They are both equivalent in terms of what they describe, but one is vastly longer and more annoying to deal with.
The canonical example of this is data access, but I have another example. I've worked on a messaging system that communicates over serial port, sockets, etc., and I found I kept having to write classes like this over and over again:
public class FooMessage
{
public FooMessage()
{
}
public FooMessage(int bar, string baz, DateTime blah)
{
this.Bar = bar;
this.Baz = baz;
this.Blah = blah;
}
public void Read(BinaryReader reader)
{
this.Bar = reader.ReadInt32();
this.Baz = Encoding.ASCII.GetString(reader.ReadBytes(30));
this.Blah = new DateTime(reader.ReadInt16(), reader.ReadByte(),
reader.ReadByte());
}
public void Write(BinaryWriter writer)
{
writer.Write(this.Bar);
writer.Write(Encoding.ASCII.GetBytes(
this.Baz.PadRight(30).Substring(0, 30)));
writer.Write((Int16)this.Blah.Year);
writer.Write((byte)this.Blah.Month);
writer.Write((byte)this.Blah.Day);
}
public int Bar { get; set; }
public string Baz { get; set; }
public DateTime Blah { get; set; }
}
Try to imagine, if you will, writing this code for no fewer than 300 different types of messages. The same boring, tedious, error-prone code being written, over and over again. I managed to write about 3 of these before I decided it would be easier for me to just write a code generator, so I did.
I won't post the code-gen code, it's a lot of arcane CodeDom stuff, but the bottom line is that I was able to compact the entire system down to a single XML file:
<Messages>
<Message ID="12345" Name="Foo">
<ByteField Name="Bar"/>
<TextField Name="Baz" Length="30"/>
<DateTimeField Name="Blah" Precision="Day"/>
</Message>
(More messages)
</Messages>
How much easier is this? (Rhetorical question.) I could finally breathe. I even added some bells and whistles so it was able to generate a "proxy", and I could write code like this:
var p = new MyMessagingProtocol(...);
SetFooResult result = p.SetFoo(3, "Hello", DateTime.Today);
In the end I'd say this saved me writing a good 7500 lines of code and turned a 3-week task into a 3-day task (well, plus the couple of days required to write the code-gen).
Conclusion: Code generation is only appropriate for a relatively small number of problems, but when you're able to use one, it will save your sanity.
A code generator is useful if:
The cost of writing and maintaining the code generator is less than the cost of writing and maintaining the repetition that it is replacing.
The consistency gained by using a code generator will reduce errors to a degree that makes it worthwhile.
The extra problem of debugging generated code will not make debugging inefficient enough to outweigh the benefits from 1 and 2.
For domain-driven or multi-tier apps, code generation is a great way to create the initial model or data access layer. It can churn out the 250 entity classes in 30 seconds ( or in my case 750 classes in 5 minutes). This then leaves the programmer to focus on enhancing the model with relationships, business rules or deriving views within MVC.
The key thing here is when I say initial model. If you are relying on the code generation to maintain the code, then the real work is being done in the templates. (As stated by Max E.) And beware of that because there is risk and complexity in maintaining template-based code.
If you just want the data layer to be "automagically created" so you can "make the GUI work in 2 days", then I'd suggest going with a product/toolset which is geared towards the data-driven or two-tier application scenario.
Finally, keep in mind "garbage in=garbage out". If your entire data layer is homogeneous and does not abstract from the database, please please ask yourself why you are bothering to have a data layer at all. (Unless you need to look productive :) )
How 'bout an example of a good use of a code generator?
This uses t4 templates (a code generator built in to visual studio) to generate compressed css from .less files:
http://haacked.com/archive/2009/12/02/t4-template-for-less-css.aspx
Basically, it lets you define variables, real inheritance, and even behavior in your style sheets, and then create normal css from that at compile time.
Everyone talks here about simple code generation, but what about model-driven code generation (like MDSD or DSM)? This helps you move beyond the simple ORM/member accessors/boilerplate generators and into code generation of higher-level concepts for your problem domain.
It's not productive for one-off projects, but even for these, model-driven development introduces additional discipline, better understanding of employed solutions and usually a better evolution path.
Like 3GLs and OOP provided an increase in abstraction by generating large quantities of assembly code based on a higher level specification, model-driven development allows us to again increase the abstraction level, with yet another gain in productivity.
MetaEdit+ from MetaCase (mature) and ABSE from Isomeris (my project, in alpha, info at http://www.abse.info) are two technologies on the forefront of model-driven code generation.
What is needed really is a change in mindset (like OOP required in the 90's)...
I'm actually adding the finishing touches to a code generator I'm using for a project I've been hired on. We have a huge XML files of definitions and in a days worth of work I was able to generate over 500 C# classes. If I want to add functionality to all the classes, say I want to add an attribute to all the properties. I just add it to my code-gen, hit go, and bam! I'm done.
It's really nice, really.
There are many uses for code generation.
Writing code in a familiar language and generating code for a different target language.
GWT - Java -> Javascript
MonoTouch - C# -> Objective-C
Writing code at a higher level of abstraction.
Compilers
Domain Specific Languages
Automating repetitive tasks.
Data Access Layers
Initial Data Models
Ignoring all preconceived notions of code-generation, it is basically translating one representation (usually higher level) to another (usually lower level). Keeping that definition in mind, it is a very powerful tool to have.
The current state of programming languages has by no means reached its full potential and it never will. We will always be abstracting to get to a higher level than where we stand today. Code generation is what gets us there. We can either depend on the language creators to create that abstraction for us, or do it ourselves. Languages today are sophisticated enough to allow anybody to do it easily.
If with code generator you also intend snippets, try the difference between typing ctor + TAB and writing the constructor each time in your classes. Or check how much time you earn using the snippet to create a switch statement related to an enum with many values.
If you're paid by LOC and work for people who don't understand what code generation is, it makes a lot of sense. This is not a joke, by the way - I have worked with more than one programmer who employs this technique for exactly this purpose. Nobody gets paid by LOC formally any more (that I know of, anyway), but programmers are generally expected to be productive, and churning out large volumes of code can make someone look productive.
As an only slightly tangential point, I think this also explains the tendency of some coders to break a single logical unit of code into as many different classes as possible (ever inherit a project with LastName, FirstName and MiddleInitial classes?).
Here's some heresy:
If a task is so stupid that it can be automated at program writing time (i.e. source code can be generated by a script from, let's say XML) then the same can also be done at run-time (i.e. some representation of that XML can be interpreted at run-time) or using some meta-programming. So in essence, the programmer was lazy, did not attempt to solve the real problem but took the easy way out and wrote a code generator. In Java / C#, look at reflection, and in C++ look at templates

Structuring RSpec file structure and code for tests with very large coverage?

I've just started looking at a project that has >20k unit tests written in Rspec (the project itself isn't written in Ruby; just the test cases). The current number of test cases is expected to grow dramatically in the future, as more functionality is added. What's already happened (over an extended period) is that RSpec started out being a particularly good solution for testing this project, but as the project has grown, the fairly ad-hoc structure of their RSpec test cases has bitten them badly. One of the biggest problems they've got is with taxonomy in their test code - the structure (or lack of) in naming their test cases, fixtures, helper code etc. in their test cases.
As you could imagine, with >20k unit tests, there's a lot of methods with very similar names, using helper methods that are all loaded from a global namespace.
To highlight just one area where the problem is biting, there's ~10 databases within this application. Checking the structure of tables/columns/views/constraints/stored procs/... for all of these databases is something that (quite reasonably) is covered within the existing RSpec unit tests. However, the total number of DDL entities within this collection of databases that needs to be checked is probably >10000; covering the whole range of DB structural checks and being able to selectively test only subsets of DB structure requires either:
10000 separate methods (and I'm ruling out that option straight away!)
a fairly complex naming convention within the test cases (i.e. something encompassing the DB name + table name + column name + ...),
OR passing e.g. DB name, table name, column name, ... to a generic method
OR separation of concerns via namespaces (and I'm not aware of an elegantly scalable way of doing this within RSpec),
OR some clever metaprogramming (that I suspect could ultimately make an already messy structure even more difficult to follow).
What exists now is a bit of all of the above, as far as I can tell, with not a lot of obvious planning...
Do you have any tips or references I could look at to try to straighten out this mess, and give them some sort of scalable structure to their RSpec testing? In particular, suggestions on how to structure the various RSpec files for very large projects would be very welcome.
The RSpec Book is an excellent resource for RSpec tips and tricks, and for BDD methodology in general (although your focus appears to be more about testing). There are several ways to simplify and DRY up specs to make them easier to manage, including shared examples (Chapter 12) and macros (Chapter 17).
I also recommend David Chelimsky's blog.
Still it looks like your project could be a real challenge. Of the approaches you mentioned, I think using macros with DB, table, and column as parameters is the most promising.

What's the difference between data and code?

To take an example, consider a set of discounts available to a supermarket shopper.
We could define these rules as data in some standard fashion (lists of qualifying items, applicable dates, coupon codes) and write generic code to handle these. Or, we could write each as a chunk of code, which checks for the appropriate things given the customer's shopping list and returns any applicable discounts.
You could reasonably store the rules as objects, serialised into Blobs or stored in code files, so that each rule could choose its own division between data and code, to allow for future rules that wouldn't fit the type of generic processor considered above.
It's often easy to criticise code that mixes data in, via if statements that check for 6 different things that should be in a file or a database, but is there a rule that helps in the edge cases?
Or is this the point of Object Oriented design, to stop us worrying about the line between data and code?
To clarify, the underlying question is this: How would you code the above example? Is there a rule of thumb that made you decide what is data and what is code?
(Note: I know, code can be compiled, but in a world of dynamic languages and JIT compilation, even that is a blurry concept.)
Fundamentally, there is of course no difference between data and code, but for real software infrastructures, there can be a big difference. Apart from obvious things like, as you mentioned, compilation, the biggest issue is this:
Most sufficiently large projects are designed to produce "releases" that are one big bundle, produced in 3-month (or longer) cycles, tested extensively and cannot be changed afterwards except in tightly controlled ways. "Code" most definitely cannot be changed, so anything that does need to be changed has to be factored out and made "configuration data" so that changing it becomes palatable those whose job it is to ensure that a release works.
Of course, in most cases bad configuration data can break a release just as thoroughly as bad code, so the whole thing is largely an illusion - in reality it doesn't matter whether it's code or "configuration data" that changes, what matters is that the interface between the main system and the parts that change is narrow and well-defined enough to give you a good chance that the person who does the change understands all consequences of what he's doing.
This is already harder than most people think when it's really just a few strings and numbers that are configured (I've personally witnessed a production mainframe system crash because it had one boolean value set differently than another system it was talking to). When your "configuration data" contains complex logic, it's almost impossible to achieve. But the situation isn't going to be any better ust because you use a badly-designed ad hoc "rules configuration" language instead of "real" code.
This is a rather philosophical question (which I like) so I'll answer it in a philosophical way: with nothing much to back it up. ;)
Data is the part of a system that can change. Code defines behavior; the way in which data can change into new data.
To put it more accurately: Data can be described by two components: a description of what the datum is supposed to represent (for instance, a variable with a name and a type) and a value.
The value of the variable can change according to rules defined in code. The description does not change, of course, because if it does, we have a whole new piece of information.
The code itself does not change, unless requirements (what we expect of the system) change.
To a compiler (or a VM), code is actually the data on which it performs its operations. However, the to-be-compiled code does not specify behavior for the compiler, the compiler's own code does that.
It all depends on the requirement. If the data is like lookup data and changes frequently you dont really want to do it in code, but things like Day of the Week, should not chnage for the next 200 years or so, so code that.
You might consider changing your topic, as the first thing I thought of when I saw it, was the age old LISP discussion of code vs data. Lucky in Scheme code and data looks the same, but thats about it, you can never accidentally mix code with data as is very possible in LISP with unhygienic macros.
Data are information that are processed by instructions called Code. I'm not sure I feel there's a blurring in OOD, there are still properties (Data) and methods (Code). The OO theory encapsulates both into a gestalt entity called a Class but they are still discrete within the Class.
How flexible you want to make your code in a matter of choice. Including constant values (what you are doing by using if statements as described above) is inflexible without re-processing your source, whereas using dynamically sourced data is more flexible. Is either approach wrong? I would say it really depends on the circumstances. As Leppie said, there are certain 'data' points that are invariate, like the days of the week that can be hard coded but even there it may be advantageous to do it dynamically in certain circumstances.
In Lisp, your code is data, and your
data is code
In Prolog clauses are terms, and terms
are clauses.
The important note is that you want to separate out the part of your code that will execute the same every time, (i.e. applying a discount) from the part of your code which could change (i.e. the products to be discounted, or the % of the discount, etc.)
This is simply for safety. If a discount changes, you won't have to re-write your discount code, you'll only need to go into your discounts repository (DB, or app file, or xml file, or however you choose to implement it) and make a small change to a number.
Also, if the discount code is separated into an XML file, then you can give the entire application to a manager, and with sufficient instructions, they won't need to pester you whenever they want to change the discount rates.
When you mix in data and code, you are exponentially increasing the odds of breaking when anything changes. So, as leppie said, you need to extract the constantly changing parts, and put them in a separate place.
Huge difference. Data is a given to system while code is a part of system.
Wrong data is senseless: our code===handler is good and what you put that you take, it is not a trouble of system that you meant something else. But if code is bad - system is bad.
In example, let's consider some JSON, some bad code parser.js by me and let's say good V8. For my system bad parser.js is a code and my system works wrong. But for Google system my bad parser is data that no how says about quality of V8.
The question is very practical, no sophistic.
https://en.wikipedia.org/wiki/Systems_engineering tries to make good answer and money.
Data is information. It's not about where you decide to put it, be it a db, config file, config through code or inside the classes.
The same happens for behaviors / code. It's not about where you decide to put it or how you choose to represent it.
The line between data and code (program) is blurry. It's ultimately just a question of terminology - for example, you could say that data is everything that is not code. But, as you wrote, they can be happily mixed together (although usually it's better to keep them separate).
Code is any data which can be executed. Now since all data is used as input to some program at some point of time, it can be said that this data is executed by a program! Thus your program acts as a virtual machine for your data. Hence in theory there is no difference between data and code!
In the end what matters is software engineering/development considerations like performance, efficiency etc. For example data driven programs may not be as efficient as programs which have hard coded (and hence fragile) conditional statements. Hence I choose to define code as any data which can be efficiently executed and all else being plain data.
It's a tradeoff between flexibility and efficiency. Executable data (like XML rules) offers more flexibility (sometimes) while the same data/rules when coded as part of the application will run more efficiently but changing it frequently becomes cumbersome. In other words executable data is easy to deploy but is inefficient and vice-versa. So ultimately the decision rests with you - the software designer.
Please correct me if I wrong.
Relationship between code and data is as follows:
code after compiled to a program processes the data while execution
program can extract data, transform data, load data, generate data ...
Also
program can extract code, transform code, load code, generate code tooooooo...
Hence code without compiled or interperator is useless, data is always worth..., but code after compiled can do all the above activities....
For eg)
Sourcecontrolsystem process Sourcecodes
here source code itself is a code
Backupscripts process files
here files is a data and so on...
I would say that the distinction between data, code and configuration is something to be made within the context of a particular component. Sometimes it's obvious, sometimes less so.
For example, to a compiler, the source code it consumes and the object code it creates are both data - and should be separated from the compiler's own code.
In your case you seem to be describing the option of a particularly powerful configuration file, which can contain code. Much as, for example, the GIMP lets you 'configure' plugins using Scheme. As the developer of the component that reads this configuration, you would think of it as data. When working at a different level -- writing the configuration -- you would think of it as code.
This is a very powerful way of designing.
Applying this to the underlying question ("How would you code the above example?"), one option might be to adopt or design a high level Domain Specific Language (DSL) for specifying rules. At startup, or when first required, the server reads the rule and executes it.
Provide an admin interface allowing the administrator to
test a new rule file
replace the current configuration with that from a new rule file
... all of which would happen at runtime.
A DSL might be something as simple as a table parser or an XML parser, or it could be something as sophisticated as a scripting language. From C, it's easy to embed Python or Lua. From Java it's easy to embed Groovy or Clojure.
You could switch in compiled code at runtime, with clever linking or classloader tricks. This seems more difficult and less valuable than the embedded DSL option, in my opinion.
The best practical answer to this question I found is this:
Any class that needs to be serialized, now or in any foreseeable future, is data.
Everything else is code.
That's why, for example, Java's HashMap is data - although it has a lot of code, API methods and specific implementation (i.e., it might look as code at first glance).