How do I know if my Foundry job is using incremental computation? - palantir-foundry

I want to know if a job I'm debugging is using incremental computation or not since it's necessary for my debugging techniques.

There are two ways to tell: the job's Spark Details will indicate this (if it's using Python), as will its code.
Spark Details
If you navigate to the Spark Details page as noted here, you'll notice there's a tab for Snapshot / Incremental. In this tab, and if your job is using Python, you'll get a description of if your job is running using incremental computation. If the page reports No Incremental Details Found and you ran the job recently, this means it is not using Incremental computation. However, if your job is somewhat old (typically more than a couple of days), this may not be correct as these Spark details are removed for retention reasons.
A quick way to check if your job's information has been removed due to retention is to navigate to the Query Plan tab and see if any information is present. If nothing is present, this means your Spark details have been deleted and you will need to re-run your job in order to see anything. If you want a more reliable way of determining if a job is using incremental computation, I'd suggest following the second method below.
Code
If you navigate to the code backing the Transform, you'll want to look for a couple indicators, depending on the language used.
Python
The Transform will have an #incremental() decorator on it if it's using incremental computation. This doesn't indicate however whether it will choose to write or read an incremental view. The backing code can choose what types of reads or writes it wishes to do, so you'll want to inspect the code more closely to see what it's written to do.
from transforms.api import transform, Input, Output, incremental
#incremental() # This being present indicates incremental computation
#transform(...)
def my_compute_function(...):
...
Java
The Transform will have the getReadRange and getWriteMode methods overridden in the backing code.

Related

how to create custom node in knime?

I have added all the plugins of Knime in Eclipse and I want to create my Own custom node. but I am not able to understand how to pass the data from one node to another node.
I saw one node which has been provided by the Knime itself which is " File Reader " node. Now I want the source code of this node or jar file for this node But I am not able to find it out.
I am searching with the similar name in eclipse plugin folder but still I didn't get it.
Can someone please tell me how to pass the data from one node to another node and how to identify the classes or jar for any node given by knime and source code also.
Assuming that your data is a standard datatable, then you need to subclass NodeModel, with a call to the supertype constructor:
public MyNodeModel(){
//One incoming table, one outgoing table
super(1,1);
}
You need to override the default #execute(BufferedDataTable[] inData, ExecutionContext exec) method - this is where the meat of the node work is done and the output table created. Ideally, if your input and output table have a one-to-one row mapping then use a ColumnRearranger class (because this reduces disk IO considerably, and if you need it, allows simple parallelisation of your node), otherwise your execute method needs to iterate through the incoming datatable and generate an output table.
The #configure(DataTableSpec[] inSpecs) method needs to be implemented to at the least provide a spec for the output table if this can be determined before the node is executed (it normally can, and this allows downstream nodes also to be configures, but the 'Transpose' node is an example of a node which cannot do so).
There are various other methods which you also need to implement, but in some cases these will be empty methods.
In addition to the NodeModel, you need to implement some other classes too - a NodeFactory, optionally a NodeSettingsPane and optionally a NodeView.
In Eclipse you can view the sources for many nodes, and also the KNIME community 'book' pages all have a link to their source code. Take a look at https://tech.knime.org/developer-guide and https://tech.knime.org/developer/example for a step-by-step guide. Also, questions to the knime forums (including a developer forum) generally get rapid responses - and KNIME run a Developer Training Course a few times a year if you want to spend a few days learning more. And last but not least, it is worth familiarising yourself with the noding guidelines which describe the best practice of how your node should behave
Source code for KNIME nodes are now available on git hub.
Alternatively you can check under your project>plugin dependencies>knime-base.jar>org.knime.base.node.io.filereader for file reader source code in eclipse KNIME SDK.
Knime-base.jar will be added to your project by default when created with KNIME SDK.

How can I check for downstream components in an SSIS custom transform?

I am working on a custom SSIS component that has 4 asynchronous outputs. It works just fine but now I have a user request for an enhancement and I am not sure how to handle it. They want to use the component in another context where only 2 of the 4 outputs will be well defined. I foolishly said that this would be trivial for me to support, I planned to just look to see if the two "undefined" streams were even connected, if not then I would skip over that part of the processing.
My problem is that I cannot figure out if an output is connected at run time, I had hoped that the output pipeline or output buffer would be missing. It doesn't look like that is the case; even when they are not hooked up the output and buffer are present.
Does anyone know where I should be looking to see if an output has a downstream consumer or not?
Thanks!
Edit: I was never able to figure out how to do this reliably, so I ended up making this behaviour configurable by the user. It is not automatic like I would have hoped but the difference I found between the BIDS environment and the DTExec environment pushed me to the conclusion that a component probably should not be making assumptions about the component graph it is embedded in.

What are logging libraries for?

This may be a stupid question, as most of my programming consists of one-man scientific computing research prototypes and developing relatively low-level libraries. I've never programmed in the large in an enterprise environment before. I've always wondered, what are the main things that logging libraries make substantially easier than just using good old fashioned print statements or file output, simple programming logic and a few global variables to determine how verbosely things get logged? How do you know when a few print statements or some basic file output ain't gonna cut it and you need a real logging library?
Logging helps debug problems especially when you move to production and problems occur on people's machines you can't control. Best laid plans never survive contact with the enemy, and logging helps you track how that battle went when faced with real world data.
Off the shel logging libraries are easy to plug in and play in less than 5 minutes.
Log libraries allow for various levels of logging per statement (FATAL, ERROR, WARN, INFO, DEBUG, etc).
And you can turn up or down logging to get more of less information at runtime.
Highly threaded systems help sort out what thread was doing what. Log libraries can log information about threads, timestamps, that ordinary print statements can't.
Most allow you to turn on only portions of the logging to get more detail. So one system can log debug information, and another can log only fatal errors.
Logging libraries allow you to configure logging through an external file so it's easy to turn on or off in production without having to recompile, deploy, etc.
3rd party libraries usually log so you can control them just like the other portions of your system.
Most libraries allow you to log portions or all of your statements to one or many files based on criteria. So you can log to both the console AND a log file.
Log libraries allow you to rotate logs so it will keep several log files based on many different criteria. Say after the log gets 20MB rotate to another file, and keep 10 log files around so that log data is always 100MB.
Some log statements can be compiled in or out (language dependent).
Log libraries can be extended to add new features.
You'll want to start using a logging libraries when you start wanting some of these features. If you find yourself changing your program to get some of these features you might want to look into a good log library. They are easy to learn, setup, and use and ubiquitous.
There are used in environments where the requirements for logging may change, but the cost of changing or deploying a new executable are high. (Even when you have the source code, adding a one line logging change to a program can be infeasible because of internal bureaucracy.)
The logging libraries provide a framework that the program will use to emit a wide variety of messages. These can be described by source (e.g. the logger object it is first sent to, often corresponding to the class the event has occurred in), severity, etc.
During runtime the actual delivery of the messaages is controlled using an "easily" edited config file. For normal situations most messages may be obscured altogether. But if the situation changes, it is a simpler fix to enable more messages, without needing to deploy a new program.
The above describes the ideal logging framework as I understand the intention; in practice I have used them in Java and Python and in neither case have I found them worth the added complexity. :-(
They're for logging things.
Or more seriously, for saving you having to write it yourself, giving you flexible options on where logs are store (database, event log, text file, CSV, sent to a remote web service, delivered by pixies on a velvet cushion) and on what is logged at runtime, rather than having to redefine a global variable and then recompile.
If you're only writing for yourself then it's unlikely you need one, and it may introduce an external dependency you don't want, but once your libraries start to be used by others then having a logging framework in place may well help your users, and you, track down problems.
I know that a logging library is useful when I have more than one subsystem with "verbose logging," but where I only want to see that verbose data from one of them.
Certainly this can be achieved by having a global log level per subsystem, but for me it's easier to use a "system" of some sort for that.
I generally have a 2D logging environment too; "Info/Warning/Error" (etc) on one axis and "AI/UI/Simulation/Networking" (etc) on the other. With this I can specify the logging level that I care about seeing for each subsystem easily. It's not actually that complicated once it's in place, indeed it's a lot cleaner than having if my_logging_level == DEBUG then print("An error occurred"); Plus, the logging system can stuff file/line info into the messages, and then getting totally fancy you can redirect them to multiple targets pretty easily (file, TTY, debugger, network socket...).

How to make hudson aggregate results provided in build artifacts over several builds

I have a hudson job that performs a stress test, torturing a virtual machine for several hours with some CPU- and IO-intensive tasks. The build scripts write a few interesting results into several files which are then stored as build artifacts. For example, one result is the time it took to perform certain operations.
I need to monitor the development of these results. For example, I need to know when the time for certain operations suddenly increases. So I need to aggregate these results over several (all?) builds. The ideal scenario would be if I could download the aggregated data from hudson.
I've been thinking about several possibilities to do this, but they all seem quite complicated. That's when I thought someone else might have had that problem already.
Maybe there already are some plugins doing this?
If you can write a script to extract the relevant numbers from the log files, you can use the Plot Plugin to visualize the data. We use this for simple stuff like tracking the executable size of build artifacts.
The Plot Plugin is more manual than the Perf Plugin mentioned by #Tao, but it might be easier to integrate depending on how much data murging the Perf Plugin requires.
Update: Java-style properties files (which are used as input to the Plot Plugin) are just simple name-value pairs in a text file, e.g.:
YVALUE=1234
Here's a build script that shows a (very stupid) example:
echo YVALUE=$RANDOM > buildtime.properties
This example plots a random number with each build.
I have not persoanlly use this plugin yet, but this might fits your need if you can just generate the xml file according to this plugin's format according to its description.
PerfPublisher Plugin
What about creating the results as JUnit results (XML files) so the results can be recorded by Hudson and will be aggregated by Hudson for different builds.

What's the difference between data and code?

To take an example, consider a set of discounts available to a supermarket shopper.
We could define these rules as data in some standard fashion (lists of qualifying items, applicable dates, coupon codes) and write generic code to handle these. Or, we could write each as a chunk of code, which checks for the appropriate things given the customer's shopping list and returns any applicable discounts.
You could reasonably store the rules as objects, serialised into Blobs or stored in code files, so that each rule could choose its own division between data and code, to allow for future rules that wouldn't fit the type of generic processor considered above.
It's often easy to criticise code that mixes data in, via if statements that check for 6 different things that should be in a file or a database, but is there a rule that helps in the edge cases?
Or is this the point of Object Oriented design, to stop us worrying about the line between data and code?
To clarify, the underlying question is this: How would you code the above example? Is there a rule of thumb that made you decide what is data and what is code?
(Note: I know, code can be compiled, but in a world of dynamic languages and JIT compilation, even that is a blurry concept.)
Fundamentally, there is of course no difference between data and code, but for real software infrastructures, there can be a big difference. Apart from obvious things like, as you mentioned, compilation, the biggest issue is this:
Most sufficiently large projects are designed to produce "releases" that are one big bundle, produced in 3-month (or longer) cycles, tested extensively and cannot be changed afterwards except in tightly controlled ways. "Code" most definitely cannot be changed, so anything that does need to be changed has to be factored out and made "configuration data" so that changing it becomes palatable those whose job it is to ensure that a release works.
Of course, in most cases bad configuration data can break a release just as thoroughly as bad code, so the whole thing is largely an illusion - in reality it doesn't matter whether it's code or "configuration data" that changes, what matters is that the interface between the main system and the parts that change is narrow and well-defined enough to give you a good chance that the person who does the change understands all consequences of what he's doing.
This is already harder than most people think when it's really just a few strings and numbers that are configured (I've personally witnessed a production mainframe system crash because it had one boolean value set differently than another system it was talking to). When your "configuration data" contains complex logic, it's almost impossible to achieve. But the situation isn't going to be any better ust because you use a badly-designed ad hoc "rules configuration" language instead of "real" code.
This is a rather philosophical question (which I like) so I'll answer it in a philosophical way: with nothing much to back it up. ;)
Data is the part of a system that can change. Code defines behavior; the way in which data can change into new data.
To put it more accurately: Data can be described by two components: a description of what the datum is supposed to represent (for instance, a variable with a name and a type) and a value.
The value of the variable can change according to rules defined in code. The description does not change, of course, because if it does, we have a whole new piece of information.
The code itself does not change, unless requirements (what we expect of the system) change.
To a compiler (or a VM), code is actually the data on which it performs its operations. However, the to-be-compiled code does not specify behavior for the compiler, the compiler's own code does that.
It all depends on the requirement. If the data is like lookup data and changes frequently you dont really want to do it in code, but things like Day of the Week, should not chnage for the next 200 years or so, so code that.
You might consider changing your topic, as the first thing I thought of when I saw it, was the age old LISP discussion of code vs data. Lucky in Scheme code and data looks the same, but thats about it, you can never accidentally mix code with data as is very possible in LISP with unhygienic macros.
Data are information that are processed by instructions called Code. I'm not sure I feel there's a blurring in OOD, there are still properties (Data) and methods (Code). The OO theory encapsulates both into a gestalt entity called a Class but they are still discrete within the Class.
How flexible you want to make your code in a matter of choice. Including constant values (what you are doing by using if statements as described above) is inflexible without re-processing your source, whereas using dynamically sourced data is more flexible. Is either approach wrong? I would say it really depends on the circumstances. As Leppie said, there are certain 'data' points that are invariate, like the days of the week that can be hard coded but even there it may be advantageous to do it dynamically in certain circumstances.
In Lisp, your code is data, and your
data is code
In Prolog clauses are terms, and terms
are clauses.
The important note is that you want to separate out the part of your code that will execute the same every time, (i.e. applying a discount) from the part of your code which could change (i.e. the products to be discounted, or the % of the discount, etc.)
This is simply for safety. If a discount changes, you won't have to re-write your discount code, you'll only need to go into your discounts repository (DB, or app file, or xml file, or however you choose to implement it) and make a small change to a number.
Also, if the discount code is separated into an XML file, then you can give the entire application to a manager, and with sufficient instructions, they won't need to pester you whenever they want to change the discount rates.
When you mix in data and code, you are exponentially increasing the odds of breaking when anything changes. So, as leppie said, you need to extract the constantly changing parts, and put them in a separate place.
Huge difference. Data is a given to system while code is a part of system.
Wrong data is senseless: our code===handler is good and what you put that you take, it is not a trouble of system that you meant something else. But if code is bad - system is bad.
In example, let's consider some JSON, some bad code parser.js by me and let's say good V8. For my system bad parser.js is a code and my system works wrong. But for Google system my bad parser is data that no how says about quality of V8.
The question is very practical, no sophistic.
https://en.wikipedia.org/wiki/Systems_engineering tries to make good answer and money.
Data is information. It's not about where you decide to put it, be it a db, config file, config through code or inside the classes.
The same happens for behaviors / code. It's not about where you decide to put it or how you choose to represent it.
The line between data and code (program) is blurry. It's ultimately just a question of terminology - for example, you could say that data is everything that is not code. But, as you wrote, they can be happily mixed together (although usually it's better to keep them separate).
Code is any data which can be executed. Now since all data is used as input to some program at some point of time, it can be said that this data is executed by a program! Thus your program acts as a virtual machine for your data. Hence in theory there is no difference between data and code!
In the end what matters is software engineering/development considerations like performance, efficiency etc. For example data driven programs may not be as efficient as programs which have hard coded (and hence fragile) conditional statements. Hence I choose to define code as any data which can be efficiently executed and all else being plain data.
It's a tradeoff between flexibility and efficiency. Executable data (like XML rules) offers more flexibility (sometimes) while the same data/rules when coded as part of the application will run more efficiently but changing it frequently becomes cumbersome. In other words executable data is easy to deploy but is inefficient and vice-versa. So ultimately the decision rests with you - the software designer.
Please correct me if I wrong.
Relationship between code and data is as follows:
code after compiled to a program processes the data while execution
program can extract data, transform data, load data, generate data ...
Also
program can extract code, transform code, load code, generate code tooooooo...
Hence code without compiled or interperator is useless, data is always worth..., but code after compiled can do all the above activities....
For eg)
Sourcecontrolsystem process Sourcecodes
here source code itself is a code
Backupscripts process files
here files is a data and so on...
I would say that the distinction between data, code and configuration is something to be made within the context of a particular component. Sometimes it's obvious, sometimes less so.
For example, to a compiler, the source code it consumes and the object code it creates are both data - and should be separated from the compiler's own code.
In your case you seem to be describing the option of a particularly powerful configuration file, which can contain code. Much as, for example, the GIMP lets you 'configure' plugins using Scheme. As the developer of the component that reads this configuration, you would think of it as data. When working at a different level -- writing the configuration -- you would think of it as code.
This is a very powerful way of designing.
Applying this to the underlying question ("How would you code the above example?"), one option might be to adopt or design a high level Domain Specific Language (DSL) for specifying rules. At startup, or when first required, the server reads the rule and executes it.
Provide an admin interface allowing the administrator to
test a new rule file
replace the current configuration with that from a new rule file
... all of which would happen at runtime.
A DSL might be something as simple as a table parser or an XML parser, or it could be something as sophisticated as a scripting language. From C, it's easy to embed Python or Lua. From Java it's easy to embed Groovy or Clojure.
You could switch in compiled code at runtime, with clever linking or classloader tricks. This seems more difficult and less valuable than the embedded DSL option, in my opinion.
The best practical answer to this question I found is this:
Any class that needs to be serialized, now or in any foreseeable future, is data.
Everything else is code.
That's why, for example, Java's HashMap is data - although it has a lot of code, API methods and specific implementation (i.e., it might look as code at first glance).