We are looking to build a single pipeline within a code repository that cleans, harmonizes, and transforms data to features of interest. We would like to apply that single pipeline code on different inputs and then test how the outputs look.
For example, we would like to test the pipeline on synthetic data, version 1 of 'real' data that includes only retrospective data, and version 2 of 'real' data that includes retrospective and prospective data. The comparison of the outputs could be what percent of patients had diabetes in version 1 compared to version 2.
I saw that you could template code repositories in foundry. Is this a viable option? Could you template your code repository and apply to the three scenarios I have provided? Is there a better option?
If your data scale is reasonably small, I would recommend going down the test-driven path of development here instead of trying to compare and contrast results across a wide variety of datasets. You'll find the iteration time and difficulty in exactly comparing results probably quite high.
For this, you should follow the method I lay out here and create representative datasets for each input you expect as a .csv file in your repo, then you can incorporate these schemas as a unique input to your core code and inspect the outputs with ease.
This will let you 'tighten' your code much easier and faster, after which you can then run this logic on real full-scale data and generate your outputs as you wish.
Templating code is possible but should be incorporated with great care. If what you're truly solving for is comparing and contrasting the execution of your code on arbitrary schemas, then you should use test-driven in-repo development. If what you're after is running a core set of logic across a wide variety of outputs after the code is working, then generated transforms is going to work great. If what you're really after is rolling out a large codebase of transformations across differently-permissioned projects where each needs to be completely independent / configured separately of the other, then maybe you should consider templates. I would stick to test-driven development and generated transforms until you prove otherwise.
I'm documenting the current state of a javascript package which is comprised of several modules predominantly consisting of standalone functions. As the result of using callbacks extensively, the package includes nested calls between standalone functions from multiple packages.
With this in mind, does anyone know what is the best way to represent calls between standalone functions in a sequence diagram?
Are the details of the standalone function worth it?
A common wisdom recommend to avoid the trap of UML as graphical programming language. Things that are easier expressed in code and easy to understand by readers better stay as code. Prefer to use UML to give the big picture, and explain complex relationships that are less obvious to spot in the code.
Automate obvious documentation?
Manually modelling a very precise sequence diagram is time consuming. Moreover, such diagram is quickly outdated with the next version of code.
Therefore, if your interest is to give an overview on how the function relate to each other, you may be interested to provide instead a visual overview using a simpler call graph. The reader can grasp the overall structure easily and look for more details in the code:
The advantage is that this task can be automated, using one of the many call graph generators available on the market (just google for javascript call graph generator to find some). There's by the way an excellent book on further automating documentation, that I can only recommend with enthousiasm: "Living Documentation: Continuous Knowledge Sharing by Design, First Edition"
If you have to set the focus on the detailed chronological sequencing of the calls a call graph would however not be sufficient. In this case, the sequence diagrams may then indeed be more relevant.
UML sequence diagrams with standalone functions?
A sequence diagram shows interactions between lifelines within an enclosing classifier. Usually a lifeline is used to represent an object, i.e. an instance of a class, but its definition is flexible enough to accommodate with any participant in an interaction.
Individual standalone functions can moreover be considered as individual objects that instantiate a more general class of functions (that's the concept behind a functor, like C++ std::function). This is particularly relevant in javascript where functions can be assigned to variables or used as parameters. So you may just use a lifeline that clarifies this. Up to you to decide how you will name the call message (e.g. operator()(a,b,c) or using its real name for readability ? ):
You can also group a bunch of related standalone functions into a pseudo-class that would represent in your model a module, compilation unit or namespace. Although a module is not stricto sensu an object, you may in your modeling deal with it as if it was a class with only one (anonymous) instance (i.e. its state would be the global variables defined in the module scope. The related standalone functions could be seen as operations of this imaginary class). The lifeline would correspond to a module, and function calls would be represented as synchronous messages either to another module or a message to itself with nested activation to visualise the “nested” calls.
I try to code floating adder;
https://github.com/ElectronNest/FPU/blob/master/FloatAdd.scala
This is half way.
The normalization is huge code part, so I would like to use for-loop or some equivalent representation method.
Is it possible to use loop or we need strict coding?
Best,
S.Takano
This is a very general and large question. The equivalent of a for loop in hardware can be implemented using a number of techniques, pretty much all of them involving registers to hold state information. Looking at your code I would suggest that you start a little smaller and work on syntax, I see many syntax errors currently. I use IntelliJ community edition as an editor because it does a great job with helping to get the code properly structured. I also would strongly recommend starting from the chisel-template repository. It has the proper layout and examples of a working circuit and unit testing harness. Then start with a smaller implementation that does something simple like just pass input to output and runs in a test harness, then slowly build up the circuit to achieve your goals.
Good luck!
Welcome and thank you for your interest in Chisel!
I would like to echo Chick's suggestion to start from something small that compiles and simulates and build up from there. In particular, the linked code above conflates some Scala vs. Chisel constructs (eg. Scala's if else, vs. Chisel's when, .elsewhen, .otherwise), as well as some Verilog vs. Chisel concepts (eg. bit indexing with [high:low] vs. Chisel's (high, low))
In case you haven't seen it, I would suggest taking a look at the Chisel Bootcamp which helps explain how to use constructs like for loops to generate hardware.
I'll also plug my own responses to this question on the chisel-users mailing list where I tried to explain some of the intuition behind writing Chisel generators, including differentiating if and when and using for loops.
I was consulting several references to discover how I may output trained Weka models into Java source code so that I may use the classifiers I am training in actual code for research applications I have been developing.
As I was playing with Weka 3.7, I noticed that while it does output Java code to its main text buffer when use simpler classification (supervised in my case this time) methods such as J48 decision tree, it removes the option (rather, it voids it by removing the ability to checkmark it and fades the text) to output Java code for RandomTree and RandomForest (which are the ones that give me the best performance in my situation).
Note: I am clicking on the "More Options" button and checking "Output source code:".
Does Weka not allow you to output RandomTree or RandomForest as Java code? If so, why? Or if it does and just doesn't put it in the output buffer (since RF is multiple decision trees which I imagine it doesn't want to waste buffer space), how does one go digging up where in the file system Weka outputs java code by default?
Are there any tricks to get Weka to give me my trained RandomForest as Java code? Or is Serialization of the output *.model files my only hope when it comes to RF and RandomTree?
Thanks in advance to those who provide help.
NOTE: (As an addendum to the answer provided below) If you run across a similar situation (requiring you to use your trained classifier/ML model in your code), I recommend following the links posted in the answer that was provided in response to my question. If you do not specifically need the Java code for the RandomForest, as an example, de-serializing the model works quite nicely and fits into Java application code, fulfilling its task as a trained model/hardened algorithm meant to predict future unlabelled instances.
RandomTree and RandomForest can't be output as Java code. I'm not sure for the reasoning why, but they don't implement the "Sourceable" interface.
This explains a little about outputting a classifier as Java code: Link 1
This shows which classifiers can be output as Java code: Link 2
Unfortunately I think the easiest route will be Serialization, although, you could maybe try implementing "Sourceable" for other classifiers on your own.
Another, but perhaps inconvenient solution, would be to use Weka to build the classifier every time you use it. You wouldn't need to load the ".model" file, but you would need to load your training data and relearn the model. Here is a starters guide to building classifiers in your own java code http://weka.wikispaces.com/Use+WEKA+in+your+Java+code.
Solved the problem for myself by turning the output of WEKA's -printTrees option of the RandomForest classifier into Java source code.
http://pielot.org/2015/06/exporting-randomforest-models-to-java-source-code/
Since I am using classifiers with Android, all of the existing options had disadvantages:
shipping Android apps with serialized models didn't reliably work across devices
computing the model on the phone took too much resources
The final code will consist of three classes only: the class with the generated model + two classes to make the classification work.
The problem:
You have some data and your program needs specified input. For example strings which are numbers. You are searching for a way to transform the original data in a format you need.
And the problem is: The source can be anything. It can be XML, property lists, binary which
contains the needed data deeply embedded in binary junk. And your output format may vary
also: It can be number strings, float, doubles....
You don't want to program. You want routines which gives you commands capable to transform the data in a form you wish. Surely it contains regular expressions, but it is very good designed and it offers capabilities which are sometimes much more easier and more powerful.
ADDITION:
Many users have this problem and hope that their programs can convert, read and write data which is given by other sources. If it can't, they are doomed or use programs like business
intelligence. That is NOT the problem.
I am talking of a tool for a developer who knows what is he doing, but who is also dissatisfied to write every time routines in a regular language. A professional data manipulation tool, something like a hex editor, regex, vi, grep, parser melted together
accessible by routines or a REPL.
If you have the spec of the data format, you can access and transform the data at once. No need to debug or plan meticulously how to program the transformation. I am searching for a solution because I don't believe the problem is new.
It allows:
joining/grouping/merging of results
inserting/deleting/finding/replacing
write macros which allows to execute a command chain repeatedly
meta-grouping (lists->tables->n-dimensional tables)
Example (No, I am not looking for a solution to this, it is just an example):
You want to read xml strings embedded in a binary file with variable length records. Your
tool reads the record length and deletes the junk surrounding your text. Now it splits open
the xml and extracts the strings. Being Indian number glyphs and containing decimal commas instead of decimal points, your tool transforms it into ASCII and replaces commas with points. Now the results must be stored into matrices of variable length....etc. etc.
I am searching for a good language / language-design and if possible, an implementation.
Which design do you like or even, if it does not fulfill the conditions, wouldn't you want to miss ?
EDIT: The question is if a solution for the problem exists and if yes, which implementations are available. You DO NOT implement your own sorting algorithm if Quicksort, Mergesort and Heapsort is available. You DO NOT invent your own text parsing
method if you have regular expressions. You DO NOT invent your own 3D language for graphics if OpenGL/Direct3D is available. There are existing solutions or at least papers describing the problem and giving suggestions. And there are people who may have worked and experienced such problems and who can give ideas and suggestions. The idea that this problem is totally new and I should work out and implement it myself without background
knowledge seems for me, I must admit, totally off the mark.
UPDATE:
Unfortunately I had less time than anticipated to delve in the subject because our development team is currently in a hot phase. But I have contacted the author of TextTransformer and he kindly answered my questions.
I have investigated TextTransformer (http://www.texttransformer.de) in the meantime and so far I can see it offers a complete and efficient solution if you are going to parse character data.
For anyone who will give it a try to implement a good parsing language, the smallest set of operators to directly transform any input data to any output data if (!) they were powerful enough seems to be:
Insert/Remove: Self-explaining
Group/Ungroup: Split the input data into a set of tokens and organize them into groups
and supergroups (datastructures, lists, tables etc.)
Transform
Substituition: Change the content of the tokens (special operation: replace)
Transposition: Change the order of tokens (swap,merge etc.)
Have you investigated TextTransformer?
I have no experience with this, but it sounds pretty good and the author makes quite competent posts in the comp.compilers newsgroup.
You still have to some programming work.
For a programmer, I would suggest:
Perl against a SQL backend.
For a non-programmer, what it sounds like you're looking for is some sort of business intelligence suite.
This suggestion may broaden the scope of your search too much... but here it is:
You could either reuse, as-is, or otherwise get "inspiration" from the [open source] code of the SnapLogic framework.
Edit (answering the comment on SnapLogic documentation etc.)
I agree, the SnapLogic documentation leaves a bit to be desired, in particular for people in your situation, i.e. when just trying to quickly get an overview of what SnapLogic can do, and if it would generally meet their needs, without investing much time and learn the system in earnest.
Also, I realize that the scope and typical uses of of SnapLogic differ, somewhat, from the requirements expressed in the question, and I should have taken the time to better articulate the possible connection.
So here goes...
A salient and powerful feature of SnapLogic is its ability to [virtually] codelessly create "pipelines" i.e. processes made from pre-built components;
Components addressing the most common needs of Data Integration tasks at-large are supplied with the SnapLogic framework. For example, there are components to
read in and/or write to files in CSV or XML or fixed length format
connect to various SQL backends (for either input, output or both)
transform/format [readily parsed] data fields
sort records
join records for lookup and general "denormalized" record building (akin to SQL joins but applicable to any input [of reasonnable size])
merge sources
Filter records within a source (to select and, at a later step, work on say only records with attribute "State" equal to "NY")
see this list of available components for more details
A relatively weak area of functionality of SnapLogic (for the described purpose of the OP) is with regards to parsing. Standard components will only read generic file formats (XML, RSS, CSV, Fixed Len, DBMSes...) therefore structured (or semi-structured?) files such as the one described in the question, with mixed binary and text and such are unlikely to ever be a standard component.
You'd therefore need to write your own parsing logic, in Python or Java, respecting the SnapLogic API of course so the module can later "play nice" with the other ones.
BTW, the task of parsing the files described could be done in one of two ways, with a "monolithic" reader component (i.e. one which takes in the whole file and produces an array of readily parsed records), or with a multi-component approach, whereby an input component reads in and parse the file at "record" level (or line level or block level whatever this may be), and other standard or custom SnapLogic components are used to create a pipeline which effectively expresses the logic of parsing a record (or block or...) into its individual fields/attributes.
The second approach is of course more modular and may be applicable if the goal is to process many different files format, whereby each new format requires piecing together components with no or little coding. Whatever the approach used for the input / parsing of the file(s), the SnapLogic framework remains available to create pipelines to then process the parsed input in various fashion.
My understanding of the question therefore prompted me to suggest SnapLogic as a possible framework for the problem at hand, because I understood the gap in feature concerning the "codeless" parsing of odd-formatted files, but also saw some commonality of features with regards to creating various processing pipelines.
I also edged my suggestion, with an expression like "inspire onself from", because of the possible feature gap, but also because of the relative lack of maturity of the SnapLogic offering and its apparent commercial/open-source ambivalence.
(Note: this statement is neither a critique of the technical maturity/value of the framework per-se, nor a critique of business-oriented use of open-source, but rather a warning that business/commercial pressures may shape the offering in various direction)
To summarize:
Depending on the specific details of the vision expressed in the question, SnapLogic may be worthy of consideration, provided one understands that "some-assembly-required" will apply, in particular in the area of file parsing, and that the specific shape and nature of the product may evolve (but then again it is open source so one can freeze it or bend it as needed).
A more generic remark is that SnapLogic is based on Python which is a very swell language for coding various connectors, convertion logic etc.
In reply to Paul Nathan you mentioned writing throwaway code as something rather unpleasant. I don't see why it should be so. After all, all of our code will be thrown away and replaced eventually, no matter how perfect we wrote it. So my opinion is that writing throwaway code is pretty much ok, if you don't spend too much time writing it.
So, it seems that there are two approaches to solving your solution: either a) find some specific tool intended for the purpose (parse data, perform some basic operations on it and storing it in some specific structure) or b) use some general purpose language with lots of libraries and code it yourself.
I don't think that approach a) is viable because sooner or later you'll bump into an obstacle not covered by the tool and you'll spend your time and nerves hacking the tool, or mailing the authors and waiting for them to implement what you need. I might as well be wrong, so please if you find a perfect tool, drop here a link (I myself am doing lots of data processing in my day job and I can't swear that I couldn't do it more efficiently).
Approach b) may at first seem "unpleasant", but given a nice high-level expressive language with bunch of useful libraries (regexps, XML manipulation, creating parsers...) it shouldn't be too hard, and may be gradually turned into a DSL for the very purpose. Beside Perl which was already mentioned, Python and Ruby sound like good candidates for these languages (I bet some Lisp derivatives too, but I have no experience there).
You might find AntlrWorks useful if you go so far as defining formal grammars for what you're parsing.