Related
Somebody please re-tag with appropriate tags
Hello,
This is my story but I guess it holds true for all programmers.
We begin programming with some simple Hello World program. We practice & add functions/classes to the program. But they still maintain the Hello World style. function calling some other functions standard library.
But when it comes to real world projects(I'm just familiar with OpenSource). Lot more other things come into picture. Then begins the hardships of this newbie programmer.
Project Flow:
Program is not running as expected. Make use of Debugger
Making use of third party libraries. Today, we have
library in every popular language for
almost everything we need.
Multiple persons working on same project. Using Version Control
Systems.
Project is growing big. Build Automation
Lot of people started using your application. You need to port it to
different platforms (operating
systems/architectures). Need for
Cross Compiliation
I don't know why but we need Unit Testing Framework and/or unit tests
What else???
The problem in this is the lack of knowledge of this newbie programmer about existence of these things.
What I mean is when I started looking into some real world projects(Opensource). I didn't know what is this? and why we need to do this?
$./configure
$make
$make install
Recently I became aware of the keyword "Build Automation". I was in need of some library which was available for linux but I needed it in windows. I didn't know that its called "Cross compilation" and tools like MinGW/MSYS exist for this purpose. I had to learn these things in the hard way. I wish some one has told me about existence of such things. That would have saved my lot of time.
Today I ran into performance problem and was feeling the need for something. I guess the thing I'm looking for is Profiler. Thanks to my involvement in opensource projects. Even though I didn't realized/felt the need for this, I'm aware of term Unit Testing.
Though this (hard)way of learning things has some big advantages like now, I'm able to figure out solution or some unknown thing very quickly & unlike my other friends I don't get struck at any point. But I hate the wastage of time involved. You do not believe how much I time I wasted in figuring out the Makefiles & Gnu Build System
So, what am I looking for this in this post?
Please complete the Project Flow. I want to see what all things are involved.
For each of the tasks in the Project Flow list. I want to see following information.
Most popular solutions/tools availabe.
Wikipedia list to all alternatives.
[optional] Suggest some good books/tutorials/guides for learning about this. Or link to relavent SO posts/tags.
I know somethings are language & OS specific. I would say we have only handful of major platforms Linux/Unix, Windows, Java, .NET and handful of major languages C, C++, Java, .NET, Python. Address these languages. Its more than sufficient.
Example:
Making use of libraries:
Libraries are distributed in any of the following forms
Source Distrubtion
Static Libraries(*.lib for windows / *.a for linux)
Dynamic Libraries (.dll for windows /.so for linux)
.NET assemblies
I don't know about java
Resources (Now, once I know the above info. I can search on my own for resources)
http://en.wikipedia.org/wiki/Library_(computing)
How To Write Shared Libraries
http://www.tenouk.com/ModuleBB.html
Note:
Please not that I'm not asking to suggest info on how to learn each of these things. I'm asking about what more such kind of things are involved and alternatives for each of them.
Your question seems to be too broad, but in comments you asked for examples of different methods. This would at least enable you to find some other pointers and make your own decisions.
Rational Unified Process (RUP)/Unified Process (UP)
eXtreme Programming (XP)
Scrum
Test Driven Development (TDD)/Behaviour Driven Development (BDD)
Lean Software development
Kanban
Feature Driven Development (FDD)
Those and many more are software development methods, which comprise of different principles, processes and roles. Most of them also come with some sets of practices for different purposes, e.g. ones you mention in your question. Those are quite widely used in real world, although many companies use different combinations and also some own processes and methods. What is not part of what is used so widely, but might give you some interesting views on development are formal methods, which are precise, but most of the time too unpractical.
Google has described a novel framework for distributed processing on Massive Graphs.
http://portal.acm.org/citation.cfm?id=1582716.1582723
I wanted to know if similar to Hadoop (Map-Reduce) are there any open source implementations of this framework?
I am actually in process of writing a Pseudo distributed one using python and multiprocessing module and thus wanted to know if someone else has also tried implementing it.
Since public information about this framework is extremely scarce. (A link above and a blog post at Google Research)
Apache Giraph http://giraph.apache.org
Phoebus https://github.com/xslogic/phoebus
Bagel https://github.com/mesos/spark/pull/48
Hama http://hama.apache.org/
Signal-Collect http://code.google.com/p/signal-collect/
HipG http://www.cs.vu.nl/~ekr/hipg/
The main Hadoop project for distributed graph processing is the Hama project. Its still in incubation though.
The project has broken its work into two areas; a matrix package and a graph package.
Update:
A better option would be the Apache Giraph project which is based on Google Pregel.
Yes, a new project called Golden Orb, which is an open-source Pregel implementation written in Java that runs on both HBASE and Cassandra.
It has been submitted to Apache incubator for approval, and Ravel, the company behind Golden Orb, said they are releasing it this month (http://www.raveldata.com/goldenorb/).
See http://www.quora.com/Graph-Databases/What-open-source-graph-databases-support-horizontal-scaling
UPDATE: GraphX is GraphLab2 on Spark implemented by Joey Gonzalez, the creator of GraphLab2.
Spark's unique primitives make GraphX-Pregel the fastest JVM-based Pregel implementation. Spark is written in Scala, but Spark has a Java and Python API.
See...
GraphX: A Resilient Distributed Graph System on Spark (PDF)
Introduction to GraphX, by Joseph Gonzalez, Reynold Xin - UC Berkeley AmpLab 2013 (YouTube)
My Hacker News comment/overview on Spark.
P.S. There is also Bagel, which was the first cut at Pregel on Spark. It works; however, GraphX will be the way forward.
Two projects from Carnegie Mellon University provide Pregel-style computation
on graphs:
GraphLab http://graphlab.org
GraphChi http://graphchi.org
The programming model is not exactly same as Pregel, as they are not based on messaging but on modifying the graph (edge, vertex) data directly. Basically, it is easy to emulate Pregel in these framework.
There is also Signal/Collect a framework written in Scala and now using Akka
http://code.google.com/p/signal-collect/
https://github.com/uzh/signal-collect
From their website:
In Signal/Collect an algorithm is written from the perspective of vertices and edges. Once a graph has been specified the edges will signal and the vertices will collect. When an edge signals it computes a message based on the state of its source vertex. This message is then sent along the edge to the target vertex of the edge. When a vertex collects it uses the received messages to update its state. These operations happen in parallel all over the graph until all messages have been collected and all vertex states have converged.
Many algorithms have very simple and elegant implementations in Signal/Collect. You find more information about the programming model and features in the project wiki. Please take the time to explore some of the example algorithms below.
I create a framework called Phoebus. It is an implementation of Pregel written in Erlang. Checkout my blog entry for applying Pregel model to path finding as well..
Apache Giraph is currently in Incubator and under very active development, with committers from LinkedIn, Twitter, Facebook and academia looking to bring it up to production scale very quickly. It is pretty directly modeled on Pregel and was originally developed at Yahoo! Research. We're looking for new contributors and have several introductory JIRA issues to help people get started with the project. We'd love to have you get involved.
Stanford Students have developed an open Source implementation of Pregel.
http://infolab.stanford.edu/gps/
Has anybody seen a GP implemented with online learning rather than the standard offline learning? I've done some stuff with genetic programs and I simply can't figure out what would be a good way to make the learning process online.
Please let me know if you have any ideas, seen any implementations, or have any references that I can look at.
Per the Wikipedia link, online learning "learns one instance at a time." The online/offline labels usually refer to how training data is feed to a supervised regression or classification algorithm. Since genetic programming is a heuristic search that uses an evaluation function to evaluate the fitness of its solutions, and not a training set with labels, those terms don't really apply.
If what you're asking is if the output of the GP algorithm (i.e. the best phenotype), can be used while it's still "searching" for better solutions, I see no reason why not, assuming it makes sense for your domain/application. Once the fitness of your GA/GP's population reaches a certain threshold, you can apply that solution to your application, and continue to run the GP, switching to a new solution when a better one becomes available.
One approach along this line is an algorithm called rtNEAT, which attempts to use a genetic algorithm to generate and update a neural network in real time.
I found a few examples by doing a Google scholar search for online Genetic Programming.
An On-Line Method to Evolve Behavior and to Control a Miniature Robot in Real Time with Genetic Programming
It actually looks like they found a way to make GP modify the machine code of the robot's control system during actual activities - pretty cool!
Those same authors went on to produce more related work, such as this improvement:
Evolution of a world model for a miniature robot using genetic programming
Hopefully their work will be enough to get you started - I don't have enough experience with genetic programming to be able to give you any specific advice.
It actually looks like they found a way to make GP modify the machine code of the robot's control system during actual activities - pretty cool!
Yes, the department at Uni Dortmund was heavily into linear GP :-)
Direct execution of GP programs vs. interpreted code has some advantages, though in these days you'd probably rather want to go with dynamic languages such as Java, C# or Obj-C that allow you to write classes/methods at runtime while still you can still benefit from some runtime rather than run on the raw CPU.
The online-learning approach doesn't seem like anything absolutely novel or different from 'classic GP' to me.
From my understanding it's just a case of extending the set of training/fitness/test cases during runtime?
Cheers,
Jay
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I would like to work on a programming project in my spare time and would like to know
if there is a project where I can help the science community in some way?
Sure, plenty! I see I'm not the first to think of numerical computation libraries like Numpy/Scipy - the code in that is actually fairly mature but they could certainly use help documenting. There's also GNU Octave, which does much of the same things as Numpy but doesn't require Python. A slightly related area in which there's a lot of work to do is computer algebra systems (CAS), basically open source equivalents of Mathematica; for example Maxima, and more are listed at http://sage.math.washington.edu/home/wdj/sigsam/opensource_math.html. You could also help with visualization libraries, i.e. creation of 2D and 3D plots and figures. For Scipy the most commonly used plot generator is Matplotlib, for example. There are also loads of more specialized data visualization tools that I'm sure you can find with a few searches.
One area that I personally think needs a lot of work is creating GUIs for the programs mentioned in the previous paragraph; one major advantage that commercial programs like Matlab and Mathematica enjoy over their open source equivalents is easy-to-use graphical interfaces. Having a nice usable interface would be great for scientists who may not be skilled in command-line-fu, but open source projects have a long way to go if they're going to catch up.
Projects like scipy and numpy are largely contributed by the scientific community. I'm sure they would appreciate any help you thought you could provide.
I know BOINC is always looking for help
Edit: Here is their programming help page http://boinc.berkeley.edu/trac/wiki/DevProjects
The Bio* projects like BioPerl, BioPython, or BioRuby would certainly like some help, too.
http://sourceforge.net/search/?type_of_search=soft&words=science
In addition to searching open source projects online, you can try to contact your local university and ask if any of their researchers (students or faculty) need development help.
If you are still looking, feel free to contact me via my profile page - I know of a hardware product that needs software - it is used for research (chemistry and biology)
The nuclear ad particle physics communities make heavy use of ROOT, which is developed using an open source methodology. They accept suggestions and patches without much trouble. The main work is in C++, but there are binding and support for other languages as well.
I'm sure that other disciplines have their own domain specific tools. For instance, I know that there are open Computational Fluid Dynamics and Finite Element systems.
Have a look around. While domain knowledge would be helpful, most big tools are going to need help with routine stuff like RDBMS access, GUIs, documentation, and so on...
You can discover the current problems of Science by reading the abstracts of the academic journals. e.g. the Bioinformatics journal.
A few examples:
Find a faster/efficient methods to assemble a huge set of short DNA reads:
Find a way to build an efficient social scientific network
Find a way to compare thousand of human genomes
....
you could also propose your help on Nature Network:Collaboration or FriendFeed: The life scientists
There are many exicting opportunities in chemistry. There is a strong Open Source community, much of which is organized under the Blue Obelisk (http://www.blueobelisk.org). There have been major contributions in visualisation and algorithms which did not need previous chemical knowledge and the community is very welcoming to anyone who wishes to help.
For an example of the standard which has been achieved take a look at Jmol which visualizes molecules and other chemistry in 3D (http://www.jmol.org);
There is also real opportunity to do porting between platforms/languages. The commonest ones are Java, Python, C++ and we have been working in C#. You don't have to be an ace programmer either - contributions to data standards, data resources, tutorials, packaging, installers, testing, etc. are all highly valued.
Some of these projects are within the top 100-500 projects on Sourceforge.
Don't forget that if you find a project to be a bit over your head or you aren't able to really contribute, but you still like the idea of it, you can always donate!
I have been doing a little reading on Flow Based Programming over the last few days. There is a wiki which provides further detail. And wikipedia has a good overview on it too. My first thought was, "Great another proponent of lego-land pretend programming" - a concept harking back to the late 80's. But, as I read more, I must admit I have become intrigued.
Have you used FBP for a real project?
What is your opinion of FBP?
Does FBP have a future?
In some senses, it seems like the holy grail of reuse that our industry has pursued since the advent of procedural languages.
1. Have you used FBP for a real project?
We've designed and implemented a DF server for our automation project (dispatcher, component iterface, a bunch of components, DF language, DF compiler, UI). It is written in bare C++, and runs on several Unix-like systems (Linux x86, MIPS, avr32 etc., Mac OSX). It lacks several features, e.g. sophisticated flow control, complex thread control (there is only a not too advanced component for it), so it is just a prototype, even it works. We're now working on a full-featured server. We've learnt lot during implementing and using the prototype.
Also, we'll make a visual editor some day.
2. What is your opinion of FBP?
2.1. First of all, dataflow programming is ultimate fun
When I met dataflow programming, I was feel like 20 years ago, when I met programming first. Altough, DF programming differs from procedural/OOP programming, it's just a kind of programming. There are lot of things to discover, even sooo simple ones! It's very funny, when, as an experienced programmer, you met a DF problem, which is a very-very basic thing, but it was completely unknown for you before. So, if you jump into DF programming, you will feel like a rookie programmer, who first met the "cycle" or "condition".
2.2. It can be used only for specific architectures
It's just a hammer, which are for hammering nails. DF is not suitable for UIs, web server and so on.
2.3. Dataflow architecture is optimal for some problems
A dataflow framework can make magic things. It can paralellize procedures, which are not originally designed for paralellization. Components are single-threaded, but when they're organized into a DF graph, they became multi-threaded.
Example: did you know, that make is a DF system? Try make -j (see man, what -j is used for). If you have multi-core machine, compile your project with and without -j, and compare times.
2.4. Optimal split of the problem
If you're writing a program, you often split up the problem for smaller sub-problems. There are usual split points for well-known sub-problems, which you don't need to implement, just use the existing solutions, like SQL for DB, or OpenGL for graphics/animation, etc.
DF architecture splits your problem a very interesting way:
the dataflow framework, which provides the architecture (just use an existing one),
the components: the programmer creates components; the components are simple, well-separated units - it's easy to make components;
the configuration: a.k.a. dataflow programming: the configurator puts the dataflow graph (program) together using components provided by the programmer.
If your component set is well-designed, the configurator can build such system, which the programmer has never even dreamed about. Configurator can implement new features without disturbing the programmer. Customers are happy, because they have personalised solution. Software manufacturer is also happy, because he/she don't need to maintain several customer-specific branches of the software, just customer-specific configurations.
2.5. Speed
If the system is built on native components, the DF program is fast. The only time loss is the message dispatching between components compared to a simple OOP program, it's also minimal.
3. Does FBP have a future?
Yes, sure.
The main reason is that it can solve massive multiprocessing issues without introducing brand new strange software architectures, weird languages. Dataflow programming is easy, and I mean both: component programming and dataflow configuration building. (Even dataflow framework writing is not a rocket science.)
Also, it's very economic. If you have a good set of components, you need only put the lego bricks together. A DF program is easy to maintain. The DF config building requires no experienced programmer, just a system integrator.
I would be happy, if native systems spread, with doors open for custom component creating. Also there should be a standard DF language, which means that it can be used with platform-independent visual editors and several DF servers.
Interesting discussion! It occurred to me yesterday that part of the confusion may be due to the fact that many different notations use directed arcs, but use them to mean different things. In FBP, the lines represent bounded buffers, across which travel streams of data packets. Since the components are typically long-running processes, streams may comprise huge numbers of packets, and FBP applications can run for very long periods - perhaps even "perpetually" (see a 2007 paper on a project called Eon, mostly by folks at UMass Amherst). Since a send to a bounded buffer suspends when the buffer is (temporarily) full (or temporarily empty), indefinite amounts of data can be processed using finite resources.
By comparison, the E in Grafcet comes from Etapes, meaning "steps", which is a rather different concept. In this kind of model (and there are a number of these out there), the data flowing between steps is either limited to what can be held in high-speed memory at one time, or has to be held on disk. FBP also supports loops in the network, which is hard to do in step-based systems - see for example http://www.jpaulmorrison.com/cgi-bin/wiki.pl?BrokerageApplication - notice that this application used both MQSeries and CORBA in a natural way. Furthermore, FBP is natively parallel, so it lends itself to programming of grid networks, multicore machines, and a number of the directions of modern computing. One last comment: in the literature I have found many related projects, but few of them have all the characteristics of FBP. A list that I have amassed over the years (a number of them closer than Grafcet) can be found in http://www.jpaulmorrison.com/cgi-bin/wiki.pl?FlowLikeProjects .
I do have to disagree with the comment about FBP being just a means of implementing FSMs: I think FSMs are neat, and I believe they have a definite role in building applications, but the core concept of FBP is of multiple component processes running asynchronously, communicating by means of streams of data chunks which run across what are now called bounded buffers. Yes, definitely FSMs are one way of building component processes, and in fact there is a whole chapter in my book on FBP devoted to this idea, and the related one of PDAs (1) - http://www.jpaulmorrison.com/fbp/compil.htm - but in my opinion an FSM implementing a non-trivial FBP network would be impossibly complex. As an example the diagram shown in
is about 1/3 of a single batch job running on a mainframe. Every one of those blocks is running asynchronously with all the others. By the way, I would be very interested to hearing more answers to the questions in the first post!
1: http://en.wikipedia.org/wiki/Pushdown_automaton Push-down automata
Whenever I hear the term flow based programming I think of LabView, conceptually. Ie component processes who's scheduling is driven primarily by a change to its input data. This really IS lego programming in the sense that the labview platform was used for the latest crop of mindstorm products. However I disagree that this makes it a less useful programming model.
For industrial systems which typically involve data collection, control, and automation, it fits very well. What is any control system if not data in transformed to data out? Ie what component in your control scheme would you not prefer to represent as a black box in a bigger picture, if you could do so. To achieve that level of architectural clarity using other methodologies you might have to draw a data domain class diagram, then a problem domain run time class relationship, then on top of that a use case diagram, and flip back and forth between them. With flow driven systems you have the luxury of being able to collapse a lot of this information together accurately enough that you can realistically design a system visually once the components are build and defined.
One question I never had to ask when looking at an application written in labview is "What piece of code set this value?", as it was inherent and easy to trace backwards from the data, and also mistakes like multiple untintended writers were impossible to create by mistake.
If only that was true of code written in a more typically procedural fashion!
1) I build a small FBP framework for an anomaly detection project, and it turns out to have been a great idea.
You can also have a look at some of the KNIME videos, that give a good idea of what a flow based framework feels like when the framework is put together by a great team. Admittedly, it is batch based and not created for continuous operation.
By far the best example of flow based programming, however, is UNIX pipes which is one of the oldest, most overlooked FBP framework. I don't think I have to elaborate on the power of nix pipes...
2) FBP is a very powerful tool for a large set of problems. The intrinsic parallelism is a great advantage, and any FBP framework can be made completely network transparent by using adapter modules. Smart frameworks are also absurdly fault tolerant, and able to dynamically reload crashed modules when necessary. The conceptual simplicity also allows cleaner communication with everybody involved in a project, and much cleaner code.
3) Absolutely! Pipes are here to stay, and are one of the most powerful feature of unix. The power inherent in a FBP framework compared to a static program are many, and trivialise change, to the point where some frameworks can be reconfigured while running with no special measures.
FBP FTW! ;-)
In automotive development, they have a language agnostic messaging protocol which is part of the MOST specification (Media Oriented Systems Transport), this was designed to communicate between components over a network or within the same device. Systems usually have both a real and visualized message bus - therefore you effectively have a form of flow based programming.
That was what made the light bulb go on for me several years ago and brought me here. It really is a fantastic way to work and so much more fun than conventional programming. The message catalog form the central specification and point of reference. It works well for both developers and management. i.e. Management are able to browse the message catalog instead of looking at source.
With integrated logging also referencing the catalog to produce intelligible analysis things can get really productive. I have real world experience of developing commercial products in this way. I am interested in taking things further, particularly with regards to tools and IDEs. Unfortunately I think many people within the automotive sector have missed the point about how great this is and have failed to build on it. They are now distracted by other fads and failed to realize that there was far more to most development than the physical bus.
I've used Spring Web Flow extensively in Java Web applications to model (typically) application processes, which tend to be complex wizard-like affairs with lots of conditional logic as to which pages to display. Its incredibly powerful. A new product was added and I managed to recut the existing pieces into a completely new application process in an hour or two (with adding a couple of new views/states).
I also looked into using OS Workflow to model business processes but that project got canned for various reasons.
In the Microsoft world you have Windows Workflow Foundation ("WWF"), which is becoming more popular, particularly in conjunction with Sharepoint.
FBP is just a means of implementing a finite state machine. It's nothing new.
I realize that it is not exactly the same thing, but this model has been used for years in PLC programming. ISO calls it Sequential Flow Chart, but many people call it Grafcet after a popular implementation. It offers parallel processing and defines transitions between states.
It's being used in the Business Intelligence world these days to mashup and process data. Data processing steps like ETL, querying, joining , and producing reports can be done by the end-user. I'm a developer on an open system - ComposableAnalytics.com In CA, the flow-based apps can be shared and executed via the browser.
This is what MQ Series, MSMQ and JMS are for.
This is cornerstone of Web Services and Enterprise Service Bus implementations.
Products like TIBCO and Sun's JCAPS are basically flow-based without using this particular buzz-word.
Most of the work of the application is done with small modules that pass messages through a processing network.