I'm currently participating in a programming contest (http://contest.github.com), which has as goal, to create a recommendation engine. I started coding in ruby, but soon realised it wasn't fast enough for the algorithms I had in mind. So I switched to C, which is the only non-scripting language I know. It was fast, of course, but I cringed every time I had to write a for loop, to go through the elements of an array (which was very often).
That's when it dawned: I wish I knew a fast, yet high-level language, to program all these intensive computations with ease!
So I looked at my options, but there are a lot of options these days! Here the best candidates I've found over the months, with something which bothers me about each of them (that hopefully you can clear up):
Clojure: I'm not sure I want to get into the whole lisp thing, I like my syntax and cruft. I could be convinced, though.
Haskell: Too academic? I don't really care for pure functional, I just want something which works. But it has nice syntax, and I don't mind static typing.
Scala: Weird language. I tried it out but it feels messy/inconsistent to me.
OCaml: Also wondering if this is too academic? The poor concurrency support also bothers me.
Arc: Paul Graham's lisp, too obscure, and again, I'm not sure I want to learn a lisp. But I trust this man!
Any advice? I really like the functional languages, for their ability to manipulate lists with ease, but I'm open to other options too. I'd like something about as fast as Java..
The kind of things I want to be able to do with lists are like (ruby):
([1, 2, 3, 4] - [2, 3]).map {|i| i * 2 } # which results in [2, 8]
I would also prefer an open-source language.
Thanks
Out of the languages that you've listed, neither Haskell nor Arc match your "fast" requirement - both are slower than Java. Your idea that Haskell is faster than Java and approaches C is most likely coming from one well-known flawed test that tried to measure performance by implementing sort. One thing that they've missed is that Haskell is lazy, and thus you need to use the results of the sort for it to actually perform that; and they measured performance simply by remembering current time, "calling" the sort function, and checking the time delta. C version of the test faithfully performed the sort, Haskell version simply returned a thunk for lazy evaluation which was never called.
In practice, there are a number of reasons why Haskell cannot be that fast even in theory. First, because of pervasive lazy evaluation, it often cannot pass around raw values, and has to generate thunks for expressions - the optimizer can trim down on those in trivial cases, but not for more complicated ones. Second, polymorphic Haskell functions are implemented as runtime-polymorphic, and not like C++ templates where every new type parameter instantiates a new version of code that is optimally compiled. Obviously, this necessitates extra boxing/unboxing. In the end, Haskell will struggle to beat any decent VM (such as HotSpot JVM, or CLR in .NET 2.0+), much less C/C++.
Now that's settled in, let's move on to the rest. Scala uses JVM as a backend, and thus is not going to be any faster than Java - and if you use higher-level abstractions, it will most likely be slower somewhat, but probably in the same ballpark. Clojure also runs on JVM, but it's also dynamically typed, and that carries an unavoidable performance penalty (I heard it does clever tricks to mitigate that to some extent, but some of it really is unavoidable no matter what).
That leaves OCaml, and out of your list, it is the only language that had actually been conclusively shown to reach the performance of C/C++ compilers on valid tests. It should be noted however that this would not be typical of idiomatic OCaml code - for example, its polymorphism is also runtime, similar to Haskell, and that carries the appropriate penalty; also, its OOP system is structural rather than nominal, which precludes an optimal vtable-based implementation; so that is going to be slower than C++, too (I'd expect perf penalty close to that of Objective-C dispatch compared to C++ dispatch, but I don't have any numbers to back that up). So you can beat C++ in OCaml if you steer away from certain language features, but unfortunately, it's those features that make OCaml so attractive in the first place.
My advice would be this: if you really need speed, go with C++. It can be fairly high-level if you use high-level libraries such as STL and Boost. It doesn't have some high-level language abstractions you might be used to, but libraries can compensate for that - sometimes fully, sometimes in part. For example, you don't have to write a for-loop to iterate over an array - you can use std::for_each, std::copy_if, std::transform, std::accumulate and similar algorithms (which are mostly analogous to map, filter, fold, and similar traditional FP primitives), and also Boost.Lambda to cut down on boilerplace.
Why not simple Java or C#? Should be faster then Ruby, more high level then C and have a huge userbase.
Your criticism of pretty much everything seems to be that it's "weird" or "too academic." But what does that mean? It's the sort of vague criticism that you can throw at any unfamiliar language that isn't totally mainstream (i.e., not C, C++, Objective-C, Java, Ruby, Python or PHP). There's nothing about all those languages that's inherently good for academia and bad for anything else. Try to break down your analysis a little further: Specifically, what is it that troubles you about those languages? You might find that your brain is just instinctively pushing away something unfamiliar. If that's the case, learning one of those languages might be a good way to expand your mind.
Alternatively: It sounds like you're looking for a functional language, so you might look at F#. It's a first-class CLR language created by Microsoft, so it doesn't carry any "academic" mental baggage, and it's very similar to OCaml.
newLISP is fast, small, integrates extremely easily with C, and it has quite a few statistical functions built-in.
Haskell is my current preference as a performant, high-level language. I've also heard very good things about OCaml, but haven't personally used it much.
Scala and Clojure will have similar performance to Java -- slow, slow, slow! Sure, they'll be faster than Ruby, but what isn't?
Arc is a set of macros for MzScheme, and is not particularly fast. If you want a performant LISP, try Common LISP -- it can be compiled to machine code.
How about Delphi / FreePascal? They're native code & fast. I do a lot of real-time graphics & processing with them. They dont require that you work 'low level', but you can if you need to. Plus you can embed assembler if needed for extra performance. FreePascal is cross platform if you want to stay off Windows.
D might fit the bill? Compiles to machine code but allows for programming using higher-level concepts.
Python can be made to run fast, especially using the NumPy package. Relevant links below:
http://www.scipy.org/PerformancePython
Cython and numpy speed
You seem uncomfortable with any language that doesn't look like one you already use. That's going to limit you, so I'd suggest one you won't be comfortable with if you're interested in expanding your horizons. I'm not saying you'll want to continue with any particular language (I have a definite preference never to touch Tcl again), but you should try it sometime.
There are nice fast implementations of Common Lisp, and that's an easy language to write functional programs in. Besides, if you can get along with it, you'll find a lot of neat things you can do with it.
Computation? Fortran. Beats the pants off of anything else.
If you don't mind .NET...
F# - based on O'Caml, multiparadigm language with full access to .NET Framework. Included officially in .NET FW 4.0
Nemerle - see F# and add to that a POWERFUL metaprogramming capabilities.
After your update:
If you want to manipulate lists easily you should go with Common Lisp. It is only 2 times slower that C in average (and actually faster in some things), it is great for list processing and it is multi-paradigm (imperative, functional and OO) - so you don't have to stick to functional-only programming. SBCL is a good Common Lisp to try first, IMO.
And don't get bothered by strange "lispy" things like parentheses. It is not only quite stupid to judge language by its syntax, rather than semantics, but also parentheses are one of the greatest strengths of LISP, because they eliminate differences between data and expressions and you can manipulate language itself to make it fit your needs.
Don't listen to people who advice C++/C#/Java. Java functional part is non-existant. C++ functional part is terrible. C# delegates makes me sick because of their complexity. They are not REAL multi-paradigm imperative/functional languages, they are imperative/OO languages that have some small functional bits, you can't do real functional programming in them.
C++ or alternatively C# and mono.
Honestly, to accomplish much in the world of software engineering, you will likely have to wrap your head around these languages you find distasteful. Java, C, C++, C#, etc. are likely to come up in a career that involves programming.
Looks like you've done some interesting work. I encourage you to push your technical skills harder. It will be worth the effort.
Alternatively, Python might be good, given your interests. You might find Smalltalk interesting, or even ATS.
For some ideas, look at the Language Shootout and analysis by Oscar Boykin. You have already discovered this, but comparing Ruby to C we see that Ruby is between 14 and 600 times slower (several tests are more than 100 times slower). He also points out that Python is faster than Ruby. The benchmarks for all languages is interesting.
Also interesting are benchmarks from Dan Corlan.
You might consider python; it supports writing modules in C or C++, so you can get it working in a high-level language, profile it, rework the algorithms, and if it still isn't fast enough, translate the hotspots to C or C++ for speed.
Consider Tcl, combined with C. Do the really compute-intensive stuff in C since that's what you know how to do, then use Tcl as the glue to combine the high level code with your C-based code.
I make this recommendation not because Tcl is necessarily the best language for the job (there really is no "best" for something like this) but because you'll learn a lot about the concept of combining the strengths of two different languages. It's an important technique that could serve you well in your career whether it's Tcl/C, Lua/C, Groovy/Java, Python/C, etc.
Python with pyrex or psyco may be a better fit? Probably not as fast as C, but you can see some significant speedups from regular Python.
If you want something that's "about as fast as Java," the obvious solution is JRuby.
If you install Netbeans (use the download button under the Ruby column), JRuby is the default interpreter. It doesn't get much easier!
If your problem is C's clunky loops, I'd suggest looking at Ada. It allows you to loop through a whole array with a simple statement like so:
for I in array_name'range loop
--'// Code goes here
end loop;
For AI projects, I'd also suggest you look into using Clips, which is a freely-available inference engine.
Rather than OCAML, you might consider F# -- it's source compatible with OCAML (or you can use a lighter weight syntax) and it supports actor-style concurrency with what it calls asynchronous workflows (which are really an almost-monad for applying asynchronous execution).
Not that -- as Scala shows -- you need to have actor style concurrency baked into the language, if you build it into a library. The rest is just syntactic sugar.
Learn C++ and familiarize yourself with its standard library. It won't be that hard to learn as you already 'speak' C, but keep in mind that C++ is not just a better C, it's another language with its own concepts and methods.
Why not Erlang?
It's not too much like the languages you already know, so you can learn new concepts
It has some interesting capabilities for multiprocessing
It's not out of academia. Erlang was a commercial language first.
There are at least two significant open source applications written in it: CouchDB and Wings3d
I believe going through C C++ and Java or .net then moving on from here to any one way java or .net , because c is more machine oriented and C++ and java will give you hands on with object oriented learning, then later on switching to python (to really appreciate the clean code than in C C++ and JAVA ).
Related
I'm making the transition from Java to PHP/Javascript and discovering all the practical aspects of using a weakly typed language.
As I'm in a position to fully compare the two I'd like to know the pros and cons of each approach. Also, are there any other forms of typing out there?
A weakly dynamically typed programming language (like PHP) made that the programmer's mistakes occur as non-coherent behaviours (for instance, the program gonna display stupid informations).
With a strongly dynamically typed language (like python), the programming mistakes causes error message. It makes the mistakes easier to uncover and diagnosis but in general the program became not usable after the message has been shown.
Finally, with a strongly statically typed language (like Java, Ada, OCaml, Haskell, ...) some mistakes can be uncovered at compile time and hence reduce the risk to provide an bugged program. (but the release occurs later)
Yes. Python uses Dynamic Typing.
Generally it's a matter of personal preference and the role that the architects of a given language's intended use.
PHP (a scripting language) for example makes sense to be weakly typed, as the tasks it generally performs are far less complex, and require less constraints then say a compiled language.
Regarding your final question, Mathematica is said to be "typeless."
High-level, typeless, dynamic language with consistent symbolic syntax and semantics across all data, functions, and interfaces
PHP/javascript can be used to develope better looking UI's than Java. PHP will be having less constraints and easy to learn and execute than java.
I don't want an automated solution.
When you have to translate a program from a language to another, what do you do? You prefer to rewrite it from the beginning or copy and paste it and change only what need to be changed?
What's the best choice?
It depends on
the goals (quick hack for one time use? long-lived production project for work?)
the resources I have (how many man-hours? Test suit and/or functional spec for old code? familiarity with both languages?)
most importantly, differences between the languages. Both the conceptual (OO? functional? reflection? control structures?) as well as available libraries.
Please note that this bullet is not as trivial as it seems - this depends in large part on how idiomatic the original program is - as an example, some people write very "C-like" Perl code (e.g. using C control flow and very C++ like OO design) which can be trivially copied to C or C++, and some people write incredibly intricate idiomatic Perl using functional programming, closures and reflective capabilties; which can't be obviously translated into C/C++.
Also, quality of the original code. E.g. good program will have separated business logic, usually expressed in standard configuration and control flow that's easier to directly clone.
E.g. translating from PHP to Perl for a hack job, you can often start out with copying, since many PHP constructs can be 1-to-1 mapped onto equivalent Perl constructs (just take your pick of Perl templating web library). The resulting code won't be GOOD Perl but will be Good Enough for some purposes.
On the other hand, translating, say, LISP code to Java, you're better off just translating the original code into a functionality specification and re-write from scratch. Your example of Python and JavaScript is probably in the same box.
Usually you have two languages that share at least some concepts (e.g. both have OO, and some imperative control structures) and thus you end up with some combination of the two approaches - parts of the code can be "thoughtlessly" translated, parts need to be re-written from scratch.
The more of the second (complete rewrite) approach, the better quality idiomatic and powerful code you end up with.
Normally, a rewrite using the original for inspiration and direction is the best choice.
The way you may do something in one language might very well not be the right way to do it in another.
When it comes to copy-paste, it is rare that you can just do that - languages are different and follow different syntax rules.
Of course, this all depends on the source and destination languages.
With your comment - javascript and python, I would say a rewrite is the best option.
That probably mostly depends on the similarity between the two languages and the meaning of "translation".
For instance, translating a bit of C89 code to C++ might not be so hard considering it should compile out of the box when you copy-paste (C++ is a compatible super-set to C89). I would hardly consider that a "translation", though.
On the other hand, translating Java to Haskell would certainly require a complete rewrite as the language paradygms, even worse than the syntax, are completely different.
Consider Wikipedia's list of programming languages. I'm too lazy to count how many of them are on this list, but let's assume there are 100.
If you want to translate one of them into another, that means that there are at least 100*99 = 9900 possible combinations for translation.
And that's an awful lot. Since most languages are unique, translation is very, very dependent on source and destination language.
Consider this Pascal to C converter. The author states it took him one and a half year to make a good translator for these particular languages. Obviously, this isn't a trivial task.
Depending on your ambitions, you might spend either one day, or many years translating a program from language A to language B.
How long this takes depends on your skill, size of your source code, complexity of languages A and B and their similarity.
As you can see, this isn't a trivial task and is highly dependent on your situation.
Usually compilers translate from the language they support to assembly. Or at most to an assembly-like language (bytecode), like GIMPLE/GENERIC for GCC or Python/Java/.NET bytecode.
Wouldn't it be simpler for a compiler translate to a simpler language, which already implement a big subset of their grammar?
For example an Objective-C compiler, which is 100% compatible with C, could add the semantics only for the syntax it extends to C's, translating it into C. I can see many advantages of doing this; one could use this Objective-C compiler to translate its code into C in order to compile the generated C code with a different compiler that doesn't support C++ (but that optimizes more, or that compiles quicker, or able to compile for more architectures). Or one would be able to use the generated C code in a project where only C is allowed.
I guess/hope that if things were working like this, it would have been a lot easier to write extensions for current languages (eg: adding to C++ keywords to ease the implementation of common patterns, or, still in C++, removing the declare before use rule by moving inline member functions to the end of header files)
What kind of penalties would there be? Generated code would be very difficult to be understood by humans? Compilers wouldn't be able to optimize as much as they can now? What else?
This is actually used by a lot of languages, through the use of Intermediate languages. The biggest example for this would be Pascal, which had the Pascal-P system: Pascal was compiled into a hypothetical assembly language. To port pascal would only mean making a compiler for this assembly language: a task a lot simpler than porting the entire pascal compiler. After writing this compiler, you'd only need to compile the (machine-independent) pascal compiler that was written in this.
Bootstrapping is also used quite often in programming language design. Many languages have their compilers written in the same language(Haskell comes to mind here). By doing this, writing a new functionality for the language simply means translating that idea into the current language, putting it into the compiler, then recompiling.
I don't think the problem with this method is really the readability of generated code(I don't sift through assembly byte-code generated through compilers, personally), but one of optimization. Many ideas in higher-level programming languages( weak-typing comes to mind) are hard to automatically translate into lower-level system languages such as C. There's a reason why GCC tends to do its optimization before code generation.
But for the most part, compilers do translate into simpler languages except for maybe the most basic of system languages.
Incidentally, as a counterexample, Tcl is one language that is known to be very-very hard (if not totally impossible) to translate to C. Over the last 20 years there have been a couple of projects that tried this, even one promise of a commercial product but none have materialized.
In part it is because Tcl is a very dynamic language (as any language with an eval function is). In part it is because the only way to know if something is code or data is to run the program.
Since Objective-C is a strict superset of C and C++ contains a very large amount that is a lot like C, to parse either you effectively already need to be able to parse C. In which case, outputting to machine code and outputting to more C code aren't substantially different in processing cost, the main cost to the user being that compiling now takes as long as it originally did plus the amount of time a second compiler takes.
Any attempt to copy and paste the stuff that looks like C and translate the rest around it would be prone to problems. Firstly, C++ isn't a strict superset of C so things that look like C don't necessarily compile exactly the same anyway (especially versus C99). And even if they did, supposing a user made an error in their C stuff, compilers don't tend to provide error information in a machine readable format, so it'd be really hard for the Objective-C to C layer to give the user a meaningful error after receiving e.g. "error at line 99".
That said, many compiler suites, like GCC and even more so like the upcoming Clang + LLVM, use an intermediate form to decouple the bit that knows about the specifics of one architecture from the bit that knows the specifics of a particular language. However, it tends to be more of a data structure than something intentionally easy to express as a written language.
So: compilers don't work like this for purely practical reasons.
Haskell is actually compiled this way: the GHC compiler first translates the source code to an intermediary functional language (which is less rich than Haskell self), performs optimizations and then lowers the whole thing to C code which is then compiled by GCC. This solutions has problems tough, and projects were started to replace this backend.
http://blog.llvm.org/2010/05/glasgow-haskell-compiler-and-llvm.html
There is a compilers construction stack which is fully based on this idea. Any new language is implemented as a trivial translation into a lower level language or a combination of languages which are already defined within this stack.
http://www.meta-alternative.net/mbase.html
However, in order to be able to do so, you'd need at least some metaprogramming capabilities in every little language you add to a hierarchy. This requirement adds some severe limitations on languages semantics.
Programming languages seem to go through several stages. Firstly, someone dreams up a new language, Foo Language. The compiler/interpreter is written in another language, usually C or some other low level language. At some point, FooL matures and grows, and eventually someone, somewhere will write a compiler and/or interpreter for FooL in FooL itself.
My question is this: What is the minimal subset of language features such that someone could implement that language in itself?
Compiler can be written even using a Turing machine - a Universal Turing Machine is basically a compiler/interpreter of any Turing machine, so any Turing-complete language should be enough :)
In theory, surprisingly little. A computability theorist would say that all you need is mu-recursion or a Turing machine or the like.
However, from a practical point of view, you're not going to be very happy trying to implement a programming language in a Turing machine. I would say that, at a minimum, you would want to have all the usual control-flow constructs, the primitive datatypes, subroutines, as well as arrays and structs. That should be enough to let you implement that subset of the language in the language itself -- and you can then bootstrap yourself up from there.
One option is a read-eval-print loop. This can be used to build many higher-level constructs. I believe this is the path taken by LISP.
I am unsure about the beginnings of C, but I think it started with a few system calls to implement branching, loops, assignment and single-character I/O, and built from there.
Id assume a assembler would make the cut.
My question is this: What is the minimal subset of language features such that someone could implement that language in itself?
There is no requirement for the language to be useful for anything other than compiling itself? I present to you Useless, the language in which every text is a proper program and means "a program that takes any input and produces itself" (this is also known as Useless compiler).
I am attempting to determine prior art for the following idea:
1) user types in some code in a language called (insert_name_here);
2) user chooses a destination language from a list of well-known output candidates (javascript, ruby, perl, python);
3) the processor translates insert_name_here into runnable code in destination language;
4) the processor then runs the code using the relevant system call based on the chosen language
The reason this works is because there is a pre-established 1 to 1 mapping between all language constructs from insert_name_here to all supported destination languages.
(Disclaimer: This obviously does not produce "elegant" code that is well-tailored to the destination language. It simply does a rudimentary translation that is runnable. The purpose is to allow developers to get a quick-and-dirty implementation of algorithms in several different languages for those cases where they do not feel like re-inventing the wheel, but are required for whatever reason to work with a specific language on a specific project.)
Does this already exist?
The .NET CLR is designed such that C++.Net, C#.Net, and VB.Net all compile to the same machine language, and you can "decompile" that CLI back in to any one of those languages.
So yes, I would say it already exists though not exactly as you describe.
There are converters available for different languages. The problem you are going to have is dealing with libraries. While mapping between language statements might be easy, finding mappings between library functions will be very difficult.
I'm not really sure how useful that type of code generator would be. Why would you want to write something in one language and then immediately convert it to something else? I can see the rationale for 4th Gen languages that convert diagrams or models into code but I don't really see the point of your effort.
Yes, a program that transform a program from one representation to another does exist. It's called a "compiler".
And as to your question whether that is always possible: as long as your target language is at least as powerful as the source language, then it is possible. So, if your target language is Turing-complete, then it is always possible, because there can be no language that is more powerful than a Turing-complete language.
However, there does not need to be a dumb 1:1 mapping.
For example: the Microsoft Volta compiler which compiles CIL bytecode to JavaScript sourcecode has a problem: .NET has threads, JavaScript doesn't. But you can implement threads with continuations. Well, JavaScript doesn't have continuations either, but you can implement continuations with exceptions. So, Volta transforms the CIL to CPS and then implements CPS with exceptions. (Newer versions of JavaScript have semi-coroutines in the form of generators; those could also be used, but Volta is intended to work across a wide range of JavaScript versions, including obviously JScript in Internet Explorer.)
This seems a little bizarre. If you're using the term "prior art" in its most common form, you're discussing a potentially patentable idea. If that is the case, you have:
1/ Published the idea, starting the clock running on patent filing - I'm assuming, perhaps incorrectly, that you're based in the U.S. Other jurisdictions may have other rules.
2/ Told the entire planet your idea, which means it's pretty much useless to try and patent it, unless you act very fast.
If you're not thinking about patenting this and were just using the term "prior art" in a laypersons sense, I apologize. I work for a company that takes patents very seriously and it's drilled into us, in great detail, what we're allowed to do with information before filing.
Having said that, patentable ideas must be novel, useful and non-obvious. I would think that your idea would not pass on the third of these since you're describing a language translator which would have the prior art of the many pascal-to-c and fortran-to-c converters out there.
The one glimmer of hope would be the ability of your idea to generate one of multiple output languages (which p2c and f2c don't do) but I think even that would be covered by the likes of cross compilers (such as gcc) which turn source into one of many different object languages.
IBM has a product called Visual Age Generator in which you code in one (proprietary) language and it's converted into COBOL/C/Java/others to run on different target platforms from PCs to the big honkin' System z mainframes, so there's your first problem (thinking about patenting an idea that IBM, the biggest patenter in the world, is already using).
Tons of them. p2c, f2c, and the original implementation s of C++ and Objective C strike me immediately. Beyond that, it's kind of hard to distinguish what you're describing from any compiler, especially for us old guys whose compilers generated ASM code for an intermediate represetation anyway.