Is CouchDB best suited for dynamic languages? - json

I'm familiar with CouchDB and the idea of mapping its results to Scala objects, as well as find some natural way to iteract with it, came immediatly.
But I see that Dynamic languages such as Ruby and Javascript do things very well with the json/document-centric/shchema-free aproach of CouchDB.
Any good aproach to do things with Couch in static languages?

I understand that CouchDB works purely with JSON objects. Since JSON is untyped, it's tempting to believe that it's more naturally suited for dynamic languages. However, XML is generally untyped too, and Scala has very good library support for creating and manipulating XML. For an exploration of Scala's XML features, see: http://www.ibm.com/developerworks/library/x-scalaxml/
Likewise with JSON. With the proper library support, dealing with JSON can feel natural even in static languages. For one approach to dealing with JSON data in Scala, see this article: http://technically.us/code/x/weaving-tweed-with-scala-and-json/
With object databases in general, sometimes it's convenient to define a "model" (using, for example, a class in the language) and use JSON or XML or some other untyped document language to be a serialized representation of the class. Proper library support can then translate between the serialized form (like JSON) and the in-memory data structures, with static typing and all the goodies that come with it. For one example of this approach, see Lift's Record which has added conversions to and from JSON: http://groups.google.com/group/liftweb/msg/63bb390a820d11ba

I wonder if you asked the right question. Why are you using Scala, and not dynamic languages? Probably because of some goodness that Scala provides you that is important for you and, I assume, your code quality. Then why aren't you using a "statically typed" (i.e. schema-based) database either? Once again I'm just assuming, but the ability to respond to change comes to mind. Production SQL databases have a horrible tendency of being very difficult to change and refactor.
So, your data is weakly typed, and your code is strongly typed. But somewhere you'll need to make the transition. This means that somewhere, you'll have a "schema" for your data even though the database has none. This schema is defined by the classes you're mapping Couch documents onto. This makes perfect sense; most uses of Couch that I've seen have a key such as "type" and for each type at least some common set of keys. Whether to hand-map the JSON to these Scala classes or to use e.g. fancy reflection tools (slower but pretty), or some even fancier Scala feature that I'm yet new to is a detail. Start with the easy-but-slow one, then see if it's fast enough.
The big thing occurs when your classes, i.e. your schema, change. Instead of ALTER'ing your tables, you can just change the class, ensure that you do something smart if for some document a key you expect is missing (because it was based on an older version of the class), and off you go. Responding to change has never been easier, and still your code is as statically typed as it can get.
If this is not good enough for you, and you want no schema at all, then you're effectively saying that you don't want to use classes to define and manipulate your data. That's fine too (though I can't imagine a use), but then the question is not about dynamic vs static languages, but about whether to use class-based OO languages at all.

Related

Can we achieve homoiconicity in Unison?

At first glance, it seems like Unison may be homoiconic due to the fact that "code is data", at least in the sense that Unison code are stored as cryptographic hashes in a durable fashion. However, directly working with cryptographic hashes doesn't seem very practical, perhaps no more than directly working with compiled bytecode for the JVM. So maybe it is best to break this down into two parts:
Is Unison currently homoiconic?
Could Unison be homoiconic, with additional code-generation and AST manipulation features?
On 1, I'd say no. When Unison is eventually self-hosted, then sure, the compiler data structures could be made available as a library.
However, since Unison builtins to convert any Unison value or code to a well defined serialized form, you can write a library that parses that form into some Unison data structures that represent that code. That is actually what the in progress Unison JIT compiler does. And the library that Dan developed for this will be something people could use for other purposes (like I could imagine using it to write plugins that would generate a JSON serializer for arbitrary Unison values, for instance).
Maybe some people would say the existence of said library counts as homoiconicity now. Like it doesn't really matter if the compiler internally represents code as a Unison data structure as long as you have a function for converting code to a Unison data structure.
Aside: I dislike the term homoiconicity. It's a fancy piece of jargon that isn't even particularly well-defined.

What should I read on json and html parsers to build one myself?

I want to create a json and html parser to deepen my knowledge in them (I don't want to reinvent it to be "more efficient", as you could think).
What should I read to succede with it?
P.S: I know about parsing laws, but couldn't find some on json.
P.P.S: C++ implementation is my target.
JSON is specified in RFC 8259 (using EBNF) and ECMA-404 (using railroad diagrams). Since they both define the same grammar, which of the two you use is unimportant; go for the one you fibd easier.
JSON parsing is pretty simple. HTML, on the other hand, is a huge project, made more complicated by the absence of a versioned authoritative standard which makes it a bit of a moving target.
HTML parsing as currently defined by the "living standard" is a procedure which probably cannot be encapsulated in a context-free grammar. No real attempt is made to use grammatical descriptions in the standard, although it is possible to extract at least a lexical grammar, if you ignore the sections dealing with the handling of lexical errors.
Certainly, you could write a parser for a well-behaved subset, but that parser might not cope well with many of the "HTML" documents you will want to process. Personally, for learning purposes, I'd suggest trying your hand at XML. (Also see XML Namespaces].

Get Map value like plain old Javascript objects

I'm new to Immutable.js, so this is a very trivial question.
It looks like I can't get a Map value like with plain old Javascript objects, e.g. myMap.myKey. Apparently I have to write myMap.get('myKey')
I am very surprised by this behavior. Is there a reason for that? Is there any extension to Immutable.js which would allow me to type myMap.myKey?
Came back to elaborate on my comment, but SO doesn't allow that after certain time. Converting it into an answer.
The question you have asked has been reciprocated several times with people who start new with immutable, yours truly included. Its on one of the rants I wrote a while ago.
It starts to make sense when you look at it from immutability perspective. If you expose value types as your own properties, they won't be immutable because they are value types and could be assigned to.
Nonetheless, its frustrating to spread these getters all across your components/views. If you can afford it, you should try to use the Record type. It offers traditional access to members (except in IE 8). Better still, you can extend from this type and add helper getters/setters (e.g. user.getName(), user.setName('thebat') instead of user.get('name')/set('name', 'thebat')) to abstract your model's internal structure from your views. However there are challenges to overcome like nested structures and de-serialization of objects.
If the above is not your cup of tea, I'd recommend swallowing the bitter pill :).
I think you are missing the concept Immutable was build:
Immutable data cannot be changed once created, leading to much simpler
application development, no defensive copying, and enabling advanced
memoization and change detection techniques with simple logic.
Persistent data presents a mutative API which does not update the data
in-place, but instead always yields new updated data.
One way or another you may transform Immutable data structures to plain old JS objects as: myMap.toJS()

What's the use of abstract syntax trees?

I am learning on my own about writing an interpreter for a programming language, and I have read about Abstract Syntax Trees. I have an idea of what they are, but I do not see their use.
Why are ASTs useful?
They represent the logic/syntax of the code, which is naturally a tree rather than a list of lines, without getting bogged down in concrete syntax issues such as where you place your asterisk.
The logic can then be manipulated in a manner more consistent and convenient from the backend's POV, which can be (and is, for everything but Lisps ;) very different from how we write the concrete syntax.
The main benefit os using an AST is that you separate the parsing and validation logic from the implementation piece. Interpreters implemented as ASTs really are easier to understand and maintain. If you have a problem parsing some strange syntax you look at the AST parser , if a pices of code is not producing the expected results than you look at the code that interprets the AST.
The other great advantage is when you syntax requires "lookahead" e.g. if your syntax allows a subroutine to be used before it is defined it is trivial to validate the existence of a subroutine when you are using an AST - its much more difficult with an "on the fly" parser.
You need "syntax trees" to represent the structure of most programming langauges, in order to carry out analysis or transformation on documents that contain programming language text. (You can see some fancy examples of this via my bio).
Whether that tree is abstract (AST) or concrete (CST) is a matter of taste, convenience, and engineering sweat. The term CST is specially used to describe the parse derivation tree when a grammar is used to deconstruct source code; it usually contains tree elements for lots of concrete syntax such as statement terminator semicolons. AST is used to mean "something simpler than the CST", e.g., leaving out semicolon tree nodes because they don't affect program analysis much, and thus writing analyzers that process ASTs is less conceptual and engineering effort than writing the same analyzer on a CST. A better way to understand this is to realize that the AST is usually as isomorphic equivalent of the CST, that is, you should be able to regenerate the CST from it. If you want to transform the source text and regenerate it, then the CST is often a better choice as it loses less information from the original program (and my fancy example uses this approach).
I think you will find the SO discussion on abstract vs. concrete syntax trees pretty helpful.
In general you are going to parse you code into some form of AST, it may be more or less of a formal model. So I think what Kirk Woll was getting at by his comment above is that when you parse the language, you very often use the parser to create some sort of data model of the raw content of what you are reading, generally organized in a tree fashion. So by that definition an AST is hard to avoid unless you are doing a very simple translator.
I use ANTLR often for parsing complex languages and in that context there is a slightly more specific meaning of an AST. ANTLR has a handy way of generating an AST in the parser grammar using pretty simple actions. You then write a generally much simpler parser for this AST which you can operate on like a much simpler version the language you are processing. Whether the extra work of building two parsers is a net gain is a function of the language complexity and what you are planning on doing with with it once you parsed it.
A good book on the subject that you may want to take a look at is "Language Implementation Patterns" by Terrence Parr the ANTLR author. He addresses this topic pretty thoroughly. That said, I didn't really get ASTs until I started using them, so that (as usual) is the best way to understand them.
Late to the question but I thought I'd add something. You don't actually have to build an AST. It is possible to emit instructions directly as you parse the source code. In this case, the AST is implied in the parsing grammar. For simple languages, especially dynamically typed languages, this is a perfectly ok strategy. For more complex languages or where you need to further analyze the source code, an AST can be very useful. For example, if your language is statically typed, ie your variables are declared with fixed types then the AST can be used to check that you're not assigning the wrong type to a variable. eg assigning a string to a variable that is declared to hold an integer would be wrong and this can be caught more conveniently with the AST.
Also, as others have mentioned, the AST offers a clean separation between syntax analysis and code generation and makes the code much more modular.

Cross-platform and language (de)serialization

I'm looking for a way to serialize a bunch of C++ structs in the most convenient way so that the serialization is portable across C++ and Java (at a minimum) and across 32bit/64bit, big/little endian platforms. The structures to be serialized just contain data, i.e. they're pure data objects with no state or behavior.
The idea being that we serialize the structs into an octet blob that we can store in a database "generically" and be read out later on. Thus avoiding changing the database whenever a struct changes and also avoiding assigning each data member to a field - i.e. we only want one table to hold everything "generically" as a binary blob. This should make less work for developers and require less changes when structures change.
I've looked at boost.serialize but don't think there's a way to enable compatibility with Java. And likewise for inheriting Serializable in Java.
If there is a way to do it by starting with an IDL file that would be best as we already have IDL files that describe the structures.
Cheers in advance!
I stumbled here, having a very similar question. 6 years later, this might not be useful to you, but hopefully it will be to others.
There are a lot of alternatives, unfortunately with no clear winner (although one could argue that JSON is the clear winner). Even Google has released multiple competing technologies (all of them apparently being used internally):
FlatBuffers: this one seems to meet the requirements from the original question, has interesting benchmarks and supports some form of IDL (I'm personally not familiar with IDL)
Protocol Buffers: mentioned previously.
XFJSON: 5%-12% smaller than JSON.
Not to forget the alternatives posted in the other answers. Here are a few more:
YAML: JSON minus all the double quotes, but using indentation instead. It's more human readable, but probably less efficient, especially as it gets larger.
BSON (Binary JSON)
MessagePack (Another compacted JSON)
With so many variations, JSON is clearly the winner in terms of simplicity/convenience and cross-platform access. It has gained even more popularity in the last couple years, with the rise of JavaScript. A lot of people probably use that as a de-facto solution, without giving it much thought (that's what I originally did :P).
However, if size becomes an issue, but you prefer to keep things simple and not use one of the more advanced libraries, you could just compress JSON using zlib (that's what I'm doing now), or some other cross-platform algorithm (but that's a whole other topic).
To speed up JSON handling in C++, you could also use RapidJSON.
I'm surprised Jon Skeet hasn't already pounced on this one :-)
Protocol Buffers is pretty much designed for this sort of scenario -- passing structured data cross-language.
That said, if you're using a database the way you suggest, you really shouldn't be using a full-strength RDBMS like Oracle or SQL Server but rather a lightweight key-value store such as Berkeley DB or one of the many "cloud table" engines.
If I want to go really really cross language, I normally would suggest JSON, as the ease of javascript support and an abundance of libraries, as well as being human readable and modifiable (I prefer it to XML as I find it smaller in terms of chars, faster, and more readable). It's not the most efficient in terms of space, however, and a more machine readable format like protocol buffers or thrift would have advantages there (thrift can be made from an IDL, but it is also made for encoding services, so it could be heavier than you want).
You need ASN.1! (Some people refer to this as binary XML.) ASN.1 is very compact and thus ideal to transfer data between two systems. And for those who don't think this is ever used: several Internet protocols are based upon the ASN.1 model for data serialization!
Unfortunately, there aren't many libraries available for Java or C++ that will support ASN.1. I had to work with it several years ago and just couldn't find a good, free or inexpensive tool to allow support for ASN.1 in C++. At Objective Systems they are selling ASN.1/XML solutions but it's extremely expensive. (The ASN.1 compiler for C++ and Java, that is!) It costs you an arm and a leg at least! (But then you will have a tool that you can use with only one hand...)
I'd suggest saving the data with SQLite database. The structs can be stored as database rows in SQLite tables.
The resulting database file is binary compatible across many different platforms and can be stored as a BLOB in your main database. I believe the file size is comparable to compressed XML file with the same data, but memory usage during processing will be significantly less than XML DOM.
Why haven't you chosen XML, as this perfectly suits your demand. Both C++ and Java allow for an easy implementation.
Furthermore, I doubt your idea of storing everything as a blob in the database, use a relational database what a database has been designed for, or switch to some object oriented database like http://www.versant.com/en_US/products/objectdatabase which supports both Java and C++.
There is also Avro. Look this question for comparison of Apache thrift, protocol buffers, mes and so on.