Why can you only prepend to lists in functional languages? - language-agnostic

I have only used 3 functional languages -- scala, erlang, and haskell, but in all 3 of these, the correct way to build lists is to prepend the new data to the front and then reversing it instead of just appending to the end. Of course, you could append to the lists, but that results in an entirely new list being constructed.
Why is this? I could imagine it's because lists are implemented internally as linked lists, but why couldn't they just be implemented as doubly linked lists so you could append to the end with no penalty? Is there some reason all functional languages have this limitation?

Lists in functional languages are immutable / persistant.
Appending to the front of an immutable list is cheap because you only have to allocate a single node with the next pointer pointing to the head of the previous list. There is no need to change the original list since it's only a singly linked list and pointers to the previous head cannot see the update.
Adding a node to the end of the list necessitates modifying the last node to point to the newly created node. Only this is not possible because the node is immutable. The only option is to create a new node which has the same value as the last node and points to the newly created tail. This process must repeat itself all the way up to the front of the list resulting in a brand new list which is a copy of the first list with the exception of thetail node. Hence more expensive.

Because there is no way to append to a list in O(1) without modifying the original (which you don't do in functional languages)

Because it's faster
They certainly could support appending, but it's so much faster to prepend that they limit the API. It's also kind of non-functional to append, as you must then modify the last element or create a whole new list. Prepend works in an immutable, functional, style by its nature.

That is the way in which lists are defined. A list is defined as a linked list terminated by a nil, this is not just an implementation detail. This, coupled with that these languages have immutable data, at least erlang and haskell do, means that you cannot implement them as doubly linked lists. Adding an element would them modify the list, which is illegal.

By restricting list construction to prepending, it means that anybody else that is holding a reference to some part of the list further down, will not see it unexpectedly change behind their back. This allows for efficient list construction while retaining the property of immutable data.

Related

Is there a programming term for selecting or inserting similar to "upsert"?

Is there a programming term for a widget/service that has the ability to either select an existing item or create/insert a new item? Maybe like how "upsert" is a combination of updating existing or inserting new?
Edit
Sorry, maybe my use of "upsert" as an example term is causing confusion. I'm looking for a term that describes (selecting or creating), not (creating or updating). For example, a widget that can query existing contacts to assign to a company, but also allows ad-hoc creation of contacts as well. I'm hoping to use the term in the name of such widgets to succinctly describe their function.
In web services based on representational state transfer (REST), the PUT method has semantics which can mean create or modify. Based on this, I would propose put as a possibility.
This also happens to match Java usage of the word: consider the put method on Hashtable. It maps a key to a value and returns the old value corresponding to the key, if it existed. This implies it can either exist or not already, and the action will succeed. Mozilla uses the same semantics for put on IDBObjectStore accessed in JavaScript.
It seems like where put is used for updating a collection, it typically has the meaning of create or update.
Based on the clarification - something like provide, specify, define or identify might be acceptable options for communicating the contents of a collection created by giving existing instances and/or creating new ones on the spot. That is, I can provide, specify, define or identify a real number by making reference to some well-known ones (pi, sqrt(2), etc.) or by creating a new one (e.g., by listing some digits or describing how it is computed).

Is there an implementation of a binary search tree annotated with sub-tree size

I have been researching the tree data structure described at this link (near the bottom):
http://sigpipe.macromates.com/2009/08/13/maintaining-a-layout/
It is mentioned that this data structure could be a finger tree. However, after more research around finger trees, I've found that this lacks the "fingers" that makes finger trees finger trees. Instead, it seems this is just an annotated binary tree (annotated with subtree size).
Do you know of an existing implementation (in any language) of this data-structure that I could use as a reference for my own implementation (though, preferably not an implementation in a functional programming language)?
Or, what would be the most optimal way of retrofitting the subtree size annotations into an existing tree data-structure?
Thanks!
Simon Tatham's Counted B-Trees are similar. if the node count is replaced with a width of buffer like in tweak, these provide operations like ropes.
in fact from reading that the page you reference i see that it was being used like a piece table or line table for an editor
in the paper, Positional Delta Trees to reconcile updates with read-optimized data storage, the authors present a tree which behavior in regard to the invariants it holds between the nodes in the tree bares a striking resemblance to xanadu's enfilades to which the Counted B-tree is also similar.
I've got a project on github called Boost.Intrusive Annotated Trees that aims to provide generic support for annotations like subtree count in Boost.Intrusive. Subtree count was my original use case for it.
Currently it requires C++11 variadic templates and only supports the rbtree, but it works, and I hope to remove both of those restrictions in time
Update: Now builds with C++03. Still only supports rbtree.
When used with a subtree count annotation it's similar to what jordan describes in the answer above - it calculates (left+right+1) at each node. The implementation is quite different - it works with any node and/or value traits; the annotation updates are integrated into the rbtree algorithms instead, which keeps the number of recalculations done minimal.
I've implemented something similar based on a question I asked the other day. I added annotations to the boost::intrusive::rbtree/avltree nodes to calculate the size of each subtree (foreach node count = node->left->count + node->right->count + 1). I perform this update on insertion/deletetion/rebalance of the tree by using the boost value_traits hook for set_parent, set_left, and set_right. Pretty much, as stated in the site you referenced, after each node update, update the current node's size and then traverse up the tree until you hit the root, updating each node's size as you go.
The problem comes when you want to insert into the tree at a specific position. Pretty much the moment you do this, you'll invalidate the key-ordering invariant for the tree structure. This means you won't be able to perform efficient O(log n) lookups by key. But, if you wanted that, you probably wouldn't be needed the size annotations anyway.

Namespaces and records in erlang

Erlang obviously has a notion of namespace, we use things like application:start() every day.
I would like to know if there is such a thing as namespace for records. In my application I have defined record user. Everything was fine until I needed to include rabbit.hrl from RabbitMQ which also defines user, which is conflicting with mine.
Online search didn't yield much to resolve this. I have considered renaming my user record and prefixing it with something, say "myapp_user". This will fix this particular issue, until I suspect I hit another conflict say with my record "session".
What are my options here? Is adding a prefix myapp_ to all my records a good practice, or is there a real support for namespaces with records and I am just not finding it?
EDIT: Thank you everyone for your answers. What I've learned is that the records are global. The accepted answer made it very clear. I will go with adding prefixes to all my records, as I have expected.
I would argue that Erlang has no namespaces whatsoever. Modules are global (with the exception of a very unpopular extension to the language), names are global (either to the node or the cluster), pids are global, ports are global, references are global, etc.
Everything is laid flat. The namespacing in Erlang is thus done by convention rather than any other mean. This is why you have <appname>_app, <appname>_sup, etc. as module names. The registered processes also likely follow that pattern, and ETS tables, and so on.
However, you should note that records themselves are not global things: as JUST MY correct OPINION has put it, records are simply a compiler trick over tuples. Because of this, they're local to a module definition. Nobody outside of the module will see a record unless they also include the record definition (either by copying it or with a header file, the later being the best way to do it).
Now I could argue that because you need to include .hrl files and record definitions on a per-module basis, there is no such thing as namespacing records; they're rather scoped in the module, like a variable would be. There is no reason to ever namespace them: just include the right one.
Of course, it could be the case that you include record definitions from two modules, and both records have the same name. If this happens, renaming the records with a prefix might be necessary, but this is a rather rare occurrence in my experience.
Note that it's also generally a bad idea to expose records to other modules. One of the problems of doing so is that all modules depending on yours now get to include its .hrl file. If your module then change the record definition, you will have to recompile every other module that depends on it. A better practice should be to implement functions to interact with the data. Note that get(Key, Struct) isn't always a good idea. If you can pick meaningful names (age, name, children, etc.), your code and API should make more sense to readers.
You'll either need to name all of your records in a way that is unlikely to conflict with other records, or you need to just not use them across modules. In most circumstances I'll treat records as opaque data structures and add functionality to the module that defines the record to access it. This will avoid the issue you've experienced.
I may be slapped down soundly by I GIVE TERRIBLE ADVICE here with his deeper knowledge of Erlang, but I'm pretty sure there is no namespaces for records in Erlang. The record name is just an atom grafted onto the front of the tuple that the compiler builds for you behind the scenes. (Records are pretty much just a hack on tuples, you see.) Once compiled there is no meaningful "namespace" for a record.
For example, let's look at this record.
-record(branch, {element, priority, left, right}).
When you instantiate this record in code...
#branch{element = Element, priority = Priority, left = nil, right = nil}.
...what comes out the other end is a tuple like this:
{branch, Element, Priority, nil, nil}
That's all the record is at this point. There is no actual "record" object and thus namespacing doesn't really make any sense. The name of the record is just an atom tacked onto the front. In Erlang it's perfectly acceptable for me to have that tuple and another that looks like this:
{branch, Twig, Flower}
There's no problem at the run-time level with having both of these.
But...
Of course there is a problem having these in your code as records since the compiler doesn't know which branch I'm referring to when I instantiate. You'd have to, in short, do the manual namespacing you were talking about if you want the records to be exposed in your API.
That last point is the key, however. Why are you exposing records in your API? The code I took my branch record from uses the record as a purely opaque data type. I have a function to build a branch record and that is what will be in my API if I want to expose a branch at all. The function takes the element, priority, etc. values and returns a record (read: a tuple). The user has no need to know about the contents. If I had a module exposing a (biological) tree's structure, it too could return a tuple that happens to have the atom branch as its first element without any kind of conflict.
Personally, to my tastes, exposing records in Erlang APIs is code smell. It may sometimes be necessary, but most of the time it should remain hidden.
There is only one record namespace and unlike functions and macros there can only be one record with a name. However, for record fields there is one namespace per record, which means that there is no problems in having fields with the same name in different records. This is one reason why the record name must always be included in every record access.

Quickest way to represent array in mysql for retrieval

I have an array of php objects that I want to store into a mysql database table. The only way I can think of is just have a table to represent the object with a unique id and a separate table to store the array (there could be a column array_id and an object_id) but retrieving would require a join I believe which could get expensive. Is there a better way? I don't care much about storage space or insertion time as much as retrieval time.
I don't necessarily need this to work for associative arrays but if the solution could, that would be preferred.
Building a tree structure (read as Array) in mysql can be tricky but it is done all of the time. Almost any forum with nested threads has some mechanism to store a tree structure. As another poster said they do not have to be expensive.
The real question is how you want to use the data. If you need to be able to add/remove data fields from individual nodes in the tree then you can use one of two models
1) Adjacency List Model
2) Modified Preorder Tree Traversal Algorithm
(They sound scary, but it's not that bad I promise.)
The first one listed is probably the more common you will encounter and the second is the one I have begun to use more frequently and has some nice benefits once you wrap your head around it. Take a look at this page--it has an EXCELLENT writeup about both.
http://articles.sitepoint.com/article/hierarchical-data-database
As another poster said though, if you don't need to change the data with queries or search inside the text then use a PHP function to store it in a single field.
$array = array('something'=>'fun', 'nothing'=>'to do')
$storage_array = serialize($array);
//INSERT INTO DB
//DRAW OUT OF DB
$array = unserialize($row['stored_array']);
Presto-changeo, that one is easy.
If you are comfortable with not being able to SQL search through the data within the array, you could add a single column to the table, and serialize the array into it. You would have to deserialize it on retreival.
You could use JSON / PHP serializeation or whatever is more appropriate for the language you're developing in.
Joins don't have to be so expensive - you can define an index.

What's the best way to model an unordered list (i.e., a set)?

What's the most natural way to model a group of objects that form a set? For example, you might have a bunch of user objects who are all subscribers to a mailing list.
Obviously you could model this as an array, but then you have to order the elements and whoever is using your interface might be confused as to why you're encoding arbitrary ordering data.
You can use a hash where the members are keys that map to "1" or "true", but in most languages there are restrictions on what data types a hash key can be.
What's the standard way to do this in modern languages (PHP, Perl, Ruby, Python, etc)?
In Python, you would use the set datatype. A set supports containing any hashable object, so if you have a custom class you need to store in a set and the default hashable behaviour is not appropriate, you can implement __hash__ to implement the behaviour you want.
C# has the HashSet<T> generic collection.
public class EmailAddress // probably needs to override GetHashCode()
{
...
}
var addresses = new HashSet<EmailAddress>();
Most modern languages are going to have some form of Set data structure. Java has HashSet, which implements the Set interface.
In PHP you can use an array to store your data. Either search the array before you add a new element, or use array_unique to remove duplicates after inserting all elements.
In c as a stand-in for understanding the machine directly:
For small, discrete and well defined ranges: use a bitwise array to indicate the presence of each possible item (set for present, unset for absent).
Use a hash-table for all other cases.
Write functions to implement adding and removing items, testing for presence or absence, testing for sub-sets, etc as needed.
As the other answers note, however, if you just want the functionality, use a language feature or third-party library that is already well debugged.
A lot of the time hash-based sets are the correct thing to use, but if you don't need to do key-based lookups and don't worry about enforcing unique values, a vector or list is fine. There is overhead to a hash table, after all.
You seem to be concerned that people will think that the order in the vector is important, but I think that it is a common enough usage that, with documentation, you shouldn't confuse people.
It really depends on how you want to access and use the data.
and Array is usually the simplest way to store data, without any other requirements. Usually other data types are used for different reasons (you want to append data, you want to search data in constant time, you need quick set union/intersection, etc) If your only concern is the abstraction you could wrap it in some kind of unordered facade.
In Perl I would use a hash, definitely. In other languages I would lament the lack of a hash.