Triple Stores vs Relational Databases [closed] - relational-database

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I was wondering what are the advantages of using Triple Stores over a relational database?

The viewpoint of the CTO of a company that extensively uses RDF Triplestores commercially:
Schema flexibility - it's possible to do the equivalent of a schema change to an RDF store live, and without any downtime, or redesign - it's not a free lunch, you need to be careful with how your software works, but it's a pretty easy thing to do.
More modern - RDF stores are typically queried over HTTP it's very easy to fit them into Service Architectures without hacky bridging solutions, or performance penalties. Also they handle internationalised content better than typical SQL databases - e.g. you can have multiple values in different languages.
Standardisation - the level of standardisation of implementations using RDF and SPARQL is much higher than SQL. It's possible to swap out one triplestore for another, though you have to be careful you're not stepping outside the standards. Moving data between stores is easy, as they all speak the same language.
Expressivity - it's much easier to model complex data in RDF than in SQL, and the query language makes it easier to do things like LEFT JOINs (called OPTIONAL in SPARQL). Conversely though, if your data is very tabular, then SQL is much easier.
Provenance - SPARQL lets you track where each piece of information came from, and you can store metadata about it, letting you easily do sophisticated queries, only taking into account data from certain sources, or with a certain trust level, on from some date range etc.
There are downsides though. SQL databases are generally much more mature, and have more features than typical RDF databases. Things like transactions are often much more crude, or non existent. Also, the cost per unit information stored in RDF v's SQL is noticeably higher. It's hard to generalise, but it can be significant if you have a lot of data - though at least in our case it's an overall benefit financially given the flexibility and power.

Both commenters are correct, especially since Semantic Web is not a database, it's a bit more general than that.
But I guess you might mean triple store, rather than Semantic Web in general, as triple store v. relational database is a somewhat more meaningful comparison. I'll preface the rest of my answer by noting that I'm not an expert in relational database systems, but I have a little bit of knowledge about triple stores.
Triple (or quad) stores are basically databases for data on the semantic web, particularly RDF. That's kind of where the similarity between triples stores & relational databases end. Both store data, both have query languages, both can be used to build applications on top of; so I guess if you squint your eyes, they're pretty similar. But the type of data each stores is quite different, so the two technologies optimize for different use cases and data structures, so they're not really interchangeable.
A lot of people have done work in overlaying a triples view of the world on top of a relational database, and that can work, and also will be slower than a system dedicated for storing and retrieving triples. Part of the problems is that SPARQL, the standard query language used by triple stores, can require a lot of self joins, something relational databases are not optimized for. If you look at benchmarks, such as SP2B, you can see that Oracle, which just overlays SPARQL support on its relational system, runs in the middle or at the back of the pack when compared with systems that more natively support RDF.
Of course, the RDF systems would probably get crushed by Oracle if they were doing SQL queries over relational data. But that's kind of the point, you pick the tool that's well suited for the application you want to build.
So if you're thinking about building a semantic web application, or just trying to get some familiarity in the area, I'd recommend ultimately going with a dedicated triple store.
I won't delve into reasoning and how that plays into query answering in triple stores, as that's yet another discussion, but it's another important distinction between relational systems and triple stores that do reasoning.

Some triplestores (Virtuoso, Jena SDB) are based on relational databases and simply provide an RDF / SPARQL interface. So to rephrase the question slighty, are triplestores built from the ground up as a triplestore more performant than those that aren't - #steve-harris definitely knows the answer to that ;) but I wager a yes.
Secondly, what features do triplestores have that RDBMS don't. The simple answer is support for SPARQL, RDF, OWL etc. (i.e the Semantic Web Technology stack) and to make it a fair fight, its better to define the value of SPARQL based on SPARQL 1.1 (it has considerably more features than 1.0). This provides support for federation (so so cool), property path expressions and entailment regimes along with an standards set of update protocols, graph management protocols (that SPARQL 1.0 didn't have and sorely lacked). Also #steve-harris points out that transactions are not part of the standard (can of worms) although many vendors provide non-standardised mechanisms for transactions (Virtuoso supports JDBC and Hibernate compliant connection pooling and management along with all the transactional features of Hibernate)
The big drawback in my mind is that not many triplestores support all of SPARQL 1.1 (since it is still not in recommendation) and this is where the real benefits lie.
Having said that, I am and always have been an advocate of substituting RDBMS with triplestores and platforms I deliver run entirely off triplestores (Volkswagen in my last role was an example of this), deprecating the need for RDBMS. An additional advantage is that Object to RDF mapping is more flexible and provides more options and flexibility than traditional ORM (also known as putting a square peg in a round hole).

Also you can still use a database but use RDF as a data exchange format which is very flexible.

Related

Why are there so many data structures absent from high level languages?

Why is it that higher level languages (Javascript, PHP, etc.) don't offer data structures such as linked lists, queues, binary trees, etc. as part of their standard library? Is it for historical/practical/cultural reasons or is there something more fundamental that I'm missing.
linked lists
You can implement a linked list fairly easily in most dynamic languages, but they aren't that useful. For most cases, dynamic arrays (which most dynamic languages have built-in support for) are a better fit: they have better usage and cache coherence, better performance on indexed lookup, and decent performance on insert and delete. At the application level, there aren't many use cases where you really need a linked list.
queues
Easily implemented using a dynamic array.
binary trees
Binary trees are easily implemented in most dynamic languages. Binary search trees are, as ever, a chore to implement, but rarely needed. Hashtables will give you about the same performance and often better for many use cases. Most dynamic languages provide a built-in hashtable (or dictionary, or map, or table, or whatever you want to call it).
It's definitely important to know these foundational data structures, and if you're writing low-level code, you'll find yourself using them. At the application level where most dynamic languages are used though, hashtables and dynamic arrays really do cover 95% of your data structure needs.
There are many definitions of "high level" when it comes to languages, but I will take it in this context to refer to languages that are purpose-built for a specific domain (4GL anyone?). Such "specific domains" are typically restricted in scope, for example: web page construction, report writing, database querying, etc. Within that limited scope, there is frequently little need for anything but the most basic data structures.
Is There a Need?
Let's consider the case of Javascript. The scope of this language was originally very bounded, being a scripting language than ran within the confines of a web browser. It was concerned primarily with providing a small amount of dynamic behaviour on otherwise static web pages. Furthermore, the limitations of the technology made it impractical to write large components in that environment (notably performance and the sandbox model).
Since Javascript was confined to address "small problems", there was little need for a rich set of data structures. As data structures go, the Javascript's map is very flexible. You must remember that Basic and FORTRAN went a long way providing only arrays -- maps are considerably more flexible than that. Javascript appears to be undergoing a transformation, escaping the sandbox. Some very ambitious systems are being built in Javascript, both within and outside of the browser. And the technology is advancing to keep up with it (witness the new Javascript engines, persistence models, and so on). I anticipate that the demand for more interesting data structures will increase, and that the demand will be met.
Library capabilities generally appear as needs arise. Many of the basic data structures are so easy to implement that it hardly seems worth adding them to a library -- especially if that library needs to go through some sort of standardization process. This is why so many languages (of all levels) do not provide them out of the box. But I think that there is another force at work that will change all that... the rise of multiprogramming.
A New Need Arising?
It wasn't too long ago that the code that most developers wrote ran within the confines of a single thread. But now, our systems are full of threads, web workers, agents, coroutines, clusters, clouds and all manner of concurrent systems. This changes the whole complexion of implementing data structures from scratch.
In a single-threaded context, it is trivial to implement a linked list in almost any language. But add concurrency to the mix and now it takes a great deal of effort to get it right. One really needs to be a specialist to stand a chance at all. That is why you see rich collection frameworks in all the latest languages. The need to share data structures across thread boundaries (or worse) is being fulfilled.
History... But Not The Future
So, to summarize, I think the reason why rich data structures are conspicuously absent from many languages is largely historical. The need was not great enough to justify the effort. But there are new forces at work, in the form of highly concurrent systems, that force language libraries to provide industrial strength implementations of the richer data structures.
My intuitive answer would be that these languages defer the higher-level data structures to the programmer to implement him/herself. This allows the programmers to custom tailor the specific data structure to the problem being solved by the software. Often in an organization, many of these DSes are packaged in libraries for re-use in a large-scale application.

How established are ORMs (object relational mapping) in the world of databases

I'm not a database admin or architect, so I have to ask those who do it 24/7. How established is the concept of an ORM (object relational mapping) in the world of database administration and architecture? Is it still happening, widely approved but still in its early stages, or is generally disapproved? I'm learning this area and would like to get a feel for whether it's going to be knowledge appreciated by the wider segment of this field.
A lot of places are using them, that doesn't mean they are using them well or that they are a good idea for the long term health of the database. Doesn't mean they aren't either, usually it just means the people choosing them don't think about how this affects database design, maintenance and performance over time.
ORMs are widely used. Django (python web application framework) uses SQLAlchemy. Hibernate is popular for Java programs. ORMs make SQL application development faster, reduce the amount of boilerplate code that you must write, and hide which database you are using from the rest of the application. Performance may suffer using an ORM, so be prepared to customize things as needed.
Widely used and definitely the present and near future. Database access through a handcoded layer of SQL generation was always fraught with drudgery and typos, and was unwieldy at best. ORMs let you use a persistence store in a programming way.
I thought this blog argued for it well: http://jonkruger.com/blog/category/fluent-nhibernate/ and SO posts like this (nHibernate versus LLBLGen Pro) show just how many people are using them.
I can tell you from my experience. We are a $2.5B solar manufacturing company and we are basing our next generation of products on ORM technology. We're using Linq-To-SQL, quite successfully. We are very happy with it.
The concept has been around for at least 20 years.
If you take a look at any decent web framework, whether it's Java, Ruby, PHP, C# or Python, they all incorporate ORMs. Generally, it's perceived as being a more professional choice unless you have specific needs for high performance or custom SQL.

What are the factors in choosing a specific Database Management System? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
Why are there so many Database management systems? I am not an DB expert and I've never thought about using another Database other than mySQL.
Programming languages offer different paradigms, so it makes sense to choose a specific language for your purpose.
Question
What are the factors in choosing a specific Database management system ?
Different Strokes for Different Folks:
The .NET people like the homogeneous stack that Microsoft SQL Server provides.
Oracle is the 'Please use in Enterprise Applications only' DBMS.
MySQL and PostgreSQL are used by the Open-Source crowd.
SQLite is great for an embedded DBMS.
Microsoft Access is great for a One-Person Microsoft Office Integrated Database (or, for people that don't know any better)
I know next to nothing about non-relational DBMSs: NoSQL, MongoDB, db4o, CouchDB, BigTable. I'd recommend a different question to address those, since their aims are different than traditional RDBMSs.
DBMS are around for many many years and very important for the IT infrastructure in the past, nowerdays and for the future. So a lot of people tried to get into the business. There are a lot of office suites, internt browsers, etc, etc.
What are factors to choose a specific DB management system ?
Licensing
Platform
Performance
Supported programming language
etc, etc
If the paradigms are the same, it's also a market sharing issue.. (Has it been skipped?!?) Otherwise, Peter's answer is considerable.
There's a noticable absence of e.g. column-oriented (LucidDB), platform independent (derby), in-memory (hsqldb, although derby fits here as well) and probably other databases, as classified by their key properties.
"This answer doesn't really answer 'why', it just answers 'who'."
True, but I guess the answer to 'why' might be that there have been so many 'who' 's who all thought they could do it "better than the others".
With "better" having the significance of a fill-in-the-gaps that the particular 'who' chose to pick on :
nearly watertight guarantee of read success through MVCC locking, as opposed to more traditional two-phase locking.
no fee, as opposed to million-dollar fees
easy interfacing with language XYZ, which the others don't have
...
My personal pet issue is support for CREATE ASSERTION. It's been in the SQL standard since 1992, and none of the big elephants know how to support it. I do.
For the most part, if you are writing for the RDBMS/SQLish market, the number one question you should probably ask is, "What do I already know about? What does my staff know about?" If you have an answer for that, then you should probably pick that SQL engine first. My inner database geek cringes at this answer, but the truth is that unless your developers are among the tiny fraction that really get relational databases anyway, you're going to use the same deep ruts of standard database mistakes as everyone else, and the main question is going to be whether you can get your system to go fast enough.
This is probably true if you've swallowed a pitcher of the NoSQL beverage of choice as well, since there too you have to pick the thing you understand.
If you are already in a position to understand all these differences, however, then you will understand that the answer is "it depends". The usual four dimensions come down to these: execution speed for a given workload profile (this is a matter of whether the database is excellent at the particular kind of problem: some are faster for lookup, for instance, where others are better under high concurrency writing); SQL conformance in the target areas (e.g. Oracle has funny -- i.e. wrong -- NULL handling, MySQL is all over the map, Postgres wraps unquoted identifiers to lower case); money cost both immediately and over the long haul (include hardware requirements, costs of hiring people, licenses); and maybe features you want (if you want Oracle's RAC, you have to buy Oracle).
Database systems offer different paradigms too. For instance, MySQL or MSSQL are relational, while db4o is object-oriented, and MongoDB is document-oriented.

Does a "thin data access layer" mainly imply writing SQL by hand?

When you say "thin data access layer", does this mainly mean you are talking about writing your SQL manually as opposed to relying on an ORM tool to generate it for you?
That probably depends on who says it, but for me a thin data access layer would imply that there is little to no additional logic in the layer (i.e. data storage abstractions), probably no support for targeting multiple RDBMS, no layer-specific caching, no advanced error handling (retry, failover), etc.
Since ORM tools tend to supply many of those things, a solution with an ORM would probably not be considered "thin". Many home-grown data access layers would also not be considered "thin" if they provide features such as the ones listed above.
Depends on how we define the word "thin". It's one of the most abused terms I hear, rivaled only by "lightweight".
That's one way to define it, but perhaps not the best. An ORM layer does a lot besides just generate SQL for you (e.g., caching, marking "dirty" fields, etc.) That "thin" layer written in lovingly crafted SQL can become pretty bloated by the time you implement all the features an ORM is providing.
I think "thin" in this context means:
It is lightweight;
It has a low performance overhead; and
You write minimal code.
Writing SQL certainly fits this bill but there's no reason it couldn't be an ORM either although most ORMs that spring to mind don't strike me as lightweight.
I think it depends on the context.
It could very well mean that, or it may simply mean that your business objects map directly a simple underlying relational table structure: one table per class, one column per class attribute, so that the translation of business object structure to database table structure is "thin" (i.e. not complex). This could still be handled by an ORM of course.
It may mean that there is no or minimal logic employed on the database such as avoiding the use of stored procedures. As other people have mentioned it depends on the statement's context as to the most likely meaning.
I thought data access were always supposed to be thin... DALs aren't really the place to have logic.
Maybe the person you talked to is talking about a combination of a business layer and a data access layer; where the business layer is non-existent (e.g. a very simple app, or perhaps all of the business rules are in the database, etc).

Is there a business proven cloud store / Key=>Value Database? (Open Source) [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
I have been looking for cloud computing / storage solutions for a long time (inspired by the Google Bigtable). But I can't find a easy-to-use, business-ready solution.
I'm searching a simple, fault tolerant, distributed Key=>Value DB like SimpleDB from Amazon.
I've seen things like:
The CouchDB Project : Simple and distributed, fault-tolerant Database. But it understands only JSON. No XML connectors etc.
Eucalyptus : Nice Amazon EC2 interfaces. Open Standards & XML. But less distributed and less fault-tolerant? There are also a lot of open tickets with XEN/VMWare issues.
Cloudstore / Kosmosfs : Nice distributed, fault tolerant fs. But it's hard to configure. Are there any java connectors?
Apache Hadoop : Nice system which much more then abilities to store data. Uses its own Hadoop Distributed File System and has been testet on clusters with 2000 nodes.
*Amazon SimpleDB : Can't find an open-source alternative! It's a nice but expensive system for huge amounts of data. And you're addicted to Amazon.
Are there other, better solutions out there? Which one is the best to choose? Which one offers the smallest amount of SOF(Singe Point of Failure)?
How about memcached?
The High Scalability blog covers this issue; if there's an open source solution for what you're after, it'll surely be there.
Other projects include:
Project Voldemort
Lightcloud - Key-Value Database
Ringo - Distributed key-value storage for immutable data
Another good list: Anti-RDBMS: A list of distributed key-value stores
MongoDB is another option which is very similar to CouchDB, but using query language very similar to SQL instead of map/reduce in JavaScript. It also supports indexes, query profiling, replication and storage of binary data.
It has huge amount of documentation which might be overwhelming at fist, so I would suggest to start with Developer's tour
Wikipedia says that Yahoo both contributes to Hadoop and uses it in production (article linked from wikipedia). So I'd say it counts for business-provenness, although I'm not sure whether it counts as a K/V value database.
Not on your list is the Friendfeed system of using MySQL as a simple schema-less key/value store.
It's hard for me to understand your priorities. CouchDB is simple, fault-tolerant, and distributed, but somehow you exclude it because it doesn't have XML. Are XML and Java connectors an unstated requirement?
(Anyway, CouchDB should in fact be excluded because it's young, its API isn't stable, and it's not a key-value store.)
I use Google's Google Base api, it's Xml, free, documented, cloud based, and has connectors for many languages. I think it will fill your bill if you want free hosting too.
Now if you want to host your own servers Tokyo cabinet is your answer, its key=>value based, uses flat files, and is the fastest database out there right now (very barebones compared to say Oracle, but incredibly good at storing and accessing data, about 1 million records per second, with about 10bytes of overhead (depending on the storage engine)). As for business ready TokyoCabinet is the heart of a service called Mixi, which is the equivalent of Japan's Facebook+MyPage, with several million heavy users, so it's actually very battle proven.
If you want something like Bigtable, you can't go past HBase or Hypertable - they're both open-source Bigtable clones. One thing to consider, though, is if your requirements really are 'big enough' for Bigtable. It scales up to thousands of tablet servers, and as such, has quite a bit of infrastructure under it to enable that (for example, handling the expectation of regular node failures).
If you don't anticipate growing to, at the very least, tens of tablet servers, you might want to consider one of the proposed alternatives: You can't beat BerkelyDb for simplicity, or MySQL for ubiquity. If all you need is a key/value datastore, you can put a simple 'dict' wrapper around your database interface, and switch out your backend if you outgrow one.
You might want to look at hypertable which is modeled after google's bigtable.
Use The CouchDB
Whats wrong with JSON?
JSON to XML is trivial
You might want to take a look at this (using MySQL as key-value store):
http://bret.appspot.com/entry/how-friendfeed-uses-mysql
Cloudera is a company that commercializes Apache Hadoop, with some value-add of course, like productization, configuration, training & support services.
Instead of looking for something inspired by Google's bigtable- Why not just use bigtable directly? You could write a front-end on Google App-Engine.
Good compilation of storage tools for your question :
http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores/
Tokyo Cabinet has also received some attention as it supports table schemas, key value pairs and hash tables. It uses Lua as an embedded scripting platform and uses HTTP as it's communication protocol Here is an great demonstration.