Does making a relational-like query language operate over bags rather than sets give it better performance properties? - relational-database

One of the core differences between SQL and the relational algebra is that SQL operates over bags whilst the relational algebra operates over sets.
Are there any performance benefits to designing SQL this way? Could a pure relational algebra which operates strictly over sets ever compete with SQL on performance? For which relational operators is ensuring uniqueness of rows expensive?
Performance here means strictly the amount of time it takes to execute queries.

Related

Is it possible to express a not null constraint using relational calculus?

I understand that relational calculus is based on first order logic and as such has no concept of null values, however a not null constraint can be expressed in a query in relational algebra using an anti-join. Is there an equivalent mechanism to express such a query using only relational calculus?
For example, could a basic SQL query in the form:
SELECT * from x WHERE y IS NOT NULL
be expressed using relational calculus?
E.F.Codd proposed introducing nulls to the relational model but he never seemed to deal with the consequences. In his book, "Relational Model for Database Management", he proposed using two different kinds of null and a four-value logic. He suggested such a system would need a tautology detection algorithm to make sure the right result (or at least a useful, comprehensible result) would be returned for some queries. It seems to me that such a scheme must be impractical and doomed to fail, although I have no proof. To me it seems unlikely that users would be able to understand tautology detection properly.
Under Codd's scheme, short circuit operations like x=x would presumably evaluate to true, even in the presence of nulls. The authors of SQL did not follow Codd's scheme of course, and there lies the difficulty. There is no single consistent set of rules for the treatment of nulls either in theory or in working software, so unless you explain such a system and its rules your question is unanswerable.

For a many to many relationship is it better to use relational database or nosql?

For a many to many relationship is it better to use relational database or nosql?
Let's assume you have a bunch of users. And each user can have friends that are from the same users table. So it's essentially a many to many relationship to itself. Many to many relationship in relational database will create a third table. Now I was wondering assuming this user table is huge like millions of people in there, this third table would be thus be gigantic assuming let's say each person has more than 10 friends each. Wouldn't it be more efficient for friends(and just overall more intuitive) to be stored as a json list in a nosql as shown below?
{"user1": "friendslist":["user2","user3","user4"]}
{"user2": "friendslist":["user1","user3","user4"]}
{"user3": "friendslist":["user1","user2","user4"]}
{"user4": "friendslist":["user1","user2","user3"]}
so this is also a data structures question so it would be btree vs hash table if I'm not mistaken.
It does seem more intuitive to the untrained. That's why the network data model is still so prevalent even though the relational model has been around for decades.
"Better" depends on how you want to use it, and "more efficient" depends on the database engine, indexes and various other factors. I prefer the relational model since I can formulate any reasonable question that can be logically derived from the data and get a correct answer. For example, if I wanted to find friends of friends, I could join a relational many-to-many table with itself. I could find cycles and cliques of any particular size. I could easily declare a unique constraint on pairs of friends.
It's possible to do these things without a relational database but I doubt it would be as easy or concise.
The particular data structure used by the database engine has nothing to do with the relational concept, though it is relevant to efficiency. For more info on which data structure would be used, you'll need to look at particular database management systems and their storage engines.
Why would a relational implementation be "gigantic"? Why would your structure be "more efficient"? You are making a lot of unfounded assumptions that it would be good for you to think about. (Learn some relational basics. And the relational take on relational vs NoSQL.)
Re "intuitive", the obvious relational organization for when U friended F is a table Friended holding rows where... "U friended F". Friended(U,F) for short. If you want Us where U friended x, that's the rows where Friended(U,x), ie the rows in PROJECT U RESTRICT F x Friended, ie the rows in PROJECT U (Friended WHERE F=x), depending on whether you want to think in logic, relations or a mix. What's your query for that? Using a relational interface in terms of predicates and tables does not require or preclude any particular implementations. The entire NoSQL movement is a sad consequence of lack of understanding by users and vendors of the relational model as interface to data, not as storage structure. A DBMS for a NoSQL use case needs only to be a relational DBMS better supporting arbitrary types in querying and implementation.
From my answer to Adjustable, versioned graph database:
There is an obvious 1:1 correspondence between your states at a given time and a relational database with a given schema. So there is an obvious 1:1 correspondence between your set of states over time and a changing-schema database, ie a variable whose value is a database plus metadata, manipulated by both DDL and DML update commands. So there is no evidence that you shouldn't just use a relational DBMS.
Relational DBMSs allow generic querying with automated implementation at a certain computational complexity with certain opportunities for optimization. Any application can have specialized queries that make a specialized data structure and operators a better choice. But you must design your application and know about such special aspects to justify this. As it is, with the obvious correspondences between your states and relational states, this has not been justified.
Just because you can draw a picture of your application state as of some time using a graph does not mean that you need a graph database. What matters is what specialized queries/expressions you will be evaluating. You should understand what these are in terms of your problem domain, which is probably most easily expressible per some specialized data structure and operators and relationally. Then you can compare the expressive and computational demands to a specialized data structure, a relational representation, and the models of particular graph databases.
Of course there are specialized applications where we use optimized special operators and storage. But that merits justification, and from a relational perspective should supported by an extendible relational DBMS.

Theoretical basis for CRUD Operations on data

I know that RDBMSs are based on the Relational Model, supported by Relational Algebra.
Various Relational Algebra theoretical concepts like Selection, Projection, Joins implemented in Query languages like SQL. But these operations are primarily the R (Read) of CRUD (Create, Read, Update, Delete).
CRUD is the holy grail of programming, especially in the enterprise world.
I wanted to know on which programming language independent, theoretical foundation (may or may not be mathematical) are the INSERTS, UPDATES, DELETES modeled on? Does such a theory even exist?
If it would exist, it could probable explain things like constraints on Databases amongst other things.
Eg:
You cannot update a single row (tuple) without specifying a unique column (a WHERE clasue).
Or,
If a one to many relation is deleted, the entity on the many side gets deleted (the table in which the other table's primary key is housed).
For the sake of simplicity let us assume all CRUD is operated on Relational Models only.
The reason I am asking is because I need to do a deep R&D for a product that hopes to automate CRUD. I know I know people have tried and failed, but I'd still like to be pointed to some theoretical foundation please!
EDIT This will also help in the design of ORMs which can produce all CRUD operations independent of the underlying DB Model
EDIT I just found this link -> https://cs.stackexchange.com/questions/43672/a-relational-algebra-extended-to-model-the-full-dml-crud-domain This is similar to what I have to ask unfortunately the OP's question circles into a specific implementation!
In relational terms CREATE, UPDATE and DELETE operations are all assignments. E.g. inserting I into T can be accomplished by:
T = T UNION I;
Any practical relational language ought to have syntax shortcuts for these operations. See Tutorial D for example.
CRUD can be reduced to relations, relational algebra, variables and (optionally) type theory. A database is seen as a set of relation variables, similar to variables in any imperative programming language except that they hold relations rather than scalar values. Queries apply a sequence of relational algebra operators to the values stored in relation variables. Read queries return the result to the caller. Create, Update and Delete queries assign the result back to the original relation variable.
One problem with ORMs is that they confuse rows for entities, tables for entity sets and columns for attributes. Chen's original paper stated that entities are represented by values and attributes are one-to-one relations represented by pairs of values. Another problem is trying to manipulate a row at a time when the underlying system works with sets. Another is trying to abstract over a very high-level declarative data sublanguage.
I don't want ORMs, I want my objects to talk in SQL with each other, but that's a different topic.
This is too long for a comment.
"Relational" databases only loosely implement relational algebra. The "relational" in relational algebra, for instance, refers (among other things) to the relationship between "attributes" (columns) and their values within a "tuple" (rows in a table). In most SQL databases, all rows in a table ("tuples") have the same columns. That is not a requirement for relational algebra. Another examples are duplicates within tables. Relational algebra deals with sets of "tuples", where duplicates are not allowed. Yet, relational databases allow duplicates in tables unless a primary key is explicitly defined.
The semantics around CRUD are driven more by the ACID properties of databases (atomicity, consistency, isolation, and durability). These properties drive the transactional semantics of relational databases.
In my experience, successful practical applications usually differ from theoretical underpinnings.

Why we use Dimensional Model over Denormalized relational Model?

I am confused in some questions. I need their answers.
If our relational model is also De-normalize then why we prefer dimensional model ?
What is the reason we prefer dimensional model over relational model ?
Your historical data can also stored in OLTP and you can perform reporting easily on any OLTP then why we use dimensional model and data warehouse ?
What is the difference between a dimension and a de-normalized table ?
Thanks in advance
Short answer:
If your lookups / retrievals from your OLTP tables are fast enough, and your specific search requirements do not have such complications as are described below, then there should not be a need to get into any dimensional star-schemas.
Long answer:
Dimensional and Denormalized models have different purposes. Dimensional models are generally used for data warehousing scenarios, and are particularly useful where super-fast query results are required for computed numbers such as "quarterly sales by region" or "by salesperson". Data is stored in the Dimensional model after pre-calculating these numbers, and updated as per some fixed schedule.
But even without a data warehouse involved, a Dimensional model could be useful, and its purpose could complement that of the Denormalized model, as in the following example:
A Dimensional model enables fast search. Joins between the dimension tables and the fact table are set up in a star-schema. Searching for John Smith would be simplified because we'll search for John OR Smith only in the relevant dimension table, and fetch the corresponding person ids from the fact table (fact table FKs point to dimension table PKs), thereby getting all persons with either of the 2 keywords in their name. (A further enhancement would enable us to search for all persons having variations of "John Smith" in their names e.g. John, Jon, Johnny, Jonathan, Smith, Psmith, Smythe by building snowflake dimensions.)
A Denormalized model, on the other hand, enables fast retrieval, such as returning back a lot of columns about a specific item without having to join multiple tables together.
So in the above scenario, we would first use the Dimensional model to get a set of IDs for the persons of our interest, and then use the Denormalized table to get full details of those selected IDs without having to do any further joins.
This kind of a search would be very slow if we directly query the Denormalized tables, because a text search will need to be done on the PersonName column. It becomes even slower if we try to include the name variations, or if we need to add more search criteria.
Excellent reference:
An excellent reference for learning about the vast (and very interesting) topic of Dimensional Modeling is Ralph Kimball's The Data Warehouse Lifecycle Toolkit. Its companion volume The Data Warehouse Toolkit covers a large number of actual use cases.
Hope this helps!
A dimensional model uses denormalisation as one of its techniques in order to optimise the database for:
- query performance, and
- user understanding.
OLTP systems are typically hard to report from and also slow, being optimised as they are for OLTP (insert, update, delete) performance and also to protect transactional integrity.
A data warehouse, using a dimensional model, still uses relational techniques but is instead optimised to consider the experience of getting the data out over getting the data in.
Truth is, you can't always report easily from any OLTP system: the tables are often obscurely titled without considering people are going to want to get at the data to make business decisions. Reporting tools that generate SQL also struggle to make performant queries on your typical normalised schema.
Modern advances in OLTP technologies provide alternatives to dimensional models that address performance issues, but still do not tackle the typical steps made in creating a dimensional model, to make the database tables easier to comprehend and navigate.
A dimension is a table that is intended to represent a business concept or entity, giving context to a particular measurement of a business process (or 'fact'). Dimensions are typically denormalised in a dimensional model both to reduce the number of tables to comprehend/navigate but also to reduce the number of joins for performance reasons. For example, a Product dimension may contact Brand information whereas in an OLTP model these would be separate tables, which allows users to filter a Fact by Brand directly without traversing multiple tables.
I agree with #Rich, mainly the fact that dimensional model uses denormalized tables. I had started following Kimball's book as #Krishna indicates, about 2 year ago.
I think you will get answers to all your questions/doubts if you read this book.
Please note, if you aim for some kind of BI solution, then per my opinion, follow dimensional modelling. This is for reporting ease, beiing true and closer to business process.
You can perhaps also make report direct from OLTP system, but your reporting solution may not survive test of user's ever changing demands. Dimension modelling is done while remaining close to natural business process. At the same time, it remains so flexible that any other add-on process can be done easily like setting up piece of puzzle when you are closer to solve it.

Using NoSQL database for relational purpose

Non-relational databases are attracting more attention day by day. The main limitation is that today's complicated data are indeed connected. Isn't it convenient to connect databases as we connect tables in RDBMS? Of course, I just mean simple cases. Imagine three tables of Articles, Tags, Relationships. In a RDBMS like Mysql, we can run three queries to
1. Find ID of a given tag
2. Find Articles connected with the captured Tag ID
3. Fetch the contents of Articles tagged with the term
Instead of three queries, we perform a single query by JOIN. I think three queries in a key/value database like BerkeleyDB is faster than a JOIN query in Mysql.
Is this idea practical? Or other issues are involved to ignore this approach?
NoSQL databases can support relational data models just fine. You're just left to implement the relational mapping yourself in your application, and that effort is typically not insignificant.
In some applications this extra effort will be worthwhile. Perhaps you only have a small number of tables and the joins you need are very simple. Or perhaps you've done some performance evaluation between a traditional relational DBMS and a NoSQL alternative and found that the NoSQL option is more appropriate for your needs for any number of reasons (performance, scalability, flexibility, whatever).
You should keep one thing in mind, however. A typical SQL DBMS is basically a NoSQL DB with an optimized, well-built relational engine in front of it. Some databases even let you bypass the relational layer and treat their system like a pure NoSQL DB.
Therefore, the moment you start to build your own relational mappings and joins on top of a NoSQL DB you should ask yourself, "Didn't someone build this for me already?" The answer may well be "yes", and the solution might be to go with a traditional SQL DBMS.
To answer the "3 query" part of your question specifically, the answer is "maybe". You certainly might be able to make such a query run faster in a NoSQL DB than in an RDBMS, but you need to keep in mind that there are more things to consider here than just the raw speed of your query:
The technical debt you will incur as you build join-like functionality that you wouldn't have had to build otherwise
The time it will take you to build, test and optimize your query code which will likely be more significant than writing a simple SQL query
Any difference in transactional guarantees or other typical product features (replication, management tools, etc) which you may lose or gain depending on the NoSQL option you choose
The ability to hire DBMs who know how to run your database from an operational perspective
You might review that list and say to yourself, "No big deal, I'm running a simple app with only a few thousand DB entries and I'll maintain it myself". If so, knock yourself out - Berkeley (and other NoSQL options) would work fine. I've used Berkeley many times for those kinds of applications. But you may have a different answer if you are building the back-end for a significantly-sized SaaS product which might soon have millions of users and very complex queries.
We can't give a one-size-fits-all answer, unfortunately. You'll have to make the judgement call yourself based on the needs of you application and organization.
Sure, a single record join is pretty speedy in either solution, but that's not the big advantage of joins. Joins are useful when you're joining many, many rows with many, many other rows. Imagine if, in your example, you wanted to do that for 100 different tags. Without joins, you're talking 300 queries to SQL's one.
Another solution on noSql systems is playOrm. It does Joins BUT only in partitions so the table can be infinite size, but the partitions have to be on par with the size of RDBMS tables. It does all the fancy hibernate stuff as well for you with all the related annotations though it has some differences and will be adding Embedded for use when you denormalize. It makes things much easier. Typically dealing with nosql is kind of a pain in all the translation logic you have to do and all the manual indexing and updates and removes from the index....playOrm does all this for you instead.