Consider a hierarchical file system where each folder maintains a version history (i.e. name and additional properties may change). I need to implement this in MySQL 5.1 but a future version may be ported to SQL Server 2012.
I understand there are several options for tree-structures in databases:
Adjacency List
Nested Set (may cause extremely slow insertions)
Nested intervals (complex stuff, requires support for recursion...)
These techniques have been discussed on StackOverflow before. However, my problem adds another dimension to the problem as I need to maintain a history for each node. The data that needs to be maintained can be seen as a list of properties. E.g. Name, date, type...
Some premises
The database is expected to handle 5-10 simultaneous clients.
The tree is expected to grow up to a 1000-5000 parent nodes (with an arbitrary number of leafs).
Nodes may be inserted at any time.
Nodes/leafs may never be updated or deleted. Insted, a version history is maintained.
Reorganization of nodes is not permitted. (Though, if possible, this would be nice to have!)
Multiple clients may simultaneously add/modify tree nodes. Hence, the clients need to continuously re-read the tree structure (no need for real-time updates).
Order of importance: Traceability (crucial), performance, scalability.
Q: What is the preferred technique of choice for the tree structure and its version controlled node data? SQL samples are appreciated, but not mandatory.
Versioning is extremely tricky because you are dealing with time varying data and neither of the databases you have suggested (or any others that I am aware of) have native support for doing this simply.
Please have a read of Developing Time-Oriented Database Applications in SQL; the book may be almost 15 years old but the issues are largely unchanged.
The fact that you mention "Traceability (crucial)" suggests that you are going to want to get this right.
When considering a simple report to show just the heirachy, the issues you need to consider are do you need to know:
what the tree looks like today, using today's data (yes, obviously)
what the tree looks like now, using last week's data
what the tree looks like a week ago, using the today's data
what the tree looks like a week ago, using last week's data
what the tree looks like a week ago, using the week before last's data
The issues you face are because you are dealing with time varying data, that is updated at a different time from the real-world process it is modeling, that in itself may be dealing with temporal data. Anyway, read the book.
If this is a non-issue (i.e. the tree is static), then #didierc is correct in his comment, the nodes of the tree can refer to an external versioning table. However, if you also need to store versioning information about the heirachy itself this approach won't work if implemented naively (using whatever model).
To give a concrete example, consider a simple tree valid at 1/1/13 - A->B->C. If this changes on 1/2/13 to A->D->B->C. If you run a query on 1/3/13, referring back to 1/2/13, which tree do you want to retrieve?
Good luck
Related
I want to build an application that will serve a lot of people (more than 2 million) so I think that I should use Google Cloud Datastore. However I also know that there is an option to use Google Cloud SQL and still serve a lot of people using mySQL (like what Facebook and Youtube do).
Is this a correct assumption to use Datastore rather that the relational Cloud SQL with this many users? Thank you in advance
To give an intelligent answer, I would need to know a lot more about your app. But... I'll outline the biggest gotchas I've found...
Google Datastore is effectively a distributed hierarchical data store. To get the scalability they wanted there had to be some compromises. As a developer you will find that these are anywhere from easy to work around, difficult to work around, or impossible to work around. The latter is far more likely than you would ever assume.
If you are accustomed to relational databases and the ability to manipulate data across multiple tables within the same transaction, you are likely to pull your hair out with datastore. The biggest(?) gotcha is that transactions are only supported across a limited number of entity groups (5 at the current time). To give a simple example, say you had a simple parent-child relationship and you needed to update child records under more than 5 parents at the same time within a transaction... can't be done (yes, really). If you reorganize your data structures and try to put all of the former child records under a single entity so they can be updated in a single transaction, you will come across another limitation... the fact that you can't reliably update the same entity group more than once per second (yes, really). And if you query an entity type across parents without specifying the root entity of each, you will get what is euphemistically referred to as "eventual consistency"... which means it isn't (yes, really).
The above is all in Google's documentation, but you are likely to gloss over it if you are just getting started (of course it can handle it!).
It is not strictly true that Facebook and YouTube are using MySQL to serve the majority of their content to the majority of their users. They both mainly use very large NoSQL stores (Cassandra and BigTable) for scalability, and probably use MySQL for smaller scale work that demands more complex relational storage. Try to use Datastore if you can, because you can start for free and will also save money when handling large volumes of data.
It depends on what you mean by 'a lot of people', what sort of data you have, and what you want to do with it.
Cloud SQL is designed for applications that need a SQL database, which can handle any query you can write in SQL, and ensures your data is always in a consistent state.
Cloud SQL can serve up to 3200 concurrent queries, depending on the tier. If the queries are simple and can be served from RAM they should take just a few ms, and assuming your users issue about 1 request per second, then it could support tens of thousands of simultaneously active users. If, however, they are doing more complex queries like searches, or writing a lot of data, then it will be less.
If you have a simple set of queries, are less concerned about immediate consistency, or expect much more traffic, then you should look at datastore.
From CouchDB guide:
Maintaining consistency within a single database node is relatively
easy for most databases. The real problems start to surface when you
try to maintain consistency between multiple database servers. If a
client makes a write operation on server A, how do we make sure that
this is consistent with server B, or C, or D? For relational
databases, this is a very complex problem with entire books devoted to
its solution. You could use multi-master, master/slave, partitioning,
sharding, write-through caches, and all sorts of other complex
techniques.
Why is it hard to maintain consistency between the database server in a relational model? And why is CouchDB approach simpler and easier?
Couch simplifies it in two ways.
First, it has a higher level replication model built in and enforced by the system.
Second, it's data elements are coarser, giving the optimistic locking and conflict resolutions models less to work with.
As a general rule, RDBMSs do not natively support optimistic locking. Many frameworks built on top of them do, but not the DBMSs themselves. Some may support it internally, but if they do, it's not exposed to the end users.
Couch supports optimistic locking/versioning intrinsically, and relies up this for its replication.
In a RDBMS, most larger order data items are broken up in to their normalized, relational components. A simple order may well be composed of a half dozen tables, each with their own row structure. But the combination of tables and their relationships are what make up "an order". Given this finer grain representation of the order, it's difficult for the database to capture the concept of "change" at the higher level. What does "an order changed" mean? The database sees a collection of nodes and relationships, not higher order meta objects like "orders".
The application can define change, but not as readily the database.
Now this is not so much an issue if you're replicating the entire database, but it's significantly more complicated if you're replicating a portion of the database.
In Couch, an order, for example, is an entire document. Change the document, and the entire order "changes", and thus the entire order is replicated. In a RDBMS, if a line item changed, then it's easy enough to detect that one line changed, but does that mean the "order" changed? What if an item that the order is referring to changed, does that change the order? You can see how this gets more complicated.
All of this can be built on top of the RDBMS, but then it's the application doing change management and facilitating replication, not the database.
However, no matter what support CouchDB offers, it can only goes so far, and that caveat is highlighted in the quote:
When two versions of a document conflict during replication, the
winning version is saved as the most recent version in the document’s
history. Instead of throwing the losing version away, as you might
expect, CouchDB saves this as a previous version in the document’s
history, so that you can access it if you need to. This happens
automatically and consistently, so both databases will make exactly
the same choice.
It is up to you to handle conflicts in a way that makes sense for your
application. You can leave the chosen document versions in place,
revert to the older version, or try to merge the two versions and save
the result.
During replication, Couch simply has deterministic rules to make two systems consistent. But consistent doesn't make them correct. When Couch detects two documents in conflict, it picks one, deterministically, and the winner stomps on the loser. But as far as your application is concerned, the loser may have been "right", or the correct document is the fusion of the two documents.
You have to write that logic handle those merges. And this is a fundamental issue with all master-master replication schemes. The technique to determine "who wins". The "now what" problem when to different opinions on what the data should look like arrive at the same crossroads.
No system can handle that for you. All a system can do is pick some set of rules it follows, or lets you configure to handle the problem, because the problem is almost always application dependent.
If the simpler model that Couch supports and projects for you works, then that's great. If it doesn't, then you're kind of stuck. Many RDBMSs have solid support for Master-Slave replication, as it's a simpler model, and with that support it's pretty much transparent to the end user application.
I am required to make a general schema of a huge database that I have never used.
The problem is that I do not know how/where could I start doing this because, not considering the size, I have no idea of what is each table for. I can guess some but there are the mayority of them in which generic name fields do not say anything to me.
Do you have some advice?what could I do?
There is no documentation about the database and the creators are not able to help me because they are in another company now.
Thank you very much in advanced.
This isn't going to be easy.
Start by gathering any documentation, notes, etc. that exist. Also, it'll greatly help to have a thorough understanding of the type of data being stored, and of the application. Keep ample notes of your discoveries, and build the documentation that should have been built before.
If your database contains declared foreign keys, you can start there, and at least get down the relationships between the tables. Keeping in mind that this may be incomplete. As #John Watson points out, if the relationships are declared, there are tools to do this for you.
Check for stored functions and procedures, including triggers. Though these are somewhat uncommon in MySQL databases. Triggers especially will often yield clues ("every update to table X inserts a new row to table Y" -> "table Y is probably a log or audit table").
Some of the tables are hopefully obvious, and if you know what is related to them, you may be able to start figuring out those related tables.
Hopefully you have access to application code, which you can grep and read to find clues. Access to a test environment which you can destroy repeatedly will be useful too ("what happens if I change this in the app, where does the database change?"; "what happens if I scramble these values?"; etc.). You can dump tables and use diff on them, provided you dump them ordered by primary or unique key.
Doing queries like SELECT DISTINCT foo FROM table can help you see what different things can be in a column.
If its possible to start from a mostly-empty database (e.g., minimal to get the app to run), you can observe what changes as you add data to the app. Much quicker to dump the database when its small. Same for diffing it, same for reading through the output. Some things are easier to understand in a tiny database, but some things are more difficult. When you have a huge dataset and a column is always 3, you can be much more confident it always is.
You can watch SQL traffic from the application(s) to get an idea of what tables and columns they access for each function, and how they join them. Watching SQL traffic can be done in application-specific ways (e.g., DBI trace) or server-specific ways (turn on the general query log) or with a packet tracer like Wireshark or tcpdump. Which is appropriate is going to depend on the environment you're working in. E.g., if you have to do this on a production system, you probably want Wireshark. If you are doing this in dev/test, the disadvantage of the MySQL query log is that all the apps may very well be mixed together, and if multiple people are hitting the apps it'll get confusing. The app-specific log probably won't suffer from this, but of course the app may not have that.
Keep in mind the various ways data can be stored. For example, all three of these could mean May 1, 1980:
1980-05-01 — As a DATE, TIMESTAMP, or text.
2444330.5 — Julian day (with time, specifies at midnight)
44360 — Modified Julian day
326001600 — UNIX timestamp (with time, specifies midnight) assuming local time is US Eastern Time (seconds since Jan 1 1970 UTC)
There may be things in the database which are denormalized, and some of them may be denormalized incorrectly. E.g., you may be wondering "why does this user have a first name Bob in one table, and a first name Joe in another?" and the answer is "data corruption".
There may be columns that aren't used. There may be entire tables that aren't used. Despite this, they may still have data from older versions of the app (or other, no-longer-in-use apps), queries run from the MySQL console, etc.
There may be things which aren't visible in the application anywhere, but are used. Their purpose may be completely non-obvious without knowledge of the algorithms implemented in the app(s). For example, a search function in an app may store all kinds of precomputed information about the documents to search and their connections. Worse, these tables may only be updated by batch jobs, so changing a document won't touch them (making you mistakenly believe they have nothing to do with documents). Then, you come in the next morning, and the table is mysteriously very different. Though, in the search case, a query log when running search will tell you.
Try using the free mySQL workbench (it's specific to mySQL).
I have reverse engineered databases this way and also ended up with great Entity Relationship Diagrams!
I've worked with SQL for 20 years and this product really is great (it's free, from the mysql folks themselves).
It can have occasional problems, crashes, etc. at least it did on Ubuntu10 but they've been relatively rare and far out-weighed by the benefits! It's also actively developed so bugs are actually fixed on an on-going basis.
Assuming that nobody bothered to declare foreign keys in the table definition, and the database belongs to an application which is in use, after grabbing the current schema, the next step for me would be to enable logging of all queries (hoping that the data does NOT use a trivial ORM like [x]hibernate) to identify joins and data semantics.
This perl script may be helpful.
Does anyone out there have any great ideas to achieve a massively scalable hierarchical datastore? It needs rapid add and ability to have many users of site requesting reports on the number of nodes below a certain node in hierarchy.
This is the scenario....
I will have a very large number of nodes getting added per hour. Lets say I want to add 1 million nodes per hour. They will likely be appearing all over the hierarchy. Ideally the scale will be into the billions of nodes but 50 million is a target to aim for. I need to be able to calculate at any time the number of nodes below any given point and there will likely be many people doign this at the same time. Think of it as a report that many users (100,000 concurrent perhaps) will be calling for at any one time. they might request all nodes below a certain node.
The database could either be created by a single process reading out of a flat table formatted as an adjacency list (rapid inserts, slow reporting) or it could be a standard design where users of the web site are updating the hierarchy directly if the datastore exists to cope with the massive number of nodes being created.
I already have this implemented in Django using Treebeard and MySQL. I am using a Materialised Path method and it is fairly good but I want lightning speed in comparison. With a datastore of 30,000 nodes I am achieving 120 inserts at the bottom of the tree per minute running on a 2 year old laptop. I want a lot more than this obviously and think that maybe there is a better datastore to use. Maybe PyTables, BigTable, MongoDB or Cassandra?
Easy integration into Python/Django would be good but I can always write this part of the system in another language if I have to. If we used the single process read out of flat datastore and process into a really efficient hierarchical datastore which will be perfect for reporting, I guess I will have no concurrency issues that will negate the need for transactions.
Anyway, that's enough info to get us started. Is this easy using the right technology?
Have you looked at the Neo4J graph database? It seems pretty darn capable, and has a Python wrapper and some support (in development) for Django. Neo runs on Java, and you can use it either with Jython or JPype and CPython.
I'm wondering if some other non-relational database would be a good fit for activity streams - sort of like what you see on Facebook, Flickr (http://www.flickr.com/activity), etc. Right now, I'm using MySQL but it's pretty taxing (I have tens of millions of activity records) and since they are basically read-only once written and always viewed chronologically, I was thinking that an alternative DB might work well.
The activities are things like:
6 PM: John favorited Bacon
5:30 PM: Jane commented on Snow Crash
5:15 PM: Jane added a photo of Bacon to her album
The catch is that unlike Twitter and some other systems, I can't just simply append activities to lists for each user who is interested in the activity - if I could it looks like Redis would be a good fit (with its list operations).
I need to be able to do the following:
Pull activities for a set or subset of people who you are following ("John" and "Jane"), in reverse date order
Pull activities for a thing (like "Bacon") in reverse date order
Filter by activity type ("favorite", "comment")
Store at least 30 million activities
Ideally, if you added or removed a person who you are following, your activity stream would reflect the change.
I have been doing this with MySQL. My "activities" table is as compact as I could make it, the keys are as small as possible, and the it is indexed appropriately. It works, but it just feels like the wrong tool for this job.
Is anybody doing anything like this outside of a traditional RDBMS?
Update November 2009: It's too early to answer my own question, but my current solution is to stick with MySQL but augment with Redis for fast access to the fresh activity stream data. More information in my answer here: How to implement the activity stream in a social network...
Update August 2014: Years later, I'm still using MySQL as the system of record and using Redis for very fast access to the most recent activities for each user. Dealing with schema changes on a massive MySQL table has become a non-issue thanks to pt-online-schema-change
I'd really, really, suggest stay with MySQL (or a RDBMS) until you fully understand the situation.
I have no idea how much performance or much data you plan on using, but 30M rows is not very many.
If you need to optimise certain range scans, you can do this with (for example) InnoDB by choosing a (implicitly clustered) primary key judiciously, and/or denormalising where necessary.
But like most things, make it work first, then fix performance problems you detect in your performance test lab on production-grade hardware.
EDIT:Some other points:
key/value database such as Cassandra, Voldermort etc, do not generally support secondary indexes
Therefore, you cannot do a CREATE INDEX
Most of them also don't do range scans (even on the main index) because they're using hashing to implement partitioning (which they mostly do).
Therefore they also don't do range expiry (DELETE FROM tbl WHERE ts < NOW() - INTERVAL 30 DAYS)
Your application must do ALL of this itself or manage without it; secondary indexes are really the killer
ALTER TABLE ... ADD INDEX takes quite a long time in e.g. MySQL with a large table, but at least you don't have to write much code to do it. In a "nosql" database, it will also take a long time BUT also you have to write heaps and heaps of code to maintain the new secondary index, expire it correctly, AND modify your queries to use it.
In short... you can't use a key/value database as a shortcut to avoid ALTER TABLE.
I am also planning on moving away from SQL. I have been looking at CouchDB, which looks promising. Looking at your requirements, I think all can be done with CouchDB views, and the list api.
It seems to me that what you want to do -- Query a large set of data in several different ways and order the results -- is exactly and precisely what RDBMeS were designed for.
I doubt you would find any other datastore that would do this as well as a modern commercial DBMS (Oracle, SQLServer, DB2 etc.) or any opn source tool that would accomplish
this any better than MySql.
You could have a look at Googles BigTable, which is really a relational database but
it can present an 'object'y personality to your program. Its exceptionaly good for free format text
searches, and complex predicates. As the whole thing (at least the version you can download) is implemented in Python I doubt it would beat MySql in a query marathon.
For a project I once needed a simple database that was fast at doing lookups and which would do lots of lookups and just an occasional write. I just ended up writing my own file format.
While you could do this too, it is pretty complex, especially if you need to support it from a web server. With a web server, you would at least need to protect every write to the file and make sure it can be read from multiple threads. The design of this file format is something you should work out as good as possible with plenty of testing and experiments. One minor bug could prove fatal for a web project in this style, but if you get it working, it can work real well and extremely fast.
But for 99.999% of all situations, you don't want such a custom solution. It's easier to just upgrade the hardware, move to Oracle, SQL Server or InterBase, use a dedicated database server, use faster hard disks, install more memory, upgrade to a 64-bit system. Those are the more generic tricks to improve performance with the least effort.
I'd recommend learning about message queue technology. There are several open-source options available, and also robust commercial products that would serve up the volume you describe as a tiny snack.
CouchDB is schema-free, and it's fairly simple to retrieve a huge amount of data quickly, because you are working only with indexes. You are not "querying" the database each time, you are retrieving only matching keys (which are pre-sorted making it even faster).
"Views" are re-indexed everytime new data is entered into the database, but this takes place transparently to the user, so while there might be potential delay in generating an updated view, there will virtually never be any delay in retrieving results.
I've just started to explore building an "activity stream" solution using CouchDB, and because the paradigm is different, my thinking about the process had to change from the SQL thinking.
Rather than figure out how to query the data I want and then process it on the page, I instead generate a view that keys all documents by date, so I can easily create multiple groups of data, just by using the appropriate date key, essentially running several queries simultaneously, but with no degradation in performance.
This is ideal for activity streams, and I can isolate everything by date, or along with date isolation I can further filter results of a particular subtype, etc - by creating a view as needed, and because the view itself is just using javascript and all data in CouchDB is JSON, virtually everything can be done client-side to render your page.