I want to build an application that uses data from several endpoints.
Lets say I have:
JSON API for getting cinema data
XML Export for getting data about ???
Another JSON API for something else
A csv-file for some more shit ...
In my application I want to bring all this data together and build views for it and so on ...
MY idea was to set up a database by create schemas for all these data sources, so I can do some kind of "import scripts" which I can call whenever I want to get the latest data.
I thought of schemas because I want to be able to easily adept a new API with any kind of schema.
Please enlighten me of the possibilities and best practices out there (theory and practice if possible :P)
You are totally right on making a database. But the real problem is probably not going to be how to store your data. It's going to be how to make it fit together logically and semantically.
I suggest you first take a good look at what your enpoints can provide. Get several samples from every source and analyze them if you can. How will you know which data is new? How can you match it against existing data and against data from other sources? If existing data changes or gets deleted, how will you detect and handle that? What if sources disagree on something? How and when should you run the synchronization? What will you do if one of your sources goes down? Etc.
It is extremely difficult to make data consistent if your data sources are not. As a rule, if the sources are different, they are not consistent. Thus the proverb "garbage in, garbage out". We, humans, have no problem dealing with small inconsistencies, but algorithms cannot work correctly if there are discrepancies. Even if everything fits together on paper, one usually forgets that data can change over time...
At least that's my experience in such cases.
I'm not sure if in the application you want to display all the data in the same view or if you are going to be creating different views for each of the sources. If you want to display the data in the same view, like a grid, I would recommend using inheritance or an interface depending on your data and needs. I would recommend setting this structure up in the database too using different tables for the different sources and having a parent table related to all them that has a type associated with it.
Here's a good thread with discussion about choosing an interface or inheritance.
Inheritance vs. interface in C#
And here are some examples of representing inheritance in a database.
How can you represent inheritance in a database?
I am coming from object relation database background, I understand Couchbase is schema-less, but data migration will still happen as the application develop.
In SQL we have management tool to alter table, or I can write migration script with SQL to do migration from version 1 table to version 2 table.
But in document, say we have json Document UserProfile:
UserProfile
{
"Owner": "Rich guy!",
"Car": "Cool car"
}
We might want to add a last visit field there, allow user have multiple car, so the new updated document will become follows:
UserProfile
{
"Owner": "Rich guy!",
"Car": ["Cool car", "Another car"],
"LastVisit": "2015-09-29"
}
But for easier maintenance, I want all other UserProfile documents to follow the same format, having "Car" field as an array.
From my experience in SQL, I could write migration script which support migrating different version of table. Migrate from version 1 table to version 2...N table.
So how can I should I write such migration code? I will have to really just writing an app (executable) using Couchbase SDK to migrate all the documents each time?
What will be the good way for doing migration like this?
Essentially, your problem breaks down into two parts:
Finding all the documents that need to be updated.
Retrieving and updating said documents.
You can do this in one of two ways: using a view that gives you the document ids, or using a DCP stream to get all the documents from the bucket. The view only gives you the ids of the documents, so you basically iterate over all the ids, and then retrieve, update and store each one using regular key-value methods. The DCP protocol, on the other hand, gives you the actual documents.
The advantage of using a view is that it's very simple to implement, works with any language SDK, and it lets you write your own logic around the process to make it more robust and safe. The disadvantage is having to build a view just for this, and also that if the data keeps changing, you must retrieve the ENTIRE view result at once, because if you try to page over the view with offsets, the ordering of results can change, thus giving you an inconsistent snapshot of the data.
The advantage of using DCP to stream all documents is that you're guaranteed to get a consistent snapshot of your data even if it's constantly changing, and also that you get the whole document directly as part of the stream, so you don't need to retrieve it separately - just update and store back to the database. The disadvantage is that it's currently only implemented in the Java SDK and is considered an experimental feature. See this blog for a simple implementation.
The third - and most convenient for an SQL user - way to do this is through the N1QL query language that's introduced in Couchbase 4. It has the same data manipulation commands as you would expect in SQL, so you could basically issue a command along the lines of UPDATE myBucket SET prop = {'field': 'value'} WHERE condition = 'something'. The advantage of this is pretty clear: it both finds and updates the documents all at once, without writing a single line of program code. The disadvantage is that the DML commands are considered "beta" in the 4.0 release of Couchbase, and that if the data set is too large, then it might not actually work due to timing out at some point. And of course, that fact that you need Couchbase 4.0 in the first place.
I don't know of any official tool currently to help with data model migrations, but there are some helpful code snippets depending on the SDK you use (see e.g. bulk updates in java).
For now you will have to write your own script. The basic process is as follow:
Make sure all your documents have a model_version attribute that you increment after each migration.
Before a migration update your application code so it can handle both the old and the new model_version, and so that new documents are written in the new model.
Write a script that iterate through all the old model documents in your bucket(you need a view that emits the document key), make the update you want, increment model_version and save the document back.
In a high concurrency environment it's important to have good error handling and monitoring, you could have for example a view that counts how many documents are in each model_version.
You can use Couchmove, which is a java migration tool working like Flyway DB.
You can execute N1QL queries with this tool to migrate your documents and keep tracking of your changes.
If I understood correctly, the crux here is getting and then 'update every CB docs'. This can be done with a view, provided that you understand that views are only 'eventually consistent' (unlike read/write actions which are strongly consistent).
If (at migration-time) no new documents are added to your bucket, then your view would be up-to-date and should return the entire set of documents to be migrated. easy.
On the other hand, if new documents continue to be written into your bucket, and these documents need to be migrated, then you will have to run your migration code continually to catch all these new docs (since the view wont return them until it is updated, a few seconds later).
In this 2nd scenario, while migration is happening, your bucket will contain a heterogeneous collection of docs: some that have been migrated already, some that are about to be migrated and some that your view has not 'seen' yet (because they were recently added) and would only be migrated once you re-run the migration code.
To make the migration process efficient, you'll need to find a way to differentiate between already-migrated items and yet-to-be-migrated items. You can add a field to each doc with its 'version number' and update it during the migration. Your view should be defined to only select documents with older 'version number' and ignore already-migrated items.
I suggest you read more about couchbase views - here and on their site.
Regarding your migration: There are two aspects here: (1) getting the list of document ids that need to be updated and (2) the actual update.
The actual update is simple: you retrieve the doc and save it again with the new format. There's no explicit schema. Where once you added column in SQL and populated it, you now just add a field in the json-doc (of all the docs). All migrated docs should have this field. Side note: Things get little more complicated if (while you're migrating) the document can be updated by another process. This requires special handling (read aboud CAS if that's the case).
Getting all the relevant doc-keys requires that you define a view and query it. Its beyond the scope of this answer (and is very well documented). Once you have all the keys, you simply iterate them one by one and update them.
With N1QL, Couchbase provides the same schema migration capabilities as you have in RDBMS or object-relational database. For the example in your question, you can place the following query in a migration script:
UPDATE UserProfile
SET Car = TO_ARRAY(Car),
LastVisit = NOW_STR();
This will migrate all the documents in your bucket to your new schema. Note that update statements in Couchbase provide document-level atomicity, not statement-level atomicity. But since this update is idempotent (repeatable), you can run it multiple times if you run into errors. Note: similar to the last paragraph of David's answer above.
PROBLEM
I am developing an app where the data model will be very similar to JSFiddle's. A user will create a new entry that will be assigned a GUID in the database. My question is how to handle when other users want to modify/fork/version the original entry. JSFiddle handles this by versioning the entry (so the URL becomes something like jsfiddle.net/GUID/1).
What is the benefit to JSFiddle's method over assigning a new GUID to the modified version and just recording a relationship to the original entry in the database?
It seems like no matter what I will have to create a new entry in the database that will essentially be a modified copy of the original.
Also, there will be both registered and anonymous users just like JSFiddle. The registered users should be able to log in and see all of their own entries and possibly the versions/forks that exist off of their own entries (though this isn't currently a requirement).
Am I missing something? Is there a right and wrong way to do this?
TECH
Using parse.com's RESTful API for data CRUD; node on the server.
What is the benefit to JSFiddle's method over assigning a new GUID to the modified version and just recording a relationship to the original entry in the database?
I would imagine none, both would require the same copy operation and the same double query (in MongoDB) to get the parent.
The only difference is what field you go by.
Am I missing something?
Not that I can see.
Is there a right and wrong way to do this?
It seems as though you have this pretty well covered frankly.
MVCC does seem the right way to do this in some respects, however you don't have to go the full hog. If you were there might be cause for you to change to a database that has it built in like CouchDB or something because MongoDBs implementation would be on top of its current existing lock mechanisms, its like adding a lock on a lock.
I have a database component that I'm trying to make as general as possible. Is it possible to accomplish this:
Take in a custom class that I don't have the definition for
Recreate that class locally from the foreign instance
Basically I can't include the definition of objects that will be stored in the database but I want the database to process the raw data of whatever class passed in, store it, and be able to provide it again as an object.
Ideally I could cast it back to it's custom class when the Object gets back from the database.
It sounds like what you are asking for is serialization.
Serialzation in AS3 is possible through a few different methods. I recommend you refer to this article as it describes the method quite clearly.
To elaborate, once you serialize your object, you send it to the server and pair it with a key in a database. Then you can serialize it back into the original object by downloading it from the server again.
I think you're going to find that there are a lot of pitfalls with what you want to do. I suspect that you'll find over the long haul that you can solve the problem in other ways, since someone, somewhere needs a definition of the Class you're instantiating (you also need to think about what happens if you have two instances with conflicting definitions).
Probably a better way to go is to make your app more data driven--where every object can be built based on the data about it. If you need to be able to swap out implementations, consider storing the Class definitions in external swfs and downloading those based on paths or other information stored in the database. Again, you need to consider what will happen if the implementations collide between multiple swfs.
Can you expand on what you're trying to do? It's easier to give you clearer instructions with more information.
I have lots of stuff in an app.config, and when changes are necessary, an app restart is required. Bad for my 24x7 web server system (it really is 24x7, not even 23x7). I would like to use a good strategy for keeping the config information in a DB table and query/use it as needed. I googled around a bit and am coming up dry. Does anyone have any suggestions before I re-invent the wheel?
Thanks.
I needed exactly this for my recent application, and couldn't use any application server specific techniques as I needed some console apps run on cronjobs to access them too.
I basically made a couple of small tables to create a registry-style configuration database. I have a table of keys (which all have parent-keys so they can be arranged in a tree structure) and a table of values which are attached to keys. All keys and values are named, so my access functions look like this:
openKey("/my_app");
createKey("basic_settings");
openKey("basic_settings");
createValue("log_directory","c:\logs");
getValue("/my_app/basic_settings","log_directory");
The tree structure allows you to logically separate similar data (e.g. you can have a "log_directory" value under several different keys) and avoids having the overly verbose names you find in properties files.
All the values are just strings (varchar2 in the db), so there's some overhead in converting booleans and numbers: but it's only config data, so who cares?
I also create a "settings_changed" value that has a datetime string in it: so any app can quickly tell if it needs to refresh it's configuration (you obviously need to remember to set it when you change anything though).
There may be tools out there to do this kind of thing already: but this was only a days worth of coding and works a treat. I added command line tools to edit and upload/download parts or all of the tree, then made a quick graphical editor in Java Swing.