I'm trying to build a "relationship" in CouchDB for a Dropbox-like scenario with:
Users
Folders
Files
So far I'm struggeling whether to reference or embed the above things and haven't tackled permissions yet. In my scenario I just want to store the path to the files and don't want to work with attachments. Here's what I have:
Option 1 (Separate Documents)
Here I chain just everything together and it (at least for me) seems to be a copy of a RDBMS model which should not be the goal when using NoSQL.
{
"id": "user1",
"type": "user",
"folders": [
"folder1",
"folder2"
]
}
{
"id": "folder1",
"type": "folder",
"path": "\\user1\\pictures",
"files": [
"file1",
"file2"
]
}
{
"id": "file1",
"type": "file",
"name": "myDoc.txt",
}
Option 2 (Separate Documents)
In this option I would leave the users document as it is and put into the folders document the users id for the purpose of referencing.
{
"id": "user1",
"type": "user",
}
{
"id": "folder1",
"type": "folder",
"path": "\\user1\\pictures",
"owner" "user1",
"files": [
"file1",
"file2"
]
}
{
"id": "file1",
"type": "file",
"name": "myDoc.txt",
}
Option 3 (Embedded Documents)
Similar to Option 2 I here would dismiss the the third document type files and embed everything into the folder document. I read that it is only an option if I don't have to many items to store and I don't know how much items a user will store for example.
{
"id": "user1",
"type": "user",
}
{
"id": "folder1",
"type": "folder",
"path": "\\user1\\pictures",
"owner" "user1",
"files": [{
"id": "file1",
"type": "file",
"name": "myDoc1.txt"
}, {
"id": "file2",
"type": "file",
"name": "myDoc2.txt"
}
]
}
Option 4
I could also put everything in just one document but in this scenario this makes no sense. The JSON documents would get to big in time and thats not something which is desirable in regards to performance / load-time.
Conclusion
For me none of the above options seem to fit my scenario and I would appreciate some input from you in how to design a proper database schema in CouchDB. Or maybe one of the above options is already a good start and I just don't see it.
To provide you with a concrete idea, I'd model a Dropbox clone somehow like this:
Shares: The root folder that is shared. There is no need to model subfolders, as they don't have different permissions. Here I can set the physical location of the folder and the users that are allowed to use them. I'd expect that there are only a few shares per user, so you can keep the list of shares in memory.
Files: The actual files in the share. Depending on your use case, there's no need to keep the files in a database, as the filesystem itself is already a great file database by itself! If you need to hash and deduplicate files (such as Dropbox does it), then you might create a cache in CouchDB.
This would be the document structure:
{
"_id": "share.pictures",
"type": "share",
"owner": "Alice",
"writers": ["Bob", "Carl"],
"readers": ["Dorie", "Eve", "Fred"],
"rootPath": "\\user1\pictures"
},
{
"_id": "file.2z32236e2sdwhatever",
"type": "file",
"path": ["vacations", "2017 maui"],
"filename": "DSC1234.jpg",
"size": 12356789,
"hash": "1235a",
"createdAt": "2017-07-29T15:03:20.000Z",
"share": "share.pictures"
},
{
"_id": "file.sdfwhatever",
"type": "file",
"path": ["vacations", "2015 alaska"],
"filename": "DSC12345.jpg",
"size": 11,
"hash": "acd5a",
"createdAt": "2017-07-29T15:03:20.000Z",
"share": "share.pictures"
}
This way you can build a CouchDB view of files by share and path and query it by folder:
function (doc) {
if (doc.type === 'file') emit([doc.share].concat(doc.path), doc.size);
}
If you want, you can add also add a reduce function with just _sum and get a hierarchical size calculator for free (well, almost)!
Assuming you called the database 'dropclone' and added the view to a design document called 'dropclone' with the view name 'files', you would query it like this:
http://localhost:5984/dropclone/_design/dropclone/_view/files?key=["share.pictures","vacations"]
You'd get 123456800 as a result.
For
http://localhost:5984/dropclone/_design/dropclone/_view/files?key=["share.pictures","vacations"]&reduce=false&include_docs=true
You would get both files as a result.
You can also add the whole share name and path into the _id, because then you can directly access each file just by the known path. You can still add the path redundantly or leave it out and just split the _id into its path component dynamically.
Other approaches would be:
Use one CouchDB database per share and use CouchDB's _security mechanism to manage the access.
Split files into chunks, hash them and store the chunk hashes for each file. This way you can virtualize and deduplicate the complete file system. This is what Dropbox does behind the scenes to save storage space.
One thing you shouldn't do is store the files themselves into CouchDB, this will get dirty quite quickly. NPM had to experience that some years ago, and they had to move away from this model in a huge engineering effort.
Data Modeling starts with the queries the application will use.
If your queries will be that a user sees all his/her folders, and opening a folder displays all docs and sub-folders beneath it, the option 1 is a nature fit to the queries.
However, there is one very important question you need to answer first, especially for CouchDB. Which is how large you database will be. If you will need a DB partitioned across multiple nodes, then the performance would suffer, possibly to a point that DB becomes unresponsive. Because opening a folder with many docs would mean searching every partition. This is due to the partitioning is decided by the hashing of the ID which user has no control. The performance will be fine for a small single node (or non partitioned) DB.
Option 2 requires you build index on "owner", which suffers for the same reason as option 1.
Options 3/4 are kind of denormalization, which addressed the above performance issue. If the docs are large and updated often, the overhead of storage and cost of compaction may be significant. You need bench-marking for your specific workloads.
In summary, if your target DB will be big and partitioned, then there is no easy answer. Careful prototype and bench-marking would be needed.
Related
I'm looking to only allow the upload of specific filetypes to Azure Storage to trigger an Azure Function.
Current function.json file:
{
"scriptFile": "__init__.py",
"bindings": [{
"name": "myblob",
"type": "blobTrigger",
"direction": "in",
"path": "{name}.json",
"connection": "storage-dev"
}]
}
Would I just add another path value like this...
"path": "{name}.json",
"path": "{name}.csv"
...or an array of values like this...
"path": [
"{name}.csv",
"{name}.json"
]
Can't seem to find an example in the docs.
EDIT:
Thank you #BowmanZhu! Your guidance was awesome.
Changed trigger to EventGrid
Actually was able to create a single Advanced Filter rather than create multiple Subscriptions:
You want a blobtrigger to monitor two or more paths at the same time.
I can tell you simply, it's impossible. This is why you can't find the relevant documentation, because there is no such thing. If you must use blobtrigger at the same time according to your requirements, you can only use multiple blobtrigger.
But you have another option: eventgridtrigger:
You just need to create multiple event grid, and let them point to the same endpoint function.
We are designing Elastic Search model for events, their schedules and venues, where the events take place.
The design is following:
Example of queries we might need:
Find events, which are Concerts, between 1/7/2017 and 7/7/2017
Find artists who performs at London and the event is Theatre play
Find events, which are Movies and having Score > 70%
Find users, who attend event AwesomeEvent
Find venues, which locality is London and any event is planned in the future since today
I've read elastic doc and few articles like this and some stack questions. But still I'm not sure about our model, because it's very specific.
Examples of possible usage:
1) Using nested pattern
{
"title": "Event",
"body": "This great event is going to be...",
"Schedules": [
{
"name": "Schedule 1",
"start": "7.7.2017",
"end": "8.7.2017"
},
{
"name": "Schedule 2",
"start": "10.7.2017",
"end": "11.7.2017"
}
],
"Performers": [
{
"name": "Performer 1",
"genre": "Rock"
},
{
"name": "Performer 2",
"genre": "Pop"
}
],
...
}
Pros:
More flat model which should stick to "key:value" approach
Entity carries all information by itself
Cons:
Lot of redundant data
More complex entities
2) Parent / Child relation between following entities (simplified)
{
"title": "Event",
"body": "This great event is going to be...",
}
{
"title": "Schedule",
"start": "7.7.2017",
"end": "8.7.2017"
}
{
"name": "Performer",
"genre": "Rock"
}
Pros:
Avoiding to duplicate redundant data
Cons:
More joins (even the parent/child are stored at same shard)
The model is not that flat, I'm not sure about the performance
So far we have a relational database, where the model works fine but it's not fast enough. Especially for example when you imagine a cinema, one event(movie) can have a thousands of schedules in different localities and we want to achieve very fast response for filtering as I wrote at the first part.
I expect any suggestions leading to properly designing the data model. I will be also glad for reviewing my assumptions (probably some of them might be wrong).
It's hard to denormalize your data. For example, the number of performers in an event is unknown; so if you were to have specific fields for performers, you would need perofrmer1.firstname, perofrmer1.lastname, performer2.firstname, performer2.lastname, etc. However if you use nested field instead, you would simply define a nested field Performer under event index with correct sub-field mappings, then you can add as many as you want to it. This will enable you to lookup event by performer or performer by event. The same apply to the rest of the indices.
As far as parent-child vs nested, parent-child provide more dependence as child documents reside on a completely separate index. Both parent-child and nested fields can specify "include_in_parent" option to automatically denormalize the fields for you
In scanning the docs I cannot find how to update part of a document.
for example - say the whole document looks like this:
{
"Active": true,
"Barcode": "123456789",
"BrandID": "9f3751ef-f14f-464a-bb86-854e99cf14c0",
"BuyCurrencyOverride": ".37",
"BuyDiscountAmount": "45.00",
"ID": "003565a3-4a0d-47d9-befb-0ac642cb8057",
}
but I only want to work with part of the document as I don't want to be selecting / updating the whole document in many cases:
{
"Active": false,
"Barcode": "999999999",
"BrandID": "9f3751ef-f14f-464a-bb86-854e99cf14c0",
"ID": "003565a3-4a0d-47d9-befb-0ac642cb8057",
}
How can I use N1QL to just update those fields? Upsert completely replaces the whole document and update statement is not that clear.
Thanks
The answer to your question depends on why you want to update only part of the document (e.g., are you concerned about network bandwidth?), and how you want to perform the update (e.g., from the web console? from a program using the SDK?).
The 4.5 sub-document API, for which you provided a link in your comment, is a feature only available via the SDK (e.g., from Go or Java programs), and the goal of that feature is to reduce network bandwidth by no transmitting entire documents around. Does your use case include programmatic document modifications via the SDK? If so, then the sub-document API is a good way to go.
Using the "UPDATE" statement in N1QL is a good way to change any number of documents that match a pattern for which you can specify a "WHERE" clause. As noted above, it works very similarly to the "UPDATE" statement in SQL. To use your example above, you could change the "Active" field to false in any documents where the BuyDiscountAmount was "45.00":
UPDATE my bucket SET Active = false WHERE BuyDiscountAmount = "45.00"
When running N1QL UPDATE queries, almost all the network traffic will be between the Query, Index, and Data nodes of your cluster, so a N1QL update does not cause much network traffic into/out-of your cluster.
If you provide more details about your use case, and why you want to update only part of your documents, I could provide more specific advice on the right approach to take.
The sub-doc API introduced in Couchbase4.5 is currently not used by N1QL. However, when you use the UPDATE statement to update parts of one or more documents.
http://developer.couchbase.com/documentation/server/current/n1ql/n1ql-language-reference/update.html
Let me know any Qs.
-Prasad
It is simple like sql query.
update `Employee` set District='SambalPur' where EmpId="1003"
and here is the responce
{
"Employee": {
"Country": "India",
"District": "SambalPur",
"EmpId": "1003",
"EmpName": "shyam",
"Location": "New-Delhi"
}
}
I'm new to CouchBase, and I'm looking for a solution to scale my social network. Couchbase looks more interesting, specially it's easy to scale features.
But I'm struggling about creating a view for a specific kind of document.
My documents looks like this:
{
"id": 9476182,
"authorid": 86498,
"content": "some text here",
"uid": 41,
"accepted": "N",
"time": "2014-12-09 09:58:03",
"type": "testimonial"
}
{
"id": 9476183,
"authorid": 85490,
"content": "some text here",
"uid": 41,
"accepted": "Y",
"time": "2014-12-09 10:44:01",
"type": "testimonial"
}
What I'm looking for is for a view that would be equivalent to this SQL query.
SELECT * FROM bucket WHERE (uid='$uid' AND accepted='Y') OR
(uid='$uid' AND authorid='$logginid')
This way I could fetch all user's testimonials even the ones not approved, if the user who is viewing the testimonials page is the owner of that testimonials page, or if not, show all given users testimonials where accepted is =="Y", plus testimonials not approved yet, but written by the user's who is viewing the page.
If you could give me some tips about this I'll be very grateful.
Unlike SQL you cannot directly pass input parameters into views; however, you can emulate this to some extent by filtering ranges.
While not exactly matching SQL, I would suggest you simply filter testimonials based on the user ID, and then do the filtering on the client side. I am making the assumption that in most cases there will not even be any pending testimonials, and therefore you will not really end up with a lot of unnecessary data.
Note that it is possible to filter this using views entirely, however it would require:
Bigger keys OR
Multiple views OR
Multiple queries
In general it is recommended to make the emitted keys smaller, as this increases performance; so better stick with the above-mentioned solution.
A coworker and I are in a heated debate regarding the design of a REST service. For most of our API, GET calls to collections return something like this:
GET /resource
[
{ "id": 1, ... },
{ "id": 2, ... },
{ "id": 3, ... },
...
]
We now must implement a call to a collection of properties whose identifying attribute is "name" (not "id" as in the example above). Furthermore, there is a finite set of properties and the order in which they are sent will never matter. The spec I came up with looks like this:
GET /properties
[
{ "name": "{PROPERTY_NAME}", "value": "{PROPERTY_VALUE}", "description": "{PROPERTY_DESCRIPTION}" },
{ "name": "{PROPERTY_NAME}", "value": "{PROPERTY_VALUE}", "description": "{PROPERTY_DESCRIPTION}" },
{ "name": "{PROPERTY_NAME}", "value": "{PROPERTY_VALUE}", "description": "{PROPERTY_DESCRIPTION}" },
...
]
My coworker thinks it should be a map:
GET /properties
{
"{PROPERTY_NAME}": { "value": "{PROPERTY_VALUE}", "description": "{PROPERTY_DESCRIPTION}" },
"{PROPERTY_NAME}": { "value": "{PROPERTY_VALUE}", "description": "{PROPERTY_DESCRIPTION}" },
"{PROPERTY_NAME}": { "value": "{PROPERTY_VALUE}", "description": "{PROPERTY_DESCRIPTION}" },
...
}
I cite consistency with the rest of the API as the reason to format the response collection my way, while he cites that this particular collection is finite and the order does not matter. My question is, which design best adheres to RESTful design and why?
IIRC how you return the properties of a resource does not matter in a RESTful approach.
http://www.ics.uci.edu/~fielding/pubs/dissertation/rest_arch_style.htm
From an API client point of view I would prefer your solution, considering it is explicitly stating that the name of a property is XYZ.
Whereas your coworkers solution would imply it is the name, but how would I know for sure (without reading the API documenation). Try not to assume anything regarding your consuming clients, just because you know what it means (and probably is easy enough to assume to what it means) it might not be so obvious for your clients.
And on top of that, it could break consuming clients if you are ever to decide to revert that value from being a name back to ID. Which in this case you have done already in the past. Now all the clients need to change their code, whereas they would not have to in your solution, unless they need the newly added id (or some other property).
To me the approach would depend on how you need to use the data. Are the property names known before hand by the consuming system, such that having a map lookup could be used to directly access the record you want without needing to iterate over each item? Would there be a method such as...
GET /properties/{PROPERTY_NAME}
If you need to look up properties by name and that sort of method is NOT available, then I would agree with the map approach, otherwise, I would go with the array approach to provide consistent results when querying the resource for a full collection.
I think returning a map is fine as long as the result is not paginated or sorted server side.
If you need the result to be paginated and sorted on the server side, going for the list approach is a much safer bet, as not all clients might preserve the order of a map.
In fact in JavaScript there is no built in guarantee that maps will stay sorted (see also https://stackoverflow.com/a/5467142/817385).
The client would need to implement some logic to restore the sort order, which can become especially painful when server and client are using different collations for sorting.
Example
// server sent response sorted with german collation
var map = {
'รค':{'first':'first'},
'z':{'second':'second'}
}
// but we sort the keys with the default unicode collation algorigthm
Object.keys(map).sort().forEach(function(key){console.log(map[key])})
// Object {second: "second"}
// Object {first: "first"}
A bit late to the party, but for whoever stumbles upon this with similar struggles...
I would definitely agree that consistency is very important and would generally say that an array is the most appropriate way to represent a list. Also APIs should be designed to be useful in general, preferably without optimizing for a specific use-case. Sure, it could make implementing the use-case you're facing today a bit easier but it will probably make you want to hit yourself when you're implementing a different one tomorrow. All that being said, of course for quite some applications the map-formed response would just be easier (and possibly faster) to work with.
Consider:
GET /properties
[
{ "name": "{PROPERTY_NAME}", "value": "{PROPERTY_VALUE}", "description": "{PROPERTY_DESCRIPTION}" },
...
]
and
GET /properties/*
{
"{PROPERTY_NAME}": { "value": "{PROPERTY_VALUE}", "description": "{PROPERTY_DESCRIPTION}" },
...
}
So / gives you a list whereas /* gives you a map. You might read the * in /* as a wildcard for the identifier, so you're actually requesting the entities rather than the collection. The keys in the response map are simply the expansions of that wildcard.
This way you can maintain consistency across your API while the client can still enjoy the map-format response when preferred. Also you could probably implement both options with very little extra code on your server side.