Couchbase: N1QL JOIN performance issue - json

I'm getting familiar with Couchbase (I'm getting started with the Server Community Edition), my goal is to migrate our current SQLite database to Couchbase in order to build an efficient real-time synchronization mechanism with mobile devices.
The first steps have been positive so far, we've created buckets (one bucket per SQLite table) and imported all data (one JSON document per SQLite row).
Also, in order to allow complex queries and filtering, we've created indices (both primary and secondary) for all buckets.
To summarize, we have two main buckets:
1) players, which contains documents with the following structure
{
"name": "xxx",
"transferred": false,
"value": n,
"playmaker": false,
"role": "y",
"team": "zzz"
}
2) marks, with the following structure (where the "player" field is a reference to a document ID in the players bucket)
{
"drawgoal": 0,
"goal": 0,
"owngoal": 0,
"enter": 1,
"mpenalty": 0,
"gotgoal": 0,
"ycard": 0,
"assist": 0,
"wingoal": 0,
"mark": 6,
"penalty": 0,
"player": "xxx",
"exit": 0,
"fmark": 6,
"team": "yyy",
"rcard": 0,
"source": "zzz",
"day": 1,
"spenalty": 0
}
So good so far, however when I try to run complex N1QL queries that require a JOIN, performances are pretty bad compared to SQLite.
For instance, this query takes around 3 seconds to be executed:
select mark.*, player.`role` from players player join marks mark on key mark.player for player where mark.type = "xxx" and mark.day = n order by mark.team asc, player.`role` desc;
We currently have 600 documents in players (disk used = 16MB, RAM used = 12MB) and 20K documents in marks (disk used = 70MB, RAM used = 17MB), which should not be much from my point of view.
Are there any settings I can tune to improve JOIN performance? Any specific index I can create?
Is this performance degradation the price to pay to have more flexibility and more features compared to SQLite?
Should I avoid as much as possible using JOIN in Couchbase and instead duplicate data where needed?
Thanks

I found the answer :)
By changing the query to:
select marks.*, players.`role` from marks join players on keys marks.player where marks.day = n and marks.type = "xxx" order by marks.team asc, players.`role` desc;
execution time drops to less than 300 milliseconds. Apparently, inverting the JOIN (from marks to players) dramatically improves the performance.
The reason why this query is much faster than the other one is that Couchbase evaluates the query as follows:
first retrieves all marks documents matching the filtering conditions
then tries to join them with players documents
By doing so, the number of documents to join is much lower, hence the execution time drops.

I think you've left some details out, so I'm going to fill in the blanks with my guesses. First, a JSON document can't have a field like "value": n. It needs to be a string like "n" or a number like 1. I have assumed you mean a literal number, so I put 1 in there.
Next, let's look at your query:
select m.*, p.`role`
from players p
join marks m on key m.player for p
where m.type = "xxx"
and m.day = 1
order by m.team asc, p.`role` desc;
Again, you had m.day = n, so I put m.day = 1. This query does not run without an index. I'm going to assume you created a primary index (which will scan the whole bucket, and is not ideal for production):
create primary index on players;
create primary index on marks;
The query still doesn't run, so you must have added an index on the 'players' field in marks:
create index ix_marks_player on marks(player);
The query runs, but returns no results, because your example documents are missing a "type": "xxx" field. So I added that field, and now your query runs.
Look at the Plan Text by just clicking "plan text" (If you were using Enterprise, you would see a visual version of the Plan diagram).
The plan text shows that the query is using a PrimaryScan on the players bucket. Indeed, your query is attempting to join every player document. So as the player bucket grows, the query will get slower.
In your answer here on SO, you say that a different query to get the same data works faster:
select m.*, p.`role`
from marks m
join players p on keys m.player
where m.day = 1
and m.type = "xxx"
order by m.team asc, p.`role` desc;
You swapped the join, but looking at the plan text, you are still running a PrimaryScan. This time it's scanning all the marks documents. I'm assuming you have fewer of those (either fewer total, or since you are filtering on day you have fewer to join).
So my answer is basically: do you always need to join all of the documents?
If so, why? If not, I suggest you modify your query to add a LIMIT/OFFSET (perhaps for paging) or some other filter so you aren't querying for everything.
One more point: it looks like you are using buckets for organizational purposes. This isn't strictly wrong, but it's not really going to scale. Buckets are distributed across the cluster, so you are limited in the number of buckets you can reasonably use (there may even be a hard limit at 10 buckets).
I don't know your use case, but often it's better to use a "type"/"_type"/"docType"/etc value in your documents for organization instead of relying on buckets.

The first steps have been positive so far, we've created buckets (one bucket per SQLite table) and imported all data (one JSON document per SQLite row)
You have a problem here. You have tried to map a SQL database schema to a document database schema without regard for best practices or even dire warnings in Couchbase's documentation.
First, you should be using one bucket. A bucket is more like a database than a table (although it's more complex than that) and Couchbase recommends using a single bucket per cluster unless you have a very good reason not to. It helps with performance, scaling, and resource utilization. Each of your documents should have a field that indicates the type of data. This is what separates your "tables". I use a field named '_type'. Eg. you will have 'player' and 'mark' document types.
Second, you should rethink importing the data as one row per document. Document databases give you different schema options and some are very useful for improving performance. You certainly can keep it this way, but it's probably not optimal. This is a common pitfall that developers run into when first using a NoSQL database.
One good example is in one to many relationships. Instead of having many mark documents for a single player document, you can embed the marks as an array inside the player document. The document can store arrays of objects!
Eg.
{
"name": "xxx",
"transferred": false,
"value": n,
"playmaker": false,
"role": "y",
"team": "zzz",
"_type": "player",
"marks": [
"mark": {
"drawgoal": 0,
"goal": 0,
"owngoal": 0,
"enter": 1,
},
"mark": {
"drawgoal": 0,
"goal": 0,
"owngoal": 0,
"enter": 1,
},
"mark": {
"drawgoal": 0,
"goal": 0,
"owngoal": 0,
"enter": 1,
}
]
}
You can do this for team and role as well, but it sounds like that would denormalize things which you may not be ready to deal with and isn't always a good idea.
Couchbase can index inside the JSON, so you can still use N1QL to query the marks from all players. This also lets you pull a player's document and marks in a single key:value call, which is the fastest kind.

Related

Dictionary or fixed-size list for efficient MongoDB storage

I've read how BSON works as opposed to JSON, but I still couldn't come to a conclusion which of the following is stored more efficiently in MongoDB:
Ex1:
[
{ "f1": "smth", "f2": 0.8, "f3": [[1,2],[3,4]], "f4": 0 },
{ "f1": "smth", "f2": 0.8, "f3": [[1,2],[3,4]], "f4": 0 },
{ "f1": "smth", "f2": 0.8, "f3": [[1,2],[3,4]], "f4": 0 }
]
Ex2:
[
["smth", "smth", "smth"],
[0.8,0.8,0.8],
[[[1,2],[3,4]],[[1,2],[3,4]],[[1,2],[3,4]]],
[0,0,0]
]
Regardless of the duplicate values of course, I fear that because of the repetitive dictionary keys (i.e., "f1", "f2", "f3", "f4"), Ex2 would take less storage space, especially when the number of documents in the DB is in millions. I do consider of course that in Ex2, each array at a different index has a meaning that is not directly declared (as in Ex1 - "f1"...).
First, your application needs to work. It doesn't matter how fast it is if it doesn't provide useful functionality. Assuming you are implementing a real project, these usually not just have non-trivial requirements, but also the requirements change over time. Optimizing your data model (which is quite difficult to change regardless of the database used) at the expense of making your application totally inflexible is generally going to end with the failure of the project.
You can shorten field names if you want. Mongoid for example provides this functionality out of the box.
"Fixed-size list" is not a meaningful term with respect to MongoDB. All arrays can be of any size and the size is encoded in the array. MongoDB isn't implemented like a relational database with fixed row size if you use certain types.
As prasad said, your second option is probably going to become unusable in a hurry once you start trying to query it in any meaningful way, but, if you use MongoDB as a write-only data store and your schema is going to be fixed for the life of the project, sure, your data will take up less space on disk and will be faster to insert if you omit field names and use arrays.
On the other hand, if you want an inexpensive bulk data store that is still queryable, try https://docs.mongodb.com/datalake/.

What is the impact (performance wise) on using linq statements like Where, GroupJoin etc on a mobile app in Xamarin Forms

Although the question might sound a bit vague and a bit misleading but I will try to explain it.
In Xamarin.Forms, I would like to present a list of products. The data are coming from an api call that delivers json.
The format of the data is as follows: A list of products and a list of sizes for each product. An example is the following:
{
"product": {
"id": 1,
"name": "P1",
"imageUrl": "http://www.image.com"
}
}
{
"sizes": [
{
"productId": 1,
"size": "S",
"price": 10
},
{
"productId": 1,
"size": "M",
"price": 12
}
]
}
It seems to me that I have 2 options:
The first is to deliver the data from the api call with the above format and transform them into the list that I want to present by using limq GroupJoin command (hence the title of my question)
The second option is to deliver the finalized list as json and just present it in the mobile application without any transformation.
The first option will deliver less amount of data but will use a linq statement to restructure the data and the second option will deliver a larger amount of data but the data will already be structured in the desired way.
Obviously, delivering less amount of data is preferable (first option), but my question is, will the use of a linq GroupJoin command “kill” the performance of the application?
Just for clarification, the list that will be presented in the mobile application will have 2 items and the items will be the following:
p1-size: s – price 10
p2-size: m – price 12
Thanks
I've had rather complex sets of linq statements; I think the most lists I was working with was six, with a few thousand items in a couple of those lists, and hundreds or less in the others; to join and where things, and the performance impact is negligible. This was in Xamarin.Forms PCL on Droid/iOS.
(I did manage really bad performance once when I was calling linq on a linq on a linq, rather than calling linq on a list. i.e. I had to ensure I ToList()ed a given linq statement before trying to use it in another join statement; understandably due to the deferred/lazy execution of linq).

Database schema or framework to support keyword searches

About to add keyword/tags to one of the business objects in our database, let's call the table users. I've considered adding a tags table and a usertags table, but I can't see an easy way to perform queries which would contain and and or. For example, I'd like to be able to return all the users that have tag A AND B, as well as query for users with tag A OR B. OR queries are easy, but AND queries are
I've considered even putting all the user records into a json backed database so I could have all the users duplicated like this:
{
user_id:1,
keyword:"A",
keyword:"B"
}
etc.
but I'm not sure how performant a database like MongoDB is when running queries like this.
Yet another option is to have a tags field on the user table, and use REGEX queries. In some ways I like this the best, since it means it's much easier to have ad hoc queries, but I'm worried about performance.
Note that the tag isn't the only field that we need to search by, so ideally we'd have a solution that supports date range searches as well as searches against other fields.
I can only really talk of MongoDB for that matter, so I'll stick to it.
Let's assume a more accurate model like
{
_id: "foo#bar.com",
keywords: [ "A", "B" ],
joined: ISODate("2014-12-28T12:00:00.123Z"),
tags: [ "C", "D" ],
location: { type: "Point", coordinates: [ 38.1200538, -86.9141607 ] },
notes: "Lorem ipsum dolor sic amet."
}
Performance in MongoDB is determined more or less by two factors: wether a field you query is indexed and wether the index is in RAM. In general, MongoDB tries to keep at least all indices in RAM, plus as big of a subset of the data as possible. Indexing a field is quite easy. To stick with your first requirement, we index the keywords field:
db.yourCollection.ensureIndex({ keywords: 1})
What happens now is that MongoDB will create a list of keywords and a link to the respective documents. So if you do a query for keyword "A"
db.yourCollection.find({keywords: "A"})
only the documents actually containing the keyword "A" will be read and returned. This is called an index scan. If there wasn't an index on "keywords", MongoDB would have read each and every document in the collection, checking wether the keyword field contained "A" and added the respective documents to the result set, which is called a collection scan.
Now, checking for a document that has both the "A" and the "B" keyword, that would be rather simple:
db.yourCollection.find({$or: [ {keywords:"A"}, {keywords:"B"} ] })
Since we have indexed the "keywords" field, the logical check is done in RAM and the respective documents are added to the result set.
As for regex searches, they are absolutely possible and quite fast for indexed fields:
db.yourCollection.find({keywords: /^C.*/i})
will return all documents which contain keywords beginning with the letter "c" (case insensitive) using an index scan.
As for your requirement for doing queries on date ranges:
db.yourCollection.find({joined:
{
$gte: ISODate("2014-12-28T00:00:00.000Z"),
$lt: ISODate("2014-12-29T00:00:00.000Z")
}
})
will return all users who joined on the Dec 28, 2014. Since we haven't created an index on the field yet, a collection scan would have been used. Of course, you can create an index on the "joined" field.
So, let's assume you want to find all users with a keyword "A" from Santa Claus, IN:
db.yourCollection.find({
keywords: "A",
location: {
$nearSphere : {
$geometry: {
type : "Point",
coordinates: [ 38.1200538, -86.9141607 ]
},
$minDistance: 0,
$maxDistance: 10000
}
}
})
This will return... Nothing, iirc, since we have to create a geospatial index first:
db.collection.ensureIndex( { location : "2dsphere" } )
Now the mentioned query will work as expected.
Conclusion
Your requirements can be fulfilled by MongoDB and with proper indexing with good performance. However, you might want to dig into MongoDBs restrictions.
You might want to read a bit more. Here are my suggestions:
Introduction to MongoDB
Index documentation
Data modelling introduction

Creating Family Tree with Neo4J

I have a set of data for a family tree in Neo4J and am trying to build a Cypher query that produces a JSON data set similar to the following:
{Name: "Bob",
parents: [
{Name: "Roger",
parents: [
Name: "Robert",
Name: "Jessica"
]},
{Name: "Susan",
parents: [
Name: "George",
Name: "Susan"
]}
]}
My graph has a relationship of PARENT between MEMBER nodes (i.e. MATCH (p.Member)-[:PARENT]->(c.Member) ). I found Nested has_many relationships in cypher and neo4j cypher nested collect which ends up grouping all parents together for the main child node I am searching for.
Adding some clarity based on feedback:
Every member has a unique identifier. The unions are currently all associated with the PARENT relationship. Everything is indexed so that performance will not suffer. When I run a query to just get back the node graph I get the results I expect. I'm trying to return an output that I can use for visualization purposes with D3. Ideally this will be done with a Cypher query as I'm using the API to access neo4j from the frontend being built.
Adding a sample query:
MATCH (p:Person)-[:PARENT*1..5]->(c:Person)
WHERE c.FirstName = 'Bob'
RETURN p.FirstName, c.FirstName
This query returns a list of each parent for five generations, but instead of showing the hierarchy, it's listing 'Bob' as the child for each relationship. Is there a Cypher query that would show each relationship in the data at least? I can format it as I need to from there...
Genealogical data might comply with the GEDCOM standard and include two types of nodes: Person and Union. The Person node has its identifier and the usual demographic facts. The Union nodes have a union_id and the facts about the union. In GEDCOM, Family is a third element bringing these two together. But in Neo4j, I found it suitable to also include the union_id in Person nodes. I used 5 relationships: father, mother, husband, wife and child. The family is then two parents with an inward vector and each child with an outward vector. The image illustrates this. This is very handy for visualizing connections and generating hypotheses. For example, consider the attached picture and my ancestor Edward G Campbell, the product of union 1917 where three brothers married three Vaught sisters from union 8944 and two married Gaither sisters from union 2945. Also, in the upper left, how Mahala Campbell married her step-brother John Greer Armstrong. Next to Mahala is an Elizabeth Campbell who is connected by marriage to other Campbell, but is likely directly related to them. Similarly, you can hypothesize about Rachael Jacobs in the upper right and how she might relate to the other Jacobs.
I use bulk inserts which can populate ~30000 Person nodes and ~100,000 relationships in just over a minute. I have a small .NET function that returns the JSon from a dataview; this generic solution works with any dataview so it is scalable. I'm now working on adding other data, such as locations (lat/long), documentation (particularly that linking folks, such as a census), etc.
You might also have a look at Rik van Bruggens Blog on his family data:
Regarding your query
You already create a path pattern here: (p:Person)-[:PARENT*1..5]->(c:Person) you can assign it to a variable tree and then operate on that variable, e.g. returning the tree, or nodes(tree) or rels(tree) or operate on that collection in other ways:
MATCH tree = (p:Person)-[:PARENT*1..5]->(c:Person)
WHERE c.FirstName = 'Bob'
RETURN nodes(tree), rels(tree), tree, length(tree),
[n in nodes(tree) | n.FirstName] as names
See also the cypher reference card: http://neo4j.com/docs/stable/cypher-refcard and the online training http://neo4j.com/online-training to learn more about Cypher.
Don't forget to
create index on :Person(FirstName);
I'd suggest building a method to flatten out your data into an array. If they objects don't have UUIDs you would probably want to give them IDs as you flatten and then have a parent_id key for each record.
You can then run it as a set of cypher queries (either making multiple requests to the query REST API, or using the batch REST API) or alternatively dump the data to CSV and use cypher's LOAD CSV command to load the objects.
An example cypher command with params would be:
CREATE (:Member {uuid: {uuid}, name: {name}}
And then running through the list again with the parent and child IDs:
MATCH (m1:Member {uuid: {uuid1}}), (m2:Member {uuid: {uuid2}})
CREATE m1<-[:PARENT]-m2
Make sure to have an index on the ID for members!
The only way I have found thus far to get the data I am looking for is to actually return the relationship information, like so:
MATCH ft = (person {firstName: 'Bob'})<-[:PARENT]-(p:Person)
RETURN EXTRACT(n in nodes(ft) | {firstName: n.firstName}) as parentage
ORDER BY length(ft);
Which will return a dataset I am then able to morph:
["Bob", "Roger"]
["Bob", "Susan"]
["Bob", "Roger", "Robert"]
["Bob", "Susan", "George"]
["Bob", "Roger", "Jessica"]
["Bob", "Susan", "Susan"]

Improving Grails CreateCriteria query speed with joins

I have a Grails application that does a rather huge createCriteria query pulling from many tables. I noticed that the performance is pretty terrible and have pinpointed it to the Object manipulation I do afterwards, rather than the createCriteria itself. My query successfully gets all of the original objects I wanted, but it is performing a new query for each element when I am manipulating the objects. Here is a simplified version of my controller code:
def hosts = Host.createCriteria().list(max: maxRows, offset: rowOffset) {
// Lots of if statements for filters, etc.
}
def results = hosts?.collect{ [ cell: [
it.hostname,
it.type,
it.status.toString(),
it.env.toString(),
it.supporter.person.toString()
...
]]}
I have many more fields, including calls to methods that perform their own queries to find related objects. So my question is: How can I incorporate joins into the original query so that I am not performing tons of extra queries for each individual row? Currently querying for ~700 rows takes 2 minutes, which is way too long. Any advice would be great! Thanks!
One benefit you get using criteria is you can easily fetch associations eagerly. As a result of which you would not face the well known N+1 problem while referring associations.
You have not mentioned the logic in criteria but I would suggest for ~700 rows I would definitely go for something like this:
def hosts = Host.createCriteria().list(max: maxRows, offset: rowOffset) {
...
//associations are eagerly fetched if a DSL like below
//is used in Criteria query
supporter{
person{
}
}
someOtherAssoc{
//Involve logic if required
//eq('someOtherProperty', someOtherValue)
}
}
If you feel that tailoring a Criteria is cumbersome, then you can very well fallback to HQL and use join fetch for eager indexing for associations.
I hope this would definitely reduce the turnaround time to less than 5 sec for ~700 records.