Database schema or framework to support keyword searches

Database schema or framework to support keyword searches - mysql

About to add keyword/tags to one of the business objects in our database, let's call the table users. I've considered adding a tags table and a usertags table, but I can't see an easy way to perform queries which would contain and and or. For example, I'd like to be able to return all the users that have tag A AND B, as well as query for users with tag A OR B. OR queries are easy, but AND queries are
I've considered even putting all the user records into a json backed database so I could have all the users duplicated like this:
{
user_id:1,
keyword:"A",
keyword:"B"
}
etc.
but I'm not sure how performant a database like MongoDB is when running queries like this.
Yet another option is to have a tags field on the user table, and use REGEX queries. In some ways I like this the best, since it means it's much easier to have ad hoc queries, but I'm worried about performance.
Note that the tag isn't the only field that we need to search by, so ideally we'd have a solution that supports date range searches as well as searches against other fields.

I can only really talk of MongoDB for that matter, so I'll stick to it.
Let's assume a more accurate model like
{
_id: "foo#bar.com",
keywords: [ "A", "B" ],
joined: ISODate("2014-12-28T12:00:00.123Z"),
tags: [ "C", "D" ],
location: { type: "Point", coordinates: [ 38.1200538, -86.9141607 ] },
notes: "Lorem ipsum dolor sic amet."
}
Performance in MongoDB is determined more or less by two factors: wether a field you query is indexed and wether the index is in RAM. In general, MongoDB tries to keep at least all indices in RAM, plus as big of a subset of the data as possible. Indexing a field is quite easy. To stick with your first requirement, we index the keywords field:
db.yourCollection.ensureIndex({ keywords: 1})
What happens now is that MongoDB will create a list of keywords and a link to the respective documents. So if you do a query for keyword "A"
db.yourCollection.find({keywords: "A"})
only the documents actually containing the keyword "A" will be read and returned. This is called an index scan. If there wasn't an index on "keywords", MongoDB would have read each and every document in the collection, checking wether the keyword field contained "A" and added the respective documents to the result set, which is called a collection scan.
Now, checking for a document that has both the "A" and the "B" keyword, that would be rather simple:
db.yourCollection.find({$or: [ {keywords:"A"}, {keywords:"B"} ] })
Since we have indexed the "keywords" field, the logical check is done in RAM and the respective documents are added to the result set.
As for regex searches, they are absolutely possible and quite fast for indexed fields:
db.yourCollection.find({keywords: /^C.*/i})
will return all documents which contain keywords beginning with the letter "c" (case insensitive) using an index scan.
As for your requirement for doing queries on date ranges:
db.yourCollection.find({joined:
{
$gte: ISODate("2014-12-28T00:00:00.000Z"),
$lt: ISODate("2014-12-29T00:00:00.000Z")
}
})
will return all users who joined on the Dec 28, 2014. Since we haven't created an index on the field yet, a collection scan would have been used. Of course, you can create an index on the "joined" field.
So, let's assume you want to find all users with a keyword "A" from Santa Claus, IN:
db.yourCollection.find({
keywords: "A",
location: {
$nearSphere : {
$geometry: {
type : "Point",
coordinates: [ 38.1200538, -86.9141607 ]
},
$minDistance: 0,
$maxDistance: 10000
}
}
})
This will return... Nothing, iirc, since we have to create a geospatial index first:
db.collection.ensureIndex( { location : "2dsphere" } )
Now the mentioned query will work as expected.
Conclusion
Your requirements can be fulfilled by MongoDB and with proper indexing with good performance. However, you might want to dig into MongoDBs restrictions.
You might want to read a bit more. Here are my suggestions:
Introduction to MongoDB
Index documentation
Data modelling introduction

Related

what is view in couchbase

I am trying to understand what exactly couchbase view is used for, I have gone through some materials in docs, but the 'view' concept does not settle me quite well.
Are views in couchbase analogues to views in view in RDBMS?
https://docs.couchbase.com/server/6.0/learn/views/views-basics.html
A view performs the following on the Couchbase unstructured (or
semi-structured) data:
Extract specific fields and information from the data files.
Produce a view index of the selected information.
how does view and index work here, seems there is separate index for view. so if a documents updates are both indexes updated?
https://docs.couchbase.com/server/6.0/learn/views/views-store-data.html
In addition, the indexing of data is also affected by the view system
and the settings used when the view is accessed.
Helpful post:
Views in Couchbase

You can think of Couchbase Map/Reduce views as similar to materialized views, yes. Except that you create them with JavaScript functions (a map function and optionally a reduce function).
For example:
function(doc, meta)
{
emit(doc.name, [doc.city]);
}
This will look at every document, and save a view of each document that contains just city, and has a key of name.
For instance, let's suppose you have two documents;
[
key 1 {
"name" : "matt",
"city" : "new york",
"salary" : "100",
"bio" : "lorem ipsum dolor ... "
},
key 2 {
"name" : "emma",
"city" : "columbus",
"salary" : "120",
"bio" : "foo bar baz ... "
}
]
Then, when you 'query' this view, instead of full documents, you'll get:
[
key "matt" {
"city" : "new york"
},
key "emma" {
"city" : "columbus"
}
]
This is a very simple map. You can also use reduce functions like _count, _sum, _stats, or your own custom.
The results of this view are stored alongside the data on each node (and updated whenever the data is updated). However, you should probably stay away from Couchbase views because:
Views are stored alongside the data on each node. So when reading it, data has to be pulled from every node, combined, and pulled again. "Scatter/gather"
JavaScript map/reduce doesn't give you all the query capabilities you might want. You can't do stuff like 'joins', for instance.
Couchbase has SQL++ (aka N1QL), which is more concise, declarative, and uses global indexes (instead of scatter/gather), so it will likely be faster and reduce strains during rebalance, etc.
Are deprecated as of Couchbase Server 7.0 (and not available in Couchbase Capella at all)

Retrieving object from firebase which contains the exact items in its attributes in an efficient way

I am using Firebase and Xamarin Forms to deploy an app. I am trying to figure it out how to get an object (or several) matching one criteria. Let's say I have a collection of characters and each of them has different attributes like name, age, city and the last attribute is an array of string saying what kind of tools they have.
For example, having this three characters in the collection:
{ 'characters':
{
'char001': {
'name': 'John',
"tools":[ "knife", "MagicX", "laser", "fire" ]
},
'char002': {
'name': 'Albert',
"tools":[ "MagicX" ]
},
'char003': {
'name': 'Chris',
"tools":[ "pistol", "knife", "magicX" ]
}
}
}
I want to retrieve the character(s) who has a knife and magicX, so the query will give me as a result: char001, and char003.
That said, I have a large set of data, like +10.000 characters in the collection plus each character can have up to 10 items in tools.
I can retrieve the objects if the attribute tools where just one string, but having tools as an array I have to iterate throw all the items of each character and see how many of them has a knife and then the same procedure looking for the one with magicX, and the do the union of the two queries which is going to give me the result. This, in terms of speed, it's so slow.
I would like to do it on the back-end side directly, and just receive the correct data.
How could I perform the query in firebase?
Thank you so much in advance,
Cheers.

In Firebase, this is easy, assuming that characters is a collection...
If it's the case, one way to do it is to structure your "charachter" documents like so:
'char001': {
name: "John",
tools: {
knife: true,
MagicX: true,
laser: true
}
}
This way, you'll be able to perform compound EQUALITY queries and get back all the characters with the tools you're searching for. Something like:
db.collection('characters').where('tools.knife', '==', true).where('tools.magicX', '==', true)
Mind you, you can combine up to 10 equality clauses in a query.
I hope this helps, search for "firestore compound queries" for more info.

Couchbase: N1QL JOIN performance issue

I'm getting familiar with Couchbase (I'm getting started with the Server Community Edition), my goal is to migrate our current SQLite database to Couchbase in order to build an efficient real-time synchronization mechanism with mobile devices.
The first steps have been positive so far, we've created buckets (one bucket per SQLite table) and imported all data (one JSON document per SQLite row).
Also, in order to allow complex queries and filtering, we've created indices (both primary and secondary) for all buckets.
To summarize, we have two main buckets:
1) players, which contains documents with the following structure
{
"name": "xxx",
"transferred": false,
"value": n,
"playmaker": false,
"role": "y",
"team": "zzz"
}
2) marks, with the following structure (where the "player" field is a reference to a document ID in the players bucket)
{
"drawgoal": 0,
"goal": 0,
"owngoal": 0,
"enter": 1,
"mpenalty": 0,
"gotgoal": 0,
"ycard": 0,
"assist": 0,
"wingoal": 0,
"mark": 6,
"penalty": 0,
"player": "xxx",
"exit": 0,
"fmark": 6,
"team": "yyy",
"rcard": 0,
"source": "zzz",
"day": 1,
"spenalty": 0
}
So good so far, however when I try to run complex N1QL queries that require a JOIN, performances are pretty bad compared to SQLite.
For instance, this query takes around 3 seconds to be executed:
select mark.*, player.`role` from players player join marks mark on key mark.player for player where mark.type = "xxx" and mark.day = n order by mark.team asc, player.`role` desc;
We currently have 600 documents in players (disk used = 16MB, RAM used = 12MB) and 20K documents in marks (disk used = 70MB, RAM used = 17MB), which should not be much from my point of view.
Are there any settings I can tune to improve JOIN performance? Any specific index I can create?
Is this performance degradation the price to pay to have more flexibility and more features compared to SQLite?
Should I avoid as much as possible using JOIN in Couchbase and instead duplicate data where needed?
Thanks

I found the answer :)
By changing the query to:
select marks.*, players.`role` from marks join players on keys marks.player where marks.day = n and marks.type = "xxx" order by marks.team asc, players.`role` desc;
execution time drops to less than 300 milliseconds. Apparently, inverting the JOIN (from marks to players) dramatically improves the performance.
The reason why this query is much faster than the other one is that Couchbase evaluates the query as follows:
first retrieves all marks documents matching the filtering conditions
then tries to join them with players documents
By doing so, the number of documents to join is much lower, hence the execution time drops.

I think you've left some details out, so I'm going to fill in the blanks with my guesses. First, a JSON document can't have a field like "value": n. It needs to be a string like "n" or a number like 1. I have assumed you mean a literal number, so I put 1 in there.
Next, let's look at your query:
select m.*, p.`role`
from players p
join marks m on key m.player for p
where m.type = "xxx"
and m.day = 1
order by m.team asc, p.`role` desc;
Again, you had m.day = n, so I put m.day = 1. This query does not run without an index. I'm going to assume you created a primary index (which will scan the whole bucket, and is not ideal for production):
create primary index on players;
create primary index on marks;
The query still doesn't run, so you must have added an index on the 'players' field in marks:
create index ix_marks_player on marks(player);
The query runs, but returns no results, because your example documents are missing a "type": "xxx" field. So I added that field, and now your query runs.
Look at the Plan Text by just clicking "plan text" (If you were using Enterprise, you would see a visual version of the Plan diagram).
The plan text shows that the query is using a PrimaryScan on the players bucket. Indeed, your query is attempting to join every player document. So as the player bucket grows, the query will get slower.
In your answer here on SO, you say that a different query to get the same data works faster:
select m.*, p.`role`
from marks m
join players p on keys m.player
where m.day = 1
and m.type = "xxx"
order by m.team asc, p.`role` desc;
You swapped the join, but looking at the plan text, you are still running a PrimaryScan. This time it's scanning all the marks documents. I'm assuming you have fewer of those (either fewer total, or since you are filtering on day you have fewer to join).
So my answer is basically: do you always need to join all of the documents?
If so, why? If not, I suggest you modify your query to add a LIMIT/OFFSET (perhaps for paging) or some other filter so you aren't querying for everything.
One more point: it looks like you are using buckets for organizational purposes. This isn't strictly wrong, but it's not really going to scale. Buckets are distributed across the cluster, so you are limited in the number of buckets you can reasonably use (there may even be a hard limit at 10 buckets).
I don't know your use case, but often it's better to use a "type"/"_type"/"docType"/etc value in your documents for organization instead of relying on buckets.

The first steps have been positive so far, we've created buckets (one bucket per SQLite table) and imported all data (one JSON document per SQLite row)
You have a problem here. You have tried to map a SQL database schema to a document database schema without regard for best practices or even dire warnings in Couchbase's documentation.
First, you should be using one bucket. A bucket is more like a database than a table (although it's more complex than that) and Couchbase recommends using a single bucket per cluster unless you have a very good reason not to. It helps with performance, scaling, and resource utilization. Each of your documents should have a field that indicates the type of data. This is what separates your "tables". I use a field named '_type'. Eg. you will have 'player' and 'mark' document types.
Second, you should rethink importing the data as one row per document. Document databases give you different schema options and some are very useful for improving performance. You certainly can keep it this way, but it's probably not optimal. This is a common pitfall that developers run into when first using a NoSQL database.
One good example is in one to many relationships. Instead of having many mark documents for a single player document, you can embed the marks as an array inside the player document. The document can store arrays of objects!
Eg.
{
"name": "xxx",
"transferred": false,
"value": n,
"playmaker": false,
"role": "y",
"team": "zzz",
"_type": "player",
"marks": [
"mark": {
"drawgoal": 0,
"goal": 0,
"owngoal": 0,
"enter": 1,
},
"mark": {
"drawgoal": 0,
"goal": 0,
"owngoal": 0,
"enter": 1,
},
"mark": {
"drawgoal": 0,
"goal": 0,
"owngoal": 0,
"enter": 1,
}
]
}
You can do this for team and role as well, but it sounds like that would denormalize things which you may not be ready to deal with and isn't always a good idea.
Couchbase can index inside the JSON, so you can still use N1QL to query the marks from all players. This also lets you pull a player's document and marks in a single key:value call, which is the fastest kind.

Why provide custom reference keys instead of childByAutoId in Firebase

I am just learning Firebase and I would like to know why one would need custom reference keys instead of just using childByAutoId. The examples from docs showed mostly similar to the following:
{
"users": {
"alovelace": {
"name": "Ada Lovelace",
"contacts": { "ghopper": true },
},
"ghopper": { ... },
"eclarke": { ... }
}
}
but why not use something like
{
"users": {
"gFlmT9skBHfxf7vCBCbhmxg6dll1": {
"name": "Ada Lovelace",
"contacts": { "ghopper": true },
},
"gFlmT9skBHfxf7vCBCbhmxg6dll2": { ... },
"gFlmT9skBHfxf7vCBCbhmxg6dll3": { ... }
}
}
Though I would prefer the first example for readability purposes. Aside from that, would there be any impact regarding Firebase features and other development related things like querying, updating, etc? Thanks!

Firebase's childByAutoId method is great for generating the keys in a collection:
Where the items need to be ordered by their insertion time
Where the items don't have a natural key
Where it is not a problem if the same item occurs multiple times
In a collection of users, none of these conditions (usually) apply: the order doesn't matter, users can only appear in the collection once, and the items do have a natural key.
That last one may not be clear from the sample in the documentation. Users stored in the Firebase Database usually come from a different system, often from Firebase Authentication. That system gives users a unique ID, in the case of Firebase Authentication called the UID. This UID is a unique identifier for the user. So if you have a collection of users, using their UID as the key makes it easy to find the users based on their ID. In the documentation samples, just read the keys as if they are (friendly readable versions of) the UID of that user.
In your example, imagine that you've read the node for Ada Lovelace and want to look up her contacts. You'd need the run a query on /users, which gets more and more expensive as you add users. But in the model from the documentation you know precisely what node you need to read: /users/ghopper.

Filter posts by multiple tags to return posts that have all those tags, with good performance

StackOverflow lets you search for posts by tags, and lets you filter by an intersection of tags, e.g. ruby x mysql x tags. But typically it's inefficient to retrieve such lists from MySQL using mulitple joins on the taggings. What's a more performant way to implement filter-by-multiple tag queries?
Is there a good NoSQL approach to this problem?

In a NoSQL or document-oriented scenario, you'd have the actual tags as part of your document, likely stored as a list. Since you've tagged this question with "couchdb", I'll use that as an example.
A "post" document in CouchDB might look like:
{
"_id": <generated>,
"question": "Question?",
"answers": [... list of answers ...],
"tags": ["mysql", "tagging", "joins", "nosql", "couchdb"]
}
Then, to generate a view keyed by tags:
{
"_id": "_design/tags",
"language": "javascript",
"views": {
"all": {
"map": "function(doc) {
emit(doc.tags, null);
}"
}
}
}
In CouchDB, you can issue an HTTP POST with multiple keys, if you wish. An example is in the documentation. Using that technique, you would be able to search by multiple tags.
Note: Setting the value to null, above, helps keep the views small. Use include_docs=true in your query if you want to see the actual documents as well.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008