"space as AND" text search with couchbase - couchbase

With couchbase, I would like to retreive documents by finding words into their title.
The user will enter a string, the spaces will be seen as logical ANDs :
Let's say I have these documents :
{title : "My blue car is worderful", ...}
{title : "the sky is blue", ... }
{title : "mais ou est donc or ni car", ...}
{title : "president's car is blue", ...}
If the user enter "car blue" in the web interface, I would like to find :
{title : "My blue car is worderful", ...}
{title : "president's car is blue", ...}
How can I do that with couchbase ?

Unfortunately Couchbase is not suited at all for free text search, luckily for you it does have a native plugin for integration with ElasticSearch which is great for free text search. The Couchbase transport plugin replicates all data from your cluster to an ElasticSearch cluster in near real time, you could then use ES's abilities for free text search to provide this sort of functionality.
To get started with the Couchbase transport plugin go here:
http://www.couchbase.com/couchbase-server/connectors/elasticsearch
More in depth article on setting up replication and configuration:
http://docs.couchbase.com/couchbase-elastic-search/#indexing-and-querying-data
Here is the link to ES documentation on text querying:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/0.90/query-dsl-text-query.html

Related

what is view in couchbase

I am trying to understand what exactly couchbase view is used for, I have gone through some materials in docs, but the 'view' concept does not settle me quite well.
Are views in couchbase analogues to views in view in RDBMS?
https://docs.couchbase.com/server/6.0/learn/views/views-basics.html
A view performs the following on the Couchbase unstructured (or
semi-structured) data:
Extract specific fields and information from the data files.
Produce a view index of the selected information.
how does view and index work here, seems there is separate index for view. so if a documents updates are both indexes updated?
https://docs.couchbase.com/server/6.0/learn/views/views-store-data.html
In addition, the indexing of data is also affected by the view system
and the settings used when the view is accessed.
Helpful post:
Views in Couchbase
You can think of Couchbase Map/Reduce views as similar to materialized views, yes. Except that you create them with JavaScript functions (a map function and optionally a reduce function).
For example:
function(doc, meta)
{
emit(doc.name, [doc.city]);
}
This will look at every document, and save a view of each document that contains just city, and has a key of name.
For instance, let's suppose you have two documents;
[
key 1 {
"name" : "matt",
"city" : "new york",
"salary" : "100",
"bio" : "lorem ipsum dolor ... "
},
key 2 {
"name" : "emma",
"city" : "columbus",
"salary" : "120",
"bio" : "foo bar baz ... "
}
]
Then, when you 'query' this view, instead of full documents, you'll get:
[
key "matt" {
"city" : "new york"
},
key "emma" {
"city" : "columbus"
}
]
This is a very simple map. You can also use reduce functions like _count, _sum, _stats, or your own custom.
The results of this view are stored alongside the data on each node (and updated whenever the data is updated). However, you should probably stay away from Couchbase views because:
Views are stored alongside the data on each node. So when reading it, data has to be pulled from every node, combined, and pulled again. "Scatter/gather"
JavaScript map/reduce doesn't give you all the query capabilities you might want. You can't do stuff like 'joins', for instance.
Couchbase has SQL++ (aka N1QL), which is more concise, declarative, and uses global indexes (instead of scatter/gather), so it will likely be faster and reduce strains during rebalance, etc.
Are deprecated as of Couchbase Server 7.0 (and not available in Couchbase Capella at all)

Will MongoDB overwrite my custom id attributes of my JSON when inserting it into a Collection?

Struggling to understand MongoDBs handling of ids. Right now, I have a JSON file which I would like to put into a MongoDB Database. The file looks like this, roughly:
{
id: 'HARRYPOTTER-1',
title: 'Harry Potter and the Philosophers Stone',
price: 10
}
I would now like to put this file into MongoDB. Will my id attribute get lost? Will MongoDB want to overwrite it with its own unique id?
I have made sure that my id attributes are unique and I am making use of them elsewhere, so I am a little worried now. But maybe I understood things incorrectly.
Thanks a lot in advance!
Mongodb creates an _id field any element that doesn't have it.
If _id already there, it won't overwrite it. (and throws an error instead).
If id is there, mongodb doesn't care. It won't modify, and will follow 1 and 2.
Let's run an example in the mongo shell:
> db.random.insert({
... id: 'HARRYPOTTER-1',
... title: 'Harry Potter and the Philosophers Stone',
... price: 10
... })
WriteResult({ "nInserted" : 1 })
And now inspect the inserted document
> db.random.findOne()
{
"_id" : ObjectId("5f954cc93b09d63a06f7a4a9"),
"id" : "HARRYPOTTER-1",
"title" : "Harry Potter and the Philosophers Stone",
"price" : 10
}
You can see the _id has been created. It doesn't matter id and is not overwritten.
PS: The right binary you should look for to put that json in a MongoDB database is mongoimport (not mongorestore).
For more details refer to the docs.

Solr facet doesn't segment text

I am a beginner of Solr. I push the books.json into Solr, which looks like
{
"id" : "978-0641723445",
"cat" : ["book","hardcover"],
"name" : "The Lightning Thief",
"author" : "Rick Riordan",
"series_t" : "Percy Jackson and the Olympians",
"sequence_i" : 1,
"genre_s" : "fantasy",
"inStock" : true,
"price" : 12.50,
"pages_i" : 384
}
then I change the schema of "name" to
<field name="name" type="text_general"/> with everything else unchanged. The Analysis in Solr gives correct segmentation. However, when I run the query http://localhost:8983/solr/testCore/select?facet.field=name&facet=on&indent=on&q=*:*&wt=json
the output is not segmented:
"facet_counts":{
"facet_queries":{},
"facet_fields":{
"name":[
"Lucene in Action, Second Edition",1,
"Sophie's World : The Greek Philosophers",1,
"The Lightning Thief",1,
"The Sea of Monsters",1]},
"facet_ranges":{},
"facet_intervals":{},
"facet_heatmaps":{}}}
Can anyone explain why?
After changing the definition of a field (unless you're only changing the "query" part of an analysis chain that has separate query and index chains), you'll have to reindex your content.
Since the facet module works on the actual tokens generated in the index, you have to clean out the old index and reindex all your content, so that each value is processed again and divided into tokens matching the behaviour you're looking for.
If all your documents are still present when you're reindexing (so all the old ids are still there), you don't have to clean out the index first, since all the old tokens will be overwritten. But to be sure you can delete everything first, then reindex your content and see the new tokens.
You can also do this while in production as long as a commit doesn't happen in between; first issue the delete, then reindex and then commit. Until the commit happens all the old data will still be available (but be aware that other threads or other code can issue a commit while you're working on the index, so be sure you're the only one issuing commits first).

Best database for my type of application? MySQL? MongoDB? PostgreSQL? CouchDB? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I am currently writing an application where i have to store a huge amount of data. My application is written in Node.js and im using the cluster and async module to make use of my complete system.
Here are some properties of my application and the environment i am using:
Workstation:
CPU: 6 Cores at 3.5 GHz
RAM: 16 GB
Nodejs: latest version
Current Database: MySQL
OS: Windows 10
Application:
Is using currently 6 workers each taking 0.1% CPU and 80 MB of RAM
Gets Data for the Database via RPC calls in JSON format
Data:
Blocks (currently ~376,000 Blocks) increasing every ~10min by one. Example Data for one Block:
{
"hash" : "000000000fe549a89848c76070d4132872cfb6efe5315d01d7ef77e4900f2d39",
"confirmations" : 88029,
"size" : 189,
"height" : 227252,
"version" : 2,
"merkleroot" : "c738fb8e22750b6d3511ed0049a96558b0bc57046f3f77771ec825b22d6a6f4a",
"tx" : [
"c738fb8e22750b6d3511ed0049a96558b0bc57046f3f77771ec825b22d6a6f4a"
],
"time" : 1398824312,
"nonce" : 1883462912,
"bits" : "1d00ffff",
"difficulty" : 1.00000000,
"chainwork" : "000000000000000000000000000000000000000000000000083ada4a4009841a",
"previousblockhash" : "00000000c7f4990e6ebf71ad7e21a47131dfeb22c759505b3998d7a814c011df",
"nextblockhash" : "00000000afe1928529ac766f1237657819a11cfcc8ca6d67f119e868ed5b6188"
}
Transactions (currently ~84,850,717 Transactions) increasing every second by ~1.3 Transactions. Example Data for one transaction:
{
"hex" : "0100000001268a9ad7bfb21d3c086f0ff28f73a064964aa069ebb69a9e437da85c7e55c7d7000000006b483045022100ee69171016b7dd218491faf6e13f53d40d64f4b40123a2de52560feb95de63b902206f23a0919471eaa1e45a0982ed288d374397d30dff541b2dd45a4c3d0041acc0012103a7c1fd1fdec50e1cf3f0cc8cb4378cd8e9a2cee8ca9b3118f3db16cbbcf8f326ffffffff0350ac6002000000001976a91456847befbd2360df0e35b4e3b77bae48585ae06888ac80969800000000001976a9142b14950b8d31620c6cc923c5408a701b1ec0a02088ac002d3101000000001976a9140dfc8bafc8419853b34d5e072ad37d1a5159f58488ac00000000",
"txid" : "ef7c0cbf6ba5af68d2ea239bba709b26ff7b0b669839a63bb01c2cb8e8de481e",
"version" : 1,
"locktime" : 0,
"vin" : [
{
"txid" : "d7c7557e5ca87d439e9ab6eb69a04a9664a0738ff20f6f083c1db2bfd79a8a26",
"vout" : 0,
"scriptSig" : {
"asm" : "3045022100ee69171016b7dd218491faf6e13f53d40d64f4b40123a2de52560feb95de63b902206f23a0919471eaa1e45a0982ed288d374397d30dff541b2dd45a4c3d0041acc001 03a7c1fd1fdec50e1cf3f0cc8cb4378cd8e9a2cee8ca9b3118f3db16cbbcf8f326",
"hex" : "483045022100ee69171016b7dd218491faf6e13f53d40d64f4b40123a2de52560feb95de63b902206f23a0919471eaa1e45a0982ed288d374397d30dff541b2dd45a4c3d0041acc0012103a7c1fd1fdec50e1cf3f0cc8cb4378cd8e9a2cee8ca9b3118f3db16cbbcf8f326"
},
"sequence" : 4294967295
}
],
"vout" : [
{
"value" : 0.39890000,
"n" : 0,
"scriptPubKey" : {
"asm" : "OP_DUP OP_HASH160 56847befbd2360df0e35b4e3b77bae48585ae068 OP_EQUALVERIFY OP_CHECKSIG",
"hex" : "76a91456847befbd2360df0e35b4e3b77bae48585ae06888ac",
"reqSigs" : 1,
"type" : "pubkeyhash",
"addresses" : [
"moQR7i8XM4rSGoNwEsw3h4YEuduuP6mxw7"
]
}
},
{
"value" : 0.10000000,
"n" : 1,
"scriptPubKey" : {
"asm" : "OP_DUP OP_HASH160 2b14950b8d31620c6cc923c5408a701b1ec0a020 OP_EQUALVERIFY OP_CHECKSIG",
"hex" : "76a9142b14950b8d31620c6cc923c5408a701b1ec0a02088ac",
"reqSigs" : 1,
"type" : "pubkeyhash",
"addresses" : [
"mjSk1Ny9spzU2fouzYgLqGUD8U41iR35QN"
]
}
},
{
"value" : 0.20000000,
"n" : 2,
"scriptPubKey" : {
"asm" : "OP_DUP OP_HASH160 0dfc8bafc8419853b34d5e072ad37d1a5159f584 OP_EQUALVERIFY OP_CHECKSIG",
"hex" : "76a9140dfc8bafc8419853b34d5e072ad37d1a5159f58488ac",
"reqSigs" : 1,
"type" : "pubkeyhash",
"addresses" : [
"mgnucj8nYqdrPFh2JfZSB1NmUThUGnmsqe"
]
}
}
],
"blockhash" : "00000000103e0091b7d27e5dc744a305108f0c752be249893c749e19c1c82317",
"confirmations" : 88192,
"time" : 1398734825,
"blocktime" : 1398734825
}
Problem:
The MySQL Database is pushing the CPU to 100% while using only 500MB of RAM. My bottleneck is currently the MySQL Database which is not able to handle the speed and the amount of data from my application and is taking a lot of CPU power.
What i am looking for:
A database which can handle my application even when i am increasing the worker amount
It should be easy to retrieve the information and to select data which has dependencies. (Blocks have a connection to transactions through the tx <--> txid value)
Should be able to hold even more data in the future because of the steady increase of data
Needs to be accessable by multiple workers at the same time
Bonus: Sort of notification (channel) to my application when data is changing
I hope someone can give me a suggestion which database is suitable for my type of project and give me maybe a guess on the needed storage amount.
You can suggest me also another database which i haven't mentioned in the title.
Relational databases are useful when you have, well, lots of relations between things, and particularly when you’ll want to traverse through those relations while querying. For example, you could have a bunch of customers, each having a number of orders, which are all from suppliers, which are in locations; you might want to query for all customers who have at least five orders from suppliers in a particular location. Or maybe you want to know the total number of orders from suppliers, grouped by location. Relational databases are excellent at this.
Your data does have relations, yes. However, it sounds like you aren’t planning on trying to traverse through them or aggregate them very much, and your data, once stored, will seldom if ever change. That sounds to me like a document store would better suit you.
Out of the databases you list, MongoDB and Redis could be considered document stores. You said you had only 512 MB of RAM; that kind of disqualifies Redis, which loves to store all of its data in RAM, with throwing it onto disk as an afterthought. I’m not sure what balance MongoDB tries to strike, but I believe that while it uses RAM somewhat liberally, it also does try to get it to disk, eventually. (Some people poke fun at it, saying that it doesn’t try very hard at durability. It looks like you’re storing data that’s publicly available, so that shouldn’t be too much of a problem—if you lose some most-recently-written data, you can just repopulate it from a public source.)
In the comments, you pointed out that you were very commonly going to be querying for all transactions in a block. MongoDB should be able to deal with that use-case with ease. The only thing you’ll need to make sure of is that you create an index on the block-ID column (field? I’m not sure what MongoDB calls them), which should allow for that kind of query to be efficiently executed.

Google autocomplete filtered by business type

I am trying to use Google Maps API to give me an autocomplete to find nearby bars by typing in the bar you are looking for. This doesn't seem to fall into the user stories of the API. I am having problems figuring out what combination of tricks I need to use to accomplish this.
The autocomplete function does not have granular enough place types (establishment, geocode) to filter only for bars so my predictions are full of gas stations, law offices and graveyards.
The nearbysearch is granular enough to filter type=bar but cannot be used as an autocomplete because the name and keyword parameters are exact matches. So when I search for "craw" I get ZERO_RESULTS not "Crawdaddy's".
Next I thought I would Get the predictions, do a radarsearch of the same location applying the type=bar filter and I would only take predictions that were in the radar results based on the reference number. NO GO either the radar search is using a different reference number than autocomplete or it is not returning the nearest type=bay to my location.
So ultimately, I am asking am I on a fools errand here or is there someway to implement an autocomplete that is filtered by business type?
Thanks,
Tal
I'm sorry to be the bearer of bad news. I'll go through your three possible ideas:
The autocomplete will match for keywords. Lawyers usually call themselves barristers, so you've got the wildcard screwing you up on that one. However, there is an optional types parameters that will allow you to filter by type in there, if that is any help.
I am actually having the opposite results on this one. Using the following parameters:
query=etoi
types=bar
location=50,2
sensor=true
radius=20000
Yields as first result:
{
"geometry" : {
"location" : {
"lat" : 50.1089160,
"lng" : 1.831840
}
},
"icon" : "http://maps.gstatic.com/mapfiles/place_api/icons/bar-71.png",
"id" : "3ec8233a732b4ee70dbb03034e2e20f84517763d",
"name" : "L'Etoile du Jour",
"rating" : 4.30,
"reference" : "CnRuAAAAgOI6rLfSjaWyY_MRdK8zybHJOmAoqBLEtAgIxaZN5_UAS7WbWSYBukIro9ZCuiXSa9_HCOeHUmKPKkS6j9lxrET8cRX089azCKfvbR-lMFmzUb3Sd2VoWr02yPGJhXDBT7TjMpPPiuTWsZCY0Mcy9xIQHiE-o5v_EURALkxNElUPnRoUXcuht7Ov6k64DT1eA8-t9NR6-O8",
"types" : [ "bar", "restaurant", "food", "establishment" ],
"vicinity" : "2 Chaussée Marcadé, Abbeville"
},
So I am guessing that there is more at play, but you'll have to ask Google on this one or provide your exact search params (excluding key) for me to look into.
This is a terrible idea, as radarSearch will do 5x the amount of queries on your credit balance than any other API, and it does not return IDs so you cannot cross-reference. This is most likely a dead-end.