Couchbase : How to Delete Multiple Documents By Using Document ID pattern - couchbase

I am new to Couchbase database.
I have populated a bucket with 10,000 documents and I want to remove these documents using document Id pattern by using N1Ql delete query. For example the keys are like :ao.sl3:eid:89049032000001000000000016677381. So I want to use a pattern , something like ':ao.sl3:eid%' to delete all the documents.
On Couchbase Web UI, my document looks like below image -
I want to use CouchBase web UI query editor to delete the documents.
Thanks

You will need a primary index to do this, but you can simply use the LIKE operator:
DELETE
FROM mybucketname
WHERE META().id LIKE ':ao.sl3:eid%'
Some things to keep in mind:
A primary index is a very dangerous thing to have in production. If this is a one-time thing, make sure to remove that primary index after your DELETE has run.
If this is something you will be doing in production on a regular basis, you may want to consider an alternate approach. Depending on your use case, you may want to look at TTL, N1QL paging, eventing, or an alternate data model that can use a more efficient index.

It has been a few years, but I am just following up any any loose ends, as Matthew Groves indicated you could use the Couchbase Eventing service.
The function to do what you want is quite trivial 4 lines (8 if you count comments).
// example the keys are like :ao.sl3:eid:89049032000001000000000016677381
// delete any key starting with a pattern like ":ao.sl3:eid:"
// make a binding of a bucket alias called src_bkt in read+write mode
// then deploy with a Feed Boundary of Everything.
function OnUpdate(doc, meta) {
if (!meta.id.startsWith(':ao.sl3:eid:')) return;
delete src_bkt[meta.id];
}
For more information on Couchbase Eventing refer to eventing-overview and eventing-examples

Related

Solr indexing structure with MySQL

I have three to five search fields in my application and planning to integrate this with Apache Solr. I tried to do the sams with a single table and is working fine. Here are my questions.
Can we create index multiple tables in same core ? Or should i create separate core for each indexes (i guess this concept is wrong).
Suppose i have 4 tables users, careers, education and location. I have two search boxes in a php page where one is to search for simple locations (just like an autocomplete box) and another one is to get search for a keyword which should check on tables careers and education. If multiple indexes are possible under single core;
2.1 How do we define the query here ?
2.2 Can we specify index name in query (like table name in mysql) ?
Links which can answer my concerns are enough.
If you're expecting to query the same data as part of the same request, such as auto-completing users, educations and locations at the same time, indexing them to the same core is probably what you want.
The term "core" is probably identical to the term "index" in your usage, and having multiple sets of data in the same index will usually be achieved through having a field that indicates the type of document (and then applying a filter query if you want to get documents of only one type, such as fq=type:location. You can use the grouping feature of Solr to get separate result sets of documents back for each query as well.
If you're only ever going to query the data separately, having them in separate indexes are probably the way to go, as you'll be able to scale and perform analysis and tuning independent from each index in that case (and avoid having to always have a filter query to get the type of content you're looking for).
Specifying the index name is the same as specifying the core, and is part of the URL to Solr: http://localhost:8983/solr/index1/ or http://localhost:8983/solr/index2/.

MySQL --> MongoDB: Keep IDs or create mapping?

We are going to migrate our database from MySQL to MongoDB.
Some URLs pointing at our web application use database IDs (e.g. http://example.com/post/5)
At the moment I see two possibilities:
1) Keep existing MySQL IDs and use them as MongoDB IDs. IDs of new documents will get new MongoDB ObjectIDs.
2) Generate new MongoDB ObjectIDs for all documents and create a mapping with MySQLId --> MongoDBId for all external links with old IDs in it.
2 will mess up my PHP app a little, but I could imagine that #1 will cause problems with indexes or sharding?
What is the best practice here to avoid problems?
1) Keep existing MySQL IDs and use them as MongoDB IDs. IDs of new
documents will get new MongoDB ObjectIDs.
ObjectId very useful when you don't want/have a natural primary key for your documents, but mixing ObjecIDs and numerical IDs as primary keys can only cause you problems later on with queries. I would suggest a different route. Keep existing MySQL IDs and use them as MongoDB IDs; create new documents with numerical IDs, as you would do for MySQL. This way you don't have to mix data types in one field.
2) Generate new MongoDB ObjectIDs for all documents and create a
mapping with MySQLId --> MongoDBId for all external links with old IDs
in it.
This can work also, but you need, as you said, map your new and old IDs. This is probably some extra work which you can avoid if you leave your IDs unchanged.
I could imagine that #1 will cause problems with indexes or sharding?
ObjectIDs and MySQL AUTO_INCREMENT IDs are both monotonically increasing so there wouldn't be much difference if they are used as a shard keys (you will probably use hashed shard keys in that case; you can read more details here).
Edit
Which problems could occur when mixing ObjectIDs and numeric IDs?
If you're doing simple equality checks (i.e get doc. with {_id: 5} or {_id: ObjectId("53aeb2dcb9f8955d1a927b97")) you will have no problems. However, range queries will be more complicated:
As an example:
db.coll.find({_id : { $gt : 5}})
This query will return you only documents with num. IDs.
This query:
db.coll.find({_id : { $gt : ObjectId("53aeb2dcb9f8955d1a927b97")}});
will return only documents that have ObjectIds.
Obviously, you can use $or to find either, but my point that your queries won't be as straightforward as with non-mixed Ids.

How to check if a node is already indexed in the neo4j-spatial index?

I'm running the latest neo4j v2, with the spatial plugin installed. I have managed to index almost all of the nodes I need indexed in the geo index. One of the problems that I'm struggling with is how can I easily check if a node is already been indexed ?
I can't find any REST endpoint to get this information and not easy to get to this with cypher. But I tried this query as it seems to give me the result I want except that the runtime is unacceptable.
MATCH (a)-[:RTREE_REFERENCE]->(b) where b.id=989898 return b;
As the geo index only store a reference to the node that has been indexed in a property value of id in a node referenced by the relationship RTREE_REFERENCE I figured this could be the way to go.
This query takes now: 14459 ms run from the neo4j-shell.
My database is not big, about 41000 nodes, that I want to add to the spatial index in total.
There must be a better way to do this. Any idea and or pointer would be greatly appreciated.
Since you know the ID of your data node, you can access it directly in Cypher without an index, and just check for the incoming RTREE_REFERENCE relationship:
START n=node(989898) MATCH (p)-[r:RTREE_REFERENCE]->(n) RETURN r;
As a side node, your Cypher had the syntax 'WHERE n.id=989898' but if this is an internal node ID, then that will not work, since n.id will look for a property with key 'id'. For the internal node id, use 'id(n)'.
If your 'id' is actually a node property (and not it's internal ID), then I think #deemeetree suggestion is better, using an index over this property.
Right now your requests seems to be scouring through all the nodes in the network which are related with :RTREE_REFERENCE and checking id property for each of them.
Why don't you try to instead start your search from the node id you need and then get all the paths like that?
I also don't quite understand why you need to return the node that you're defining, but anyway.
As you're running Neo4J I recommend you to add labels to your nodes (all of them in the example below):
START n=node(*) SET n:YOUR_LABEL_NAME
then create an index on the labeled node by id property.
CREATE INDEX ON :YOUR_LABEL_NAME(id)
Once you've done that, run a query like this:
MATCH (b:YOUR_LABEL_NAME{id:"989898"}), a-[:RTREE_REFERENCE]->b RETURN a,b;
That should increase the speed of your query.
Let me know if that works and please explain why you were querying b in your original question if you already knew it...

Derive a new Sphinx index from an existing Sphinx index as the data source?

I'm using Sphinx to index a large MySQL data table of products with a daily cron job. When a new products index is created, I would also like to create an index of merchants with the top n products, using Sphinx's multi-valued attribute (MVA). It's a relatively simple grouping operation. Is there a way to instruct Sphinx to use it's own index (the product index mentioned above) to create another index (the merchants index)?
Not directly, but might sort of be possible. Frankly its probably more hassle than just creating the second index direct from mysql as well.
Sphinx doesnt really store the original text when building an index. So would need to duplicate all the required columns as attributes, so that the data would be stored. You could then build a second index, by running sphinxQL command (in sql_query).
You might also run into issues with max_matches, unless you get creative with ranged queries.
SO as its not a build in feature, will be hard to make it work well.

Best primary key for storing URLs

which is the best primary key to store website address and page URLs?
To avoid the use of autoincremental id (which is not really tied to the data), I designed the schema with the use of a SHA1 signature of the URL as primary key.
This approach is useful in many ways: for example I don't need to read the last_id from the database so I can prepare all table updates calculating the key and do the real update in a single transaction. No constraint violation.
Anyway I read two books which tell me I am wrong. In "High performance MySQL" it is said that the random key is not good for the DB optimizer. Moreover, in each Joe Celko's books he says the primary key should be some part of the data.
The question is: the natural keys for URLs are... URLs themselves. The fact is that if for a site it is short (www.something.com), there's not an imposed limit for am URL (see http://www.boutell.com/newfaq/misc/urllength.html).
Consider I have to store (and work with) some millions of them.
Which is the best key, then? Autoincremental ids, URLs, hashes of URLs?
You'll want an autoincrement numeric primary key. For the times when you need to pass ids around or join against other tables (for example, optional attributes for a URL), you'll want something small and numeric.
As for what other columns and indexes you want, it depends, as always, on how you're going to use them.
A column storing a hash of each URL is an excellent idea for almost any application that uses a significant number of URLs. It makes SELECTing a URL by its full text about as fast as it's going to get. A second advantage is that if you make that column UNIQUE, you don't need to worry about making the column storing the actual URL unique, and you can use REPLACE INTO and INSERT IGNORE as simple, fast atomic write operations.
I would add that using MySQL's built-in MD5() function is just fine for this purpose. Its only disadvantage is that a dedicated attacker can force collisions, which I'm quite sure you don't care about. Using the built-in function makes, for example, some types of joins much easier. It can be a tiny bit slower to pass a full URL across the wire ("SELECT url FROM urls WHERE hash=MD5('verylongurl')" instead of "WHERE hash='32charhexstring'"), but you'll have the option to do that if you want. Unless you can come up with a concrete scenario where MD5() will let you down, feel free to use it.
The hard question is whether and how you're going to need to look up URLs in ways other than their full text: for example, will you want to find all URLs starting with "/foo" on any "bar.com" host? While "LIKE '%bar.com%/foo%'" will work in testing, it will fail miserably at scale. If your needs include things like that, you can come up with creative ways to generate non-UNIQUE indexes targeted at the type of data you need... maybe a domain_name column, for starters. You'll have to populate those columns from your application, almost certainly (triggers and stored procedures are a lot more trouble than they're worth here, especially if you're concerned about performance -- don't bother).
The good news is that relational databases are very flexible for that sort of thing. You can always add new columns and populate them later. I would suggest for starters: int unsigned auto_increment primary key, unique hash char(32), and (assuming 64K chars suffices) text url.
Presumably you're talking about an entire URL, not just a hostname, including CGI parameters and other stuff.
SHA-1 hashing the URLs makes all the keys long, and makes sorting out trouble fairly obscure. I had to use indexes on hashes once to obscure some confidential data while maintaining the ability to join two tables, and the performance was poor.
There are two possible approaches. One is the naive and obvious one; it will actually work well in mySQL. It has advantages such as simplicity, and the ability to use URL LIKE 'whatever%' to search efficiently.
But if you have lots of URLs concentrated in a few domains ... for example ....
http://stackoverflow.com/questions/3735390/best-primary-key-for-storing-urls
http://stackoverflow.com/questions/3735391/how-to-add-a-c-compiler-flag-to-extconf-rb
etc, you're looking at indexes which vary only in the last characters. In this case you might consider storing and indexing the URLs with their character order reversed. This may lead to a more efficiently accessed index.
(The Oracle table server product happens has a built in way of doing this with a so-called reversed index.)
If I were you I would avoid an autoincrement key unless you have to join more than two tables ON TABLE_A.URL = TABLE_B.URL or some other join condition with that kind of meaing.
Depends on how you use the table. If you mostly select with WHERE url='<url>', then it's fine to have a one-column table. If you can use an autoincrement id to identify an URL in all places in your app, then use the autoincrement