How to check if a node is already indexed in the neo4j-spatial index? - neo4j-spatial

I'm running the latest neo4j v2, with the spatial plugin installed. I have managed to index almost all of the nodes I need indexed in the geo index. One of the problems that I'm struggling with is how can I easily check if a node is already been indexed ?
I can't find any REST endpoint to get this information and not easy to get to this with cypher. But I tried this query as it seems to give me the result I want except that the runtime is unacceptable.
MATCH (a)-[:RTREE_REFERENCE]->(b) where b.id=989898 return b;
As the geo index only store a reference to the node that has been indexed in a property value of id in a node referenced by the relationship RTREE_REFERENCE I figured this could be the way to go.
This query takes now: 14459 ms run from the neo4j-shell.
My database is not big, about 41000 nodes, that I want to add to the spatial index in total.
There must be a better way to do this. Any idea and or pointer would be greatly appreciated.

Since you know the ID of your data node, you can access it directly in Cypher without an index, and just check for the incoming RTREE_REFERENCE relationship:
START n=node(989898) MATCH (p)-[r:RTREE_REFERENCE]->(n) RETURN r;
As a side node, your Cypher had the syntax 'WHERE n.id=989898' but if this is an internal node ID, then that will not work, since n.id will look for a property with key 'id'. For the internal node id, use 'id(n)'.
If your 'id' is actually a node property (and not it's internal ID), then I think #deemeetree suggestion is better, using an index over this property.

Right now your requests seems to be scouring through all the nodes in the network which are related with :RTREE_REFERENCE and checking id property for each of them.
Why don't you try to instead start your search from the node id you need and then get all the paths like that?
I also don't quite understand why you need to return the node that you're defining, but anyway.
As you're running Neo4J I recommend you to add labels to your nodes (all of them in the example below):
START n=node(*) SET n:YOUR_LABEL_NAME
then create an index on the labeled node by id property.
CREATE INDEX ON :YOUR_LABEL_NAME(id)
Once you've done that, run a query like this:
MATCH (b:YOUR_LABEL_NAME{id:"989898"}), a-[:RTREE_REFERENCE]->b RETURN a,b;
That should increase the speed of your query.
Let me know if that works and please explain why you were querying b in your original question if you already knew it...

Related

Empty N1QL result set

On my couchbase bucket I have a bucket (myBucket) which contains 1.7 billion documents. I have a primary index on the bucket that should make myBucket fully queryable.
CREATE PRIMARY INDEX 'my_primary' ON myBucket
The issue is that I cannot get ANY results from N1QL. All responses are empty. Even doing something as simple as:
SELECT * from myBucket LIMIT 1;
Winds up returning an empty set.
Can you provide some basic information about your setup, server version, document size. Also, check logs (especially indexer.log and query.log) if it has reported any errors/warnings.
To make sure sanity of the setup, can you first try with a smaller dataset, or rather create a partial index on smaller amount of data and try the query using that index. Based on that we can guide you further.
-Prasad

how to avoid duplicate vertex entries in DSE graph/Titan

I have the following graph definition.
schema.propertyKey("ID").text().create()
schema.vertexLabel("Student").properties("ID").create()
When I execute the below Gremlin query, a new vertex is created.
g.addV(label, 'Student').property('ID', '1234')
when I execute it again, new Vertex with same ID has been created.I'm looking for a way to make ID value unique. Meaning I should get error when I try to add a new student with same ID(1234). Any help highly appreciated on this.
I don't know about DSE Graph, but in Titan you can create an index and configure it to be unique. But it is not recommended to do that (at least not if it affects many vertices) as Titan has to use locks to insert vertices with such an index in order to avoid duplicates.
You will get a better performance if you check whether the vertex exists already before inserting it.
Daniel Kuppitz provided a query for that on the mailing list [1]:
g.V().has('Student','ID','1234').tryNext().orElseGet(g.addV(T.label,'Student','ID', '1234').next())
Of course, you could get into race conditions here where two of those queries are evaluated for the same ID at the same time. But this should only occur very rarely and you can probably perform a regular clean-up with an OLAP job with an upcoming version of TinkerPop. (Unfortunately, it is currently not possible to modify the graph with an OLAP job.)
[1] https://groups.google.com/forum/#!topic/gremlin-users/pCYf6h3Frb8
When you define your schema for your graph set the cardinality of the ID property to SINGLE
From the Titan schema docs
SINGLE: Allows at most one value per element for such key. In other
words, the key→value mapping is unique for all elements in the graph.
The property key birthDate is an example with SINGLE cardinality since
each person has exactly one birth date.
Here's a link to the docs http://s3.thinkaurelius.com/docs/titan/1.0.0/schema.html

Indexing in Neo4j, text or integer

I am creating an application which uses both MySql and Neo4j. I think that listing the many nodes properties in a table will be faster at reading all those after querying for a specific set of nodes (or even before), but I am open to be proven wrong. After all finding properties of a row is what relational dbs are for.
To ensure consistency, I have created a property on each node which is the auto_increment ID in my sql table.
I wish neo4j would allow indexing a property regardless of labels but that's not the case and I struggle to understand why this is not possible at all.
Question is: do you think that the performance in neo4j would be much better if the index is on a number versus a string? I am thinking whether to drop the numeric id and just stick with node.name
You can configure indexes on properties without referring to particular labels. You do this by editing node_auto_indexing in conf/neo4j.properties.
If you're looking to compare simple equality, I'd guess that indexing on numbers might be slightly faster, but I doubt the difference is big enough to be very meaningful, unless the string alternatives are very large.
Another option would be to put an AutoInc label and index on that label with your auto_id node property.
Assuming that auto_id is the property you added to all nodes to reference the MySQL auto_increment ID column, then:
CREATE INDEX ON AutoInc:(auto_id)
MATCH(n)
SET n :AutoInc

MySQL: SELECTing by hash: is this possible?

I don't think it has too much sense. Although, this way you could hide the real static value from .php file, but keeping its hash value in php file for mysql query. The source of php file can't be reached from user's machine, but you have make backups of your files, and that static value is there. Selecting using hash of column would resolve this problem, I believe.
But, I didn't find any examples or documentation saying that it's possible to use such functions in queries (not for values in sql queries, but for columns to select).
Is this possible?
An extremely slow query that simply selects all rows with an empty "column".
SELECT * FROM table WHERE MD5(column) = 'd41d8cd98f00b204e9800998ecf8427e'
If you're doing a lot of these queries, consider saving the MD5 hash in a column or index. Even better would be to do all MD5 calculations on the script's end - the day you're going to need an extra server for your project you'll notice that webservers scale a lot better than database servers. (That's something to worry about in the future, of course)
It should be noted that setting up your system this way won't actually solve any problem in your particular case. You are not making your system more secure doing this, you are just making it more convoluted.
The standard way to hide secret values from the source base is to store these secret values in a separate file, and never submit that file to source control or make a backup of it. Load the value of the secret by using php code and then work with the value directly in MySQL (one way to do this is to store a "config.php" file or something along that lines that just sets variables/constants, and then just php-include the file).
That said, I'll answer your question anyway.
MySQL actually has a wide-variety of hashing and encryption functions. See http://dev.mysql.com/doc/refman/5.0/en/encryption-functions.html
Since you tagged your question md5 I'm assuming the function you're looking for is MD5: http://dev.mysql.com/doc/refman/5.0/en/encryption-functions.html#function_md5
You select it just like this:
SELECT MD5(column) AS hashed_column FROM table
Then the value to compare to the hash will be in the column alias 'hashed_column'.
Or to select a particular row based on the hash:
SELECT * FROM table WHERE MD5(column) = 'hashed-value-to-compare'
If I understand correctly, you want to use a hash as a primary key:
INSERT INTO MyTable (pk) VALUES (MD5('plain-value'));
Then you want to retrieve it by hash without knowing what its hash digest is:
SELECT * FROM MyTable WHERE pk = MD5('plain-value');
Somehow this is supposed to provide greater security in case people steal a backup of your database and PHP code? Well, it doesn't. If I know the original plain-value and the method of hashing, I can find the data just as easily as if you didn't hash the value.
I agree with the comment from #scunliffe -- we're not sure exactly what problem you're trying to solve, but it sounds like this method will not solve it.
It's also inefficient to use an MD5 hash digest as a primary key. You have to store it in a CHAR(32), or else UNHEX it and store it in BINARY(16). Regardless, you can't use INT or even BIGINT as the primary key datatype. The key values are more bulky, and therefore make larger indexes.
Also new rows will insert in an arbitrary location in the clustered index. That's more expensive than adding new values to the end of the B-tree, as you would do if you used simple auto-incrementing integers like everyone else.

MySQL: is there something like an internal record identifier for every record in a MySQL table?

I'm building a spreadsheet app using MySQL as storage, I need to identify records that are being updated client-side in order to save the changes.
Is there a way, such as some kind of "internal record identifier" (internal as in used by the database engine itself), to uniquely identify records, so that I'll be able to update the correct one?
Certainly, a SELECT query can be used to identify the record, including all the fields in the table, but obviously that has the downside of returning multiple records in most situations.
IMPORTANT: the spreadsheet app aims to work on ANY table, even ones tremendously poorly designed, without any keys, so solutions such as "define a field with an UNIQUE index and work with that" are not an option, table structure may be extremely variable and must not matter.
Many thanks.
AFAIK no such unique internal identifier (say, a simple row ID) exists.
You may maybe be able to run a SELECT without any sorting and then get the n-th row using a LIMIT. Under what conditions that is reliable and safe to use, a mySQL Guru would need to confirm. It probably never is.
Try playing around with phpMyAdmin, the web frontend to mySQL. It is designed to deal with badly designed tables without keys. If I remember correctly, it uses all columns it can get hold of in such cases:
UPDATE xyz set a = b WHERE 'fieldname' = 'value'
AND 'fieldname2' = 'value2'
AND 'fieldname3' = 'value3'
LIMIT 0,1;
and so on.
That isn't entirely duplicate-safe either, of course.
The only idea that comes to my mind is to add a key column at runtime, and to remove it when your app is done. It's a goose-bump-inducing idea, but maybe better than nothing.
MySQL has "auto-increment" numeric columns that you can add and even define as a primary key, that would give you a unique record id automatically generated by the database. You can query the last record id you just inserted with select LAST_INSERT_ID()
example from mysql's official documentation here
To my knowledge, MySQL lacks the implicit ROWID feature as seen in Oracle (and exists in other engines with their own syntax). You'll have to create your own AUTO_INCREMENT field.