how to avoid duplicate vertex entries in DSE graph/Titan - unique

I have the following graph definition.
schema.propertyKey("ID").text().create()
schema.vertexLabel("Student").properties("ID").create()
When I execute the below Gremlin query, a new vertex is created.
g.addV(label, 'Student').property('ID', '1234')
when I execute it again, new Vertex with same ID has been created.I'm looking for a way to make ID value unique. Meaning I should get error when I try to add a new student with same ID(1234). Any help highly appreciated on this.

I don't know about DSE Graph, but in Titan you can create an index and configure it to be unique. But it is not recommended to do that (at least not if it affects many vertices) as Titan has to use locks to insert vertices with such an index in order to avoid duplicates.
You will get a better performance if you check whether the vertex exists already before inserting it.
Daniel Kuppitz provided a query for that on the mailing list [1]:
g.V().has('Student','ID','1234').tryNext().orElseGet(g.addV(T.label,'Student','ID', '1234').next())
Of course, you could get into race conditions here where two of those queries are evaluated for the same ID at the same time. But this should only occur very rarely and you can probably perform a regular clean-up with an OLAP job with an upcoming version of TinkerPop. (Unfortunately, it is currently not possible to modify the graph with an OLAP job.)
[1] https://groups.google.com/forum/#!topic/gremlin-users/pCYf6h3Frb8

When you define your schema for your graph set the cardinality of the ID property to SINGLE
From the Titan schema docs
SINGLE: Allows at most one value per element for such key. In other
words, the key→value mapping is unique for all elements in the graph.
The property key birthDate is an example with SINGLE cardinality since
each person has exactly one birth date.
Here's a link to the docs http://s3.thinkaurelius.com/docs/titan/1.0.0/schema.html

Related

Can/should I make id column that is part of a composite key non-unique [duplicate]

I have got a table which has an id (primary key with auto increment), uid (key refering to users id for example) and something else which for my question won’t matter.
I want to make, lets call it, different auto-increment keys on id for each uid entry.
So, I will add an entry with uid 10, and the id field for this entry will have a 1 because there were no previous entries with a value of 10 in uid. I will add a new one with uid 4 and its id will be 3 because I there were already two entried with uid 4.
...Very obvious explanation, but I am trying to be as explainative an clear as I can to demonstrate the idea... clearly.
What SQL engine can provide such a functionality natively? (non Microsoft/Oracle based)
If there is none, how could I best replicate it? Triggers perhaps?
Does this functionality have a more suitable name?
In case you know about a non SQL database engine providing such a functioality, name it anyway, I am curious.
Thanks.
MySQL's MyISAM engine can do this. See their manual, in section Using AUTO_INCREMENT:
For MyISAM tables you can specify AUTO_INCREMENT on a secondary column in a multiple-column index. In this case, the generated value for the AUTO_INCREMENT column is calculated as MAX(auto_increment_column) + 1 WHERE prefix=given-prefix. This is useful when you want to put data into ordered groups.
The docs go on after that paragraph, showing an example.
The InnoDB engine in MySQL does not support this feature, which is unfortunate because it's better to use InnoDB in almost all cases.
You can't emulate this behavior using triggers (or any SQL statements limited to transaction scope) without locking tables on INSERT. Consider this sequence of actions:
Mario starts transaction and inserts a new row for user 4.
Bill starts transaction and inserts a new row for user 4.
Mario's session fires a trigger to computes MAX(id)+1 for user 4. You get 3.
Bill's session fires a trigger to compute MAX(id). I get 3.
Bill's session finishes his INSERT and commits.
Mario's session tries to finish his INSERT, but the row with (userid=4, id=3) now exists, so Mario gets a primary key conflict.
In general, you can't control the order of execution of these steps without some kind of synchronization.
The solutions to this are either:
Get an exclusive table lock. Before trying an INSERT, lock the table. This is necessary to prevent concurrent INSERTs from creating a race condition like in the example above. It's necessary to lock the whole table, since you're trying to restrict INSERT there's no specific row to lock (if you were trying to govern access to a given row with UPDATE, you could lock just the specific row). But locking the table causes access to the table to become serial, which limits your throughput.
Do it outside transaction scope. Generate the id number in a way that won't be hidden from two concurrent transactions. By the way, this is what AUTO_INCREMENT does. Two concurrent sessions will each get a unique id value, regardless of their order of execution or order of commit. But tracking the last generated id per userid requires access to the database, or a duplicate data store. For example, a memcached key per userid, which can be incremented atomically.
It's relatively easy to ensure that inserts get unique values. But it's hard to ensure they will get consecutive ordinal values. Also consider:
What happens if you INSERT in a transaction but then roll back? You've allocated id value 3 in that transaction, and then I allocated value 4, so if you roll back and I commit, now there's a gap.
What happens if an INSERT fails because of other constraints on the table (e.g. another column is NOT NULL)? You could get gaps this way too.
If you ever DELETE a row, do you need to renumber all the following rows for the same userid? What does that do to your memcached entries if you use that solution?
SQL Server should allow you to do this. If you can't implement this using a computed column (probably not - there are some restrictions), surely you can implement it in a trigger.
MySQL also would allow you to implement this via triggers.
In a comment you ask the question about efficiency. Unless you are dealing with extreme volumes, storing an 8 byte DATETIME isn't much of an overhead compared to using, for example, a 4 byte INT.
It also massively simplifies your data inserts, as well as being able to cope with records being deleted without creating 'holes' in your sequence.
If you DO need this, be careful with the field names. If you have uid and id in a table, I'd expect id to be unique in that table, and uid to refer to something else. Perhaps, instead, use the field names property_id and amendment_id.
In terms of implementation, there are generally two options.
1). A trigger
Implementations vary, but the logic remains the same. As you don't specify an RDBMS (other than NOT MS/Oracle) the general logic is simple...
Start a transaction (often this is Implicitly already started inside triggers)
Find the MAX(amendment_id) for the property_id being inserted
Update the newly inserted value with MAX(amendment_id) + 1
Commit the transaction
Things to be aware of are...
- multiple records being inserted at the same time
- records being inserted with amendment_id being already populated
- updates altering existing records
2). A Stored Procedure
If you use a stored procedure to control writes to the table, you gain a lot more control.
Implicitly, you know you're only dealing with one record.
You simply don't provide a parameter for DEFAULT fields.
You know what updates / deletes can and can't happen.
You can implement all the business logic you like without hidden triggers
I personally recommend the Stored Procedure route, but triggers do work.
It is important to get your data types right.
What you are describing is a multi-part key. So use a multi-part key. Don't try to encode everything into a magic integer, you will poison the rest of your code.
If a record is identified by (entity_id,version_number) then embrace that description and use it directly instead of mangling the meaning of your keys. You will have to write queries which constrain the version number but that's OK. Databases are good at this sort of thing.
version_number could be a timestamp, as a_horse_with_no_name suggests. This is quite a good idea. There is no meaningful performance disadvantage to using timestamps instead of plain integers. What you gain is meaning, which is more important.
You could maintain a "latest version" table which contains, for each entity_id, only the record with the most-recent version_number. This will be more work for you, so only do it if you really need the performance.

Database Design: Storing previous version versus storing first version in version control systems

I am really new to using databases and database design/creating schemas and I'd really appreciate some advice/suggestions. I am creating an application where users enter data and I'm providing version control for that data to the users. Users can go in and revert changes or update values (sort of like git), etc and I am creating a database structure to store these values. Currently I have two different possibilities in mind, however I am not sure which one has more advantages.
First possibility: Store pointer to previous version
Data_Table
id IntegerField
data_content TextField
version_control_first_version ForeignKeyField(Data_Version_Control_Table)
Data_Version_Control_Table
id IntegerField
previous_version SelfReferentialForeignKey Nullable
In the first possibility, I store a link to the previous version of the data in the version control table. As new versions start pouring in, I create new rows in the Version Control table and for each of them, I link the row to the previous version's row. The Data Table only holds the newest version of the data (I've decided to hold the current version in a different table as this is the best approach for my use case and the version control table should be significantly larger than the current version table as there are a lot of entries there).
Second Possibility: Store pointer to root/first version
Data_Table
id IntegerField
data_content TextField
version_control_first_version ForeginKeyField(Data_Version_Control_Table)
Data_Version_Control_Table
id IntegerField
first_version SelfReferentialForeignKey Nullable
version_number IntegerField
For this design, for all versions of the same data, I store a pointer to the first version of the data rather than the previous version and a version number. When I want to rollback to a particular version, I jump the number of versions that I need to, to get to the version I'm looking for. That's the only difference here and the rest is pretty much the same. I might also add that I am storing the date time that these versions were created as well if that might help.
Are there any significant advantages or disadvantages with these options that would justify using one over the other? Will I take a performance hit if I use one over the other? Which one will allow faster and easier queries and which one is the optimal model? Are there any flaws with any of these models?
Thanks for your help in advance and have a wonderful day :)
In both cases, you are trying to emulate pointers within a relational database, which:
Isn't readily enforceable through database constraints. For example, you can't declaratively guarantee that the version_control_first_version or first_version really points to the first version or that the previous_version doesn't form branches or cycles.
Doesn't really fit very well with how indexes work. For example, traversing the "linked list" of previous_version would require either a recursive CTE or repetitive querying. The ˙first_version` is better in this regard, but still unnecessarily complicated.
It would be better, IMO, to just make the version table a weak entity:
CREATE TABLE Data_Table (
id INT PRIMARY KEY
-- Other fields...
);
CREATE TABLE Data_Version_Control_Table (
id INT REFERENCES Data_Table,
version_number INT,
PRIMARY KEY (id, version_number)
-- Other fields...
);
And then:
The first version can be efficiently identified by finding MIN(version_number) for the given id. MIN/MAX is efficient to find in a B-Tree, essentially equivalent to an index seek.
The previous version can also be efficiently found (by searching for the previous version_number). This is just an index seek.
All versions of the same object can be efficiently found by searching for object's id. This is just an index range scan, no need for "list traversal" (and as a bonus, versions are already in the correct order).

Indexing in Neo4j, text or integer

I am creating an application which uses both MySql and Neo4j. I think that listing the many nodes properties in a table will be faster at reading all those after querying for a specific set of nodes (or even before), but I am open to be proven wrong. After all finding properties of a row is what relational dbs are for.
To ensure consistency, I have created a property on each node which is the auto_increment ID in my sql table.
I wish neo4j would allow indexing a property regardless of labels but that's not the case and I struggle to understand why this is not possible at all.
Question is: do you think that the performance in neo4j would be much better if the index is on a number versus a string? I am thinking whether to drop the numeric id and just stick with node.name
You can configure indexes on properties without referring to particular labels. You do this by editing node_auto_indexing in conf/neo4j.properties.
If you're looking to compare simple equality, I'd guess that indexing on numbers might be slightly faster, but I doubt the difference is big enough to be very meaningful, unless the string alternatives are very large.
Another option would be to put an AutoInc label and index on that label with your auto_id node property.
Assuming that auto_id is the property you added to all nodes to reference the MySQL auto_increment ID column, then:
CREATE INDEX ON AutoInc:(auto_id)
MATCH(n)
SET n :AutoInc

How do I design a schema to handle periodic bulk inserts/updates?

(tldr; I think that periodic updates forces the table to use a natural key. And so I'll have to migrate my database schema.)
I have a production database with a table like planets, which although it has good potential natural keys (e.g., the planet names which never really change), uses a typical incremented integer as the primary key. The planets table has a self-referencing column or two such as *parent_planet_id*.
Now I'm building offline cloud-based workers that re-create subsets of the planets records each week, and they need to be integrated with the main server. My plan is:
A worker instance has a mini version of the database (same schema, but no planets records)
Once per week, the worker fires up, does all its processing, creates its 100,000 or so planets records, and exports the data. (I don't think the export format matters for this particular problem: could be mysqldump, yaml, etc.)
Then, the production server imports the records: some are new records, most are updates.
This last step is what I don't know how to solve. I'm not entirely replacing the planets table each time, so the problem is that the two databases each have their own incrementing integer PK's. And so I can't just do a simple import.
I thought about exporting without the id column, but then I realized that the self-referencing columns prevent this.
I see two possible solutions:
Redesign the schema to use a natural key for the planets table. This would be a pain.
Use UUID instead of an incrementing integer for the key. Would be easier, I think, to move to. The id's would be unique, and the new rows could be safely imported. This also avoids the issues with depending on natural data in keys.
Modify the Planets to use alternate-hierarchy technique, like nested sets, closure table, or path enumeration and than export. This will break the ID-dependency.
Or, if you still do not like the idea, consider your export and import as an ETL problem.
Modify the record during the export to include PlanetName, ParentPlanetName
Import all Planets (PlanetNames) first
Then import the hierarchy (ParentPlanetName)
In any case, the surrogate key from the first DB should never leave that DB -- it has no meaning outside of it.
The best solution (in terms of desing) would be to refine your keys architecture and implement some composite key having info about when and from where the planets were imported, but you do not want to do this.
Easier (I think), and yet a bit "happy engineering" solution would be to modify importing keys. You can do this for example like that:
1. lock planets table in main system (so no new key will appear during import),
2. create lookup table having two columns, ID and PLANET NAME basing on planet table in main system,
3. get the maximum key value from that table,
4. increment every imported key (identyfying and referencing the parent-child planet relationship) value by adding the MAX value retrived within step #3,
5. alter main planet table and change current auto increment value for actual MAX + 1 value
6. now go over the table (cursor loop within procedure) checking if for the current planet name you have different key in your lookup, if yes first remove the record from the table with the key from lookup (the old one) and update the key value within the currently inspected row for an old id (that was an update),
7. unlock the table.
Most operations are updates
So you need a "real" merge. In other words, you'll have to identify a proper order in which you can INSERT/UPDATE the data, so no FKs are violated in the process.
I'm not sure what parent_planet_id means, but assuming it means "orbits" and the word "planet" is stretched to also include moons, imagine you have only Phobos in your master database and Mars and Deimos need to be imported. This can only be done in a certain order:
INSERT Mars.
INSERT Deimos, set its parent_planet_id so it points to Mars.
UPDATE Phobos' parent_planet_id so it points to Mars.
While you could exchange steps (2) and (3), you couldn't do either before the step (1).
You'll need a recursive descent to determine the proper order and then compare natural keys1 to see what needs to be UPDATEd and what INSERTed. Unfortunately, MySQL doesn't support recursive queries, so you'll need to do it manually.
I don't quite see how surrogate keys help in this process - if anything, they add one more level of indirection you'll have to reconcile eventually.
1 Which, unlike surrogates, are meaningful across different databases. You can't just compare auto-incremented integers because the same integer value might identify different planets in different databases - you'll have false UPDATEs. GUIDs, on the other hand, will never match, even when rows describe the same planet - you'll have false INSERTs.

How does a hash table work? Is it faster than "SELECT * from .."

Let's say, I have :
Key | Indexes | Key-values
----+---------+------------
001 | 100001 | Alex
002 | 100002 | Micheal
003 | 100003 | Daniel
Lets say, we want to search 001, how to do the fast searching process using hash table?
Isn't it the same as we use the "SELECT * from .. " in mysql? I read alot, they say, the "SELECT *" searching from beginning to end, but hash table is not? Why and how?
By using hash table, are we reducing the records we are searching? How?
Can anyone demonstrate how to insert and retrieve hash table process in mysql query code? e.g.,
SELECT * from table1 where hash_value="bla" ...
Another scenario:
If the indexes are like S0001, S0002, T0001, T0002, etc. In mysql i could use:
SELECT * from table WHERE value = S*
isn't it the same and faster?
A simple hash table works by keeping the items on several lists, instead of just one. It uses a very fast and repeatable (i.e. non-random) method to choose which list to keep each item on. So when it is time to find the item again, it repeats that method to discover which list to look in, and then does a normal (slow) linear search in that list.
By dividing the items up into 17 lists, the search becomes 17 times faster, which is a good improvement.
Although of course this is only true if the lists are roughly the same length, so it is important to choose a good method of distributing the items between the lists.
In your example table, the first column is the key, the thing we need to find the item. And lets suppose we will maintain 17 lists. To insert something, we perform an operation on the key called hashing. This just turns the key into a number. It doesn't return a random number, because it must always return the same number for the same key. But at the same time, the numbers must be "spread out" widely.
Then we take the resulting number and use modulus to shrink it down to the size of our list:
Hash(key) % 17
This all happens extremely fast. Our lists are in an array, so:
_lists[Hash(key % 17)].Add(record);
And then later, to find the item using that key:
Record found = _lists[Hash(key % 17)].Find(key);
Note that each list can just be any container type, or a linked list class that you write by hand. When we execute a Find in that list, it works the slow way (examine the key of each record).
Do not worry about what MySQL is doing internally to locate records quickly. The job of a database is to do that sort of thing for you. Just run a SELECT [columns] FROM table WHERE [condition]; query and let the database generate a query plan for you. Note that you don't want to use SELECT *, since if you ever add a column to the table that will break all your old queries that relied on there being a certain number of columns in a certain order.
If you really want to know what's going on under the hood (it's good to know, but do not implement it yourself: that is the purpose of a database!), you need to know what indexes are and how they work. If a table has no index on the columns involved in the WHERE clause, then, as you say, the database will have to search through every row in the table to find the ones matching your condition. But if there is an index, the database will search the index to find the exact location of the rows you want, and jump directly to them. Indexes are usually implemented as B+-trees, a type of search tree that uses very few comparisons to locate a specific element. Searching a B-tree for a specific key is very fast. MySQL is also capable of using hash indexes, but these tend to be slower for database uses. Hash indexes usually only perform well on long keys (character strings especially), since they reduce the size of the key to a fixed hash size. For data types like integers and real numbers, which have a well-defined ordering and fixed length, the easy searchability of a B-tree usually provides better performance.
You might like to look at the chapters in the MySQL manual and PostgreSQL manual on indexing.
http://en.wikipedia.org/wiki/Hash_table
Hash tables may be used as in-memory data structures. Hash tables may also be adopted for use with persistent data structures; database indices sometimes use disk-based data structures based on hash tables, although balanced trees are more popular.
I guess you could use a hash function to get the ID you want to select from. Like
SELECT * FROM table WHERE value = hash_fn(whatever_input_you_build_your_hash_value_from)
Then you don't need to know the id of the row you want to select and can do an exact query. Since you know that the row will always have the same id because of the input you build the hash value form and you can always recreate this id through the hash function.
However this isn't always true depending on the size of the table and the maximum number of hashvalues (you often have "X mod hash-table-size" somewhere in your hash). To take care of this you should have a deterministic strategy you use each time you get two values with the same id. You should check Wikipedia for more info on this strategy, its called collision handling and should be mentioned in the same article as hash-tables.
MySQL probably uses hashtables somewhere because of the O(1) feature norheim.se (up) mentioned.
Hash tables are great for locating entries at O(1) cost where the key (that is used for hashing) is already known. They are in widespread use both in collection libraries and in database engines. You should be able to find plenty of information about them on the internet. Why don't you start with Wikipedia or just do a Google search?
I don't know the details of mysql. If there is a structure in there called "hash table", that would probably be a kind of table that uses hashing for locating the keys. I'm sure someone else will tell you about that. =)
EDIT: (in response to comment)
Ok. I'll try to make a grossly simplified explanation: A hash table is a table where the entries are located based on a function of the key. For instance, say that you want to store info about a set of persons. If you store it in a plain unsorted array, you would need to iterate over the elements in sequence in order to find the entry you are looking for. On average, this will need N/2 comparisons.
If, instead, you put all entries at indexes based on the first character of the persons first name. (A=0, B=1, C=2 etc), you will immediately be able to find the correct entry as long as you know the first name. This is the basic idea. You probably realize that some special handling (rehashing, or allowing lists of entries) is required in order to support multiple entries having the same first letter. If you have a well-dimensioned hash table, you should be able to get straight to the item you are searching for. This means approx one comparison, with the disclaimer of the special handling I just mentioned.