MySQL: is there something like an internal record identifier for every record in a MySQL table? - mysql

I'm building a spreadsheet app using MySQL as storage, I need to identify records that are being updated client-side in order to save the changes.
Is there a way, such as some kind of "internal record identifier" (internal as in used by the database engine itself), to uniquely identify records, so that I'll be able to update the correct one?
Certainly, a SELECT query can be used to identify the record, including all the fields in the table, but obviously that has the downside of returning multiple records in most situations.
IMPORTANT: the spreadsheet app aims to work on ANY table, even ones tremendously poorly designed, without any keys, so solutions such as "define a field with an UNIQUE index and work with that" are not an option, table structure may be extremely variable and must not matter.
Many thanks.

AFAIK no such unique internal identifier (say, a simple row ID) exists.
You may maybe be able to run a SELECT without any sorting and then get the n-th row using a LIMIT. Under what conditions that is reliable and safe to use, a mySQL Guru would need to confirm. It probably never is.
Try playing around with phpMyAdmin, the web frontend to mySQL. It is designed to deal with badly designed tables without keys. If I remember correctly, it uses all columns it can get hold of in such cases:
UPDATE xyz set a = b WHERE 'fieldname' = 'value'
AND 'fieldname2' = 'value2'
AND 'fieldname3' = 'value3'
LIMIT 0,1;
and so on.
That isn't entirely duplicate-safe either, of course.
The only idea that comes to my mind is to add a key column at runtime, and to remove it when your app is done. It's a goose-bump-inducing idea, but maybe better than nothing.

MySQL has "auto-increment" numeric columns that you can add and even define as a primary key, that would give you a unique record id automatically generated by the database. You can query the last record id you just inserted with select LAST_INSERT_ID()
example from mysql's official documentation here

To my knowledge, MySQL lacks the implicit ROWID feature as seen in Oracle (and exists in other engines with their own syntax). You'll have to create your own AUTO_INCREMENT field.

Related

Can/should I make id column that is part of a composite key non-unique [duplicate]

I have got a table which has an id (primary key with auto increment), uid (key refering to users id for example) and something else which for my question won’t matter.
I want to make, lets call it, different auto-increment keys on id for each uid entry.
So, I will add an entry with uid 10, and the id field for this entry will have a 1 because there were no previous entries with a value of 10 in uid. I will add a new one with uid 4 and its id will be 3 because I there were already two entried with uid 4.
...Very obvious explanation, but I am trying to be as explainative an clear as I can to demonstrate the idea... clearly.
What SQL engine can provide such a functionality natively? (non Microsoft/Oracle based)
If there is none, how could I best replicate it? Triggers perhaps?
Does this functionality have a more suitable name?
In case you know about a non SQL database engine providing such a functioality, name it anyway, I am curious.
Thanks.
MySQL's MyISAM engine can do this. See their manual, in section Using AUTO_INCREMENT:
For MyISAM tables you can specify AUTO_INCREMENT on a secondary column in a multiple-column index. In this case, the generated value for the AUTO_INCREMENT column is calculated as MAX(auto_increment_column) + 1 WHERE prefix=given-prefix. This is useful when you want to put data into ordered groups.
The docs go on after that paragraph, showing an example.
The InnoDB engine in MySQL does not support this feature, which is unfortunate because it's better to use InnoDB in almost all cases.
You can't emulate this behavior using triggers (or any SQL statements limited to transaction scope) without locking tables on INSERT. Consider this sequence of actions:
Mario starts transaction and inserts a new row for user 4.
Bill starts transaction and inserts a new row for user 4.
Mario's session fires a trigger to computes MAX(id)+1 for user 4. You get 3.
Bill's session fires a trigger to compute MAX(id). I get 3.
Bill's session finishes his INSERT and commits.
Mario's session tries to finish his INSERT, but the row with (userid=4, id=3) now exists, so Mario gets a primary key conflict.
In general, you can't control the order of execution of these steps without some kind of synchronization.
The solutions to this are either:
Get an exclusive table lock. Before trying an INSERT, lock the table. This is necessary to prevent concurrent INSERTs from creating a race condition like in the example above. It's necessary to lock the whole table, since you're trying to restrict INSERT there's no specific row to lock (if you were trying to govern access to a given row with UPDATE, you could lock just the specific row). But locking the table causes access to the table to become serial, which limits your throughput.
Do it outside transaction scope. Generate the id number in a way that won't be hidden from two concurrent transactions. By the way, this is what AUTO_INCREMENT does. Two concurrent sessions will each get a unique id value, regardless of their order of execution or order of commit. But tracking the last generated id per userid requires access to the database, or a duplicate data store. For example, a memcached key per userid, which can be incremented atomically.
It's relatively easy to ensure that inserts get unique values. But it's hard to ensure they will get consecutive ordinal values. Also consider:
What happens if you INSERT in a transaction but then roll back? You've allocated id value 3 in that transaction, and then I allocated value 4, so if you roll back and I commit, now there's a gap.
What happens if an INSERT fails because of other constraints on the table (e.g. another column is NOT NULL)? You could get gaps this way too.
If you ever DELETE a row, do you need to renumber all the following rows for the same userid? What does that do to your memcached entries if you use that solution?
SQL Server should allow you to do this. If you can't implement this using a computed column (probably not - there are some restrictions), surely you can implement it in a trigger.
MySQL also would allow you to implement this via triggers.
In a comment you ask the question about efficiency. Unless you are dealing with extreme volumes, storing an 8 byte DATETIME isn't much of an overhead compared to using, for example, a 4 byte INT.
It also massively simplifies your data inserts, as well as being able to cope with records being deleted without creating 'holes' in your sequence.
If you DO need this, be careful with the field names. If you have uid and id in a table, I'd expect id to be unique in that table, and uid to refer to something else. Perhaps, instead, use the field names property_id and amendment_id.
In terms of implementation, there are generally two options.
1). A trigger
Implementations vary, but the logic remains the same. As you don't specify an RDBMS (other than NOT MS/Oracle) the general logic is simple...
Start a transaction (often this is Implicitly already started inside triggers)
Find the MAX(amendment_id) for the property_id being inserted
Update the newly inserted value with MAX(amendment_id) + 1
Commit the transaction
Things to be aware of are...
- multiple records being inserted at the same time
- records being inserted with amendment_id being already populated
- updates altering existing records
2). A Stored Procedure
If you use a stored procedure to control writes to the table, you gain a lot more control.
Implicitly, you know you're only dealing with one record.
You simply don't provide a parameter for DEFAULT fields.
You know what updates / deletes can and can't happen.
You can implement all the business logic you like without hidden triggers
I personally recommend the Stored Procedure route, but triggers do work.
It is important to get your data types right.
What you are describing is a multi-part key. So use a multi-part key. Don't try to encode everything into a magic integer, you will poison the rest of your code.
If a record is identified by (entity_id,version_number) then embrace that description and use it directly instead of mangling the meaning of your keys. You will have to write queries which constrain the version number but that's OK. Databases are good at this sort of thing.
version_number could be a timestamp, as a_horse_with_no_name suggests. This is quite a good idea. There is no meaningful performance disadvantage to using timestamps instead of plain integers. What you gain is meaning, which is more important.
You could maintain a "latest version" table which contains, for each entity_id, only the record with the most-recent version_number. This will be more work for you, so only do it if you really need the performance.

Index for one column only for specific other column value

I' ve got table logs, where there are, among others, two fields: action (VARCHAR 45) and info (VARCHAR 10000).
There are multiple things logged to this table, and one of them is user ip when visiting page. For this situation action='ip', info='IP.ADD.RE.SS'.
Because info can have some big amount of text for specific things logged, I would like to only create INDEX that works for info field for action='ip' only so I can search for IP's quickly and do not have overgrown index with "actions".
I've already tried creating INDEX for first 15 characters, but still IP entries are about 1% of all stuff, and it seems a bit overkill for me.
This entire solution has been inherited from someone else, and unfortunately there is little I can do right now to change entire architecture
Any sugestion how to do it right way? Is it even possible?
Some RDBMS products support what you're describing. It's called partial or filtered indexes by different products.
PostgreSQL has partial indexes
Microsoft SQL Server as filtered indexes
SQLite has partial indexes
MySQL does not implement this idea (they are under no obligation to implement it, since it's a nonstandard feature). There has been a request for this as a new feature: https://bugs.mysql.com/bug.php?id=76631
One workaround you can do in MySQL 5.7 to simulate a partial index is to create a virtual column where the value is NULL unless the action is 'ip'. Then index that virtual column:
ALTER TABLE logs
ADD COLUMN ip_info VARCHAR(12)
AS (CASE `action` WHEN 'ip' THEN LEFT(info, 12) END),
ADD KEY (ip_info);
Strictly speaking, that still indexes every row, but at least it doesn't store any of your values in the index except where the action is 'ip'.
P.S.: I haven't tested the above example, so apologies if there are syntax errors.
This seems to fall under the "EAV" category. You have a bunch of things (ip, postdel, etc), each of which is optional. Some of them need indexing, some do not.
My recommendation is to put the key-value pairs in a JSON string. And make a special column for any thing that you do want to index (IP, in your case). It can be NULLable in order to minimize (but not totally eliminate the 'wasted' space.
See also my blog on EAV.
See also MySQL's and MariaDB's implementations involving JSON. Caution: they require relatively new versions of MySQL or MariaDB.
You're filtering on the action column anyway, so a combined index is the solution here. Create an index over both columns (action, info(15)).
The order of the columns in the index is important though. Do not change it the other way round.

MySQL self join performance: fact or just bad indexing?

As an example: I'm having a database to detect visitor (bots, etc) and since not every visitor have the same amount of 'credential' I made a 'dynamic' table like so: see fiddle: http://sqlfiddle.com/#!9/ca4c8/1 (simplified version).
This returns me the profile ID that I use to gather info about each profile (in another DB). Depending on the profile type I query the table with different nameclause (name='something') (ei: hostname, ipAddr, userAgent, HumanId, etc).
I'm not an expert in SQL but I'm familiar with indexes, constraints, primary, unique, foreign key etc. And from what I saw from these search results:
Mysql Self-Join Performance
How to tune self-join table in mysql like this?
Optimize MySQL self join query
JOIN Performance Issue MySQL
MySQL JOIN performance issue
Most of them have comments about bad performance on self-join but answers tend to go for the missing index cause.
So the final question is: is self joining a table makes it more prone to bad performance assuming that everything is indexed properly?
On a side note, more information about the table: might be irrelevant to the question but is well in context for my particular situation:
column flag is used to mark records for deletion as the user I use from php don't have DELETE permission over this database. Sorry, Security is more important than performance
I added the 'type' that will go with info I get from the user agent. (ie: if anything is (at least seems to be) a bot, we will only search for type 5000.
Column 'name' is unfortunately a varchar indexed in the primary key (with profile and type).
I tried to use as much INT and filtering (WHERE) in the SELECT query to reduce eventual lost of performance (if that even matters)
I'm willing to study and tweak the thing if needed unless someone with a high background in mysql tells me it's really not a good thing to do.
This is a big project I have in development so I cannot test it with millions of records for now but I wonder if performance will be an issues as this grows. Any input, links, references, documentation or test procedure (maybe in comments) will be appreciated.
A self-join is no different than joining two different tables. The optimizer will pick one 'table', usually based on the WHERE, then do a Nested Loop Join into the other. In your case, you have implied, via LEFT, that it should work only one way. (The Optimizer will ignore that if it sees no need for it.
Your keys are find for that Fiddle.
The real problem is "Entity-Attribute-Value", which is a messy way to lay out data in tables. Your query seems to be saying "find a (limit 1) profile (entity) that has a certain pair of attributes (name = Googlebot AND addr = ...).
It would be so much easier, and faster, to have two columns (name and addr) and a "composite" INDEX(name, addr).
I recommend doing that for the common "attributes", then put the rest into a single column with a JSON string. See here.

MySQL - Working with Encrypted Columns

I have some tables with encrypted fields. After looking through the MySQL docs, I found that you can't create a custom datatype for encryption / decryption, which would be ideal. So, instead, I have a view similar to the one below:
CREATE VIEW EMPLOYEE AS
SELECT ID, FIRST_NAME, LAST_NAME, SUPER_SECURE_DECRYPT(SSN) AS SSN
FROM EMPLOYEE_ENCRYPTED
Again, after reading through the MySQL documentation, I've learned that the view isn't insertable because it has a derived column and the SSN field isn't updatable since it's a derived column, which makes sense. However, you can't add a trigger to a view so writing to the view is not really an option.
In an attempt to get around this, I've created a couple of triggers similiar to:
CREATE TRIGGER EMPLOYEE_ENCRYPTED_UPDATE
BEFORE UPDATE ON EMPLOYEE_ENCRYPTED FOR EACH ROW
BEGIN
IF NEW.SSN <> OLD.SSN THEN
SET NEW.SSN = SUPER_SECURE_ENCRYPT(NEW.SSN);
END IF;
END;
as well as one for inserting (which, since it's so similar, I'm not going to post it). This simply means I have to read from the view and write to the table.
This is a decent solution except that when you supply a where clause for the update statement that is querying the encrypted column (as in, update an employee by their SSN). Typically, this isn't an issue since I normally use the primary key for updates but I need to know for other encrypted fields if there's a way to do this.
I want to make MySQL do the heavy lifting for encryption and decryption so that it can be as frictionless as possible to work with as a developer. I would like the application developer to not have to worry about encrypted fields as much as possible while still using encrypted fields, that's the ultimate goal here. Any help or advice is appreciated.
It is diffuclt to answer your question without knowing the type of encryption you are using. It it's a standard encryption/hashing such as MD5, you can use that directly in MySQL with a WHERE ssn=MD5('ssnStr') type of clause, tho MD5 isn't meant for decryption. Otherwise, if it's some sort of customized encryption, you'll have to either
1) create a procedure that performs the encryption/decryption and use that in your WHERE clause
or
2) perform the encryption before hand and use its result to match the condition you desire in your WHERE clause or wherever in your query.
It may be best to supply your query with the encrypted value for the SSN and use that to match to your field. If you have to perform some sort of decryption for each row in your DB, this won't be efficient at all. In other words, supply your query with input that directly matches the data stored for best performance.

What is the best method/options for expiring records within a database?

In a lot of databases I seem to be working on these days I can't just delete a record for any number of reasons, including so later on they can be displayed later (say a product that no longer exists) or just keeping a history of what was.
So my question is how best to expire the record.
I have often added a date_expired column which is datetime field. Generally I query either where date_expired = 0 or date_expired = 0 OR date_expired > NOW() depending if the data is going to be expired in the future. Similar to this, I have also added a field call expired_flag. When this is set to true/1, the record is considered expired. This is the probably the easiest method, although you need to remember to include the expire clause any time you only want the current items.
Another method I have seen is moving the record to an archive table, but this can get quite messy when there are a large number of tables that require history tables. It also makes the retrieval of the value (say country) more difficult as you have to first do a left join (for example) and then do a second query to find the actual value (or redo the query with a modified left join).
Another option, which I haven't seen done nor have I fully attempted myself is to have a table that contains either all of the data from all of the expired records or some form of it--some kind of history table. In this case, retrieval would be even more difficult as you would need to search possibly a massive table and then parse the data.
Are there other solutions or modifications of these that are better?
I am using MySQL (with PHP), so I don't know if other databases have better methods to deal with this issue.
I prefer the date expired field method. However, sometimes it is useful to have two dates, both initial date, and date expired. Because if data can expire, it is often useful to know when it was active, and that means also knowing when it started existing.
I like the expired_flag option over the date_expired option, if query speed is important to you.
I think adding the date_expired column is the easiest and least invasive method. As long as your INSERTS and SELECTS use explicit column lists (they should be if they're not) then there is no impact to your existing CRUD operations. Add an index on the date_expired column and developers can add it as a property to any classes or logic that depend on the data in the existing table. All in all the best value for the effort. I agree that the other methods (i.e. archive tables) are troublesome at best, by comparison.
I usually don't like database triggers, since they can lead to strange "behind the scenes" behavior, but putting a trigger on delete to insert the about-to-be-deleted data into a history table might be an option.
In my experience, we usually just use an "Active" bit, or a "DateExpired" datetime like you mentioned. That works pretty well, and is really easy to deal with and query.
There's a related post here that offers a few other options. Maybe the CDC option?
SQL Server history table - populate through SP or Trigger?
May I also suggest adding a "Status" column that matches an enumerated type in the code you're using. Drop an index on the column and you'll be able to very easily and efficiently narrow down your returned data via your where clauses.
Some possible enumerated values to use, depending on your needs:
Active
Deleted
Suspended
InUse (Sort of a pseudo-locking mechanism)
Set the column up as an tinyint (that's SQL Server...not sure of the MySQL equivalent). You can also setup a matching lookup table with the key/value pairs and a foreign key constraint between the tables if you wish.
I've always used the ValidFrom, ValidTo approach where each table has these two additional fields. If ValidTo Is Null or > Now() then you know you have a valid record. In this way you can also add data to the table before it's live.
There are some fields that my tables usually have: creation_date, last_modification, last_modifier (fk to user), is_active (boolean or number, depending on the database).
Look at the "Slowly Changing Dimension" SCD algorithms. There are several choices from the Data Warehousing world that apply here.
None is "best" -- each responds to different requirements.
Here's a tidy summary.
Type 1: The new record replaces the original record. No trace of the old record exists.
Type 4 is a variation on this moves the history to another table.
Type 2: A new record is added into the customer dimension table. To distinguish, a "valid date range" pair of columns in required. It helps to have a "this record is current" flag.
Type 3: The original record is modified to reflect the change.
In this case, there are columns for one or more previous values of the columns likely to change. This has an obvious limitation because it's bound to a specific number of columns. However, it is often used on conjunction with other types.
You can read more about this if you search for "Slowly Changing Dimension".
http://en.wikipedia.org/wiki/Slowly_Changing_Dimension
A very nice approach by Oracle to this problem is partitions. I don't think MySQL have something similar though.