I have a MySQL DB with 2 tables:
sample_name (stores name of a file, multiple names for same sample_hash);
sample_hash (stores the hashes of a file, will not store duplicate md5);
(all tables have an id int unsigned NOT NULL auto_increment)
My first option to relate these two tables is by creating an md5 column in both tables and relate them. However this seems to have a downside as I will be duplicating a varchar(32), which can be a waste of space with millions of records.
My second option is to calculate the file hashes first, grab the mysql_insert_id() of the sample_hash table and insert into the sample_name table. This makes sense if the hash in the sample_hash table is new, thus I have the mysql_insert_id() variable at my disposal.
But if the hash already exist in the samples_db, I don't want to store the hash again, so I will have no mysql_insert_id().
Is there an alternative other than searching the id of a given md5 in order to store it in the samples_name table in case the md5 already exist? If so, how can I do that?
From the requirements that you describe, there is no need for the sample_hash table at all.
You can keep the hashes in the sample_name table and do all your lookups of hash values in that table.
Related
My normalized DB has a table where a 1-N relationship exists for attributes related to the user. One of these attributes involves storing a yes/no binary status. This attribute is relatively frequently updated (TRUE to FALSE then back to TRUE) but also frequently retrieved.
Table
user_id (FK user table) | value_id (regular int) | yes_no (bool)
User has multiple variations of value_id but these are always retrieved as an entire set i.e. SELECT * FROM table WHERE user_id=ID
I'm thinking a bulk SELECT like that would benefit from lesser rows to return if all the FALSE rows are deleted from the table instead of being updated.
However, I understand that updating a single field would definitely also be less taxing than multiple INSERT/DELETE.
Thoughts appreciated!
Delete/insert will require from MySQL to rebuild indexes and update statistics (you have FKs == indexes).
If you have many records in the table and changing data often, this would be expensive.
I'm in the process of migrating a Ruby on Rails application from MySQL to Postgres. Is there a recommended way to keep the deleted data, like all the deleted records (their IDs at least) from MySQL?
In testing a dump-and-restore didn't seem to keep deleted records.
Also, in the event that I manage to keep the records where they are, what'll happen with the blank ones in Postgres? Will they be skipped over or used?
Example
Say I have a user with an ID of 101 and I've deleted users up to 100. I need 101 to stay at 101.
So you don't want to reassign the IDs assigned to records where you generated keys.
That should be the default in any sane migration. When you copy the data rows over - say, exporting from MySQL with SELECT ... INTO OUTFILE and importing into PostgreSQL with COPY tablename FROM 'filename.csv' WITH (FORMAT CSV), the IDs won't change.
All you'll need to do is to set the next ID to be generated in the sequence on the PostgreSQL table afterwards. So, say you have the table:
CREATE TABLE users
(
id serial primary key,
name text not null,
...
);
and you've just copied a user with id = 101 into it.
You'll now just assign a new value to the key generation sequence for the table, e.g.:
SELECT setval('users_id_seq', (SELECT max(id) FROM users)+1);
To learn more about sequences and key generation in PostgreSQL, see SERIAL in the numeric types documentation, the documentation for CREATE SEQUENCE, the docs for setval, etc. The default name for a key generation sequence is tablename_columnname_seq.
Does MYSQL automatically stores the records of a table in sorted files (in hard disks) while using a primary key or auto-increment in a field? If that is the case then what about the records of a particular table that are inserted at different points of time? Confused!
What is the best solution for inserting boolean values on database if you want more query performance and minimum losing of memory on select statement.
For example:
I have a table with 36 fields that 30 of them has boolean values (zero or one) and i need to search records using the boolean fields that just have true values.
SELECT * FROM `myTable`
WHERE
`field_5th` = 1
AND `field_12th` = 1
AND `field_20` = 1
AND `field_8` = 1
Is there any solution?
If you want to store boolean values or flags there are basically three options:
Individual columns
This is reflected in your example above. The advantage is that you will be able to put indexes on the flags you intend to use most often for lookups. The disadvantage is that this will take up more space (since the minimum column size that can be allocated is 1 byte.)
However, if you're column names are really going to be field_20, field_21, etc. Then this is absolutely NOT the way to go. Numbered columns are a sign you should use either of the other two methods.
Bitmasks
As was suggested above you can store multiple values in a single integer column. A BIGINT column would give you up to 64 possible flags.
Values would be something like:
UPDATE table SET flags=b'100';
UPDATE table SET flags=b'10000';
Then the field would look something like: 10100
That would represent having two flag values set. To query for any particular flag value set, you would do
SELECT flags FROM table WHERE flags & b'100';
The advantage of this is that your flags are very compact space-wise. The disadvantage is that you can't place indexes on the field which would help improve the performance of searching for specific flags.
One-to-many relationship
This is where you create another table, and each row there would have the id of the row it's linked to, and the flag:
CREATE TABLE main (
main_id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
);
CREATE TABLE flag (
main_id INT UNSIGNED NOT NULL,
name VARCHAR(16)
);
Then you would insert multiple rows into the flag table.
The advantage is that you can use indexes for lookups, and you can have any number of flags per row without changing your schema. This works best for sparse values, where most rows do not have a value set. If every row needs all flags defined, then this isn't very efficient.
For performance comparisson you can read a blog post I wrote on the topic:
Set Performance Compare
Also when you ask which is "Best" that's a very subjective question. Best at what? It all really depends on what your data looks like and what your requirements are and how you want to query it.
Keep in mind that if you want to do a query like:
SELECT * FROM table WHERE some_flag=true
Indexes will only help you if few rows have that value set. If most of the rows in the table have some_flag=true, then mysql will ignore indexes and do a full table scan instead.
How many rows of data are you querying over? You can store the boolean values in an integer value and use bit operations to test for them them. It's not indexable, but storage is very well packed. Using TINYINT fields with indexes would pick one index to use and scan from there.
I use one table withe some casual columns such as id, name, email, etc...also I'm inserting a variable numbers of records in each transaction, to be much efficient I need to have one unique id lets call it transaction id, that would be the same for each group of data which are inserted in one transaction, should be increment. What is the best approach for doing that?
I was thought to use
select max(transaction_id) from users
and increment that value on server side, but that seams like old fashion solution.
You could have another table usergroups with an auto-incrementing primary key, you first insert a record there (maybe including some other useful information about the group). Then get the group's unique id generated during this last insert using mysql_insert_id(), and use that as the groupid for your inserts into the first table.
This way you're still using MySQL's auto-numbering which guarantees you a unique groupid. Doing select max(transaction_id) from users and incrementing this isn't safe, since it's non-atomic (another thread may have read the same max(transaction_id) before you've had a change to increment it, and will start inserting records with a conflicting groupid).
Add new table with auto_increment column
You can create new table with auto_increment column. So you'll be able to generate unique integers in thread safe way. It'll work like this:
DB::insert_into_transaction_table()
transaction_id = DB::mysql_last_insert_id() ## this is integer value
for each record:
DB::insert_into_table(transaction_id, ...other parameters...)
And you don't require mysql transactions for this.
Generate unique string on server side before inserting
You can generate unique id (for example GUID) on server side and use it for all records inserting. But your transaction_id field should be long enough to store values generated this way (some char(...) type). It'll work like this:
transaction_id = new_GUID() ## this is usually a string value
for each record:
DB::insert_into_table(transaction_id, ...other parameters...)