A rather complicated auto increment - mysql

This is one idea more complicated than a one-to-many connection. I have a bunch of tables like photos, posts, users etc. that can be commented on. My comments table contains 3 fields that help identify the comment:
item-id - the id of the item the comment belongs to
table - the table which item-id resides in ( saved as an integer, but displayed as name below to avoid confusion )
id - the id of the comment, relative to the item-id
A sample for better understanding:
id item-id table
1 1 photos
2 1 photos
1 1 posts
2 1 posts
1 2 posts
1 1 users
Now the problem is with inserts. I find it hard to determine the current last id. Given the table above if a user is to comment on a photo with item-id = 1, then the new comment needs to have an id of 3. The only way I could think of is to run a sub-query on insert but I'm not a big fan of sub-queries. Is there some mechanism built in mysql that can help me achieve this, or any other easy and robust way?

From your comment:
I've come up with this because of the fear of unique ids running out. I know that the maximum integer value mysql can store is 1*10 to the 19th or something, which is a ridiculously large number, but not infinite. As well as numbers as huge take up more space?
MySQL's signed INT type can go up to 231-1. An unsigned INT can go up to 232-1, which is 4,294,967,295.
You're right this is not infinite, but 4.2 billion is pretty high and easily able to handle most needs.
You can also use a signed or unsigned BIGINT, which is 8 bytes, twice the size of an INT, but if you need values larger than INT, then you must store them.
Unsigned BIGINT goes up to 264-1 or 18,446,744,073,709,551,615. You're really, really, really unlikely to exhaust these values in your lifetime, even if you re-load your entire database multiple times per hour.
Re your comment.
Yes, most data types are fixed-size, meaning they use the same number of bytes on every row, regardless of the value you store in it on any given row. The reason for this is that you could change the value later, and if MySQL had to find more space to grow a small numeric value into a large numeric value, it would lead to other kinds of performance problems.
See http://dev.mysql.com/doc/refman/5.6/en/storage-requirements.html for more info on the number of bytes MySQL uses for each data type.
The exception is some string data types, (VARCHAR, VARBINARY, TEXT, BLOB), use a variable amount of space per row depending on the lengths of the strings you actually use.
But there are no numeric or date/time data types in MySQL that vary in size.
Another comment: you should ask yourself how much time & effort you're spending on optimizing this, and whether it would be more economical to just get a bigger disk. It's true the extra 4 bytes per row per integer adds up if you have a large database, but you'd need to store billions of rows before it really matters.

One thing you should consider, why is this important to you? The purpose of an ID is to be a unique identifier. Sure it can represent order in the fact that it's monotonically increasing, but is there any reason it specifically has to go from 1 to 2 to 3 for each (item-id, table) pair? Would it be that harmful if it was instead 1, 6, 20?
If you're using PHP you'll still receive that data in the same order, and in PHP it'll be very easy to know which is 1, 2 and 3.

MyISAM allows you to do this easily:
For MyISAM and BDB tables you can specify AUTO_INCREMENT on a
secondary column in a multiple-column index.
However, it's limited to two columns, so you still need to normalize this to remove one of the columns.
Otherwise, you can insert the next users (item 1) row like this:
INSERT INTO table1 (id, `item-id`, `table`)
SELECT MAX(id) + 1, 1, 'users' FROM table1 WHERE `item-id` = 1 AND `table` = 'users'
To extend it a little, the IFNULL part allows you to use the same clause for inserting the first row.
INSERT INTO table1 (id, `item-id`, `table`)
SELECT IFNULL(MAX(id), 0) + 1, 2, 'users' FROM table1 WHERE `item-id` = 2 AND `table` = 'users'
In this case, you would probably have a multi-column primary key, consisting of all three columns.

Related

MySql Indexing Strategy With Multiple Shared Columns

We have a database table which stores browser data for visitors, broken down by multiple different subtypes. For simplicity, let's use the table schema below. The querying will basically be on any single id column, the metric column, the timestamp column (stored as seconds since epoch), and one of the device, browser, or os columns.
We are going to performance test the star vs snowflake schema (where all of the ids go into a single column, but then an additional column id_type is added to determine which type of identifier it is) for this table, but as long as the star schema (which is how it is now) is within 80% of the snowflake performance, we are going to keep it since it will make our load process much easier. Before I do that however, I want to make sure the indexes are optimized on the star schema.
create table browser_data (
id_1 int,
id_2 int,
id_3 int,
id_4 int,
metric varchar(20),
browser varchar(20),
device varchar(20),
os varchar(20),
timestamp bigint
)
Would it be better to create individual indexes on just the id columns, or also include the metric and timestamp columns in those indexes as well?
Do not normalize "continuous" values, such as DATETIME, FLOAT, INT. Do leave the values in the main table.
When you move the value to other table(s), especially "snowflake", it makes querying based on the values somewhere between a little slower and a lot slower. This especially happens when you need to filter on more than one metric that is not in the main table. Either of these perform very poorly because of "snowflake" or "over-normalization":
WHERE a.x = 123 AND b.y = 345
ORDER BY a.x, b.y
As for what indexes to create -- that depends entirely on the queries you need to perform. So, I strongly recommend you sketch out the likely SELECTs based on your tentative CREATE TABLEs.
INT is 4 bytes. TIMESTAMP is 5, FLOAT is 4, etc. That is, normalizing such things are also inefficient on space.
More
When doing JOINs, the Optimizer will almost always start with one table, then move on to the another table, etc. (See "Nested Loop Join".)
For example (building on the above 'code'), when 2 columns are normalized, and you are testing on the values, you do not have two ids in hand, you only have the two values. This makes the query execution very inefficient. For
SELECT ...
FROM main
JOIN a USING(a_id)
JOIN b USING(b_id)
WHERE a.x = 123 AND b.y = 345
The following is very likely to be the 'execution plan':
Reach into a to find the row(s) with x=123; get the id(s) for those rows. This may include many rows that are yet to be filtered by b.y. a needs INDEX(x)
Go back to the main table, looking up rows with those id(s). main needs INDEX(a_id). Again, more rows than necessary may be hauled around.
Only now, do you get to b (using b_id) to check for y=345; toss the unnecessary rows you have been hauling around. b needs INDEX(b_id)
Note my comment about "haul around". Blindly using * (in SELECT *) adds to the problem -- all the columns are being hauled around while performing the steps.
On the other hand... If x and y were in the main table, then the code works like:
WHERE main.x = 123
AND main.y = 345
only needs INDEX(x,y) (in either order). And it quickly locates exactly the rows desired.
In the case of ORDER BY a.x, b.y, it cannot use any index on any table. So the query must create a tmp table, sort it, then deliver the rows in the desired order.
But if x and y are in the same table, then INDEX(x,y) (in that order) may be useful for ORDER BY x,y and avoid the tmp table and the sort.
With a single table, the Optimizer might use an index for WHERE, or it might use an index for ORDER BY, depending on the phase of the moon. In some cases, one index can be used for both -- this is optimal.
Another note: If you also have LIMIT 10,... If the sort is avoided, then only 10 rows need to be looked at, not the entire set from the WHERE.

Best solution for saving boolean values and saving cpu and memory on searches

What is the best solution for inserting boolean values on database if you want more query performance and minimum losing of memory on select statement.
For example:
I have a table with 36 fields that 30 of them has boolean values (zero or one) and i need to search records using the boolean fields that just have true values.
SELECT * FROM `myTable`
WHERE
`field_5th` = 1
AND `field_12th` = 1
AND `field_20` = 1
AND `field_8` = 1
Is there any solution?
If you want to store boolean values or flags there are basically three options:
Individual columns
This is reflected in your example above. The advantage is that you will be able to put indexes on the flags you intend to use most often for lookups. The disadvantage is that this will take up more space (since the minimum column size that can be allocated is 1 byte.)
However, if you're column names are really going to be field_20, field_21, etc. Then this is absolutely NOT the way to go. Numbered columns are a sign you should use either of the other two methods.
Bitmasks
As was suggested above you can store multiple values in a single integer column. A BIGINT column would give you up to 64 possible flags.
Values would be something like:
UPDATE table SET flags=b'100';
UPDATE table SET flags=b'10000';
Then the field would look something like: 10100
That would represent having two flag values set. To query for any particular flag value set, you would do
SELECT flags FROM table WHERE flags & b'100';
The advantage of this is that your flags are very compact space-wise. The disadvantage is that you can't place indexes on the field which would help improve the performance of searching for specific flags.
One-to-many relationship
This is where you create another table, and each row there would have the id of the row it's linked to, and the flag:
CREATE TABLE main (
main_id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
);
CREATE TABLE flag (
main_id INT UNSIGNED NOT NULL,
name VARCHAR(16)
);
Then you would insert multiple rows into the flag table.
The advantage is that you can use indexes for lookups, and you can have any number of flags per row without changing your schema. This works best for sparse values, where most rows do not have a value set. If every row needs all flags defined, then this isn't very efficient.
For performance comparisson you can read a blog post I wrote on the topic:
Set Performance Compare
Also when you ask which is "Best" that's a very subjective question. Best at what? It all really depends on what your data looks like and what your requirements are and how you want to query it.
Keep in mind that if you want to do a query like:
SELECT * FROM table WHERE some_flag=true
Indexes will only help you if few rows have that value set. If most of the rows in the table have some_flag=true, then mysql will ignore indexes and do a full table scan instead.
How many rows of data are you querying over? You can store the boolean values in an integer value and use bit operations to test for them them. It's not indexable, but storage is very well packed. Using TINYINT fields with indexes would pick one index to use and scan from there.

Primary Key Index Automatic

I m currently doing a project using mysql and am a perfect beginner in it.....
I made a table with the following columns.....
ID // A integer type column which is a primary key........
Date // A Date type column.........
Day // A String column.........
Now i just wanna know whether there exist any method by which the ID column insertion value is automatically generated......??
for eg: - If i insert a date - 4/10/1992 and Day - WED as values. The Mysql Server should automatically generate any integer value starting from 1 checking whether they exist.
i.e in a table containing the values
ID Date Day
1 01/02/1987 Sun
3 04/08/1990 Sun
If i m inserting the Date value and Day value(specified in the example) in the above table. It should be inserted as
2 04/10/1992 WED
I tried methods like using auto incrementer.....But i m afraid it just only increments the ID value.
There's a way to do this, but it's going to affect performance. Go ahead and keep auto_increment on the column, just for the first insert, or for when you want to insert more quickly.
Even with auto_increment on a column, you can specify the value, so long as it doesn't collide with an existing value.
To get the next value or first gap:
SELECT a.ID + 1 AS NextID FROM tbl a
LEFT JOIN tbl b ON b.ID = a.ID + 1
WHERE b.ID IS NULL
ORDER BY a.ID
LIMIT 1
If you get an empty set, just use 1, or let auto_increment do its thing.
For concurrency sake, you will need to lock the table to keep other sessions from using the next ID which you just found.
Well...i understood your problem...You want to generate the entries in such a way that it can control it's limit...
Well i've got a solution which is quite whacky...you may accept it if u feel like....
create your table with your primary key in auto increment mode using unsigned int (as every one suggested here)....
now consider two situations....
If your table needs to be cleared every single year or within certain duration(if such a situation exist)....
perform alter table operation to disable autoincrement mode and delete all your contents...
and then enable it again......
if what you are doing is some sort of datawarehousing.....so that a database for years....
then included a sql query to find the smallest primary key value using predefined key functions before you insert and if it is more than the 2^33 create a new table with the same details and you should maintain a seperate table to track the number of tables of this types
The trick is bit complicated and i m afraid....there don't exist a simple way as you expected....
You really don't need to cover the gaps created by deleting values from integer primary key columns. They were especially designed to ignore those gaps.
The auto increment mechanism could have been designed to take into consideration either the gaps at the top (after you delete some products with the biggest id values) or all gaps. But it wasn't because it was designed not to save space but to save time and to ensure that different transactions don't accidentally generate the same id.
In fact PostgreSQL implements it's SEQUENCE data type / SERIAL column (their equivalent to MySQL auto_increment) in such a way that if a transaction requests the sequence to increment a few times but ends up not using those ids, they never get used. That's also designed to avoid the possibility of transactions ever accidentally generating and using the same id.
You can't even save space because when you decide your table is going to use SMALLINT that's a fixed length 2 byte integer, it doesn't matter if the values are all 0 or maxed out. If you use a normal INTEGER that's a fixed length 4 byte integer.
If you use an UNSIGNED BIGINT that's an 8 byte integer which means it uses 8*8 bits = 64 bits. With an 8 byte integer you can count up to 2^64, even if your application works continuously for years and years it shouldn't reach a 20 digit number like 18446744070000000000 (if it does what the hell are you counting the molecules in the known universe?).
But, assuming you really have a concern that the ids might run out in a couple of years perhaps you should be using UUIDs in stead of integers.
Wikipedia states that "Only after generating 1 billion UUIDs every second for the next 100 years, the probability of creating just one duplicate would be about 50%".
UUIDs can be stored as BINARY(16) if you convert them into raw binary, as CHAR(32) if you strip the dashes or as CHAR(36) if you leave the dashes.
Out of the 16 bytes = 128 bits of data UUIDs use 122 random bits and 6 validation bits and they are constructed using information about when and where they were created. Meaning it is safe to create billions of UUIDs on different computers and the likelihood of collision would be overwhelmingly minuscule (as opposed to generating auto-incremented integers on different machines).

voting issue - storing votes as serialize instead of multi rows

my question is : what is the best way
storing the vote of the user include the ip and answer id in new mysql row or making just ine field in the answer row include all votes as "serialize" data
and if
what is the type of files to store this serialized data
It is almost always a bad idea to store multiple values in one column, as it becomes difficult to parse out the values you need and can make a column index unusable following the string operations necessary to extract a part.
Make a normalized table which stores one row per answer, per user. If an individual answer is itself a single data point, it belongs as its own row.
If you are tracking users by IP:
CREATE TABLE votes (
voteid INT NOT NULL PRIMARY KEY AUTO_INCREMENT,
userip VARCHAR(15) NOT NULL,
answerid INT NOT NULL
);
Plus, this gives you the benefit of being able to query your data in ways like:
/* Get vote count per user */
SELECT userip, COUNT(*) FROM votes GROUP BY userip;
/* Get users who have voted 3 or more times */
SELECT userip FROM votes GROUP BY userip HAVING COUNT(*) >= 3;
To accomplish the same thing with a serialized column, you would need to query it into application code, parse out the delimiters, and then perform your analysis. To implement the second (count >= 3) in application code requires re-implementing lots of the things the database is already very good at, like sorting, grouping, and counting.
You should create each distinct item as its own column.
This makes it much less complex to access individual columns. The queries will be less complex for you to write, in addition to generally being more efficient for MySQL.
It also allows you to specify different type and length requirements for each column. For example, you might have one Integer column, and one Character column, each with different storage requirements and implications for the type.
I recommend storing the IP address as in Integer column, using the INET_ATON() and INET_NTOA() functions.

Can I optimize my database by splitting one big table into many small ones?

Assume that I have one big table with three columns: "user_name", "user_property", "value_of_property". Lat's also assume that I have a lot of user (let say 100 000) and a lot of properties (let say 10 000). Then the table is going to be huge (1 billion rows).
When I extract information from the table I always need information about a particular user. So, I use, for example where user_name='Albert Gates'. So, every time the mysql server needs to analyze 1 billion lines to find those of them which contain "Albert Gates" as user_name.
Would it not be wise to split the big table into many small ones corresponding to fixed users?
No, I don't think that is a good idea. A better approach is to add an index on the user_name column - and perhaps another index on (user_name, user_property) for looking up a single property. Then the database does not need to scan all the rows - it just need to find the appropriate entry in the index which is stored in a B-Tree, making it easy to find a record in a very small amount of time.
If your application is still slow even after correctly indexing it can sometimes be a good idea to partition your largest tables.
One other thing you could consider is normalizing your database so that the user_name is stored in a separate table and use an integer foriegn key in its place. This can reduce storage requirements and can increase performance. The same may apply to user_property.
you should normalise your design as follows:
drop table if exists users;
create table users
(
user_id int unsigned not null auto_increment primary key,
username varbinary(32) unique not null
)
engine=innodb;
drop table if exists properties;
create table properties
(
property_id smallint unsigned not null auto_increment primary key,
name varchar(255) unique not null
)
engine=innodb;
drop table if exists user_property_values;
create table user_property_values
(
user_id int unsigned not null,
property_id smallint unsigned not null,
value varchar(255) not null,
primary key (user_id, property_id),
key (property_id)
)
engine=innodb;
insert into users (username) values ('f00'),('bar'),('alpha'),('beta');
insert into properties (name) values ('age'),('gender');
insert into user_property_values values
(1,1,'30'),(1,2,'Male'),
(2,1,'24'),(2,2,'Female'),
(3,1,'18'),
(4,1,'26'),(4,2,'Male');
From a performance perspective the innodb clustered index works wonders in this similar example (COLD run):
select count(*) from product
count(*)
========
1,000,000 (1M)
select count(*) from category
count(*)
========
250,000 (500K)
select count(*) from product_category
count(*)
========
125,431,192 (125M)
select
c.*,
p.*
from
product_category pc
inner join category c on pc.cat_id = c.cat_id
inner join product p on pc.prod_id = p.prod_id
where
pc.cat_id = 1001;
0:00:00.030: Query OK (0.03 secs)
Properly indexing your database will be the number 1 way of improving performance. I once had a query take a half an hour (on a large dataset, but none the less). Then we come to find out that the tables had no index. Once indexed the query took less than 10 seconds.
Why do you need to have this table structure. My fundemental problem is that you are going to have to cast the data in value of property every time you want to use it. That is bad in my opinion - also storing numbers as text is crazy given that its all binary anyway. For instance how are you going to have required fields? Or fields that need to have constraints based on other fields? Eg start and end date?
Why not simply have the properties as fields rather than some many to many relationship?
have 1 flat table. When your business rules begin to show that properties should be grouped then you can consider moving them out into other tables and have several 1:0-1 relationships with the users table. But this is not normalization and it will degrade performance slightly due to the extra join (however the self documenting nature of the table names will greatly aid any developers)
One way i regularly see databqase performance get totally castrated is by having a generic
Id, property Type, Property Name, Property Value table.
This is really lazy but exceptionally flexible but totally kills performance. In fact on a new job where performance is bad i actually ask if they have a table with this structure - it invariably becomes the center point of the database and is slow. The whole point of relational database design is that the relations are determined ahead of time. This is simply a technique that aims to speed up development at a huge cost to application speed. It also puts a huge reliance on business logic in the application layer to behave - which is not defensive at all. Eventually you find that you wan to use properties in a key relationsip which leads to all kinds of casting on the join which further degrades performance.
If data has a 1:1 relationship with an entity then it should be a field on the same table. If your table gets to more than 30 fields wide then consider movign them into another table but dont call it normalisation because it isnt. It is a technique to help developers group fields together at the cost of performance in an attempt to aid understanding.
I don't know if mysql has an equivalent but sqlserver 2008 has sparse columns - null values take no space.
SParse column datatypes
I'm not saying a EAV approach is always wrong, but i think using a relational database for this approach is probably not the best choice.