Indexing on BigInt column in MySQL - mysql

i have a table which has big int column used for storing the time stamp.The time stamp value which we are getting from our application is 13 digit number like 1280505757693.And there are many rows in this table right now probably more than half a million entries.Should i use Index on timestamp column or not ???? Any suggestions ?

Are the numbers contiguous or do they contain encoded information in some form? By this I mean is 1280505757693 + 1 one tick beyond 1280505757693? If so, then you can create an index and it will be useful for equal to matches and range matches; otherwise, only for equal to matches.
If you are keeping timestamps in your database you may wish to consider MySQL's TIMESTAMP and DATETIME types; see here http://dev.mysql.com/doc/refman/5.1/en/datetime.html . You can certainly create indexes on those.

Related

Performatic way to store data containing 2 fixed Chars and 4 digits in MySQL?

We have to store a "file ID" information in a multi million rows table. The format is Brazilian State ID abbreviation (i.e.: PA for PARA, BA for Bahia, SP for Sao Paulo, RJ for Rio de Janeiro, and so on) and a "scope" information, built by a short format Year ie.: 19 for 2019 and month, resulting in i.e 'PA1908' format.
As said before, the table has multi million rows and every month we have to compare it's data with external data source, and in case the external data source is most update then our table, we must replace entire STATE-YEAR-MONTH records, so the file id exist just to be a param in the query's where clause in order to select rows to delete.
In the first modeling verion, I splitted file id in two columns, being fileid_state as Char(2) datatype using hash index and fileid_scope as smallint datatype, but I'm not sure this is the only way to archive acceptable performance, may be using just only one column named file_id with Char(6) datatype with hash index could be performatic as first version. Any suggestions how of two method is best, or another way to store file id in order to select rows for deleting as fast as possible?
Remember it's kind hard for me to benchmark the methods because we have almost 1 billion rows in a limited hardware.
Q1: Datatype: First ask yourself what will be done with the string:
Do you ever need to look just at the 'state' part? The 'year' part? The 'month' part? If you answer "yes" to any of those, then you should probably store the parts in 2 or 3 columns. state CHAR(2) CHARACTER SET ascii, then use TINYINT UNSIGNED or SMALLINT UNSIGNED for the numeric part(s).
If no, the simply do CHAR(6) CHARACTER SET ascii. If needed, this can be INDEXed, either by itself, or together with other column(s) in a "composite" index. Please provide the UPDATE and SELECT statements that may need this index; we will critique.
There is no "hash" indexing, only BTree.
"select rows for deleting as fast as possible" -- What percentage of the table will be deleted? If, for example, you will DELETE FROM tbl WHERE sym = 'PA1908', and it is only a small part of the table, then INDEX(sym) works optimally.
I say "ascii" so that you avoid the space/processing needed for utf8, etc.
Q2: "is most update then our table, we must replace entire STATE-YEAR-MONTH records" -- Please elaborate on what happens here.

How to reduce table size for date data type?

I was saving data for a table, as follows.
create table A
(
RecordDate date not null, -- 3 bytes
SomeNumber int not null, -- 4 bytes
)
For each single calendar day, about 20 thousands records for table A are created, with the same RecordDate, but different SomeNumber. That makes the RecordDate redundant for many records, I felt that there is some room to reduce the table size. Is there a way I can cut table A's size without losing the date information? Thanks.
As per my comment, I don't think it is worth the effort in your case, even with 20,000 transactions per day. I suspect it's not worthwhile for millions of transactions per day. A 'Date' field uses 3 bytes of data, which is already one of the smallest datatypes in your database.
However, I think it's an interesting question.
Bansi's suggestion is a very space efficient solution (have a date table with min and max order number) but without any table keys you could (would) break your data, and it would be very hard to fix or even know it's broken. You would also massively increase your query time, as it would have to make two calculations on every record to find dates within a range.
My solution would be a dimensional modeling style date table, with a date key and the date. Your transaction table would store a date key for you to join to. For this to be of any benefit the date key would have to be a 'smallint' type (2 bytes). You would save 1 byte per row minus a relatively insignificant amount for the date table itself.
Note that using 'smallint' restricts your number of dates to 65,535 different days, or ~180 years if the days are consecutive.

Use timestamp(or datetime) as part of primary key (or part of clustered index)

I use following query frequently:
SELECT * FROM table WHERE Timestamp > [SomeTime] AND Timestamp < [SomeOtherTime] and publish = 1 and type = 2 order by Timestamp
I would like to optimize this query, and I am thinking about put timestamp as part of primary key for clustered index, I think if timestamp is part of primary key , data inserted in table has write to disk sequentially by timestamp field.Also I think this improve my query a lot, but am not sure if this would help.
table has 3-4 million+ rows.
timestamp field never changed.
I use mysql 5.6.11
Anothet point is : if this is improve my query , it is better to use timestamp(4 byte in mysql 5.6) or datetime(5 byte in mysql 5.6)?
Four million rows isn't huge.
A one-byte difference between the data types datetime and timestamp is the last thing you should consider in choosing between those two data types. Review their specs.
Making a timestamp part of your primary key is a bad, bad idea. Think about reviewing what primary key means in a SQL database.
Put an index on your timestamp column. Get an execution plan, and paste that into your question. Determine your median query performance, and paste that into your question, too.
Returning a single day's rows from an indexed, 4 million row table on my desktop computer takes 2ms. (It returns around 8000 rows.)
1) If values of timestamp are unique you can make it primary key. If not, anyway create index on timestamp column as you frequently use it in "where".
2) using BETWEEN clause looks more natural here. I suggest you use TREE index (default index type) not HASH.
3) when timestamp column is indexed, you don't need call order by - it already sorted.
(of course, if your index is TREE not HASH).
4) integer unix_timestamp is better than datetime both from memory usage side and performance side - comparing dates is more complex operation than comparing integer numbers.
Searching data on indexed field takes O(log(rows)) tree lookups. Comparison of integers is O(1) and comparison of dates is O(date_string_length). So, difference is (number of tree lookups) * (difference_comparison) = O(date_string_length)/O(1))* O(log(rows)) = O(date_string_length)* O(log(rows))

Best solution for saving boolean values and saving cpu and memory on searches

What is the best solution for inserting boolean values on database if you want more query performance and minimum losing of memory on select statement.
For example:
I have a table with 36 fields that 30 of them has boolean values (zero or one) and i need to search records using the boolean fields that just have true values.
SELECT * FROM `myTable`
WHERE
`field_5th` = 1
AND `field_12th` = 1
AND `field_20` = 1
AND `field_8` = 1
Is there any solution?
If you want to store boolean values or flags there are basically three options:
Individual columns
This is reflected in your example above. The advantage is that you will be able to put indexes on the flags you intend to use most often for lookups. The disadvantage is that this will take up more space (since the minimum column size that can be allocated is 1 byte.)
However, if you're column names are really going to be field_20, field_21, etc. Then this is absolutely NOT the way to go. Numbered columns are a sign you should use either of the other two methods.
Bitmasks
As was suggested above you can store multiple values in a single integer column. A BIGINT column would give you up to 64 possible flags.
Values would be something like:
UPDATE table SET flags=b'100';
UPDATE table SET flags=b'10000';
Then the field would look something like: 10100
That would represent having two flag values set. To query for any particular flag value set, you would do
SELECT flags FROM table WHERE flags & b'100';
The advantage of this is that your flags are very compact space-wise. The disadvantage is that you can't place indexes on the field which would help improve the performance of searching for specific flags.
One-to-many relationship
This is where you create another table, and each row there would have the id of the row it's linked to, and the flag:
CREATE TABLE main (
main_id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
);
CREATE TABLE flag (
main_id INT UNSIGNED NOT NULL,
name VARCHAR(16)
);
Then you would insert multiple rows into the flag table.
The advantage is that you can use indexes for lookups, and you can have any number of flags per row without changing your schema. This works best for sparse values, where most rows do not have a value set. If every row needs all flags defined, then this isn't very efficient.
For performance comparisson you can read a blog post I wrote on the topic:
Set Performance Compare
Also when you ask which is "Best" that's a very subjective question. Best at what? It all really depends on what your data looks like and what your requirements are and how you want to query it.
Keep in mind that if you want to do a query like:
SELECT * FROM table WHERE some_flag=true
Indexes will only help you if few rows have that value set. If most of the rows in the table have some_flag=true, then mysql will ignore indexes and do a full table scan instead.
How many rows of data are you querying over? You can store the boolean values in an integer value and use bit operations to test for them them. It's not indexable, but storage is very well packed. Using TINYINT fields with indexes would pick one index to use and scan from there.

Primary Key Index Automatic

I m currently doing a project using mysql and am a perfect beginner in it.....
I made a table with the following columns.....
ID // A integer type column which is a primary key........
Date // A Date type column.........
Day // A String column.........
Now i just wanna know whether there exist any method by which the ID column insertion value is automatically generated......??
for eg: - If i insert a date - 4/10/1992 and Day - WED as values. The Mysql Server should automatically generate any integer value starting from 1 checking whether they exist.
i.e in a table containing the values
ID Date Day
1 01/02/1987 Sun
3 04/08/1990 Sun
If i m inserting the Date value and Day value(specified in the example) in the above table. It should be inserted as
2 04/10/1992 WED
I tried methods like using auto incrementer.....But i m afraid it just only increments the ID value.
There's a way to do this, but it's going to affect performance. Go ahead and keep auto_increment on the column, just for the first insert, or for when you want to insert more quickly.
Even with auto_increment on a column, you can specify the value, so long as it doesn't collide with an existing value.
To get the next value or first gap:
SELECT a.ID + 1 AS NextID FROM tbl a
LEFT JOIN tbl b ON b.ID = a.ID + 1
WHERE b.ID IS NULL
ORDER BY a.ID
LIMIT 1
If you get an empty set, just use 1, or let auto_increment do its thing.
For concurrency sake, you will need to lock the table to keep other sessions from using the next ID which you just found.
Well...i understood your problem...You want to generate the entries in such a way that it can control it's limit...
Well i've got a solution which is quite whacky...you may accept it if u feel like....
create your table with your primary key in auto increment mode using unsigned int (as every one suggested here)....
now consider two situations....
If your table needs to be cleared every single year or within certain duration(if such a situation exist)....
perform alter table operation to disable autoincrement mode and delete all your contents...
and then enable it again......
if what you are doing is some sort of datawarehousing.....so that a database for years....
then included a sql query to find the smallest primary key value using predefined key functions before you insert and if it is more than the 2^33 create a new table with the same details and you should maintain a seperate table to track the number of tables of this types
The trick is bit complicated and i m afraid....there don't exist a simple way as you expected....
You really don't need to cover the gaps created by deleting values from integer primary key columns. They were especially designed to ignore those gaps.
The auto increment mechanism could have been designed to take into consideration either the gaps at the top (after you delete some products with the biggest id values) or all gaps. But it wasn't because it was designed not to save space but to save time and to ensure that different transactions don't accidentally generate the same id.
In fact PostgreSQL implements it's SEQUENCE data type / SERIAL column (their equivalent to MySQL auto_increment) in such a way that if a transaction requests the sequence to increment a few times but ends up not using those ids, they never get used. That's also designed to avoid the possibility of transactions ever accidentally generating and using the same id.
You can't even save space because when you decide your table is going to use SMALLINT that's a fixed length 2 byte integer, it doesn't matter if the values are all 0 or maxed out. If you use a normal INTEGER that's a fixed length 4 byte integer.
If you use an UNSIGNED BIGINT that's an 8 byte integer which means it uses 8*8 bits = 64 bits. With an 8 byte integer you can count up to 2^64, even if your application works continuously for years and years it shouldn't reach a 20 digit number like 18446744070000000000 (if it does what the hell are you counting the molecules in the known universe?).
But, assuming you really have a concern that the ids might run out in a couple of years perhaps you should be using UUIDs in stead of integers.
Wikipedia states that "Only after generating 1 billion UUIDs every second for the next 100 years, the probability of creating just one duplicate would be about 50%".
UUIDs can be stored as BINARY(16) if you convert them into raw binary, as CHAR(32) if you strip the dashes or as CHAR(36) if you leave the dashes.
Out of the 16 bytes = 128 bits of data UUIDs use 122 random bits and 6 validation bits and they are constructed using information about when and where they were created. Meaning it is safe to create billions of UUIDs on different computers and the likelihood of collision would be overwhelmingly minuscule (as opposed to generating auto-incremented integers on different machines).