Hidden Features of MySQL

Hidden Features of MySQL - mysql

Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
I've been working with Microsoft SQL Server with many years now but have only just recently started to use MySQL with my web applications, and I'm hungry for knowledge.
To continue with the long line of "hidden feature" questions, I would like to know any hidden or handy features of MySQL which will hopefully improve my knowledge of this open source database.

Since you put up a bounty, I'll share my hard won secrets...
In general, all the SQLs I tuned today required using sub-queries. Having come from Oracle database world, things I took for granted weren’t working the same with MySQL. And my reading on MySQL tuning makes me conclude that MySQL is behind Oracle in terms of optimizing queries.
While the simple queries required for most B2C applications may work well for MySQL, most of the aggregate reporting type of queries needed for Intelligence Reporting seems to require a fair bit of planning and re-organizing the SQL queries to guide MySQL to execute them faster.
Administration:
max_connections is the number of concurrent connections. The default value is 100 connections (151 since 5.0) - very small.
Note:
connections take memory and your OS might not be able to handle a lot of connections.
MySQL binaries for Linux/x86 allow you to have up to 4096 concurrent connections, but self compiled binaries often have less of a limit.
Set table_cache to match the number of your open tables and concurrent connections. Watch the open_tables value and if it is growing quickly you will need to increase its size.
Note:
The 2 previous parameters may require a lot of open files. 20+max_connections+table_cache*2 is a good estimate for what you need. MySQL on Linux has an open_file_limit option, set this limit.
If you have complex queries sort_buffer_size and tmp_table_size are likely to be very important. Values will depend on the query complexity and available resources, but 4Mb and 32Mb, respectively are recommended starting points.
Note: These are "per connection" values, among read_buffer_size, read_rnd_buffer_size and some others, meaning that this value might be needed for each connection. So, consider your load and available resource when setting these parameters. For example sort_buffer_size is allocated only if MySQL needs to do a sort. Note: be careful not to run out of memory.
If you have many connects established (i.e. a web site without persistent connections) you might improve performance by setting thread_cache_size to a non-zero value. 16 is good value to start with. Increase the value until your threads_created do not grow very quickly.
PRIMARY KEY:
There can be only one AUTO_INCREMENT column per table, it must be indexed, and it cannot have a DEFAULT value
KEY is normally a synonym for INDEX. The key attribute PRIMARY KEY can also be specified as just KEY when given in a column definition. This was implemented for compatibility with other database systems.
A PRIMARY KEY is a unique index where all key columns must be defined as NOT NULL
If a PRIMARY KEY or UNIQUE index consists of only one column that has an integer type,
you can also refer to the column as "_rowid" in SELECT statements.
In MySQL, the name of a PRIMARY KEY is PRIMARY
Currently, only InnoDB (v5.1?) tables support foreign keys.
Usually, you create all the indexes you need when you are creating tables.
Any column declared as PRIMARY KEY, KEY, UNIQUE, or INDEX will be indexed.
NULL means "not having a value". To test for NULL, you cannot use the arithmetic comparison operators such as =, <, or <>. Use the IS NULL and IS NOT NULL operators instead:
NO_AUTO_VALUE_ON_ZERO suppresses auto increment for 0 so that only NULL generates the next sequence number. This mode can be useful if 0 has been stored in a table's AUTO_INCREMENT column. (Storing 0 is not a recommended practice, by the way.)
To change the value of the AUTO_INCREMENT counter to be used for new rows:
ALTER TABLE mytable AUTO_INCREMENT = value;
or
SET INSERT_ID = value;
Unless otherwise specified, the value will begin with: 1000000 or specify it thus:
...) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=1
TIMESTAMPS:
Values for TIMESTAMP columns are converted from the current time zone to UTC for storage,
and from UTC to the current time zone for retrieval.
http://dev.mysql.com/doc/refman/5.1/en/timestamp.html
For one TIMESTAMP column in a table, you can assign the current timestamp as the default value and the auto-update value.
one thing to watch out for when using one of these types in a WHERE clause, it is best to do
WHERE datecolumn = FROM_UNIXTIME(1057941242)
and not
WHERE UNIX_TIMESTAMP(datecolumn) = 1057941242.
doing the latter won't take advantage of an index on that column.
http://dev.mysql.com/doc/refman/5.1/en/date-and-time-functions.html
UNIX_TIMESTAMP()
FROM_UNIXTIME()
UTC_DATE()
UTC_TIME()
UTC_TIMESTAMP()
if you convert a datetime to unix timestamp in MySQL:
And then add 24 hours to it:
And then convert it back to a datetime it magically loses an hour!
Here's what's happening. When converting the unix timestamp back to a datetime the timezone is taken into consideration and it just so happens that between the 28th and the 29th of October 2006 we went off daylight savings time and lost an hour.
Beginning with MySQL 4.1.3, the CURRENT_TIMESTAMP(), CURRENT_TIME(), CURRENT_DATE(), and FROM_UNIXTIME() functions return values in the connection's current time zone, which is available as the value of the time_zone system variable. In addition, UNIX_TIMESTAMP() assumes that its argument is a datetime value in the current time zone.
The current time zone setting does not affect values displayed by functions such as UTC_TIMESTAMP() or values in DATE, TIME, or DATETIME columns.
NOTE: ON UPDATE ONLY updates the DateTime if a field is changed If an UPDATE results in no fields being changed then the DateTime is NOT updated!
Addtionally, the First TIMESTAMP is always AUTOUPDATE by default even if not specified
When working with Dates, I almost always convet to Julian Date becuase Data math is then a simple matter of adding or subtracing integers, and Seconds since Midnight for the same reason. It is rare I need time resoultion of finer granularity than seconds.
Both these can be stored as a 4 byte integer, and if space is really tight can be combined into UNIX time (seconds since the epoch 1/1/1970) as an unsigned integer which will be good till around 2106 as:
' secs in 24Hrs = 86400
' Signed Integer max val = 2,147,483,647 - can hold 68 years of Seconds
' Unsigned Integer max val = 4,294,967,295 - can hold 136 years of Seconds
Binary Protocol:
MySQL 4.1 introduced a binary protocol that allows non-string data values to be sent
and returned in native format without conversion to and from string format. (Very usefull)
Aside, mysql_real_query() is faster than mysql_query() because it does not call strlen()
to operate on the statement string.
http://dev.mysql.com/tech-resources/articles/4.1/prepared-statements.html
The binary protocol supports server-side prepared statements and allows transmission of data values in native format. The binary protocol underwent quite a bit of revision during the earlier releases of MySQL 4.1.
You can use the IS_NUM() macro to test whether a field has a numeric type.
Pass the type value to IS_NUM() and it evaluates to TRUE if the field is numeric:
One thing to note is that binary data CAN be sent inside a regular query if you escape it and remember MySQL requires only that backslash and the quote character be escaped.
So that is a really easy way to INSERT shorter binary strings like encrypted/Salted passwords for example.
Master Server:
http://www.experts-exchange.com/Database/MySQL/Q_22967482.html
http://www.databasejournal.com/features/mysql/article.php/10897_3355201_2
GRANT REPLICATION SLAVE ON . to slave_user IDENTIFIED BY 'slave_password'
#Master Binary Logging Config STATEMENT causes replication
to be statement-based - default
log-bin=Mike
binlog-format=STATEMENT
server-id=1
max_binlog_size = 10M
expire_logs_days = 120
#Slave Config
master-host=master-hostname
master-user=slave-user
master-password=slave-password
server-id=2
Binary Log File must read:
http://dev.mysql.com/doc/refman/5.0/en/binary-log.html
http://www.mydigitallife.info/2007/10/06/how-to-read-mysql-binary-log-files-binlog-with-mysqlbinlog/
http://dev.mysql.com/doc/refman/5.1/en/mysqlbinlog.html
http://dev.mysql.com/doc/refman/5.0/en/binary-log.html
http://dev.mysql.com/doc/refman/5.1/en/binary-log-setting.html
You can delete all binary log files with the RESET MASTER statement, or a subset of them with PURGE MASTER
--result-file=binlog.txt TrustedFriend-bin.000030
Normalization:
http://dev.mysql.com/tech-resources/articles/intro-to-normalization.html
UDF functions
http://www.koders.com/cpp/fid10666379322B54AD41AEB0E4100D87C8CDDF1D8C.aspx
http://souptonuts.sourceforge.net/readme_mysql.htm
DataTypes:
http://dev.mysql.com/doc/refman/5.1/en/storage-requirements.html
http://www.informit.com/articles/article.aspx?p=1238838&seqNum=2
http://bitfilm.net/2008/03/24/saving-bytes-efficient-data-storage-mysql-part-1/
One thing to note is that on a mixed table with both CHAR and VARCHAR, mySQL will change the CHAR's to VARCHAR's
RecNum integer_type UNSIGNED NOT NULL AUTO_INCREMENT, PRIMARY KEY (RecNum)
MySQL always represents dates with the year first, in accordance with the standard SQL and ISO 8601 specifications
Misc:
Turing off some MySQl functionality will result in smaller data files
and faster access. For example:
--datadir will specify the data directory and
--skip-innodb will turn off the inno option and save you 10-20M
More here
http://dev.mysql.com/tech-resources/articles/mysql-c-api.html
Download Chapter 7 - Free
InnoDB is transactional but there is a performance overhead that comes with it. I have found MyISAM tables to be sufficient for 90% of my projects.
Non-transaction-safe tables (MyISAM) have several advantages of their own, all of which occur because:
there is no transaction overhead:
Much faster
Lower disk space requirements
Less memory required to perform updates
Each MyISAM table is stored on disk in three files. The files have names that begin with the table name and have an extension to indicate the file type. An .frm file stores the table format. The data file has an .MYD (MYData) extension. The index file has an .MYI (MYIndex) extension.
These Files can be copied to a storage location intact without using the MySQL Administrators Backup feature which is time consuming (so is the Restore)
The trick is make a copy of these files then DROP the table. When you put the files back
MySQl will recognize them and update the table tracking.
If you must Backup/Restore,
Restoring a backup, or importing from an existing dump file can takes a long time depending on the number of indexes and primary keys you have on each table. You can speed this process up dramatically by modifying your original dump file by surrounding it with the following:
SET AUTOCOMMIT = 0;
SET FOREIGN_KEY_CHECKS=0;
.. your dump file ..
SET FOREIGN_KEY_CHECKS = 1;
COMMIT;
SET AUTOCOMMIT = 1;
To vastly increase the speed of the reload, add the SQL command SET AUTOCOMMIT = 0; at the beginning of the dump file, and add the COMMIT; command to the end.
By default, autocommit is on, meaning that each and every insert command in
the dump file will be treated as a separate transaction and written to disk before the next one is started. If you don't add these commands, reloading a large database into InnoDB can take many hours...
The maximum size of a row in a MySQL table is 65,535 bytes
The effective maximum length of a VARCHAR in MySQL 5.0.3 and on = maximum row size (65,535 bytes)
VARCHAR values are not padded when they are stored. Trailing spaces are retained when
values are stored and retrieved, in conformance with standard SQL.
CHAR and VARCHAR values in MySQL are compared without regard to trailing spaces.
Using CHAR will only speed up your access if the whole record is fixed size. That is,
if you use any variable size object, you might as well make all of them variable size.
You gain no speed by using a CHAR in a table that also contains a VARCHAR.
The VARCHAR limit of 255 characters was raised to 65535 characters as of MySQL 5.0.3
Full-text searches are supported for MyISAM tables only.
http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html
BLOB columns have no character set, and sorting and comparison are based on the
numeric values of the bytes in column values
If strict SQL mode is not enabled and you assign a value to a BLOB or TEXT column that
exceeds the column's maximum length, the value is truncated to fit and a warning is generated.
Useful Commands:
check strict mode:
SELECT ##global.sql_mode;
turn off strict mode:
SET ##global.sql_mode= '';
SET ##global.sql_mode='MYSQL40'
or remove:
sql-mode="STRICT_TRANS_TABLES,...
SHOW COLUMNS FROM mytable
SELECT max(namecount) AS virtualcolumn FROM mytable ORDER BY virtualcolumn
http://dev.mysql.com/doc/refman/5.0/en/group-by-hidden-fields.html
http://dev.mysql.com/doc/refman/5.1/en/information-functions.html#function_last-insert-id
last_insert_id()
gets you the PK of the last row inserted in the current thread max(pkcolname) gets you last PK overall.
Note: if the table is empty max(pkcolname) returns 1 mysql_insert_id() converts the return type of the native MySQL C API function mysql_insert_id() to a type of
long (named int in PHP).
If your AUTO_INCREMENT column has a column type of BIGINT, the value returned by
mysql_insert_id() will be incorrect. Instead, use the internal MySQL SQL function LAST_INSERT_ID() in an SQL query.
http://dev.mysql.com/doc/refman/5.0/en/information-functions.html#function_last-insert-id
Just a note that when you’re trying to insert data into a table and you get the error:
Unknown column ‘the first bit of data what you want to put into the table‘ in ‘field list’
using something like
INSERT INTO table (this, that) VALUES ($this, $that)
it’s because you’ve not got any apostrophes around the values you’re trying to stick into the table. So you should change your code to:
INSERT INTO table (this, that) VALUES ('$this', '$that')
reminder that `` are used to define MySQL fields, databases, or tables, not values ;)
Lost connection to server during query:
http://dev.mysql.com/doc/refman/5.1/en/gone-away.html
http://dev.mysql.com/doc/refman/5.1/en/packet-too-large.html
http://dev.mysql.com/doc/refman/5.0/en/server-parameters.html
http://dev.mysql.com/doc/refman/5.1/en/show-variables.html
http://dev.mysql.com/doc/refman/5.1/en/option-files.html
http://dev.mysql.com/doc/refman/5.1/en/error-log.html
Tuning Queries
http://www.artfulsoftware.com/infotree/queries.php?&bw=1313
Well that should be enough to earn the bonus I would think... The fruits of many hours and many projects with a great free database. I develop application data servers on windows platforms mostly with MySQL. The worst mess I had to straighten out was
The ultimate MySQL legacy database nightmare
This required a series of appplications to process the tables into something usefull using many of the tricks mentioned here.
If you found this astoundingly helpfull, express your thanks by voting it up.
Also check out my other articles and white papers at: www.coastrd.com

One of the not so hidden feature of MySQL is that it's not really good at being SQL compliant, well, not bugs really, but, more gotchas... :-)

A command to find out what tables are currently in the cache:
mysql> SHOW open TABLES FROM test;
+----------+-------+--------+-------------+
| DATABASE | TABLE | In_use | Name_locked |
+----------+-------+--------+-------------+
| test | a | 3 | 0 |
+----------+-------+--------+-------------+
1 row IN SET (0.00 sec)
(From MySQL performance blog)

A command to find out who is doing what:
mysql> show processlist;
show processlist;
+----+-------------+-----------------+------+---------+------+----------------------------------+------------------+
| Id | User | Host | db | Command | Time | State | Info |
+----+-------------+-----------------+------+---------+------+----------------------------------+------------------+
| 1 | root | localhost:32893 | NULL | Sleep | 0 | | NULL |
| 5 | system user | | NULL | Connect | 98 | Waiting for master to send event | NULL |
| 6 | system user | | NULL | Connect | 5018 | Reading event from the relay log | NULL |
+-----+------+-----------+---------+---------+-------+-------+------------------+
3 rows in set (0.00 sec)
And you can kill a process with:
mysql>kill 5

I particularly like MySQL's built-in support for inet_ntoa() and inet_aton(). It makes handling of IP addresses in tables very straightforward (at least so long as they're only IPv4 addresses!)

I love on duplicate key (AKA upsert, merge) for all kinds of counters created lazily:
insert into occurances(word,count) values('foo',1),('bar',1)
on duplicate key cnt=cnt+1
You can insert many rows in one query, and immediately handle duplicate index for each of the rows.

Again - not really hidden features, but really handy:
Feature
Easily grab DDL:
SHOW CREATE TABLE CountryLanguage
output:
CountryLanguage | CREATE TABLE countrylanguage (
CountryCode char(3) NOT NULL DEFAULT '',
Language char(30) NOT NULL DEFAULT '',
IsOfficial enum('T','F') NOT NULL DEFAULT 'F',
Percentage float(4,1) NOT NULL DEFAULT '0.0',
PRIMARY KEY (CountryCode,Language)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
Feature: GROUP_CONCAT() aggregate function
Creates a concatenated string of its arguments per detail, and aggregates by concatenating those per group.
Example 1: simple
SELECT CountryCode
, GROUP_CONCAT(Language) AS List
FROM CountryLanguage
GROUP BY CountryCode
Output:
+-------------+------------------------------------+
| CountryCode | List |
+-------------+------------------------------------+
| ABW | Dutch,English,Papiamento,Spanish |
. ... . ... .
| ZWE | English,Ndebele,Nyanja,Shona |
+-------------+------------------------------------+
Example 2: multiple arguments
SELECT CountryCode
, GROUP_CONCAT(
Language
, IF(IsOfficial='T', ' (Official)', '')
) AS List
FROM CountryLanguage
GROUP BY CountryCode
Output:
+-------------+---------------------------------------------+
| CountryCode | List |
+-------------+---------------------------------------------+
| ABW | Dutch (Official),English,Papiamento,Spanish |
. ... . ... .
| ZWE | English (Official),Ndebele,Nyanja,Shona |
+-------------+---------------------------------------------+
Example 3: Using a custom separator
SELECT CountryCode
, GROUP_CONCAT(Language SEPARATOR ' and ') AS List
FROM CountryLanguage
GROUP BY CountryCode
Output:
+-------------+----------------------------------------------+
| CountryCode | List |
+-------------+----------------------------------------------+
| ABW | Dutch and English and Papiamento and Spanish |
. ... . ... .
| ZWE | English and Ndebele and Nyanja and Shona |
+-------------+----------------------------------------------+
Example 4: Controlling the order of the list elements
SELECT CountryCode
, GROUP_CONCAT(
Language
ORDER BY CASE IsOfficial WHEN 'T' THEN 1 ELSE 2 END DESC
, Language
) AS List
FROM CountryLanguage
GROUP BY CountryCode
Output:
+-------------+------------------------------------+
| CountryCode | List |
+-------------+------------------------------------+
| ABW | English,Papiamento,Spanish,Dutch, |
. ... . ... .
| ZWE | Ndebele,Nyanja,Shona,English |
+-------------+------------------------------------+
Feature: COUNT(DISTINCT ) with multiple expressions
You can use multiple expressions in a COUNT(DISTINCT ...) expression to count the number of combinations.
SELECT COUNT(DISTINCT CountryCode, Language) FROM CountryLanguage
Feature / Gotcha: No need to include non-aggregated expressions in the GROUP BY list
Most RDBMS-es enforce a SQL92 compliant GROUP BY which requires all non-aggregated expressions in the SELECT list to appear in the GROUP BY. In these RDBMS-es, this statement:
SELECT Country.Code, Country.Continent, COUNT(CountryLanguage.Language)
FROM CountryLanguage
INNER JOIN Country
ON CountryLanguage.CountryCode = Country.Code
GROUP BY Country.Code
is not valid, because the SELECT list contains the non-aggregated column Country.Continent which does not appear in the GROUP BY list. In these RDBMS-es, you must either modify the GROUP BY list to read
GROUP BY Country.Code, Country.Continent
or you must add some non-sense aggregate to Country.Continent, for example
SELECT Country.Code, MAX(Country.Continent), COUNT(CountryLanguage.Language)
Now, the thing is, logically there is nothing that demands that Country.Continent should be aggreagated. See, Country.Code is the primary key of the Country table. Country.Continent is also a column from the Country table and is thus by definitions functionally dependent upon the primary key Country.Code. So, there must exist exactly one value in Country.Continent for each distinct Country.Code. If you realize that, than you realize that it does not make sense to aggregate it (there is just one value, right) nor to group by it (as it won't make the result more unique as you're already grouping by on the pk)
Anyway - MySQL lets you include non-aggregated columns in the SELECT list without requiring you to also add them to the GROUP BY clause.
The gotcha with this is that MySQL does not protect you in case you happen to use a non-aggregated column. So, a query like this:
SELECT Country.Code, COUNT(CountryLanguage.Language), CountryLanguage.Percentage
FROM CountryLanguage
INNER JOIN Country
ON CountryLanguage.CountryCode = Country.Code
GROUP BY Country.Code
Will be executed without complaint, but the CountryLanguage.Percentage column will contain non-sense (that is to say, of all languages percentages, one of the available values for the percentage will be picked at random or at least outside your control.
See: Debunking Group By Myths

The "pager" command in the client
If you've got, say, 10,000 rows in your result and want to view them (This assumes the "less" and "tee" commands available, which is normally the case under Linux; in Windows YMMV.)
pager less
select lots_of_stuff FROM tbl WHERE clause_which_matches_10k_rows;
And you'll get them in the "less" file viewer so you can page through them nicely, search etc.
Also
pager tee myfile.txt
select a_few_things FROM tbl WHERE i_want_to_save_output_to_a_file;
Will conveniently write to a file.

Some things you may find interesting:
<query>\G -- \G in the CLI instead of the ; will show one column per row
explain <query>; -- this will show the execution plan for the query

Not a hidden feature, but useful nonetheless: http://mtop.sourceforge.net/

Here are some of my tips - I blogged about them in my blog (Link)
You don't need to use '#' sign when declaring variables.
You have to use a delimiter (the default is ';') to demarcate the end of a statement - Link
If you trying to move data between MS-SQL 2005 and mySQL there are a few hoops to jump through - Link
Doing case sensitive matches in mySQL - link

If you're going to be working with large and/or high transaction InnoDb databases learn and understand "SHOW INNODB STATUS" Mysql Performance Blog, it will become your friend.

If using cmdline Mysq, you can interact with the command line (on Linux machines - not sure if there is an equivalent effect on Windows) by using the shriek/exclamation mark. For example:
\! cat file1.sql
will display the code for file1.sql. To save your statement and query to a file, use the tee facility
\T filename
to turn this off use \t
Lastly to run a script you've already saved, use "source filename". Of course, the normal alternative is to direct in the script name when starting mysql from the command line:
mysql -u root -p < case1.sql
Hope that's of use to someone !
Edit: Just remembered another one - when invoking mysql from the command line you can use the -t switch so that output is in table format - a real boon with some queries (although of course terminating queries with \G as mentioned elsewhere here is also helpful in this respect). A lot more on various switches Command Line Tool
Just found out a neat way to change the order of a sort (normally use Case...)
If you want to change the order of a sort (perhaps sort by 1, 4, 3 ,2 instead of 1, 2, 3,4) you can use the field function within the Order by clause.
For example
Order By Field(sort_field,1,4,3,2)
Found this out here Order by day_of_week in MySQL courtesey of user gms8994

I don't think this is MySQL specific, but enlighting for me:
Instead of writing
WHERE (x.id > y.id) OR (x.id = y.id AND x.f2 > y.f2)
You can just write
WHERE (x.id, x.f2) > (y.id, y.f2)

mysqlsla - One of the very commonly used slow query log analysis tool. You can see top 10 worsts queries since u last rolled out slow query logs. It can also tell you the number of times that BAD query was fired and how much total time it took on the server.

Actually documented, but very annoying: automatic conversions for incorrect dates and other incorrect input.
Before MySQL 5.0.2, MySQL is forgiving of illegal or improper data values and coerces them to legal values for data entry. In MySQL 5.0.2 and up, that remains the default behavior, but you can change the server SQL mode to select more traditional treatment of bad values such that the server rejects them and aborts the statement in which they occur.
As for dates: sometimes you'll be "lucky" when MySQL doesn't adjust the input to nearby valid dates, but instead stores it as 0000-00-00 which by definition is invalid. However, even then you might have wanted MySQL to fail rather than silently storing this value for you.

The built-in SQL Profiler.

InnoDB by default stores all tables in one global tablespace that will never shrink.
You can use innodb_file_per_table which will put each table in a separate tablespace that will be deleted when you drop the table or database.
Plan ahead for this since you have to dump and restore the database to reclaim space otherwise.
Using Per-Table Tablespaces

If you insert into datetime column empty string value "", MySQL will retain the value as 00/00/0000 00:00:00. Unlike Oracle, which will save null value.

During my benchmarks with large datasets and DATETIME fields, it's always slower to do this query:
SELECT * FROM mytable
WHERE date(date_colum) BETWEEN '2011-01-01' AND ''2011-03-03';
Than this approach:
SELECT * FROM mytable
WHERE date_column BETWEEN '2011-01-01 00:00:00' AND '2011-03-03 23:59:59'

Related

Mysql: Duplicate key error with autoincrement primary key

I have a table 'logging' in which we log visitor history. We have 14 millions pageviews in a day, so we insert 14 million records in table in a day, and traffic is highest in afternoon. From somedays we are facing the problems for duplicate key entry 'id', which according to me should not be the case, since id is autoincremented field and we are not explicitly passing id in insert query. Following are the details
logging (MyISAM)
----------------------------------------
| id | int(20) |
| virtual_user_id | varchar(1000) |
| visited_page | varchar(255) |
| /* More such columns are there */ |
----------------------------------------
Please let me know what is the problem here. Is keeping table in MyISAM a problem here.

Problem 1: size of your primary key
http://dev.mysql.com/doc/refman/5.0/en/integer-types.html
The max size of an INT regardless of the size you give it is 2147483647, twice that much if unsigned.
That means you get a problem every 153 days.
To prevent that you might want to change the datatype to an unsigned bigint.
Or for even more ridiculously large volumes even a unix timestamp + microtime as a composite key. Or a different DB solution altogether.
Problem 2: the actual error
It might be concurrency, even though I don't find that very plausible.
You'll have to provide the insert IDs / errors for that. Do you use transactions?
Another possibility is a corrupt table.
Don't know your mysql version, but this might work:
CHECK TABLE tablename
See if that has any complaints.
REPAIR TABLE tablename
General advice:
Is this a sensible amount of data to be inserting into a database, and doesn't it slow everything down too much anyhow?
I wonder how your DB performs with locking and all during the delete during for example an alter table.
The right way to do it totally depends on the goals and requirements of your system which I don't know, but here's an idea:
Log lines into a log. Import the log files in our own time. Don't bother your visitors with errors or delays when your DB is having trouble or when you need to do some big operation that locks everything.

Mysql Group by query is taking long time

I have a table "Words" in mysql database. This table contains 2 fields. word(VARCHAR(256)) and p_id(INTEGER).
Create table statement for the table:
CREATE TABLE `Words` (
`word` varchar(256) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
`p_id` int(11) NOT NULL DEFAULT '0',
KEY `word_i` (`word`(255))
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
Sample entries in the table are:
+------+------+
| word | p_id |
+------+------+
| a | 1 |
| a | 2 |
| b | 1 |
| a | 4 |
+------+------+
This table contains 30+ million entries in it. I am running a group by query and it is taking 90+ minutes for running that query. The group by query I am running is:
SELECT word,group_concat(p_id) FROM Words group by word;
To optimize this problem, I sent all the data in the table into a text file using the following query.
SELECT p_id,word FROM Words INTO OUTFILE "/tmp/word_map.txt";
After that I wrote a Perl script to read all the content in the file and parse that and make a hash out of it. It took very less time compared to the Group by query(<3min).In the end hash has 14million keys(words). It is occupying a lot of memory.So Is there any way to improve the performance of Group BY query so that I don't need to go through all the above mentioned steps?
EDT: I am adding the my.cnf file entries below.
[mysqld]
datadir=/media/data/.mysql_data/mysql
tmpdir=/media/data/.mysql_tmp_data
innodb_log_file_size=5M
socket=/var/lib/mysql/mysql.sock
# Disabling symbolic-links is recommended to prevent assorted security risks
symbolic-links=0
group_concat_max_len=4M
max_allowed_packet=20M
[mysqld_safe]
log-error=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid
tmpdir=/media/data/.mysql_tmp_data/
Thanks,
Vinod

I think the index you want is:
create index words_word_pid on words(word, pid)
This does two things. First, the group by can be handled by an index scan rather than loading the original table and sorting the results.
Secondly, this index also eliminates the need to load the original data.
My guess is that the original data does not fit into memory. So, the processing goes through the index (efficiently), finds the word, and then needs to load the pages with the word on it. Well, eventually memory fills up and the page with the word is not in memory. The page is loaded from disk. And the next page is probably not in memory, and that page is loaded from disk. And so on.
You can fix this problem by increasing the memory size. You can also fix the problem by having an index that covers all the columns used in the query.

The problem is that it is hardly a frequent usecase for a database to output the whole 30M rows table into a file. The advantange of your approach with the Perl script is that you do not need random disk IO. To simulate the bahaviour in MySQL you will need to load everythin into an index (p_id, word) (the whole word, not a prefix), which might turn out an overkill for the database.
You can put only p_id into an index, this will speed up grouping, but will require a lot of random disk IO to fetch words for each row.
By the way, the covering index will take ~(4+4+3*256)*30M bytes, that is more than 23Gb of memory. It seems that the solution with the Perl script is the best you can do.
Another thing you should be aware of is that you will need to get more than 20Gb of result through a MySQL connection, and that those 20 Gb of result shoul be collected into a temporary table (and sorted by p_id if you do not append ORDER BY NULL). If you are going to download if through a MySQL binding to a programming language, you will need to force the binding use streaming (by default bindings usually get the whole resultset)

Index the table on the word column. This will accelerate the grouping substantially as the SQL engine can locate the records for grouping with minimal searching through the table.
CREATE INDEX word_idx ON Words(word);

How to implement an efficient database driven ticket/protocol control system?

Scenario: WAMP server, InnoDB Table, auto-increment Unique ID field [INT(10)], 100+ concurrent SQL requests. VB.Net should also be used if needed.
My database has an auto-increment field wich is used to generate a unique ticket/protocol number for each new information stored (service order).
The issue is that this number must be reseted each year. I.e. it starts at 000001/12 on 01/01/2012 00:00:00 up to maximum 999999/12 and then at 01/01/2013 00:00:00 it must start over again to 000001/13.
Obviously it should easly acomplished using some type of algorithm, but, I'm trying to find a more efficient way to do that. Consider:
It must (?) use auto-increment since the database has some concurrency (+100).
Each ticket must be unique. So 000001 on 2012 is not equal to 000001 on 2013.
It must be automatic. (no human interaction needed to make the reset, or whatever)
It should be reasonably efficient. (a watch program should check the database daily (?) but it seems not the best solution since it will 'fail' 364 times to have success only once).
The 'best' approach I could think of is to store the ticket number using year, such as:
12000001 - 12999999 (it never should reach the 999.999, anyway)
and then an watch program should set the auto increment field to 13000000 at 01/01/2013.
Any Suggestions?
PS: Thanks for reading... ;)

So, for futher reference I've adopted the following sollution:
I do create n tables on the database (one for each year) with only one auto-increment field wich is responsible to generate the each year unique id.
So, new inserts are done into the corresponding table considering the event date. After that the algorithm takes the last_inseted_id() an store that value into the main table using the format 000001/12. (ticket/year)
That because each year must have it's own counter since an 2012 event would be inserted even when the current date is already 2013.
That way events should be retroactive, no reset is needed and it's simple to implement.
Sample code for insertion:
$eventdate="2012-11-30";
$eventyear="2012";
$sql = "INSERT INTO tbl$eventyear VALUES (NULL)";
mysql_query ($sql);
$sql = "LAST_INSERT_ID()";
$row = mysql_fetch_assoc(mysql_query($sql));
$eventID = $row(0)
$sql = "INSERT INTO tblMain VALUES ('$eventID/$eventYear', ... ";
mysql_query($sql)

MongoDB uses something very similar to this that encodes the date, process id and host that generated an id along with some random entropy to create UUIDs. Not something that fulfills your requirement of monotonic increase, but something interesting to look at for some ideas on approach.
If I were implementing it, I would create a simple ID broker server that would perform the logic processing on date and create a unique slug for the id like you described. As long as you know how it's constructed, you have native MySql equivalents to get your sorting/grouping working, and the representation serializes gracefully this will work. Something with a UTC datestring and then a monotonic serial appended as a string.
Twitter had some interesting insights into custom index design here as they implemented their custom ID server Snowflake.
The idea of a broker endpoint that generates UUIDs that are not just simple integers, but also contain some business logic is becoming more and more widespread.

You can set up a combined PRIMARY KEY for both of the two columns ID and YEAR; this way you would only have one ID per year.
MySQL:
CREATE TABLE IF NOT EXISTS `ticket` (
`id` INT(11) NOT NULL AUTO_INCREMENT,
`year` YEAR NOT NULL DEFAULT '2012',
`data` TEXT NOT NULL DEFAULT '',
PRIMARY KEY (`id`, `year`)
)
ENGINE = InnoDB DEFAULT CHARACTER SET = utf8 COLLATE = utf8_unicode_ci
UPDATE: (to the comment from #Paulo Bueno)
How to reset the auto-increment-value could be found in the MySQL documentation: mysql> ALTER TABLE ticket AUTO_INCREMENT = 1;.
If you also increase the default value of the year-column when resetting the auto-increment-value, you 'll have a continuous two-column primary key.
But I think you still need some sort of trigger-program to execute the reset. Maybe a yearly cron-job, which is launching a batch-script to do so on each first of January.
UPDATE 2:
OK, I've tested that right now and one can not set the auto-increment-value to a number lower than any existing ID in that specific column. My mistake – I thought it would work on combined primary keys…
INSERT INTO `ticket` (`id`, `year`, `data`) VALUES
(NULL , '2012', 'dtg htg het'),
-- some more rows in 2012
);
-- this works of course
ALTER TABLE `ticket` CHANGE `year` `year` YEAR( 4 ) NOT NULL DEFAULT '2013';
-- this does not reset the auto-increment
ALTER TABLE `ticket` AUTO_INCREMENT = 1;
INSERT INTO `ticket` (`id`, `year`, `data`) VALUES
(NULL , '2013', 'sadfadf asdf a'),
-- some more rows in 2013
);
-- this will result in continously counted ID's
UPDATE 3:
The MySQL-documentation page has a working example, which uses grouped primary keys on MyISAM table. They are using a table similar to the one above, but with reversed column-order, because one must not have auto-increment as first column. It seems this works only using MyISAM instead of InnoDB. If MyISAM still fits your needs, you don't need to reset the ID, but merely increase the year and still have a result as the one you've questioned for.
See: http://dev.mysql.com/doc/refman/5.0/en/example-auto-increment.html (second example, after "For MyISAM and BDB tables you can specify AUTO_INCREMENT on a secondary column in a multiple-column index.")

"is not null" vs boolean MySQL - Performance

I have a column that is a datetime, converted_at.
I plan on making calls that check WHERE converted_at is not null very often. As such, I'm considering having a boolean field converted. Is their a significant performance difference between checking if a field is not null vs if it is false?
Thanks.

If things are answerable in a single field you favour that over to splitting the same thing into two fields. This creates more infrastructure, which, in your case is avoidable.
As to the nub of the question, I believe most database implementation, MySQL included, will have an internal flag which is boolean anyways for representing the NULLability of a field.
You should rely that this is done for you correctly.
As to performance, the bigger question should be on profiling the typical queries that you run on your database and where you created appropriate indexes and analyze table on to improve execution plans and which indexes are used during queries. This question will have a far bigger impact to performance.

Using WHERE converted_at is not null or WHERE converted = FALSE will probably be the same in matters of query performance.
But if you have this additional bit field, that is used to store whether the converted_at field is Null or not, you'll have to somehow maintain integrity (via triggers?) whenever a new row is added and every time the column is updated. So, this is a de-normalization. And also means more complicated code. Moreover, you'll have at least one more index on the table (which means a bit slower Insert/Update/Delete operations).
Therefore, I don't think it's good to add this bit field.
If you can change the column in question from NULL to NOT NULL (possibly by normalizing the table), you may get some performance gain (at the cost/gain of having more tables).

I had the same question for my own usage. So I decided to put it to the test.
So I created all the fields required for the 3 possibilities I imagined:
# option 1
ALTER TABLE mytable ADD deleted_at DATETIME NULL;
ALTER TABLE mytable ADD archived_at DATETIME NULL;
# option 2
ALTER TABLE mytable ADD deleted boolean NOT NULL DEFAULT 0;
ALTER TABLE mytable ADD archived boolean NOT NULL DEFAULT 0;
# option 3
ALTER TABLE mytable ADD invisibility TINYINT(1) UNSIGNED NOT NULL DEFAULT 0
COMMENT '4 values possible' ;
The last is a bitfield where 1=archived, 2=deleted, 3=deleted + archived
First difference, you have to create indexes for optioon 2 and 3.
CREATE INDEX mytable_deleted_IDX USING BTREE ON mytable (deleted) ;
CREATE INDEX mytable_archived_IDX USING BTREE ON mytable (archived) ;
CREATE INDEX mytable_invisibility_IDX USING BTREE ON mytable (invisibility) ;
Then I tried all of the options using a real life SQL request, on 13k records on the main table, here is how it looks
SELECT *
FROM mytable
LEFT JOIN table1 ON mytable.id_qcm = table1.id_qcm
LEFT JOIN table2 ON table2.id_class = mytable.id_class
INNER JOIN user ON mytable.id_user = user.id_user
where mytable.id_user=1
and mytable.deleted_at is null and mytable.archived_at is null
# and deleted=0
# and invisibility=0
order BY id_mytable
Used alternatively the above commented filter options.
Used mysql 5.7.21-1 debian9
My conclusion:
The "is null" solution (option 1) is a bit faster, or at least same performance.
The 2 others ("deleted=0" and "invisibility=0") seems in average a bit slower.
But the nullable fields option have decisive advantages: No index to create, easier to update, easier to query. And less storage space used.
(additionnaly inserts & updates virtually should be faster as well, since mysql do not need to update indexes, but you never would be able to notice that).
So you should use the nullable datatime fields option.

Database optimization advice

I have a table called members. I am looking on advice how to improve it.
id : This is user id (unique) (auto increment) (indexed)
status : Can contain 'activated', 'suspended', 'verify', 'delete'
admin : This just contains either 0 or 1 (if person is admin or not)
suspended_note : If a members account is suspended i can add a note so when they try and login they will see the note.
failed_login_count : basically 1 digit from 0 to 4, counts failed logins
last_visited : unix timestamp of when they last visited site; (updated on logout) (i do this via php with time() )
username : can contain from 3 to 15 characters (unique and indexed)
first_name : can contain letters only and from 3 to 40 chars in length
last_name : can contain letters only and from 2 to 50 chars in length
email : can contain an email address (i use php email filter to check if valid)
password : can contain from 6 to 10 chars in length and is hashed and contains fixed length of 40 chars in database once hashed
date_time : unix timestamp (i do this via php with time() ). When user logs in
ip : members ip on registration/logins
activationkey : i use md5 and a salt to create a unique activation key; length is always 32 chars
gender : either blank or male/female and nothing else.
websiteurl: can add they site url;
msn : can contain msn email address (use regular expression to match this)
aim : aim nickname (use regular expression to match this)
yim : yim nickname (use regular expression to match this)
twitter : twitter username (use regular expression to match this)
suspended_note; first_name; last_name; date_time; ip; gender; websiteurl; msn; aim; yim; twitter can be null because on registration only username, email and password is required so those fields will be null until filled in (they are basically optional and not required) apart from ip which is taken on signup/login.
Could anyone tell me based on the information I have given how I can improve and alter this table more efficently? I would say I could improve it as I tend to use varchar for most things and am looking to get the best performance out of it.
I tend to do quite a few selects and store the user data in sessions to avoid having to query database every time. Username is unique and indexed like id as most of my selects compare have username in it with LIMIT 1 on my queries.
UPDATE:
I wanted to ask if I changed to enum for example how would I do a select and compare query for example in php for enum? I did look online but cannot find any example queries with enum being used. Also if I changed date_time for example to timestamp do I still use time() in php to insert the unix timestamp into date_time column database?
The reason I ask is I was reading one tutorial online that says when the row is queried, selected, updated etc MySQL automatically updates the timestamp for that row; is this true as I rather insert the timestamp using php time() in timestamp field. I use php time() already for date_time but use currently use varchar not timestamp.
Plus server time is in US and in php.ini I set it to UK time but I guess mysql would store it in the time on the server which again is no good as I want them in UK time.

Some tips:
Your status should be an int connected to a lookup, or an enum.
ditto for gender
You could use a char instead of varchar. There is a lot of discussion available on that, but while varchar does help you cut down on the size, that is hardly a big issue most of the time. char can be quicker. this is tricky point though.
safe your date_time as a timestamp. There is a datatype for that
ditto for last_visited
Your ip field looks a bit long to me.
an int(5) can hold too much. So if your failed count is max 4, you don't need that big of a number! A tinyint can hold upt o 127 signed, or 255 unsigned.
A note from the comments:
You could probably normalize some
fields: fields that update often, like
failed_login_count, ip, last_visited
could be in another table. This way
your members table itself doesn't
change as often and can be in cache
I agree with this :)
Edit: some updates after your new questions.
example how would I do a select and compare query for example in php for enum?
You can just compare it to the value as if it was a string. The only difference is that with an insert or update, you can only use the give value. Just use
SELECT * FROM table WHERE table.enum = "yourEnumOption"
changed date_time for example to timestamp do I still use time() in php to insert the unix timestamp into date_time column database?
You can use now() in mysql? (this is just a quick fromthetopofmyhead, could have a minor mistake, but:
INSERT INTO table (yourTime) VALUES (NOW());
reason I ask is I was reading one tutorial online that says when the row is queried, selected, updated etc MySQL automatically updates the timestamp for that row; is this true as I rather insert the timestamp using php time() in timestamp field. I use php time() already for date_time but use currently use varchar not timestamp.
You can use the php time. The timestamp does not get updated automatically, see the manual (http://dev.mysql.com/doc/refman/5.0/en/timestamp.html): you would use something like this in the definition:
CREATE TABLE t (
ts1 TIMESTAMP DEFAULT 0,
ts2 TIMESTAMP DEFAULT CURRENT_TIMESTAMP
ON UPDATE CURRENT_TIMESTAMP)

First of all you should use mysql's built in field types:
status is ENUM('activated', 'suspended', 'verify', 'delete');
gender is ENUM('male','female','unknown')
last_visited is TIMESTAMP
suspended_note is TEXT
failed login count is TINYINT(1) because you wouldnt have 10000 failed logins right - INT(5)
date_time is DATETIME or TIMESTAMP
add an index on username and password (combined) so that logins are faster
index, unique email since you'll query by it to retrieve pwds and it should be unique
Also you might want to normalize this table and separate suspended_note, website, IP, aim etc to a separate table called profile. This way logins, session updates, pwd retrievals are queries ran in a much smaller table, and have the rest of the data selected only in pages where you need to have such data as the profile/member pages.
However this tends to vary a lot depending on how your app is thought out but generally its better practice to normalize.
You could probably normalize even more
and have a user_stats table too:
fields that update often, like
failed_login_count, ip, last_visited
could be in another table. This way
your members table itself doesn't
change as often and can be in cache. –
Konerak 1 hour ago
VARCHAR is good but when you know the size of something like the activation key always is 32 then use CHAR(32)

Well first the basics..
IP should be stored as an unsigned INT and you would use INET_ATON and INET_NTOA to retrieve and store the IP.
Status could be an enum or a tinyint 1/0.
For last visited you could insert a unix timestamp using the mysql function UNIX_TIMESTAMP (Store this in a timestamp column). To retrieve the date you would use the FROM_UNIXTIME function.
Most answers have touched on the basics of using Enum's. However using 1 for Male and 2 for Female may speed up your application as a numeric field may be faster than an alphanumeric field if you do a lot of queries by that field. You should test to find out.
Secondly we would need to know how you use the table. How does your app query the table? Where are your indexes? Are you using MyISAM? Innodb? etc. Most of my recommendations would be based on how you app hits the table. The table is also wide so I would look into normalizing it as some others have pointed out.

admin can be of type bit
Activation key can be smaller

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008