mysql partition on categorical fields and timestamp column which is varchar - mysql

Currently we have table:
CREATE TABLE `T_TRANS` (
`CASE_ID` varchar(20) DEFAULT NULL,
`C_ID` varchar(20) DEFAULT NULL,
`C_ST_IND` smallint(6) DEFAULT NULL,
`D_DTTM` int(11) DEFAULT NULL,
`E_ID` varchar(10) DEFAULT NULL,
`E_LONG` decimal(11,7) DEFAULT NULL,
`E_LAT` decimal(9,7) DEFAULT NULL,
`EV_IND` smallint(6) DEFAULT NULL,
`H_B_IND` smallint(6) DEFAULT NULL,
`V_IND` varchar(15) DEFAULT NULL,
`I_IND` smallint(6) DEFAULT NULL,
`I_P_IND` smallint(6) DEFAULT NULL,
`I_S_IND` smallint(6) DEFAULT NULL,
`IS_D_IND` smallint(6) DEFAULT NULL,
`IS_R_IND` smallint(6) DEFAULT NULL,
`L_IND` smallint(6) DEFAULT NULL,
`D_LONG` decimal(11,7) DEFAULT NULL,
`D_LAT` decimal(9,7) DEFAULT NULL,
`L_P_C_DTTM` int(11) DEFAULT NULL,
`L_T_E_DTTM` int(11) DEFAULT NULL,
`M_IND` varchar(20) DEFAULT NULL,
`N_D_COUNTER` smallint(6) DEFAULT NULL,
`O_ID` smallint(6) NOT NULL,
`P_ID` varchar(50) DEFAULT NULL,
`R_E_IND` smallint(6) DEFAULT NULL,
`R_IND` smallint(6) DEFAULT NULL,
`S_C_DTTM` varchar(20) DEFAULT NULL,
`S_IND` smallint(6) DEFAULT NULL,
`T_T_RED` varchar(20) DEFAULT NULL,
`U_D` int(11) DEFAULT NULL,
`V_D` int(11) DEFAULT NULL,
`CRT_USR_NAM` varchar(45) DEFAULT NULL,
`CRT_DTTM` varchar(45) DEFAULT NULL,
`UPD_USR_NAM` varchar(45) DEFAULT NULL,
`UPD_DTTM` varchar(45) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
My where query will be on the following columns for a specific or combination of values
C_ST_IND values range from (0,1,2,3,4,5,6,7,8,9,10,11,12)
E_IND values range from (0,1,2,3,4,5,6,7)
R_IND Values range from (0,1)
R_E_IND Values range from (0,1)
L_IND Values range from (0,1)
IS_D_IND Values range from (0,1)
I_S_IND Values range from (0,1)
I_P_IND Values range from (0,1)
I_IND Values range from (0,1)
S_IND Values range from (0,1,2,3)
H_B_IND Values range from (0,1)
O_ID Values range from (1,2,3,4,5,6)
Also my date columns are in varchar with format - '2019-01-25 01:01:59'
CRT_DTTM and UPD_DTTM
On average - Daily Load will be
CRT_DTTM Count
2019-01-20 656601
2019-01-21 686018
2019-01-22 668486
2019-01-23 680922
2019-01-24 693700
This table has millions of records now and currently in production- without any partition and index.
It is taking lot of time - to run any query.
Now, i need to create partitions/Index. Tried partition on a existing table , it takes forerver to run.
What is the best partition methods for above listed columns (frequently used in where clause) and for date columns(CRT_DTTM and UPD_DTTM) for Year, Month, Week and Day Partition.
Also any indexes?
This table will hold Three Years of data. Right now we have 3 Months of data.
How do i move my current table to a new partitioned table. I am new to mysql, any information would help reduce production query run time and report generation.

PARTITIONs do not intrinsically provide any performance. Let's see the queries so we can judge whether you have one of the rare cases, such as purging 'old' data.
Suggest you shrink the data -- SMALLINT takes 2 bytes; TINYINT UNSIGNED takes 1 byte and can easily hold all those small values you mention. 7 decimal places for lat/lng gives you the precision of under 16mm or less than one inch. Do you need that much precision? Consider DECIMAL(8,6) for latitude and (9,6) for longitude; that will save 3 bytes for each pair. (Hmmm.. Why are there two pairs?)
"A long time to run 'any' query"? Let's see some of them and work on optimizing them. The usual problem is that you need to touch lots of rows. Shrinking the rows (as mentioned above) will help some. But the big improvement comes with not touching as many rows.
This smells like a Data Warehouse application? If so, perhaps building and maintaining Summary tables is the way to go. See http://mysql.rjweb.org/doc.php/summarytables . Show me some more info, and I will help you.
Do you intend to purge data after 3 years? If so, I recommend partitioning by month and have 38 partitions. Details here: http://mysql.rjweb.org/doc.php/partitionmaint . With that, the 680K-row nightly DELETE becomes a much quicker DROP PARTITION. (Meanwhile, there is probably no benefit to the performance of queries.)
My Index Cookbook: http://mysql.rjweb.org/doc.php/index_cookbook_mysql

Related

MySQL Query Optimization that touches three tables via a union of two of them

I have a query that returns results from a single table based on the provided ID existing in a column in one of two, or both, tables. The DB schema for the relevant tables is provided below as well as the initial query and then what was later recommended to me by a peer. I go into some details below as to why this query works but I need to optimize it farther for larger datasets and pagination.
CREATE TABLE `killmails` (
`id` BIGINT(20) UNSIGNED NOT NULL,
`hash` VARCHAR(255) NOT NULL,
`moon_id` BIGINT(20) NULL DEFAULT NULL,
`solar_system_id` BIGINT(20) UNSIGNED NOT NULL,
`war_id` BIGINT(20) NULL DEFAULT NULL,
`is_npc` TINYINT(1) NOT NULL DEFAULT '0',
`is_awox` TINYINT(1) NOT NULL DEFAULT '0',
`is_solo` TINYINT(1) NOT NULL DEFAULT '0',
`dropped_value` DECIMAL(18,4) UNSIGNED NOT NULL DEFAULT '0.0000',
`destroyed_value` DECIMAL(18,4) UNSIGNED NOT NULL DEFAULT '0.0000',
`fitted_value` DECIMAL(18,4) UNSIGNED NOT NULL DEFAULT '0.0000',
`total_value` DECIMAL(18,4) UNSIGNED NOT NULL DEFAULT '0.0000',
`killmail_time` DATETIME NOT NULL,
`created_at` DATETIME NOT NULL,
`updated_at` DATETIME NOT NULL,
PRIMARY KEY (`id`, `hash`),
INDEX `total_value` (`total_value`),
INDEX `killmail_time` (`killmail_time`),
INDEX `solar_system_id` (`solar_system_id`)
)
COLLATE='utf8_general_ci'
ENGINE=InnoDB
;
CREATE TABLE `killmail_attackers` (
`id` BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT,
`killmail_id` BIGINT(20) UNSIGNED NOT NULL,
`alliance_id` BIGINT(20) UNSIGNED NULL DEFAULT NULL,
`character_id` BIGINT(20) UNSIGNED NULL DEFAULT NULL,
`corporation_id` BIGINT(20) UNSIGNED NULL DEFAULT NULL,
`faction_id` BIGINT(20) UNSIGNED NULL DEFAULT NULL,
`damage_done` BIGINT(20) UNSIGNED NOT NULL,
`final_blow` TINYINT(1) NOT NULL DEFAULT '0',
`security_status` DECIMAL(17,15) NOT NULL,
`ship_type_id` BIGINT(20) UNSIGNED NULL DEFAULT NULL,
`weapon_type_id` BIGINT(20) UNSIGNED NULL DEFAULT NULL,
`created_at` DATETIME NOT NULL,
`updated_at` DATETIME NOT NULL,
PRIMARY KEY (`id`),
INDEX `ship_type_id` (`ship_type_id`),
INDEX `weapon_type_id` (`weapon_type_id`),
INDEX `alliance_id` (`alliance_id`),
INDEX `corporation_id` (`corporation_id`),
INDEX `killmail_id_character_id` (`killmail_id`, `character_id`),
CONSTRAINT `killmail_attackers_killmail_id_killmails_id_foreign_key` FOREIGN KEY (`killmail_id`) REFERENCES `killmails` (`id`) ON UPDATE CASCADE ON DELETE CASCADE
)
COLLATE='utf8_general_ci'
ENGINE=InnoDB
;
CREATE TABLE `killmail_victim` (
`id` BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT,
`killmail_id` BIGINT(20) UNSIGNED NOT NULL,
`alliance_id` BIGINT(20) UNSIGNED NULL DEFAULT NULL,
`character_id` BIGINT(20) UNSIGNED NULL DEFAULT NULL,
`corporation_id` BIGINT(20) UNSIGNED NULL DEFAULT NULL,
`faction_id` BIGINT(20) UNSIGNED NULL DEFAULT NULL,
`damage_taken` BIGINT(20) UNSIGNED NOT NULL,
`ship_type_id` BIGINT(20) UNSIGNED NOT NULL,
`ship_value` DECIMAL(18,4) NOT NULL DEFAULT '0.0000',
`pos_x` DECIMAL(30,10) NULL DEFAULT NULL,
`pos_y` DECIMAL(30,10) NULL DEFAULT NULL,
`pos_z` DECIMAL(30,10) NULL DEFAULT NULL,
`created_at` DATETIME NOT NULL,
`updated_at` DATETIME NOT NULL,
PRIMARY KEY (`id`),
INDEX `corporation_id` (`corporation_id`),
INDEX `alliance_id` (`alliance_id`),
INDEX `ship_type_id` (`ship_type_id`),
INDEX `killmail_id_character_id` (`killmail_id`, `character_id`),
CONSTRAINT `killmail_victim_killmail_id_killmails_id_foreign_key` FOREIGN KEY (`killmail_id`) REFERENCES `killmails` (`id`) ON UPDATE CASCADE ON DELETE CASCADE
)
COLLATE='utf8_general_ci'
ENGINE=InnoDB
;
This first query is where the problem started:
SELECT
*
FROM
killmails k
LEFT JOIN killmail_attackers ka ON k.id = ka.killmail_id
LEFT JOIN killmail_victim kv ON k.id = kv.killmail_id
WHERE
ka.character_id = ?
OR kv.character_id = ?
ORDER BY killmails.killmail_time DESC
LIMIT ? OFFSET ?
This worked okay, but long query times. We optimized to this
SELECT
killmails.*,
FROM (
SELECT killmail_victim.killmail_id FROM killmail_victim
WHERE killmail_victim.corporation_id = ?
UNION
SELECT killmail_attackers.killmail_id FROM killmail_attackers
WHERE killmail_attackers.corporation_id = ?
) SELECTED_KMS
LEFT JOIN killmails ON killmails.id = SELECTED_KMS.killmail_id
ORDER BY killmails.killmail_time DESC
LIMIT ? OFFSET ?
I saw a huge improvement in query times when looking up killmails for characters, however when I started querying for larger datasets like corporation and alliance killmails, the query slows down. This is because the queries that are union'd together can potentially return large sets of data and the time it takes to read all that into memory so that the SELECTED_KMS table can be created is what I believe is taking so much time. Most of the time, with alliances, my connection to the database times out from the application. One alliance returned 900K killmailIDs from one of the union'd tables, not sure what the other returned.
I can easily add limit statements to the internal queries, but this will introduce a lot of complications when I get to paginating the data or when I introduce a feature to search for KMs by date for example.
I am looking for suggestions on how this query can be optimized and still allow for easy pagination in the near future.
Thank You
Change INDEX(corporation_id) in both tables to INDEX(corporation_id, killmail_id) so that the inner queries will be "covering".
In general, INDEX(a) is useless when you also have INDEX(a,b). Any query that needs just a, can use either of those indexes. (This rule does not apply to b; only the "leftmost" column(s).)
Where does killmails.id come from? It's not AUTO_INCREMENT; it is not alone in the PRIMARY KEY, so there is no specified "uniqueness" constraint. Is it unique by some other design? Is it computed somewhere else in the code? (I ask because I need a feel for its uniqueness and other characteristics.)
Add INDEX(id, killmails_time).
What version are you using?
Perhaps UNION ALL give the same results? It would be faster because it would not need to de-dup.
How much RAM do you have? What is the value of innodb_buffer_pool_size?
Do you really need 8-byte BIGINTs? Even if your application is using longlong (or whatever it calls it), you can probably change the schema without changing the app.
Do you need this much precision and range? DECIMAL(30,10) -- it takes 14 bytes each. DOUBLE would give you about 16 significant digits in 8 bytes, with a wider range of values (up to about 10^308). What "units" are you using? (Overkill for light-years or parsecs; inadequate for miles or km. Perhaps AUs? Then the bottom digit would be a precision of a few meters?)
The last few questions are aimed at shrinking the table and seeing if we can avoid it being as I/O-bound as it apparently is now.
Important
innodb_buffer_pool_size = 128M is terribly small, especially for a 32GB machine, and especially if your dataset is much bigger than 128MB. If there are not any other apps running on the server, bump that setting up to 20G.

How to optimize a large mysql table (stock data)

I have a table with following structure,
`trading_daily_price` (
`id` int(11) NOT NULL PRAMARY AUTO_INCREMENT,
`date` date DEFAULT NULL,
`Symbol` varchar(20) DEFAULT NULL,
`Market` varchar(12) DEFAULT NULL,
`QuoteName` text,
`Price` float DEFAULT NULL,
`PriceChange` float DEFAULT NULL,
`PriceChangePct` float DEFAULT NULL,
`Volume` float DEFAULT NULL,
`DayLow` float DEFAULT NULL,
`DayHigh` float DEFAULT NULL,
`Week52Low` float DEFAULT NULL,
`Week52High` float DEFAULT NULL,
`Open` float DEFAULT NULL,
`High` float DEFAULT NULL,
`Bid` float DEFAULT NULL,
`BidSize` float DEFAULT NULL,
`Beta` float DEFAULT NULL,
`PrevClose` float DEFAULT NULL,
`Low` float DEFAULT NULL,
`Ask` float DEFAULT NULL,
`AskSize` float DEFAULT NULL,
`VWAP` float DEFAULT NULL,
`Yield` float DEFAULT NULL,
`Dividend` char(12) DEFAULT NULL,
`DivFrequency` varchar(24) DEFAULT NULL,
`SharesOut` float DEFAULT NULL,
`PERatio` float DEFAULT NULL,
`EPS` float DEFAULT NULL,
`ExDivDate` date DEFAULT NULL,
`MarketCap` float DEFAULT NULL,
`PBRatio` float DEFAULT NULL,
`Exchange` varchar(32) DEFAULT NULL,
`NewsTitle` varchar(1024) DEFAULT NULL,
`NewsSource` varchar(32) DEFAULT NULL,
`NewsPublicationDate` date DEFAULT NULL,
`NewsURL` varchar(256) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
I didn't find an idea to break down it, in frontend presentation, I need all these columns to display. I am writing a query like,
SELECT * FROM trading_daily_price WHERE date='SOME_DATE' AND Symbol='%search_key%' ORDER BY 'column' LIMIT 10
The table has millions of records, and every day new records are added. Now the problem is every query takin so much time to generate the output. In a 4GB VPS with DigitalOcean with some configuration, it's running nicely. But, in Godaddy business hosting it's running very slowly.
I want to know is it a better idea to break the columns into multiple tables, and using JOIN statements. Will it increase performance? or I need to follow other optimization logic.
As suggested by Madhur, I have added INDEX to date, symbol, and Market. It improves the above query speed, but the following query still taking much time.
SELECT `date`,`Price` FROM trading_daily_price WHERE `Symbol` = 'GNCP:US' ORDER BY date ASC
Thanks in advance,
Rajib
As suggested by Madhur and JNevill, I found the only solution is to create multiple INDEX as required.
for first SQL,
SELECT * FROM trading_daily_price WHERE date='SOME_DATE' AND Symbol='%search_key%' ORDER BY 'column' LIMIT 10
we need to create index as below,
CREATE INDEX index_DCS ON trading_daily_price (`date`,column, symbol);
and for the second SQL,
SELECT `date`,`Price` FROM trading_daily_price WHERE `Symbol` = 'GNCP:US' ORDER BY date ASC
we need to create index as below,
CREATE INDEX index_DPS ON trading_daily_price (`date`,Price, symbol);
Thanks
You shouldn't need date, symbol and column index for your first query because you are searching symbol by %text% and MySql can only use the date part of the index. An index with date and column should be better because MySQL can utilize two columns from the index
For your new query, you will need index on Symbol, date and price. By this index, your query won't need go back to clustered index for data.
Whether splitting the table depends on your use case: how will you handle old data. If old data won't be frequently accessed, you can consider to split. But your application need cater for it.
Split up that table.
One table has the open/high/low/close/volume, indexed by stock and date.
Another table provides static information about each stock.
Perhaps another has statistics derived from the raw data.
Make changes like those, then come back for more advice/abuse.

MySQL index help - which is faster?

What I'm dealing with:
I have a project which uses ActiveCollab 2, and the database structure is new to me - practically everything gets stored to a project_objects table and has a recursively hierarchical relationship:
Record 1234 might be type "Ticket" with parent_id of 123
Record 123 might be type "Category" with parent_id of 12
Record 12 might be type "Milestone" and so on.
Currently there are upwards of 450,000 records in this table and many of the queries in the code reference the name field which does NOT have an index on it. An example value might be Design or Development.
This might be an example query:
SELECT * FROM project_objects WHERE type = "Ticket" and name = "Design"
My problem:
I have a query that is taking upwards of 12-15 seconds and I have a feeling it's from that
name column lacking the index and requiring the full text search. My understanding with indexes is that if I add one to the name field, it'll speed up the reads, but slow down the inserts and updates. Does the index need to get rebuilt completely every time a record is added or updated or is it just altered/appended? I don't want to optimize this query with an index if it means drastically slowing down other parts of the code base which depend on faster writes.
My question:
Assume 100 reads and 100 writes per day, which is more likely to be a faster process for MySQL - executing the above query on the above table without the index or having to rebuild the index every time a record is added?
I don't have the knowledge or authority to start running benchmarks, but I would like to offer a suggestion to the client without sounding completely novice. Thanks!
EDIT: Here is the table:
'CREATE TABLE `project_objects` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`source` varchar(50) DEFAULT NULL,
`type` varchar(30) NOT NULL DEFAULT ''ProjectObject'',
`module` varchar(30) NOT NULL DEFAULT ''system'',
`project_id` int(10) unsigned NOT NULL DEFAULT ''0'',
`milestone_id` int(10) unsigned DEFAULT NULL,
`parent_id` int(10) unsigned DEFAULT NULL,
`parent_type` varchar(30) DEFAULT NULL,
`name` varchar(150) DEFAULT NULL,
`body` longtext,
`tags` text,
`state` tinyint(4) NOT NULL DEFAULT ''0'',
`visibility` tinyint(4) NOT NULL DEFAULT ''0'',
`priority` tinyint(4) DEFAULT NULL,
`created_on` datetime DEFAULT NULL,
`created_by_id` smallint(5) unsigned NOT NULL DEFAULT ''0'',
`created_by_name` varchar(100) DEFAULT NULL,
`created_by_email` varchar(100) DEFAULT NULL,
`updated_on` datetime DEFAULT NULL,
`updated_by_id` smallint(5) unsigned DEFAULT NULL,
`updated_by_name` varchar(100) DEFAULT NULL,
`updated_by_email` varchar(100) DEFAULT NULL,
`due_on` date DEFAULT NULL,
`completed_on` datetime DEFAULT NULL,
`completed_by_id` smallint(5) unsigned DEFAULT NULL,
`completed_by_name` varchar(100) DEFAULT NULL,
`completed_by_email` varchar(100) DEFAULT NULL,
`comments_count` smallint(5) unsigned DEFAULT NULL,
`has_time` tinyint(1) unsigned NOT NULL DEFAULT ''0'',
`is_locked` tinyint(3) unsigned DEFAULT NULL,
`estimate` float(9,2) DEFAULT NULL,
`start_on` date DEFAULT NULL,
`start_on_text` varchar(50) DEFAULT NULL,
`due_on_text` varchar(50) DEFAULT NULL,
`workflow_status` int(4) DEFAULT NULL,
`varchar_field_1` varchar(255) DEFAULT NULL,
`varchar_field_2` varchar(255) DEFAULT NULL,
`integer_field_1` int(11) DEFAULT NULL,
`integer_field_2` int(11) DEFAULT NULL,
`float_field_1` double(10,2) DEFAULT NULL,
`float_field_2` double(10,2) DEFAULT NULL,
`text_field_1` longtext,
`text_field_2` longtext,
`date_field_1` date DEFAULT NULL,
`date_field_2` date DEFAULT NULL,
`datetime_field_1` datetime DEFAULT NULL,
`datetime_field_2` datetime DEFAULT NULL,
`boolean_field_1` tinyint(1) unsigned DEFAULT NULL,
`boolean_field_2` tinyint(1) unsigned DEFAULT NULL,
`position` int(10) unsigned DEFAULT NULL,
`version` int(10) unsigned NOT NULL DEFAULT ''0'',
PRIMARY KEY (`id`),
KEY `type` (`type`),
KEY `module` (`module`),
KEY `project_id` (`project_id`),
KEY `parent_id` (`parent_id`),
KEY `created_on` (`created_on`),
KEY `due_on` (`due_on`)
KEY `milestone_id` (`milestone_id`)
) ENGINE=InnoDB AUTO_INCREMENT=993109 DEFAULT CHARSET=utf8'
As #Ray points out, indexes do not have to be rebuilt on every Insert, Update or Delete operation. So, if you only want to improve efficuency of this (or similar) queries, add either an index on (name, type) or on (type, name).
Since you already have an index on (type) alone, I would add the first one:
ALTER TABLE project_objects
ADD INDEX name_type_IDX
(name, type) ;
It may take a few seconds on a busy server but it has to be done once and then all the queries with conditions like yours will benefit. It may also improve efficiency of several other types of queries that involve name only or name and type:
WHERE name = 'Design' AND type = 'Ticket' --- your query
WHERE name = 'Design' --- condition on `name` only
GROUP BY name --- group by `name`
WHERE name LIKE 'Design%' --- range condition on `name` only
WHERE name = 'Design' --- equality condition on `name`
AND type LIKE 'Ticket%' --- and range condition on `type`
WHERE name = 'Design' --- equality condition on `name`
GROUP BY type --- and group by `type`
GROUP BY name --- group by `name`
, type --- and `type`
The insert cost of adding a single point index on the name column is most likely negligible--it will probably amount to an addition of a constant time increase, probably no more that a few milliseconds. You will eat up some extra disk space, but that's usually not a concern. Nothing like the multiple seconds you're experienceing on select performance.
Add the index, enjoy the performance improvement.
BTW: Indexes aren't 'rebuilt' on every insert. They're usually implemented in B-Trees and unless you're deleting frequently, should require very little re-balancing once you get larger than a few levels (and rebalancing with little depth is pretty cheap).

error in mysql partitioning

I need to create the table in the below format. But is returning the error in the partitioning area that is Error Code: 1659. Field 'fldassigndate' is of a not allowed type for this type of partitioning. How to resolve this error and make partitioning ?
CREATE TABLE tblattendancesetup (
fldattendanceid int(11) NOT NULL AUTO_INCREMENT,
flddept varchar(100) DEFAULT NULL,
fldemployee varchar(100) DEFAULT NULL,
fldintime varchar(20) DEFAULT NULL,
fldouttime varchar(20) DEFAULT NULL,
fldlateafter varchar(20) DEFAULT NULL,
fldearlybefore varchar(20) DEFAULT NULL,
fldweekoff varchar(20) DEFAULT NULL,
fldshiftname varchar(20) DEFAULT NULL,
fldassigndate varchar(20) DEFAULT NULL,
fldfromdate varchar(20) DEFAULT NULL,
fldtodate varchar(20) DEFAULT NULL,
fldrefid varchar(20) DEFAULT NULL,
UNIQUE KEY fldattendanceid (fldattendanceid),
KEY in_attendancesetup (fldemployee,fldintime,fldouttime,fldlateafter,fldearlybefore,fldfromdate,fldtodate,fldattendanceid),
KEY i_emp_tmp (fldemployee),
KEY i_emp_attendance (fldemployee)
)
PARTITION BY RANGE (fldassigndate)
(PARTITION p_Apr VALUES LESS THAN (TO_DAYS('2012-05-01')),
PARTITION p_May VALUES LESS THAN (TO_DAYS('2012-06-01')),
PARTITION p_Nov VALUES LESS THAN MAXVALUE );
From the MySQL manual (Section 18):
Data type of partitioning key. A partitioning key must be either an
integer column or an expression that resolves to an integer.
Neither dates nor varchars can be used for partitioning
Although this question is very old, I thought it'd help those searching, like I was at one point.
OP should first set his column structures correctly by using a correct structure for a date field, like datetime.
Once that is done, I have not tried to set my keys in that manner, but I have found that I needed to use a composite primary key if I wanted to keep the structure of my table the same. In this case, it'd be
PRIMARY KEY(`fldattendanceid`, `fldassigndate`)
Then, there is a missing TO_DAYS specified in the PARTITION BY RANGE. Lastly, I'm not 100 percent sure, but I don't think you can partition on a null field. Even if you can, it'd be awful practice to do so.
CREATE TABLE tblattendancesetup (
fldattendanceid INT(11) NOT NULL AUTO_INCREMENT,
flddept VARCHAR(100) DEFAULT NULL,
fldemployee VARCHAR(100) DEFAULT NULL,
fldintime DATETIME DEFAULT NULL,
fldouttime DATETIME DEFAULT NULL,
fldlateafter DATETIME DEFAULT NULL,
fldearlybefore DATETIME DEFAULT NULL,
fldweekoff VARCHAR(20) DEFAULT NULL,
fldshiftname VARCHAR(20) DEFAULT NULL,
fldassigndate DATETIME NOT NULL,
fldfromdate DATETIME DEFAULT NULL,
fldtodate DATETIME DEFAULT NULL,
fldrefid VARCHAR(20) DEFAULT NULL,
PRIMARY KEY(`fldattendanceid`, `fldassigndate`),
KEY in_attendancesetup (fldemployee,fldintime,fldouttime,fldlateafter,fldearlybefore,fldfromdate,fldtodate,fldattendanceid),
KEY i_emp_tmp (fldemployee),
KEY i_emp_attendance (fldemployee)
)
PARTITION BY RANGE ( TO_DAYS(fldassigndate))
(PARTITION p_Apr VALUES LESS THAN (TO_DAYS('2012-05-01')),
PARTITION p_May VALUES LESS THAN (TO_DAYS('2012-06-01')),
PARTITION p_Nov VALUES LESS THAN MAXVALUE );
I hope this helps someone!

MYSQL: Find and delete similar records - Updated with example

I'm trying to dedup a table, where I know there are 'close' (but not exact) rows that need to be removed.
I have a single table, with 22 fields, and uniqueness can be established through comparing 5 of those fields. Of the remaining 17 fields, (including the unique key), there are 3 fields that cause each row to be unique, meaning the dedup proper method will not work.
I was looking at the multi table delete method outlined here: http://blog.krisgielen.be/archives/111 but I can't make sense of the final line of code (AND M1.cd*100+M1.track > M2.cd*100+M2.track) as I am unsure what the cd*100 part achieves...
Can anyone assist me with this? I suspect I could do better exporting the whole thing to python, doing something with it, then re-importing it, but then (1)I'm stuck with knowing how to dedup the string anyway! and (2) I had to break the record into chunks to be able to import it into mysql as it was timing out after 300 seconds so it turned into a whole debarkle to get into mysql in the first place.... (I am very novice at both mysql and python)
The table is a dump of some 40 log files from some testing. The test set for each log is some 20,000 files. The repeating values are either the test conditions, the file name/parameters or the results of the tests.
CREATE SHOW TABLE:
CREATE TABLE `t1` (
`DROID_V` int(1) DEFAULT NULL,
`Sig_V` varchar(7) DEFAULT NULL,
`SPEED` varchar(4) DEFAULT NULL,
`ID` varchar(7) DEFAULT NULL,
`PARENT_ID` varchar(10) DEFAULT NULL,
`URI` varchar(10) DEFAULT NULL,
`FILE_PATH` varchar(68) DEFAULT NULL,
`NAME` varchar(17) DEFAULT NULL,
`METHOD` varchar(10) DEFAULT NULL,
`STATUS` varchar(14) DEFAULT NULL,
`SIZE` int(10) DEFAULT NULL,
`TYPE` varchar(10) DEFAULT NULL,
`EXT` varchar(4) DEFAULT NULL,
`LAST_MODIFIED` varchar(10) DEFAULT NULL,
`EXTENSION_MISMATCH` varchar(32) DEFAULT NULL,
`MD5_HASH` varchar(10) DEFAULT NULL,
`FORMAT_COUNT` varchar(10) DEFAULT NULL,
`PUID` varchar(15) DEFAULT NULL,
`MIME_TYPE` varchar(24) DEFAULT NULL,
`FORMAT_NAME` varchar(10) DEFAULT NULL,
`FORMAT_VERSION` varchar(10) DEFAULT NULL,
`INDEX` int(11) NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`INDEX`)
) ENGINE=MyISAM AUTO_INCREMENT=960831 DEFAULT CHARSET=utf8
The only unique field is the PriKey, 'index'.
Unique records can be established by looking at DROID_V,Sig_V,SPEED.NAME and PUID
Of the ¬900,000 rows, I have about 10,000 dups that are either a single duplicate of a record, or have upto 6 repetitions of the record.
Row examples: As Is
5;"v37";"slow";"10266";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/7";"image/tiff";"Tagged Ima";"3";"191977"
5;"v37";"slow";"10268";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/8";"image/tiff";"Tagged Ima";"4";"191978"
5;"v37";"slow";"10269";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/9";"image/tiff";"Tagged Ima";"5";"191979"
5;"v37";"slow";"10270";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/10";"image/tiff";"Tagged Ima";"6";"191980"
5;"v37";"slow";"12766";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/7";"image/tiff";"Tagged Ima";"3";"193977"
5;"v37";"slow";"12768";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/8";"image/tiff";"Tagged Ima";"4";"193978"
5;"v37";"slow";"12769";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/9";"image/tiff";"Tagged Ima";"5";"193979"
5;"v37";"slow";"12770";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/10";"image/tiff";"Tagged Ima";"6";"193980"
Row Example: As It should be
5;"v37";"slow";"10266";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/7";"image/tiff";"Tagged Ima";"3";"191977"
5;"v37";"slow";"10268";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/8";"image/tiff";"Tagged Ima";"4";"191978"
5;"v37";"slow";"10269";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/9";"image/tiff";"Tagged Ima";"5";"191979"
5;"v37";"slow";"10270";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/10";"image/tiff";"Tagged Ima";"6";"191980"
Please note, you can see from the index column at the end that I have cut out some other rows - I have only idenitified a very small set of repeating rows. Please let me know if you need any more 'noise' from the rest of the DB
Thanks.
I figured out a fix - using the count function, I was using a COUNT(*) that just returned everything in the table, by using a COUNT (distinct NAME) function I am able to weed out the dup rows that fit the dup critera (as set out by the field selection in a WHERE clause)
Example:
SELECT `PUID`,`DROID_V`,`SIG_V`,`SPEED`, COUNT(distinct NAME) as Hit FROM sourcelist, main_small WHERE sourcelist.SourcePUID = 'MyVariableHere' AND main_small.NAME = sourcelist.SourceFileName
GROUP BY `PUID`,`DROID_V`,`SIG_V`,`SPEED` ORDER BY `DROID_V` ASC, `SIG_V` ASC, `SPEED`;