How to optimize a large mysql table (stock data) - mysql

I have a table with following structure,
`trading_daily_price` (
`id` int(11) NOT NULL PRAMARY AUTO_INCREMENT,
`date` date DEFAULT NULL,
`Symbol` varchar(20) DEFAULT NULL,
`Market` varchar(12) DEFAULT NULL,
`QuoteName` text,
`Price` float DEFAULT NULL,
`PriceChange` float DEFAULT NULL,
`PriceChangePct` float DEFAULT NULL,
`Volume` float DEFAULT NULL,
`DayLow` float DEFAULT NULL,
`DayHigh` float DEFAULT NULL,
`Week52Low` float DEFAULT NULL,
`Week52High` float DEFAULT NULL,
`Open` float DEFAULT NULL,
`High` float DEFAULT NULL,
`Bid` float DEFAULT NULL,
`BidSize` float DEFAULT NULL,
`Beta` float DEFAULT NULL,
`PrevClose` float DEFAULT NULL,
`Low` float DEFAULT NULL,
`Ask` float DEFAULT NULL,
`AskSize` float DEFAULT NULL,
`VWAP` float DEFAULT NULL,
`Yield` float DEFAULT NULL,
`Dividend` char(12) DEFAULT NULL,
`DivFrequency` varchar(24) DEFAULT NULL,
`SharesOut` float DEFAULT NULL,
`PERatio` float DEFAULT NULL,
`EPS` float DEFAULT NULL,
`ExDivDate` date DEFAULT NULL,
`MarketCap` float DEFAULT NULL,
`PBRatio` float DEFAULT NULL,
`Exchange` varchar(32) DEFAULT NULL,
`NewsTitle` varchar(1024) DEFAULT NULL,
`NewsSource` varchar(32) DEFAULT NULL,
`NewsPublicationDate` date DEFAULT NULL,
`NewsURL` varchar(256) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
I didn't find an idea to break down it, in frontend presentation, I need all these columns to display. I am writing a query like,
SELECT * FROM trading_daily_price WHERE date='SOME_DATE' AND Symbol='%search_key%' ORDER BY 'column' LIMIT 10
The table has millions of records, and every day new records are added. Now the problem is every query takin so much time to generate the output. In a 4GB VPS with DigitalOcean with some configuration, it's running nicely. But, in Godaddy business hosting it's running very slowly.
I want to know is it a better idea to break the columns into multiple tables, and using JOIN statements. Will it increase performance? or I need to follow other optimization logic.
As suggested by Madhur, I have added INDEX to date, symbol, and Market. It improves the above query speed, but the following query still taking much time.
SELECT `date`,`Price` FROM trading_daily_price WHERE `Symbol` = 'GNCP:US' ORDER BY date ASC
Thanks in advance,
Rajib

As suggested by Madhur and JNevill, I found the only solution is to create multiple INDEX as required.
for first SQL,
SELECT * FROM trading_daily_price WHERE date='SOME_DATE' AND Symbol='%search_key%' ORDER BY 'column' LIMIT 10
we need to create index as below,
CREATE INDEX index_DCS ON trading_daily_price (`date`,column, symbol);
and for the second SQL,
SELECT `date`,`Price` FROM trading_daily_price WHERE `Symbol` = 'GNCP:US' ORDER BY date ASC
we need to create index as below,
CREATE INDEX index_DPS ON trading_daily_price (`date`,Price, symbol);
Thanks

You shouldn't need date, symbol and column index for your first query because you are searching symbol by %text% and MySql can only use the date part of the index. An index with date and column should be better because MySQL can utilize two columns from the index
For your new query, you will need index on Symbol, date and price. By this index, your query won't need go back to clustered index for data.
Whether splitting the table depends on your use case: how will you handle old data. If old data won't be frequently accessed, you can consider to split. But your application need cater for it.

Split up that table.
One table has the open/high/low/close/volume, indexed by stock and date.
Another table provides static information about each stock.
Perhaps another has statistics derived from the raw data.
Make changes like those, then come back for more advice/abuse.

Related

mysql partition on categorical fields and timestamp column which is varchar

Currently we have table:
CREATE TABLE `T_TRANS` (
`CASE_ID` varchar(20) DEFAULT NULL,
`C_ID` varchar(20) DEFAULT NULL,
`C_ST_IND` smallint(6) DEFAULT NULL,
`D_DTTM` int(11) DEFAULT NULL,
`E_ID` varchar(10) DEFAULT NULL,
`E_LONG` decimal(11,7) DEFAULT NULL,
`E_LAT` decimal(9,7) DEFAULT NULL,
`EV_IND` smallint(6) DEFAULT NULL,
`H_B_IND` smallint(6) DEFAULT NULL,
`V_IND` varchar(15) DEFAULT NULL,
`I_IND` smallint(6) DEFAULT NULL,
`I_P_IND` smallint(6) DEFAULT NULL,
`I_S_IND` smallint(6) DEFAULT NULL,
`IS_D_IND` smallint(6) DEFAULT NULL,
`IS_R_IND` smallint(6) DEFAULT NULL,
`L_IND` smallint(6) DEFAULT NULL,
`D_LONG` decimal(11,7) DEFAULT NULL,
`D_LAT` decimal(9,7) DEFAULT NULL,
`L_P_C_DTTM` int(11) DEFAULT NULL,
`L_T_E_DTTM` int(11) DEFAULT NULL,
`M_IND` varchar(20) DEFAULT NULL,
`N_D_COUNTER` smallint(6) DEFAULT NULL,
`O_ID` smallint(6) NOT NULL,
`P_ID` varchar(50) DEFAULT NULL,
`R_E_IND` smallint(6) DEFAULT NULL,
`R_IND` smallint(6) DEFAULT NULL,
`S_C_DTTM` varchar(20) DEFAULT NULL,
`S_IND` smallint(6) DEFAULT NULL,
`T_T_RED` varchar(20) DEFAULT NULL,
`U_D` int(11) DEFAULT NULL,
`V_D` int(11) DEFAULT NULL,
`CRT_USR_NAM` varchar(45) DEFAULT NULL,
`CRT_DTTM` varchar(45) DEFAULT NULL,
`UPD_USR_NAM` varchar(45) DEFAULT NULL,
`UPD_DTTM` varchar(45) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
My where query will be on the following columns for a specific or combination of values
C_ST_IND values range from (0,1,2,3,4,5,6,7,8,9,10,11,12)
E_IND values range from (0,1,2,3,4,5,6,7)
R_IND Values range from (0,1)
R_E_IND Values range from (0,1)
L_IND Values range from (0,1)
IS_D_IND Values range from (0,1)
I_S_IND Values range from (0,1)
I_P_IND Values range from (0,1)
I_IND Values range from (0,1)
S_IND Values range from (0,1,2,3)
H_B_IND Values range from (0,1)
O_ID Values range from (1,2,3,4,5,6)
Also my date columns are in varchar with format - '2019-01-25 01:01:59'
CRT_DTTM and UPD_DTTM
On average - Daily Load will be
CRT_DTTM Count
2019-01-20 656601
2019-01-21 686018
2019-01-22 668486
2019-01-23 680922
2019-01-24 693700
This table has millions of records now and currently in production- without any partition and index.
It is taking lot of time - to run any query.
Now, i need to create partitions/Index. Tried partition on a existing table , it takes forerver to run.
What is the best partition methods for above listed columns (frequently used in where clause) and for date columns(CRT_DTTM and UPD_DTTM) for Year, Month, Week and Day Partition.
Also any indexes?
This table will hold Three Years of data. Right now we have 3 Months of data.
How do i move my current table to a new partitioned table. I am new to mysql, any information would help reduce production query run time and report generation.
PARTITIONs do not intrinsically provide any performance. Let's see the queries so we can judge whether you have one of the rare cases, such as purging 'old' data.
Suggest you shrink the data -- SMALLINT takes 2 bytes; TINYINT UNSIGNED takes 1 byte and can easily hold all those small values you mention. 7 decimal places for lat/lng gives you the precision of under 16mm or less than one inch. Do you need that much precision? Consider DECIMAL(8,6) for latitude and (9,6) for longitude; that will save 3 bytes for each pair. (Hmmm.. Why are there two pairs?)
"A long time to run 'any' query"? Let's see some of them and work on optimizing them. The usual problem is that you need to touch lots of rows. Shrinking the rows (as mentioned above) will help some. But the big improvement comes with not touching as many rows.
This smells like a Data Warehouse application? If so, perhaps building and maintaining Summary tables is the way to go. See http://mysql.rjweb.org/doc.php/summarytables . Show me some more info, and I will help you.
Do you intend to purge data after 3 years? If so, I recommend partitioning by month and have 38 partitions. Details here: http://mysql.rjweb.org/doc.php/partitionmaint . With that, the 680K-row nightly DELETE becomes a much quicker DROP PARTITION. (Meanwhile, there is probably no benefit to the performance of queries.)
My Index Cookbook: http://mysql.rjweb.org/doc.php/index_cookbook_mysql

How can i define a column as tinyint(4) using sequelize ORM

I need to define my table as
CREATE TABLE `test` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`o_id` int(11) unsigned NOT NULL,
`m_name` varchar(45) NOT NULL,
`o_name` varchar(45) NOT NULL,
`customer_id` int(11) unsigned NOT NULL,
`client_id` tinyint(4) unsigned DEFAULT '1',
`set_id` tinyint(4) unsigned DEFAULT NULL,
`s1` tinyint(4) unsigned DEFAULT NULL,
`s2` tinyint(4) unsigned DEFAULT NULL,
`s3` tinyint(4) unsigned DEFAULT NULL,
`review` varchar(2045) DEFAULT NULL,
`created_at` timestamp NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
UNIQUE KEY `br_o_id_idx` (`order_id`),
KEY `br_date_idx` (`created_at`),
KEY `br_on_idx` (`operator_name`),
KEY `br_mn_idx` (`merchant_name`)
)
but as i am looking on sequelize documentation , it does not have support for tiny int with its size.
From the lack of ZERO_FILL in your table definition, I suspect tinyint(4) probably does not do what you think it does. From 11.2.5 Numeric Type Attributes:
MySQL supports an extension for optionally specifying the display width of integer data types in parentheses following the base keyword for the type.
...
The display width does not constrain the range of values that can be stored in the column.
...
For example, a column specified as SMALLINT(3) has the usual SMALLINT range of -32768 to 32767, and values outside the range permitted by three digits are displayed in full using more than three digits.
I'm not sure if other RDBMSs treat the number in parens differently, but from perusing the sequelize source it looks like they're under the same, incorrect, impression you are.
That being said, the important part of your schema, being that you want to store those fields as TINYINTs (using only a byte of storage to contain values between 0-255), is sadly not available in the Sequelize DataTypes. I might suggest opening a PR to add it...
On the other hand, if you really are looking for the ZEROFILL functionality, and need to specify that display width of 4, you could do something like Sequelize.INTEGER(4).ZEROFILL, but obviously, that would be pretty wasteful of space in your DB.
For MySQL, the Sequelize.BOOLEAN data type maps to TINYINT(1). See
https://github.com/sequelize/sequelize/blob/3e5b8772ef75169685fc96024366bca9958fee63/lib/data-types.js#L397
and
http://docs.sequelizejs.com/en/v3/api/datatypes/
As noted by #user866762, the number in parentheses only affects how the data is displayed, not how it is stored. So, TINYINT(1) vs. TINYINT(4) should have no effect on your data.

MySQL table design / architecture, table is too big

I have a MySQL DB that contains a lot of text, I'm fetching data from a website and inserting it into a table.
I'm using a SSD HD (100GB) for the DB and I'm out of space, I think that something in the table structure made it too big, I can't predict the size of all the columns so I'm using varchar\text\medium text for most of the fields. when I insert all the data to the DB I monitor the errors and when I see that a certain field is too small for the data I'm trying to insert I'm increasing the size of the field (e.g. from varchar(1000) to varchar(2000)).
until now I have about 1.8M~ rows, I think that I'm doing something wrong.
here is the structure of my table -
CREATE TABLE `PT` (
`patID` int(11) NOT NULL,
`Title` varchar(450) DEFAULT NULL,
`IssueDate` date DEFAULT NULL,
`NoFullText` tinyint(1) DEFAULT NULL,
`Abstract` text,
`ForeignReferences` varchar(15000) DEFAULT NULL,
`CurrentUSClass` varchar(2200) DEFAULT NULL,
`OtherReferences` mediumtext,
`ForeignPrio` varchar(900) DEFAULT NULL,
`CurrentIntlClass` varchar(3000) DEFAULT NULL,
`AppNum` varchar(45) DEFAULT NULL,
`AppDate` date DEFAULT NULL,
`Assignee` varchar(300) DEFAULT NULL,
`Inventors` varchar(1500) DEFAULT NULL,
`RelatedUSAppData` text,
`PrimaryExaminer` varchar(100) DEFAULT NULL,
`AssistantExaminer` varchar(100) DEFAULT NULL,
`AttorneyOrAgent` varchar(300) DEFAULT NULL,
`ReferencedBy` text,
`AssigneeName` varchar(150) DEFAULT NULL,
`AssigneeState` varchar(80) DEFAULT NULL,
`AssigneeCity` varchar(150) DEFAULT NULL,
`InventorsName` varchar(800) DEFAULT NULL,
`InventorsState` varchar(300) DEFAULT NULL,
`InventorsCity` varchar(800) DEFAULT NULL,
`Claims` mediumtext,
`Description` mediumtext,
`InsertionTime` datetime NOT NULL,
`LastUpdatedOn` datetime NOT NULL,
PRIMARY KEY (`patID`),
UNIQUE KEY `patID_UNIQUE` (`patID`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
what should I do? I have about 20% of the data (which means I'm going to need 350GB~ space) what is the performance impact here? should I divide the table into several tables over several HDs? I'm going to use sphinx to index and query the data in the end.
All of the non-TEXT column values are stored in one 8KB record (undivided unit of space on your HDD). TEXT column values are stored as pointers to external blocks of data.
These kinds of structures (very text oriented) are better handled by NOSQL (Not Only SQL) databases like MongoDB.
But I suspect that there are a lot of things you could do regarding how to handle & structure your data in order to avoid saving huge chunks of text.
The process of structuring a database to avoid repetitious information and to allow for easy updates (update in one place - visible everywhere) is called normalization.
If the data you're storing in those big VARCHARs (ex.: Inventors length 1500) is structured as multiple elements of data (ex.: names of inventors separated by a coma) then you can restructure your DB table by creating an inventors table and referencing to it.

MySQL indexes creation strategy and inner logic

This question expects a generic answer to the wide problematic of indexes creation on MySQL database.
Let's take this table example :
CREATE TABLE IF NOT EXISTS `article` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`published` tinyint(1) NOT NULL DEFAULT '0',
`author_id` int(11) unsigned NOT NULL,
`modificator_id` int(11) unsigned DEFAULT NULL,
`category_id` int(11) unsigned DEFAULT NULL,
`title` varchar(200) COLLATE utf8_unicode_ci NOT NULL,
`headline` text COLLATE utf8_unicode_ci NOT NULL,
`content` text COLLATE utf8_unicode_ci NOT NULL,
`url_alias` varchar(50) COLLATE utf8_unicode_ci NOT NULL,
`priority` mediumint(11) unsigned NOT NULL DEFAULT '50',
`publication_date` datetime NOT NULL,
`creation_date` datetime NOT NULL,
`modification_date` datetime NOT NULL,
PRIMARY KEY (`id`)
);
Over such a sample there is a wide range of queries that could be performed on different criterions :
category_id
published
publication_date
e.g.:
SELECT id FROM article WHERE NOT published AND category_id = '2' ORDER BY publication_date;
On many tables you can see a wide range of state fields (like published here), date fields or reference fields (like author_id or category_id). What strategy should be picked to make indexes ?
Which can be developed under the following points:
Make an index on every fields that can be used in query (either as where argument or order by) even if this can lead to have a lot of indexes per table ?
Also make an index on fields that have only a small set of values like boolean or enum, this just does reduce the scope size of the scan by a n factor (assuming n being the number of inputs and every value homogeneously used) ?
I've read that MySQL prior to 5.0 used only one index per request how do the system picks it ? (by choosing the more restrictive one ?)
How does a OR statement is processed ?
How much does this is going to slow insert ?
Does InnoDB/MyISAM change anything to this problem ?
I know the EXPLAIN statement could be used to know whether a request is optimized or not, but a bit of concrete theoretical stuff would really be more constructive than a purely empirical approach !

MYSQL: Find and delete similar records - Updated with example

I'm trying to dedup a table, where I know there are 'close' (but not exact) rows that need to be removed.
I have a single table, with 22 fields, and uniqueness can be established through comparing 5 of those fields. Of the remaining 17 fields, (including the unique key), there are 3 fields that cause each row to be unique, meaning the dedup proper method will not work.
I was looking at the multi table delete method outlined here: http://blog.krisgielen.be/archives/111 but I can't make sense of the final line of code (AND M1.cd*100+M1.track > M2.cd*100+M2.track) as I am unsure what the cd*100 part achieves...
Can anyone assist me with this? I suspect I could do better exporting the whole thing to python, doing something with it, then re-importing it, but then (1)I'm stuck with knowing how to dedup the string anyway! and (2) I had to break the record into chunks to be able to import it into mysql as it was timing out after 300 seconds so it turned into a whole debarkle to get into mysql in the first place.... (I am very novice at both mysql and python)
The table is a dump of some 40 log files from some testing. The test set for each log is some 20,000 files. The repeating values are either the test conditions, the file name/parameters or the results of the tests.
CREATE SHOW TABLE:
CREATE TABLE `t1` (
`DROID_V` int(1) DEFAULT NULL,
`Sig_V` varchar(7) DEFAULT NULL,
`SPEED` varchar(4) DEFAULT NULL,
`ID` varchar(7) DEFAULT NULL,
`PARENT_ID` varchar(10) DEFAULT NULL,
`URI` varchar(10) DEFAULT NULL,
`FILE_PATH` varchar(68) DEFAULT NULL,
`NAME` varchar(17) DEFAULT NULL,
`METHOD` varchar(10) DEFAULT NULL,
`STATUS` varchar(14) DEFAULT NULL,
`SIZE` int(10) DEFAULT NULL,
`TYPE` varchar(10) DEFAULT NULL,
`EXT` varchar(4) DEFAULT NULL,
`LAST_MODIFIED` varchar(10) DEFAULT NULL,
`EXTENSION_MISMATCH` varchar(32) DEFAULT NULL,
`MD5_HASH` varchar(10) DEFAULT NULL,
`FORMAT_COUNT` varchar(10) DEFAULT NULL,
`PUID` varchar(15) DEFAULT NULL,
`MIME_TYPE` varchar(24) DEFAULT NULL,
`FORMAT_NAME` varchar(10) DEFAULT NULL,
`FORMAT_VERSION` varchar(10) DEFAULT NULL,
`INDEX` int(11) NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`INDEX`)
) ENGINE=MyISAM AUTO_INCREMENT=960831 DEFAULT CHARSET=utf8
The only unique field is the PriKey, 'index'.
Unique records can be established by looking at DROID_V,Sig_V,SPEED.NAME and PUID
Of the ¬900,000 rows, I have about 10,000 dups that are either a single duplicate of a record, or have upto 6 repetitions of the record.
Row examples: As Is
5;"v37";"slow";"10266";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/7";"image/tiff";"Tagged Ima";"3";"191977"
5;"v37";"slow";"10268";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/8";"image/tiff";"Tagged Ima";"4";"191978"
5;"v37";"slow";"10269";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/9";"image/tiff";"Tagged Ima";"5";"191979"
5;"v37";"slow";"10270";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/10";"image/tiff";"Tagged Ima";"6";"191980"
5;"v37";"slow";"12766";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/7";"image/tiff";"Tagged Ima";"3";"193977"
5;"v37";"slow";"12768";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/8";"image/tiff";"Tagged Ima";"4";"193978"
5;"v37";"slow";"12769";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/9";"image/tiff";"Tagged Ima";"5";"193979"
5;"v37";"slow";"12770";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/10";"image/tiff";"Tagged Ima";"6";"193980"
Row Example: As It should be
5;"v37";"slow";"10266";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/7";"image/tiff";"Tagged Ima";"3";"191977"
5;"v37";"slow";"10268";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/8";"image/tiff";"Tagged Ima";"4";"191978"
5;"v37";"slow";"10269";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/9";"image/tiff";"Tagged Ima";"5";"191979"
5;"v37";"slow";"10270";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/10";"image/tiff";"Tagged Ima";"6";"191980"
Please note, you can see from the index column at the end that I have cut out some other rows - I have only idenitified a very small set of repeating rows. Please let me know if you need any more 'noise' from the rest of the DB
Thanks.
I figured out a fix - using the count function, I was using a COUNT(*) that just returned everything in the table, by using a COUNT (distinct NAME) function I am able to weed out the dup rows that fit the dup critera (as set out by the field selection in a WHERE clause)
Example:
SELECT `PUID`,`DROID_V`,`SIG_V`,`SPEED`, COUNT(distinct NAME) as Hit FROM sourcelist, main_small WHERE sourcelist.SourcePUID = 'MyVariableHere' AND main_small.NAME = sourcelist.SourceFileName
GROUP BY `PUID`,`DROID_V`,`SIG_V`,`SPEED` ORDER BY `DROID_V` ASC, `SIG_V` ASC, `SPEED`;