I have a MySQL DB that contains a lot of text, I'm fetching data from a website and inserting it into a table.
I'm using a SSD HD (100GB) for the DB and I'm out of space, I think that something in the table structure made it too big, I can't predict the size of all the columns so I'm using varchar\text\medium text for most of the fields. when I insert all the data to the DB I monitor the errors and when I see that a certain field is too small for the data I'm trying to insert I'm increasing the size of the field (e.g. from varchar(1000) to varchar(2000)).
until now I have about 1.8M~ rows, I think that I'm doing something wrong.
here is the structure of my table -
CREATE TABLE `PT` (
`patID` int(11) NOT NULL,
`Title` varchar(450) DEFAULT NULL,
`IssueDate` date DEFAULT NULL,
`NoFullText` tinyint(1) DEFAULT NULL,
`Abstract` text,
`ForeignReferences` varchar(15000) DEFAULT NULL,
`CurrentUSClass` varchar(2200) DEFAULT NULL,
`OtherReferences` mediumtext,
`ForeignPrio` varchar(900) DEFAULT NULL,
`CurrentIntlClass` varchar(3000) DEFAULT NULL,
`AppNum` varchar(45) DEFAULT NULL,
`AppDate` date DEFAULT NULL,
`Assignee` varchar(300) DEFAULT NULL,
`Inventors` varchar(1500) DEFAULT NULL,
`RelatedUSAppData` text,
`PrimaryExaminer` varchar(100) DEFAULT NULL,
`AssistantExaminer` varchar(100) DEFAULT NULL,
`AttorneyOrAgent` varchar(300) DEFAULT NULL,
`ReferencedBy` text,
`AssigneeName` varchar(150) DEFAULT NULL,
`AssigneeState` varchar(80) DEFAULT NULL,
`AssigneeCity` varchar(150) DEFAULT NULL,
`InventorsName` varchar(800) DEFAULT NULL,
`InventorsState` varchar(300) DEFAULT NULL,
`InventorsCity` varchar(800) DEFAULT NULL,
`Claims` mediumtext,
`Description` mediumtext,
`InsertionTime` datetime NOT NULL,
`LastUpdatedOn` datetime NOT NULL,
PRIMARY KEY (`patID`),
UNIQUE KEY `patID_UNIQUE` (`patID`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
what should I do? I have about 20% of the data (which means I'm going to need 350GB~ space) what is the performance impact here? should I divide the table into several tables over several HDs? I'm going to use sphinx to index and query the data in the end.
All of the non-TEXT column values are stored in one 8KB record (undivided unit of space on your HDD). TEXT column values are stored as pointers to external blocks of data.
These kinds of structures (very text oriented) are better handled by NOSQL (Not Only SQL) databases like MongoDB.
But I suspect that there are a lot of things you could do regarding how to handle & structure your data in order to avoid saving huge chunks of text.
The process of structuring a database to avoid repetitious information and to allow for easy updates (update in one place - visible everywhere) is called normalization.
If the data you're storing in those big VARCHARs (ex.: Inventors length 1500) is structured as multiple elements of data (ex.: names of inventors separated by a coma) then you can restructure your DB table by creating an inventors table and referencing to it.
Related
I have a MySQL query
SELECT * FROM table WHERE INET_ATON("10.0.0.1") BETWEEN INET_ATON(s_ip) AND INET_ATON(e_ip);
Here "10.0.0.1" comes dynamically when a user visits the website and s_ip is the starting ip address column which would probably have "10.0.0.0" as starting ip address range and e_ip is the ending IP address.
Now, the problem is I have almost ~350K records which do only one thing when this query is executed and that is to get me the country code of the visitor.
When this query is executed MySQL peaks CPU consumption at 1100% and multiply that by 1000 requests/minute and my server just cannot handle it.
My server is running CentOS 7 with 100 GB of RAM and 24 Cores clocked at 3.0 GHz but still the performance is becoming a nightmare for me to handle.
I was thinking of outsourcing this functionality to third party service but I just want to make sure that nothing can be done from my side to fix this issue.
(From Comments)
CREATE TABLE ip` (
ip_ip varbinary(16) NOT NULL,
ip_last_request_time timestamp(3) NULL DEFAULT NULL,
ip_min_timeSpan_get smallint(5) unsigned NOT NULL,
ip_min_timeSpan_post smallint(5) unsigned NOT NULL,
ip_violationsCount_get smallint(5) unsigned NOT NULL,
ip_violationsCount_post smallint(5) unsigned NOT NULL,
ip_maxViolations_get smallint(5) unsigned NOT NULL,
ip_maxViolations_post smallint(5) unsigned NOT NULL,
ip_bannedAt timestamp(3) NULL DEFAULT NULL,
ip_banSeconds mediumint(8) unsigned NOT NULL DEFAULT '300',
ip_isCapatchaResolved tinyint(1) NOT NULL DEFAULT '0',
ip_isManualBanned tinyint(1) NOT NULL DEFAULT '0',
ip_city varchar(45) DEFAULT '',
ip_region varchar(45) DEFAULT '',
ip_regionCode varchar(5) DEFAULT '',
ip_regionName varchar(45) DEFAULT '',
ip_countryCode varchar(3) DEFAULT '',
ip_countryName varchar(45) DEFAULT '',
ip_continentCode varchar(3) DEFAULT '',
ip_continentName varchar(45) DEFAULT '',
ip_timezone varchar(45) DEFAULT '',
ip_currencyCode varchar(4) DEFAULT '',
ip_currencySymbol_UTF8 varchar(5) DEFAULT '',
PRIMARY KEY (ip_ip),
KEY countryCode_index (ip_countryCode)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4`
CREATE TABLE country` ( co_id char(2) COLLATE utf8mb4_unicode_ci NOT NULL,
co_re_id smallint(6) DEFAULT NULL,
co_flag_id char(4) COLLATE utf8mb4_unicode_ci NOT NULL,
co_english_name varchar(40) COLLATE utf8mb4_unicode_ci NOT NULL,
PRIMARY KEY (co_id),
KEY fk_country_region1_idx (co_re_id),
CONSTRAINT fk_country_region1 FOREIGN KEY (co_re_id)
REFERENCES region (re_id) ON DELETE NO ACTION ON UPDATE NO ACTION
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
Currently you're doing a full table scan for every query. There are a couple of things you can try.
Store INET_ATON(s_ip) in the table so it's not computed during the query. Same for e_ip.
Add an index that has these two new columns, and the country code.
Change the query to select only the country code, and use the two new columns.
Use EXPLAIN to make sure the DB uses the index for the query.
The optimizer does not know that you have a set of non-overlapping ranges that it could do some optimizations based on it. So, you have work harder to optimize the queries.
Instead of doing table scans, the code described here will do typical queries 'instantly'.
To put it bluntly, you cannot optimize the query without restructuring the data. I'm speaking also to all who have provided Answers and Comments.
(critique of schema)
ip is awfully bulky. Suggest moving city and all the fields after it to another table in order to 'normalize' that data.
It is 'wrong' to have both a ..code and ..name in the same table (except for the normalization table).
Several fields can (and should) be ascii, not utf8mb4. Example: countryCode.
On another topic... How will you handle AOL IP addresses? As I understand it, these are shared among its customers. That is, a "violator" will move around, tainting all of the AOL IPs.
10., 11., 172.16., 192.168. all come from behind a NAT, and cannot be associated with a given country, nor a given computer.
I have a table with following structure,
`trading_daily_price` (
`id` int(11) NOT NULL PRAMARY AUTO_INCREMENT,
`date` date DEFAULT NULL,
`Symbol` varchar(20) DEFAULT NULL,
`Market` varchar(12) DEFAULT NULL,
`QuoteName` text,
`Price` float DEFAULT NULL,
`PriceChange` float DEFAULT NULL,
`PriceChangePct` float DEFAULT NULL,
`Volume` float DEFAULT NULL,
`DayLow` float DEFAULT NULL,
`DayHigh` float DEFAULT NULL,
`Week52Low` float DEFAULT NULL,
`Week52High` float DEFAULT NULL,
`Open` float DEFAULT NULL,
`High` float DEFAULT NULL,
`Bid` float DEFAULT NULL,
`BidSize` float DEFAULT NULL,
`Beta` float DEFAULT NULL,
`PrevClose` float DEFAULT NULL,
`Low` float DEFAULT NULL,
`Ask` float DEFAULT NULL,
`AskSize` float DEFAULT NULL,
`VWAP` float DEFAULT NULL,
`Yield` float DEFAULT NULL,
`Dividend` char(12) DEFAULT NULL,
`DivFrequency` varchar(24) DEFAULT NULL,
`SharesOut` float DEFAULT NULL,
`PERatio` float DEFAULT NULL,
`EPS` float DEFAULT NULL,
`ExDivDate` date DEFAULT NULL,
`MarketCap` float DEFAULT NULL,
`PBRatio` float DEFAULT NULL,
`Exchange` varchar(32) DEFAULT NULL,
`NewsTitle` varchar(1024) DEFAULT NULL,
`NewsSource` varchar(32) DEFAULT NULL,
`NewsPublicationDate` date DEFAULT NULL,
`NewsURL` varchar(256) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
I didn't find an idea to break down it, in frontend presentation, I need all these columns to display. I am writing a query like,
SELECT * FROM trading_daily_price WHERE date='SOME_DATE' AND Symbol='%search_key%' ORDER BY 'column' LIMIT 10
The table has millions of records, and every day new records are added. Now the problem is every query takin so much time to generate the output. In a 4GB VPS with DigitalOcean with some configuration, it's running nicely. But, in Godaddy business hosting it's running very slowly.
I want to know is it a better idea to break the columns into multiple tables, and using JOIN statements. Will it increase performance? or I need to follow other optimization logic.
As suggested by Madhur, I have added INDEX to date, symbol, and Market. It improves the above query speed, but the following query still taking much time.
SELECT `date`,`Price` FROM trading_daily_price WHERE `Symbol` = 'GNCP:US' ORDER BY date ASC
Thanks in advance,
Rajib
As suggested by Madhur and JNevill, I found the only solution is to create multiple INDEX as required.
for first SQL,
SELECT * FROM trading_daily_price WHERE date='SOME_DATE' AND Symbol='%search_key%' ORDER BY 'column' LIMIT 10
we need to create index as below,
CREATE INDEX index_DCS ON trading_daily_price (`date`,column, symbol);
and for the second SQL,
SELECT `date`,`Price` FROM trading_daily_price WHERE `Symbol` = 'GNCP:US' ORDER BY date ASC
we need to create index as below,
CREATE INDEX index_DPS ON trading_daily_price (`date`,Price, symbol);
Thanks
You shouldn't need date, symbol and column index for your first query because you are searching symbol by %text% and MySql can only use the date part of the index. An index with date and column should be better because MySQL can utilize two columns from the index
For your new query, you will need index on Symbol, date and price. By this index, your query won't need go back to clustered index for data.
Whether splitting the table depends on your use case: how will you handle old data. If old data won't be frequently accessed, you can consider to split. But your application need cater for it.
Split up that table.
One table has the open/high/low/close/volume, indexed by stock and date.
Another table provides static information about each stock.
Perhaps another has statistics derived from the raw data.
Make changes like those, then come back for more advice/abuse.
I need to define my table as
CREATE TABLE `test` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`o_id` int(11) unsigned NOT NULL,
`m_name` varchar(45) NOT NULL,
`o_name` varchar(45) NOT NULL,
`customer_id` int(11) unsigned NOT NULL,
`client_id` tinyint(4) unsigned DEFAULT '1',
`set_id` tinyint(4) unsigned DEFAULT NULL,
`s1` tinyint(4) unsigned DEFAULT NULL,
`s2` tinyint(4) unsigned DEFAULT NULL,
`s3` tinyint(4) unsigned DEFAULT NULL,
`review` varchar(2045) DEFAULT NULL,
`created_at` timestamp NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
UNIQUE KEY `br_o_id_idx` (`order_id`),
KEY `br_date_idx` (`created_at`),
KEY `br_on_idx` (`operator_name`),
KEY `br_mn_idx` (`merchant_name`)
)
but as i am looking on sequelize documentation , it does not have support for tiny int with its size.
From the lack of ZERO_FILL in your table definition, I suspect tinyint(4) probably does not do what you think it does. From 11.2.5 Numeric Type Attributes:
MySQL supports an extension for optionally specifying the display width of integer data types in parentheses following the base keyword for the type.
...
The display width does not constrain the range of values that can be stored in the column.
...
For example, a column specified as SMALLINT(3) has the usual SMALLINT range of -32768 to 32767, and values outside the range permitted by three digits are displayed in full using more than three digits.
I'm not sure if other RDBMSs treat the number in parens differently, but from perusing the sequelize source it looks like they're under the same, incorrect, impression you are.
That being said, the important part of your schema, being that you want to store those fields as TINYINTs (using only a byte of storage to contain values between 0-255), is sadly not available in the Sequelize DataTypes. I might suggest opening a PR to add it...
On the other hand, if you really are looking for the ZEROFILL functionality, and need to specify that display width of 4, you could do something like Sequelize.INTEGER(4).ZEROFILL, but obviously, that would be pretty wasteful of space in your DB.
For MySQL, the Sequelize.BOOLEAN data type maps to TINYINT(1). See
https://github.com/sequelize/sequelize/blob/3e5b8772ef75169685fc96024366bca9958fee63/lib/data-types.js#L397
and
http://docs.sequelizejs.com/en/v3/api/datatypes/
As noted by #user866762, the number in parentheses only affects how the data is displayed, not how it is stored. So, TINYINT(1) vs. TINYINT(4) should have no effect on your data.
The problem is that after I insert 200.000-300.000 rows of data into those columns the search moves very slow and my first thing that came in mind is the indexes that I may have not added correctly. I have tried adding as many as possible BTREE indexes phpmyadmin did not let me to add for all. What would be the correct indexes for my table? I have the following table with the following indexes:
CREATE TABLE IF NOT EXISTS `carads` (
`ADID` int(7) NOT NULL AUTO_INCREMENT,
`LINK` varchar(255) CHARACTER SET latin1 NOT NULL,
`TITLE` varchar(255) NOT NULL,
`MAKE` varchar(50) CHARACTER SET latin1 NOT NULL,
`MODEL` varchar(100) CHARACTER SET latin1 NOT NULL,
`FUEL` varchar(50) CHARACTER SET latin1 NOT NULL,
`LOC` varchar(100) NOT NULL,
`TRANS` varchar(50) NOT NULL,
`YEAR` varchar(4) CHARACTER SET latin1 NOT NULL,
`BODY` varchar(255) CHARACTER SET latin1 NOT NULL,
`DESCRIPT` text CHARACTER SET latin1 NOT NULL,
`PHONENR` varchar(20) NOT NULL,
`MILEAGE` int(11) NOT NULL,
`PRICE` int(20) NOT NULL,
`DISTANCE` int(250) NOT NULL,
`POSTCODE` varchar(250) NOT NULL,
`IMAGE1` varchar(255) NOT NULL,
`IMAGE2` varchar(255) NOT NULL,
`IMAGE3` varchar(255) NOT NULL,
`IMAGE4` varchar(255) NOT NULL,
`IMAGE5` varchar(255) NOT NULL,
`CPHONE` varchar(250) NOT NULL,
`CEMAIL` varchar(500) NOT NULL,
`COLOUR` varchar(250) NOT NULL,
`EQUIPMENT` text NOT NULL,
`STATUS` tinyint(1) NOT NULL DEFAULT '1',
`DATE` date NOT NULL,
`DEL` int(1) NOT NULL DEFAULT '0',
PRIMARY KEY (`ADID`),
KEY `ix_MakeModelPrice` (`STATUS`,`MAKE`(25),`MODEL`(25),`PRICE`),
KEY `ix_Price` (`PRICE`,`STATUS`,`DEL`,`TITLE`(30),`ADID`),
KEY `ix_Date` (`DATE`,`STATUS`,`DEL`,`TITLE`(30),`ADID`),
KEY `LINK` (`LINK`),
FULLTEXT KEY `MODEL` (`MODEL`),
FULLTEXT KEY `SearchIndex` (`TITLE`,`LOC`,`TRANS`,`CPHONE`,`CEMAIL`,`COLOUR`,`EQUIPMENT`),
FULLTEXT KEY `MAKE` (`MAKE`)
)
ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=2478687;
This is very complicated and we cannot give you the correct answer, you have to understand and find the best answer by yourself.
You have to keep following in mind:
The query optimizer will choose only one index.
Indexes which start with something like "status" and or "del" (boolean values or values where 95% of the rows have the selected values) don't add any value, besides these dummy columns are followed with often queried, highly selective values.
You should first find the attributes which are
filled in most of the queries (I could imagine that "make", "price" and "year" are good candidates)
are most selective (meaning that the resulting rows are < 10%)
You have to find out which distribution of values for each of the columns exist in your table. Examples:
Make:
BMW: 5%
Alfa Romeo: 1%
VW: 7%
...
Price-Range:
0..999: 3%
1000..1999: 4%
2000..3000: 5%
...
If 80% of all searches contain "make", "price" and "year", then create an index with all 3 columns. Put the columns which are most selective and/or are mentioned in most searches to the front, followed by the other columns.
With some luck you can improve response time of many searches dramatically. You can then dig deeper into statistics and add some other indexes. Maybe 80% of all searches have a selection for "make", but in the rest there are still many searches without "make", but with focus on "price" and "fuel", then create an index for those searches.
You could as well improve performance when you use "codes" (e.g. Alfa Romeo=1, BMW=2, VW=3, ...) or cluster ranges of values (e.g. price_range: 0..999, 1000..2000, ...). This could help MySQL to build a bit more efficient indexes (smaller leads to less memory footprint and less I/0).
And to understand indexes better, try to submit a query like this (I want that index ix_MakeModelPrice is used):
-- ix_MakeModelPrice: STATUS`,`MAKE`(25),`MODEL`(25),`PRICE`
SELECT * FROM carads
where STATUS=1 AND MAKE='Alfa Romeo'
AND MODEL='159' and PRICE BETWEEN 100 and 1000
order by ADID Desc
LIMIT 0
This query should be fast (hopefully with some matching rows). Do you see why it is fast? "STATUS" is not selective, but the rest should reduce the number of rows found with an index-scan to probably way below 1%. The number of physical reads (rows) is reduced to a minimum => faster response.
This question expects a generic answer to the wide problematic of indexes creation on MySQL database.
Let's take this table example :
CREATE TABLE IF NOT EXISTS `article` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`published` tinyint(1) NOT NULL DEFAULT '0',
`author_id` int(11) unsigned NOT NULL,
`modificator_id` int(11) unsigned DEFAULT NULL,
`category_id` int(11) unsigned DEFAULT NULL,
`title` varchar(200) COLLATE utf8_unicode_ci NOT NULL,
`headline` text COLLATE utf8_unicode_ci NOT NULL,
`content` text COLLATE utf8_unicode_ci NOT NULL,
`url_alias` varchar(50) COLLATE utf8_unicode_ci NOT NULL,
`priority` mediumint(11) unsigned NOT NULL DEFAULT '50',
`publication_date` datetime NOT NULL,
`creation_date` datetime NOT NULL,
`modification_date` datetime NOT NULL,
PRIMARY KEY (`id`)
);
Over such a sample there is a wide range of queries that could be performed on different criterions :
category_id
published
publication_date
e.g.:
SELECT id FROM article WHERE NOT published AND category_id = '2' ORDER BY publication_date;
On many tables you can see a wide range of state fields (like published here), date fields or reference fields (like author_id or category_id). What strategy should be picked to make indexes ?
Which can be developed under the following points:
Make an index on every fields that can be used in query (either as where argument or order by) even if this can lead to have a lot of indexes per table ?
Also make an index on fields that have only a small set of values like boolean or enum, this just does reduce the scope size of the scan by a n factor (assuming n being the number of inputs and every value homogeneously used) ?
I've read that MySQL prior to 5.0 used only one index per request how do the system picks it ? (by choosing the more restrictive one ?)
How does a OR statement is processed ?
How much does this is going to slow insert ?
Does InnoDB/MyISAM change anything to this problem ?
I know the EXPLAIN statement could be used to know whether a request is optimized or not, but a bit of concrete theoretical stuff would really be more constructive than a purely empirical approach !