I'm trying to implement a very fast table meant to store relationships between users.
CREATE TABLE IF NOT EXISTS `friends_ram` (
`a` varchar(16) CHARACTER SET latin1 COLLATE latin1_general_ci NOT NULL,
`b` varchar(16) CHARACTER SET latin1 COLLATE latin1_general_ci NOT NULL
) ENGINE=MEMORY DEFAULT CHARSET=latin1;
INSERT INTO friends_ram (a, b)
I made some tests with circa 5M of relations and it's blazing fast and it occupies circa 134MB of ram; my question is, since the queries will be:
SELECT a WHERE b = 'foo';
or
SELECT b WHERE a = 'baar';
I'd like to know if I should use a proper indexing (increasing the size of RAM required).
I'm actually ashamed of the results,
Probably the first time I made the tests i misread the output.
Turns out that with 1000 random queries without index on a or b it takes 1000 the times with proper indexing. ahem...
Another very important thing to notice is that I tried with memcached. while it takes a little longer to store data it's faster for retrieval. also it consumes way less memory.
mysql 192MB -> Mysql MEMORY engine did it in; 0.50138092041016 seconds
memcached 76MB -> Memcache engine did it in; 0.34592795372009 seconds
memcached compressed: 45.4 MBytes -> Memcache engine did it in; 0.31583189964294 seconds
so, if you need to store simple things such as these I'd recommend memcached (compressed)
Related
I am facing serious performance issue in inserting, selecting and updating rows to a table in mysql.
The table structure I am using is
CREATE TABLE `sessions` (
`sessionid` varchar(40) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
`expiry` datetime NOT NULL,
`value` text NOT NULL,
`data` text,
PRIMARY KEY (`sessionid`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='Sessions';
The queries for which I face issue are :
INSERT INTO sessions (SESSIONID, EXPIRY, DATA, VALUE) VALUES ('b8c10810c505ba170dd9403072b310ed', '2019-05-01 17:25:50', 'PFJlc3BvbnNlIHhtbG5zPSJ1cm46b2FzaXM6bmFtZXM', '7bKDofc/pyFSQhm7QE5jb6951Ahg6Sk8OCVZI7AcbUPb4jZpHdrCAKuCPupJO14DNY3jULxKppLadGlpsKBifiJavZ/');
UPDATE sessions SET EXPIRY = '2019-05-01 17:26:07' WHERE (SESSIONID = 'e99a0889437448091a06a43a44d0f170');
SELECT SESSIONID, EXPIRY, DATA, VALUE FROM sessions WHERE (SESSIONID = '507a752c48fc9cc3043a3dbe889c52eb');
I tried explaining the query but was not able to infer much about optimizing the table/query.
From the slow query report the time taken
for select in average is 23.45, for update it is 15.93 and for insert it is
22.31.
Any help in identifying the issue is much appreciated.
How many queries per second?
How big is the table?
How much RAM?
What is the value of innodb_buffer_pool_size?
UUIDs are terrible for performance. (Is that a SHA1?) This is because they are so random that the 'next' query (any of those you mentioned) is likely not to be in cache, hence necessitating a disk hit.
So, with a table that is much larger than the buffer_pool, you won't be able to sustain more than about 100 queries per second with a spinning drive. SSD would be faster.
More on the evils of UUIDs (SHA1 has the same unfortunate properties, but no solution like the one for uuids): http://mysql.rjweb.org/doc.php/uuid
One minor thing you can do is to shrink the table:
session_id BINARY(20)
and use UNHEX() when inserting/updating/deleting and HEX() when selecting.
More
51KB avg row len --> The TEXT columns are big, and "off-record", hence multiple blocks needed to work with a row.
0.8GB buffer_pool, but 20GB of data, and 'random' PRIMARY KEY --> The cache is virtually useless.
These mean that there will be multiple disk hits to for each query, but probably under 10.
300ms (a fast time) --> about 30 disk hits on HDD (more on SSD; which do you have?).
So, I must guess that 20s for a query happened when there was a burst of activity that had the queries stumbling over each other, leading to lots of I/O contention.
What to do? Most of the data looks like hex. If that is true, you could cut the disk footprint in half (and cut back some on disk hits needed) by packing and using BINARY(..) or BLOB.
INSERT INTO sessions (SESSIONID, EXPIRY, DATA, VALUE)
VALUES (UNHEX('b8c10810c505ba170dd9403072b310ed'),
'2019-05-01 17:25:50',
UNHEX('PFJlc3BvbnNlIHhtbG5zPSJ1cm46b2FzaXM6bmFtZXM'),
UNHEX('7bKDofc/pyFSQhm7QE5jb6951Ahg6Sk8OCVZI7AcbUPb4jZpHdrCAKuCPupJO14DNY3jULxKppLadGlpsKBifiJavZ/'));
UPDATE sessions SET EXPIRY = '2019-05-01 17:26:07'
WHERE SESSIONID = UNHEX('e99a0889437448091a06a43a44d0f170');
SELECT SESSIONID, EXPIRY, DATA, VALUE FROM sessions
WHERE SESSIONID = UNHEX('507a752c48fc9cc3043a3dbe889c52eb');
and
`sessionid` VARBINARY(20) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
`expiry` datetime NOT NULL,
`value` BLOB NOT NULL,
`data` BLOB,
And ROW_FORMAT=DYNAMIC might be optimal (but this is not critical).
Your queries looks good, but problem is with your server, it may not be having enough memory to handle such request, you can increase memory of your database server to to get optimised response
I have a table like this:
create table test (
id int primary key auto_increment,
idcard varchar(30),
name varchar(30),
custom_value varchar(50),
index i1(idcard)
)
I insert 30,000,000 rows to the table
and then I execute:
select * from test where idcard='?'
The statement cost 12 seconds to return
when I use iostat to monitor disk
the read speed is about 6 mb/s while the util is 94%
is any way to optimize it?
12 seconds may be realistic.
Assumptions about the question:
A total of 30M rows, but only 3000 rows in the resultset.
Not enough room to cache things in RAM or you are running from a cold start.
InnoDB or MyISAM (the analysis is the same; the details are radically different).
Any CHARACTER SET and COLLATION for idcard.
INDEX(idcard) exists and is used in the query.
HDD disk drive, not SSD.
Here's a breakdown of the processing:
Go to the index, find the first entry with ?, scan forward until hitting an entry that is not ? (about 3K rows later).
For each of those 3K items, reach into the table to find all the columns (cf SELECT *.
Deliver them.
Step 1: Fast.
Step 2: This is (based on the assumption of not being cached) costly. It may involve about 3K disk hits. For an HDD, that would be about 30 seconds. So, 12 seconds could imply some of the stuff was cached or happened to be near each other.
Step 3: This is a network cost, which I am not considering.
Run the query a second time. It may take only 1 second the this time -- because all 3K blocks are cached in RAM! And iostat will show zero activity!
is any way to optimize it?
Well...
You already have the best index.
What are you going to do with 3000 rows all at once? Is this a one-time task?
When using InnoDB, innodb_buffer_pool_size should be about 70% of available RAM, but not so big that it leads to swapping. What is its setting, and how much RAM do you have and what else is running on the machine?
Could you do more of the task while you are fetching the 3K rows?
Switching to SSDs would help, but I don't like hardware bandaids; they are not reusable.
How big is the table (in GB) -- perhaps 3GB data plus index? (SHOW TABLE STATUS.) If you can't make the buffer_pool big enough for it, and you have a variety of queries that compete for different parts of this (and other) tables, then more RAM may be beneficial.
Seems more like an I/O limitation than something that could be solved by adding indices. What will improve the speed is change the collation of the idcard column to latin1_bin. This uses only 1 byte per character. It also uses binary comparison which is faster than case insensitive comparison.
Only do this if you have no special characters in the idcard column, because the character set of latin1 is quite limited.
ALTER TABLE `test` CHANGE COLUMN `idcard` `idcard` VARCHAR(30) COLLATE 'latin1_bin' AFTER `id`;
Furthermore the ROW_FORMAT=FIXED also improves the speed. ROW_FORMAT=FIXED is not available using the InnoDB engine, but it is with MyISAM. The resulting table I now have is shown below. It's 5 times quicker (80% less time) with select statements than the initial table.
Note that I also changed the collation for 'name' and 'custom_value' to latin1_bin. This does make quite a difference in speed in my test setup, and I'm still figuring out why.
CREATE TABLE `test` (
`id` INT(11) NOT NULL AUTO_INCREMENT,
`idcard` VARCHAR(30) COLLATE 'latin1_bin',
`name` VARCHAR(30) COLLATE 'latin1_bin',
`custom_value` VARCHAR(50) COLLATE 'latin1_bin',
PRIMARY KEY (`id`),
INDEX `i1` (`idcard`)
)
ENGINE=MyISAM
ROW_FORMAT=FIXED ;
You may try adding the three other columns in the select clause to the index:
CREATE INDEX idx ON test (idcard, id, name, custom_value);
The three columns other than idcard are being added to allow the index to cover everything being selected. The problem with your current index is that it is only on idcard. This means that once MySQL has traversed down to each leaf node in the index, it would have to do another seek back to the clustered index to lookup the values of all columns mentioned in the select *. As a result of this, MySQL may choose to ignore the index completely. The suggestion I made above avoids this additional seek.
I would appreciate if someone could explain how is it possible MySQL is not churning with a large table on default config.
note: I don't need advice how to increase the memory, improve the performance or migrate etc. I want to understand why it is working and performing well.
I have the following table:
CREATE TABLE `daily_reads` (
`a` varchar(32) NOT NULL DEFAULT '',
`b` varchar(50) NOT NULL DEFAULT '',
`c` varchar(20) NOT NULL DEFAULT '',
`d` varchar(20) NOT NULL DEFAULT '',
`e` varchar(20) NOT NULL DEFAULT '',
`f` varchar(10) NOT NULL DEFAULT 'Wh',
`g` datetime NOT NULL,
`PERIOD_START` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`i` decimal(16,3) NOT NULL,
`j` decimal(16,3) NOT NULL DEFAULT '0.000',
`k` decimal(16,2) NOT NULL DEFAULT '0.00',
`l` varchar(1) NOT NULL DEFAULT 'N',
`m` varchar(1) NOT NULL DEFAULT 'N',
PRIMARY KEY (`a`,`b`,`c`,`PERIOD_START`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
It is running on a VM with 1 CPU Core, 6GB RAM, CentOS 7 (have very limited access to that VM).
It is running on a default MySQL config with 128MB buffer pool (SELECT ##innodb_buffer_pool_size/1024/1024)
DB size is ~96GB, ~560M rows in the 'reads' table, ~710M rows with other tables.
select database_name, table_name, index_name, stat_value*##innodb_page_size
from mysql.innodb_index_stats where stat_name='size';
PRIMARY: 83,213,500,416 (no other indexes)
I get like ~500K reads/month and writes are done only as part of an ETL process directly from Informatica to the DB (~ 75M writes/month).
The read queries are called only via stored procedure:
CALL sp_get_meter_data('678912345678', '1234567765432', '2017-01-13 00:00:00', '2017-05-20 00:00:00');
// striped out the not important bits:
...
SET daily_from_date = DATE_FORMAT(FROM_DATE_TIME, '%Y-%m-%d 00:00:00');
SET daily_to_date = DATE_FORMAT(TO_DATE_TIME, '%Y-%m-%d 23:59:59');
...
SELECT
*
FROM
daily_reads
WHERE
A = FRIST_NUMBER
AND
B = SECOND_NUMBER
AND
daily_from_date <= PERIOD_START
AND
daily_to_date >= PERIOD_START
ORDER BY
PERIOD_START ASC;
My understanding of InnoDB is quite limited, but I thought I need to fit all indexes into memory to do fast queries. The read procedure takes only a few milliseconds. I thought it is not technically possible to query 500M+ tables fast enough on a default MySQL config...?
What am I missing?
note: I don't need advice how to increase the memory, improve the performance or migrate etc. I want to understand why it is working and performing well.
Long answer: Your primary key is a composite of several columns starting with a and b.
Your WHERE clause says this.
WHERE a = FRIST_NUMBER
AND b = SECOND_NUMBER
AND etc etc.
This WHERE clauses exploits the index associated with your primary key very efficiently indeed. It random-accesses the index to precisely the first row it needs, and then scans it sequentially. So it doesn't actually have to page in much of your index or your table to satisfy your query.
Short answer: When queries exploit indexes, MySQL is fast and cheap.
If you wanted an index that was perfect for this query, it would be a composite index on (a, b, daily_from_date). This would use equality matching to hit the first matching row in the index, then range scan the index for your chosen date range. But the performance you have now is pretty good.
You asked whether the index must fit entirely in memory. No. The entire purpose of DBMS software is to handle volumes of data that can't possibly fit in memory at once. Good DBMS implementations do a good job of maintaining memory caches, and refreshing those caches from mass storage, when needed. The innodb buffer pool is one such cache. Keep in mind that any insertions or updates to a table require both the table data and the index data to be written to mass storage eventually.
The performances can be improved with some index.
In your specific case, you are filtering on 3 columns: A, B, and PERIOD_START.
To speed up the query you can use index on this columns.
Add an index over PERIOD_START can be inefficient because this type stores TIME information, so you have a lot of differnt values in the same day.
You can add a new column to store the DATE part of PERIOD_START in the correct type (DATE) (something like PERIOD_START_DATE) and add an index on this column.
This makes a more effective indexing and this can improve the computation performance because you are using a look up table (key -> values).
If you do not want to change your client code, you can use a "Generated stored column". See MySql manual
Best regards
its possible your index is getting used (probably not given the leading edge doesnt match the columns in your query) but even if it isn't, you'd only ever read through the table once because the query doesn't have any joins and the subsequent runs would pick the cached results.
Since You're using informatica to load the data (its a swiss army knife of data loading) it may be doing a lot more than you realise e.g. assuming the data load is all inserts then it may drop and recreate indexes and run in bulk mode to load the data really quickly. It may even prerun the query to prime your cache with the first post load run.
Doesn't the index have to fit in memory?
No, the entire index does not have to fit in memory. Only the part of the index that needs to be examined during the query execution.
Since you have conditions on the left-most columns of your primary key (which is you clustered index), the query only examines rows that match the values you search for. The rest of the table is not examined at all.
You can try using EXPLAIN with your query and see an estimate of the number of rows examined. This is only a rough estimate calculated by the optimizer, but it should show that your query only needs to examine a small subset of the 550 million rows.
The InnoDB buffer pool keeps copies of frequently-used pages in RAM. The more frequently a page is used, the more likely it is to stay in the buffer pool and not get kicked out. Over time, as you run queries, your buffer pool gradually stabilizes with the set of pages that is most worth keeping in RAM.
If your query workload were to really scan your entire table frequently, then the small buffer pool would churn a lot more. But it's likely that your queries request the same small subset of the table repeatedly. A phenomenon called the Pareto Principle applies in many real-world applications: the majority of the requests are satisfied by a small minority of data.
This principle tends to fail when we run complex analytical queries, because those queries are more likely to scan the entire table.
I have this database "stats19" with all data from UK accidents from 2005 to 2013
Now I have to create a DataWarehouse (star type).
This are the tables i'm trying to join in one table, omitting some other variables that are not important
stats19.casualty (2,020,000 rows)
AccidentIndex varchar(13)
VehicleReference int(11)
CasualtyReference_id int(11)
CasualtyClass int(11)
CasualtySeverity varchar(7)
CasualtySex varchar(28)
CasualtyAgeBand varchar(7)
...
stats19.typeperson (2,020,000 rows)
CasualtyType_id int(11)
fk_AccidentIndex varchar(13)
fk_VehicleReference int(11)
fk_CasualtyReference_id int(11)
...
stats19.accident (1,494,275 rows)
AccidentIndex varchar(13)
AccidentDate date
AccidentTime time
...
Final table must have this variables
dw.casualtytemporary (should have 202.000 rows)
AccidentIndex VARCHAR(255),
VehicleReference INT,
CasualtyReference INT,
CasualtyClass INT,
CasualtyType INT,
AccidentDate DATE,
AccidentTime TIME,
CasualtySex VARCHAR(255),
CasualtyAgeBand VARCHAR(255)
I have been trying to execute this to insert
INSERT INTO CasualtyTemp
(SELECT c.AccidentIndex,c.VehicleReference,c.CasualtyReference_id,
c.CasualtyClass,t.CasualtyType_id,a.AccidentDate,a.AccidentTime,
c.CasualtySex, c.CasualtyAgeBand
FROM (stats19.Casualty as c
INNER JOIN stats19.typeperson as t
ON c.CasualtyReference_id = t.cf_CasualtyReference_id
INNER JOIN stats19.accident as a
ON a.AccidentIndex = c.AccidentIndex))
);
The problem comes when MYSQL CommandLine or Workbench both fail inserting by getting an error (disconnection) or taking too much time to do the insert.
Final table dw.casualtytemporary should have 2,020,000 rows as this is what the original table had.
Since you are doing full table join without 'where', so I think the search complexity is n1*log(n2)*log(n3) where ni is the row number of each table(if you use index on the inner join field).
I think your SQL statement is right, and mysql optimizer will further optimize the SQL, so I think no need to do on the SQL. But I think you can tune the MYSQL part, I list something that maybe important.
Both storage engine should be same, this can ensure that the tables are join in engine level, else they will join on server level which is slow.
If you use Innodb, maybe you can tune the important parameters related to Inoodb, like 'Innodb_buffer_pool_size', because enough space will make innodb do hash index in memory.
If you use Myisam engine, maybe you can tune the myisam index size to ensure the index can load into memory.
Also, since you will produce derived table, so tmp_table_size will be important, if tmp_table_size is small, the myisam table will be used as tmp table. Also note innodb is very slow to write due to the double write log mechanism, and this will be further worse when you are using insert...select, since concurrent insert is disable.
Other factors like if there are NULL in your field, if the field repeat heavily and if so, you can use ENUM which is faster than VARCHAR. Also notice CHAR is about 20% faster than VARCHAR, in case disk space is not a concern and string is short maybe can try.
If all of above can't resolve your problem or you don't have a big machine, since you are doing only three table, you can program some code in c/c++, which is the most efficient way.
I have a very strange situation with MySQL.
Take the following example:
A table containing 15000 entries
CREATE TABLE `temp_geonis_export` (
`auto_id` INTEGER UNSIGNED NOT NULL AUTO_INCREMENT,
`obj_id` VARCHAR(10),
`gis_id` VARCHAR(45),
`letzteReinigung_id` VARCHAR(10),
INDEX `Index_2`(`gis_id`),
PRIMARY KEY (`auto_id`)
)
ENGINE = InnoDB;
Now I'm updating 14000 of the rows in the table. I know, its a pretty ugly statement that could easily be rewritten, but that is not the question...
Update temp_geonis_export as temp
inner join (
Select gis_id, obj_id from
(
Select abw.gis_id, abw.bezeichnung, erh.obj_id
from od_abwasserbauwerk as abw
inner join od_erhaltungsereignis as erh on erh.fs_abwasserbauwerk = abw.obj_id and erh.status = 2
inner join od_reinigung as unter on unter.obj_id = erh.obj_id
order by fs_abwasserbauwerk asc, erh.zeitpunkt asc
) as alleSortiert group by alleSortiert.gis_id
) as naechsteRein on temp.gis_id = naechsteRein.gis_id
set temp.naechsteReinigung_id = naechsteRein.obj_id;
Now, if I run the Update-statement on our development server, it takes about 1 sec. On one of our production servers it takes 90 seconds!!
These are my observations:
Handler_read_rnd_next 101000 (development), 266177000 (production)
Very high CPU Usage on production system (due to the above observation)
Almost no Disk IO on both Systems
When I rewrite the Update-Query and store the output of the subquery into a temporary table, the Update-Statement is fast on both systems
Due to the observations my conclusion is, that for some reason, our production server has to perform full table scans for each updated row. The development server does not. It must be a configuration issue, since our Servers are all 5.1.25 and the hardware is comparable.
Do you have a clue, what I have to change on our production server to make it perform better?
Thanks for your help
After hours I finally got the solution:
The problem was, that the DEFAULT CHARSET on the production Server was different from the charset used in the database. Now, when creating the table without explicitly specifying the charset, MySQL did not use indexes. This is because MySQL can't use indexes on CHAR-Fields when the charsets of the joined tables differ.
Thank you so much for your help peterm. You pointed me into the right direction.