JOIN Three tables - mysql

I have this database "stats19" with all data from UK accidents from 2005 to 2013
Now I have to create a DataWarehouse (star type).
This are the tables i'm trying to join in one table, omitting some other variables that are not important
stats19.casualty (2,020,000 rows)
AccidentIndex varchar(13)
VehicleReference int(11)
CasualtyReference_id int(11)
CasualtyClass int(11)
CasualtySeverity varchar(7)
CasualtySex varchar(28)
CasualtyAgeBand varchar(7)
...
stats19.typeperson (2,020,000 rows)
CasualtyType_id int(11)
fk_AccidentIndex varchar(13)
fk_VehicleReference int(11)
fk_CasualtyReference_id int(11)
...
stats19.accident (1,494,275 rows)
AccidentIndex varchar(13)
AccidentDate date
AccidentTime time
...
Final table must have this variables
dw.casualtytemporary (should have 202.000 rows)
AccidentIndex VARCHAR(255),
VehicleReference INT,
CasualtyReference INT,
CasualtyClass INT,
CasualtyType INT,
AccidentDate DATE,
AccidentTime TIME,
CasualtySex VARCHAR(255),
CasualtyAgeBand VARCHAR(255)
I have been trying to execute this to insert
INSERT INTO CasualtyTemp
(SELECT c.AccidentIndex,c.VehicleReference,c.CasualtyReference_id,
c.CasualtyClass,t.CasualtyType_id,a.AccidentDate,a.AccidentTime,
c.CasualtySex, c.CasualtyAgeBand
FROM (stats19.Casualty as c
INNER JOIN stats19.typeperson as t
ON c.CasualtyReference_id = t.cf_CasualtyReference_id
INNER JOIN stats19.accident as a
ON a.AccidentIndex = c.AccidentIndex))
);
The problem comes when MYSQL CommandLine or Workbench both fail inserting by getting an error (disconnection) or taking too much time to do the insert.
Final table dw.casualtytemporary should have 2,020,000 rows as this is what the original table had.

Since you are doing full table join without 'where', so I think the search complexity is n1*log(n2)*log(n3) where ni is the row number of each table(if you use index on the inner join field).
I think your SQL statement is right, and mysql optimizer will further optimize the SQL, so I think no need to do on the SQL. But I think you can tune the MYSQL part, I list something that maybe important.
Both storage engine should be same, this can ensure that the tables are join in engine level, else they will join on server level which is slow.
If you use Innodb, maybe you can tune the important parameters related to Inoodb, like 'Innodb_buffer_pool_size', because enough space will make innodb do hash index in memory.
If you use Myisam engine, maybe you can tune the myisam index size to ensure the index can load into memory.
Also, since you will produce derived table, so tmp_table_size will be important, if tmp_table_size is small, the myisam table will be used as tmp table. Also note innodb is very slow to write due to the double write log mechanism, and this will be further worse when you are using insert...select, since concurrent insert is disable.
Other factors like if there are NULL in your field, if the field repeat heavily and if so, you can use ENUM which is faster than VARCHAR. Also notice CHAR is about 20% faster than VARCHAR, in case disk space is not a concern and string is short maybe can try.
If all of above can't resolve your problem or you don't have a big machine, since you are doing only three table, you can program some code in c/c++, which is the most efficient way.

Related

mysql select * by index is very slow

I have a table like this:
create table test (
id int primary key auto_increment,
idcard varchar(30),
name varchar(30),
custom_value varchar(50),
index i1(idcard)
)
I insert 30,000,000 rows to the table
and then I execute:
select * from test where idcard='?'
The statement cost 12 seconds to return
when I use iostat to monitor disk
the read speed is about 6 mb/s while the util is 94%
is any way to optimize it?
12 seconds may be realistic.
Assumptions about the question:
A total of 30M rows, but only 3000 rows in the resultset.
Not enough room to cache things in RAM or you are running from a cold start.
InnoDB or MyISAM (the analysis is the same; the details are radically different).
Any CHARACTER SET and COLLATION for idcard.
INDEX(idcard) exists and is used in the query.
HDD disk drive, not SSD.
Here's a breakdown of the processing:
Go to the index, find the first entry with ?, scan forward until hitting an entry that is not ? (about 3K rows later).
For each of those 3K items, reach into the table to find all the columns (cf SELECT *.
Deliver them.
Step 1: Fast.
Step 2: This is (based on the assumption of not being cached) costly. It may involve about 3K disk hits. For an HDD, that would be about 30 seconds. So, 12 seconds could imply some of the stuff was cached or happened to be near each other.
Step 3: This is a network cost, which I am not considering.
Run the query a second time. It may take only 1 second the this time -- because all 3K blocks are cached in RAM! And iostat will show zero activity!
is any way to optimize it?
Well...
You already have the best index.
What are you going to do with 3000 rows all at once? Is this a one-time task?
When using InnoDB, innodb_buffer_pool_size should be about 70% of available RAM, but not so big that it leads to swapping. What is its setting, and how much RAM do you have and what else is running on the machine?
Could you do more of the task while you are fetching the 3K rows?
Switching to SSDs would help, but I don't like hardware bandaids; they are not reusable.
How big is the table (in GB) -- perhaps 3GB data plus index? (SHOW TABLE STATUS.) If you can't make the buffer_pool big enough for it, and you have a variety of queries that compete for different parts of this (and other) tables, then more RAM may be beneficial.
Seems more like an I/O limitation than something that could be solved by adding indices. What will improve the speed is change the collation of the idcard column to latin1_bin. This uses only 1 byte per character. It also uses binary comparison which is faster than case insensitive comparison.
Only do this if you have no special characters in the idcard column, because the character set of latin1 is quite limited.
ALTER TABLE `test` CHANGE COLUMN `idcard` `idcard` VARCHAR(30) COLLATE 'latin1_bin' AFTER `id`;
Furthermore the ROW_FORMAT=FIXED also improves the speed. ROW_FORMAT=FIXED is not available using the InnoDB engine, but it is with MyISAM. The resulting table I now have is shown below. It's 5 times quicker (80% less time) with select statements than the initial table.
Note that I also changed the collation for 'name' and 'custom_value' to latin1_bin. This does make quite a difference in speed in my test setup, and I'm still figuring out why.
CREATE TABLE `test` (
`id` INT(11) NOT NULL AUTO_INCREMENT,
`idcard` VARCHAR(30) COLLATE 'latin1_bin',
`name` VARCHAR(30) COLLATE 'latin1_bin',
`custom_value` VARCHAR(50) COLLATE 'latin1_bin',
PRIMARY KEY (`id`),
INDEX `i1` (`idcard`)
)
ENGINE=MyISAM
ROW_FORMAT=FIXED ;
You may try adding the three other columns in the select clause to the index:
CREATE INDEX idx ON test (idcard, id, name, custom_value);
The three columns other than idcard are being added to allow the index to cover everything being selected. The problem with your current index is that it is only on idcard. This means that once MySQL has traversed down to each leaf node in the index, it would have to do another seek back to the clustered index to lookup the values of all columns mentioned in the select *. As a result of this, MySQL may choose to ignore the index completely. The suggestion I made above avoids this additional seek.

mysql table with 550M rows with 128MB memory

I would appreciate if someone could explain how is it possible MySQL is not churning with a large table on default config.
note: I don't need advice how to increase the memory, improve the performance or migrate etc. I want to understand why it is working and performing well.
I have the following table:
CREATE TABLE `daily_reads` (
`a` varchar(32) NOT NULL DEFAULT '',
`b` varchar(50) NOT NULL DEFAULT '',
`c` varchar(20) NOT NULL DEFAULT '',
`d` varchar(20) NOT NULL DEFAULT '',
`e` varchar(20) NOT NULL DEFAULT '',
`f` varchar(10) NOT NULL DEFAULT 'Wh',
`g` datetime NOT NULL,
`PERIOD_START` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`i` decimal(16,3) NOT NULL,
`j` decimal(16,3) NOT NULL DEFAULT '0.000',
`k` decimal(16,2) NOT NULL DEFAULT '0.00',
`l` varchar(1) NOT NULL DEFAULT 'N',
`m` varchar(1) NOT NULL DEFAULT 'N',
PRIMARY KEY (`a`,`b`,`c`,`PERIOD_START`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
It is running on a VM with 1 CPU Core, 6GB RAM, CentOS 7 (have very limited access to that VM).
It is running on a default MySQL config with 128MB buffer pool (SELECT ##innodb_buffer_pool_size/1024/1024)
DB size is ~96GB, ~560M rows in the 'reads' table, ~710M rows with other tables.
select database_name, table_name, index_name, stat_value*##innodb_page_size
from mysql.innodb_index_stats where stat_name='size';
PRIMARY: 83,213,500,416 (no other indexes)
I get like ~500K reads/month and writes are done only as part of an ETL process directly from Informatica to the DB (~ 75M writes/month).
The read queries are called only via stored procedure:
CALL sp_get_meter_data('678912345678', '1234567765432', '2017-01-13 00:00:00', '2017-05-20 00:00:00');
// striped out the not important bits:
...
SET daily_from_date = DATE_FORMAT(FROM_DATE_TIME, '%Y-%m-%d 00:00:00');
SET daily_to_date = DATE_FORMAT(TO_DATE_TIME, '%Y-%m-%d 23:59:59');
...
SELECT
*
FROM
daily_reads
WHERE
A = FRIST_NUMBER
AND
B = SECOND_NUMBER
AND
daily_from_date <= PERIOD_START
AND
daily_to_date >= PERIOD_START
ORDER BY
PERIOD_START ASC;
My understanding of InnoDB is quite limited, but I thought I need to fit all indexes into memory to do fast queries. The read procedure takes only a few milliseconds. I thought it is not technically possible to query 500M+ tables fast enough on a default MySQL config...?
What am I missing?
note: I don't need advice how to increase the memory, improve the performance or migrate etc. I want to understand why it is working and performing well.
Long answer: Your primary key is a composite of several columns starting with a and b.
Your WHERE clause says this.
WHERE a = FRIST_NUMBER
AND b = SECOND_NUMBER
AND etc etc.
This WHERE clauses exploits the index associated with your primary key very efficiently indeed. It random-accesses the index to precisely the first row it needs, and then scans it sequentially. So it doesn't actually have to page in much of your index or your table to satisfy your query.
Short answer: When queries exploit indexes, MySQL is fast and cheap.
If you wanted an index that was perfect for this query, it would be a composite index on (a, b, daily_from_date). This would use equality matching to hit the first matching row in the index, then range scan the index for your chosen date range. But the performance you have now is pretty good.
You asked whether the index must fit entirely in memory. No. The entire purpose of DBMS software is to handle volumes of data that can't possibly fit in memory at once. Good DBMS implementations do a good job of maintaining memory caches, and refreshing those caches from mass storage, when needed. The innodb buffer pool is one such cache. Keep in mind that any insertions or updates to a table require both the table data and the index data to be written to mass storage eventually.
The performances can be improved with some index.
In your specific case, you are filtering on 3 columns: A, B, and PERIOD_START.
To speed up the query you can use index on this columns.
Add an index over PERIOD_START can be inefficient because this type stores TIME information, so you have a lot of differnt values in the same day.
You can add a new column to store the DATE part of PERIOD_START in the correct type (DATE) (something like PERIOD_START_DATE) and add an index on this column.
This makes a more effective indexing and this can improve the computation performance because you are using a look up table (key -> values).
If you do not want to change your client code, you can use a "Generated stored column". See MySql manual
Best regards
its possible your index is getting used (probably not given the leading edge doesnt match the columns in your query) but even if it isn't, you'd only ever read through the table once because the query doesn't have any joins and the subsequent runs would pick the cached results.
Since You're using informatica to load the data (its a swiss army knife of data loading) it may be doing a lot more than you realise e.g. assuming the data load is all inserts then it may drop and recreate indexes and run in bulk mode to load the data really quickly. It may even prerun the query to prime your cache with the first post load run.
Doesn't the index have to fit in memory?
No, the entire index does not have to fit in memory. Only the part of the index that needs to be examined during the query execution.
Since you have conditions on the left-most columns of your primary key (which is you clustered index), the query only examines rows that match the values you search for. The rest of the table is not examined at all.
You can try using EXPLAIN with your query and see an estimate of the number of rows examined. This is only a rough estimate calculated by the optimizer, but it should show that your query only needs to examine a small subset of the 550 million rows.
The InnoDB buffer pool keeps copies of frequently-used pages in RAM. The more frequently a page is used, the more likely it is to stay in the buffer pool and not get kicked out. Over time, as you run queries, your buffer pool gradually stabilizes with the set of pages that is most worth keeping in RAM.
If your query workload were to really scan your entire table frequently, then the small buffer pool would churn a lot more. But it's likely that your queries request the same small subset of the table repeatedly. A phenomenon called the Pareto Principle applies in many real-world applications: the majority of the requests are satisfied by a small minority of data.
This principle tends to fail when we run complex analytical queries, because those queries are more likely to scan the entire table.

split table performance in mysql

everyone. Here is a problem in my mysql server.
I have a table about 40,000,000 rows and 10 columns.
Its size is about 4GB.And engine is innodb.
It is a master database, and only execute one sql like this.
insert into mytable ... on duplicate key update ...
And about 99% sqls executed update part.
Now the server is becoming slower and slower.
I heard that split table may enhance its performance. Then I tried on my personal computer, splited into 10 tables, failed , also tried 100 ,failed too. The speed became slower instead. So I wonder why splitting tables didn't enhance the performance?
Thanks in advance.
more details:
CREATE TABLE my_table (
id BIGINT AUTO_INCREMENT,
user_id BIGINT,
identifier VARCHAR(64),
account_id VARCHAR(64),
top_speed INT UNSIGNED NOT NULL,
total_chars INT UNSIGNED NOT NULL,
total_time INT UNSIGNED NOT NULL,
keystrokes INT UNSIGNED NOT NULL,
avg_speed INT UNSIGNED NOT NULL,
country_code VARCHAR(16),
update_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY(id), UNIQUE KEY(user_id)
);
PS:
I also tried different computers with Solid State Drive and Hard Disk Drive, but didn't help too.
Splitting up a table is unlikely to help at all. Ditto for PARTITIONing.
Let's count the disk hits. I will skip counting non-leaf nodes in BTrees; they tend to be cached; I will count leaf nodes in the data and indexes; they tend not to be cached.
IODKU does:
Read the index block containing the for any UNIQUE keys. In your case, that is probably user_id. Please provide a sample SQL statement. 1 read.
If the user_id entry is found in the index, read the record from the data as indexed by the PK(id) and do the UPDATE, and leave this second block in the buffer_pool for eventual rewrite to disk. 1 read now, 1 write later.
If the record is not found, do INSERT. The index block that needs the new row was already read, so it is ready to have a new entry inserted. Meanwhile, the "last" block in the table (due to id being AUTO_INCREMENT) is probably already cached. Add the new row to it. 0 reads now, 1 write later (UNIQUE). (Rewriting the "last" block is amortized over, say, 100 rows, so I am ignoring it.)
Eventually do the write(s).
Total, assuming essentially all take the UPDATE path: 2 reads and 1 write. Assuming the user_id follows no simple pattern, I will assume that all 3 I/Os are "random".
Let's consider a variation... What if you got rid of id? Do you need id anywhere else? Since you have a UNIQUE key, it could be the PK. That is replace your two indexes with just PRIMARY KEY(user_id). Now the counts are:
1 read
If UPDATE, 0 read, 1 write
If INSERT, 0 read, 0 write
Total: 1 read, 1 write. 2/3 as many as before. Better, but still not great.
Caching
How much RAM do you have?
What is the value of innodb_buffer_pool_size?
SHOW TABLE STATUS -- What are Data_length and Index_length?
I suspect that the buffer_pool is not big enough, and possible could be raised. If you have more than 4GB of RAM, make it about 70% of RAM.
Others
SSDs should have helped significantly, since you appear to be I/O bound. Can you tell whether you are I/O-bound or CPU-bound?
How many rows are you updating at once? How long does it take? Is it batched, or one at a time? There may be a significant improvement possible here.
Do you really need BIGINT (8 bytes)? INT UNSIGNED is only 4 bytes.
Is a transaction involved?
Is the Master having a problem? The Slave? Both? I don't want to fix the Master in such a way that it messes up the Slave.
Try to split your database into some mysql instances using mysql proxy just like mysql-proxy or haproxy instead of one mysql instance. Maybe you can have great performance.

InnoDB vs MyIsam on a frequently sorted MySQL 5.5 table

I have a table (currently InnoDB) with roughly 100k records. These records have an order column so they can make up an ordered queue. Actually, these records belong to about 40 departments that have their own queue which in turn have their own records in this table.
The problem is that we're constantly getting "lock wait time" errors because various departments are sorting its queue (and records) simultaneously.
I know that MyIsam is a table-level lock engine and InnoDB is row-level. The thing is I'm not sure which one is faster for this kind of operation.
The other thing is that this table is joined in various queries with other InnoDB tables and I don't know the what can happen if I switch the table to MyIsam.
Here's the table structure:
CREATE TABLE `ssti` (
`demand_nber` MEDIUMINT(8) UNSIGNED NOT NULL COMMENT,
`year` CHAR(4) NULL DEFAULT NULL COMMENT,
`department` CHAR(4) NULL DEFAULT NULL COMMENT '4 caracteres',
-- [other columns ]
`priority` INT(10) UNSIGNED NOT NULL DEFAULT '9999999',
PRIMARY KEY (`NR_DMD`)
)
COLLATE='latin1_swedish_ci'
ENGINE=InnoDB;
And here's the piece of java code that updates the priorities:
PreparedStatement psUpdatePriority = con.prepareStatement("UPDATE `ssti` SET `priority` = ? WHERE demand_nber=?;");
for (int i = 0; i < demands.length(); ++i) {
JSONObject d = demands.getJSONObject(i);
psUpdatePriority.setInt(1, d.getInt("newPriority"));
psUpdatePriority.setInt(2, d.getInt("demandNumber"));
psUpdatePriority.addBatch();
}
int[] totalUpdated = psUpdatePriority.executeBatch();
When investigating performance problems be sure to engage the slow query log so you have a record of the specific queries causing problems.
What it looks like here is you're including a column in your WHERE clause that's not indexed. That's extremely painful on large data sets as it requires a "table scan", or reading every record and evaluating them sequentially.
When indexed your queries should be significantly faster.
If you're really up against the wall, you may want to break out each department into their own table. This is very difficult to undo so I'd only pursue this as a last resort.
Select statements normally do not block each other. Sorting is done in tempDB for each query separately. If you are getting waits for locks, then look at updates that are blocking selects.
With row-locking UPDATE will block only needed (small number of) rows, allowing other statements access other rows. With table-locking an UPDATE will block whole table and no other statements will access table until UPDATE is finished. So MyISAM will make your problem worse in any case.
--
It seems that you are using this table for many purposes. Therefore, you need to consider all of them and their importance, when tuning performance of this table.
Case 1: Department queries its own data and needs it sorted
When a result of some data manipulation is reused many times, the general rule is to save it. It would allow reading the result straight away, rather then computing it every time.
To allow queries to read sorted data you need to create an index.
However index just on sorting column priority will not help. As each department can see only its own data, every query also contains department number. Hence your index should contain two key columns as KEY (department, priority).
Case 2: Table is joined to several other tables
To speed up queries with JOINs you'll need indexes with key same as columns used for joins.
Case 3: Inserting new, possibly transactional, data
A single table is limited in how long it can handle inserts of new data and processing reporting queries. Usually, transactional and reporting uses are considered alternative to each other. It is a good practice to use reporting tables, that summarise data from transactional tables. Also joins to dimensions are easier when data is aggregated (there are less rows).

Very different execution times for Update on different MySQL systems

I have a very strange situation with MySQL.
Take the following example:
A table containing 15000 entries
CREATE TABLE `temp_geonis_export` (
`auto_id` INTEGER UNSIGNED NOT NULL AUTO_INCREMENT,
`obj_id` VARCHAR(10),
`gis_id` VARCHAR(45),
`letzteReinigung_id` VARCHAR(10),
INDEX `Index_2`(`gis_id`),
PRIMARY KEY (`auto_id`)
)
ENGINE = InnoDB;
Now I'm updating 14000 of the rows in the table. I know, its a pretty ugly statement that could easily be rewritten, but that is not the question...
Update temp_geonis_export as temp
inner join (
Select gis_id, obj_id from
(
Select abw.gis_id, abw.bezeichnung, erh.obj_id
from od_abwasserbauwerk as abw
inner join od_erhaltungsereignis as erh on erh.fs_abwasserbauwerk = abw.obj_id and erh.status = 2
inner join od_reinigung as unter on unter.obj_id = erh.obj_id
order by fs_abwasserbauwerk asc, erh.zeitpunkt asc
) as alleSortiert group by alleSortiert.gis_id
) as naechsteRein on temp.gis_id = naechsteRein.gis_id
set temp.naechsteReinigung_id = naechsteRein.obj_id;
Now, if I run the Update-statement on our development server, it takes about 1 sec. On one of our production servers it takes 90 seconds!!
These are my observations:
Handler_read_rnd_next 101000 (development), 266177000 (production)
Very high CPU Usage on production system (due to the above observation)
Almost no Disk IO on both Systems
When I rewrite the Update-Query and store the output of the subquery into a temporary table, the Update-Statement is fast on both systems
Due to the observations my conclusion is, that for some reason, our production server has to perform full table scans for each updated row. The development server does not. It must be a configuration issue, since our Servers are all 5.1.25 and the hardware is comparable.
Do you have a clue, what I have to change on our production server to make it perform better?
Thanks for your help
After hours I finally got the solution:
The problem was, that the DEFAULT CHARSET on the production Server was different from the charset used in the database. Now, when creating the table without explicitly specifying the charset, MySQL did not use indexes. This is because MySQL can't use indexes on CHAR-Fields when the charsets of the joined tables differ.
Thank you so much for your help peterm. You pointed me into the right direction.