Design of mysql database for large number of large matrix data

Design of mysql database for large number of large matrix data - mysql

I am looking into storing a "large" amount of data and not sure what the best solution is, so any help would be most appreciated. The structure of the data is
450,000 rows
11,000 columns
My requirements are:
1) Need as fast access as possible to a small subset of the data e.g. rows (1,2,3) and columns (5,10,1000)
2) Needs to be scalable will be adding columns every month but the number of rows are fixed.
My understanding is that often its best to store as:
id| row_number| column_number| value
but this would create 4,950,000,000 entries? I have tried storing as just rows and columns as is in MySQL but it is very slow at subsetting the data.
Thanks!

Build the giant matrix table
As N.B. said in comments, there's no cleaner way than using one mysql row for each matrix value.
You can do it without the id column:
CREATE TABLE `stackoverflow`.`matrix` (
`rowNum` MEDIUMINT NOT NULL ,
`colNum` MEDIUMINT NOT NULL ,
`value` INT NOT NULL ,
PRIMARY KEY ( `rowNum`, `colNum` )
) ENGINE = MYISAM ;
You may add a UNIQUE INDEX on colNum, rowNum, or only a non-unique INDEX on colNum if you often access matrix by column (because PRIMARY INDEX is on ( `rowNum`, `colNum` ), note the order, so it will be inefficient when it comes to select a whole column).
You'll probably need more than 200Go to store the 450.000x11.000 lines, including indexes.
Inserting data may be slow (because there are two indexes to rebuild, and 450.000 entries [1 per row] to add when adding a column).
Edit should be very fast, as index wouldn't change and value is of fixed size
If you access same subsets (rows + cols) often, maybe you can use PARTITIONing of the table if you need something "faster" than what mysql provides by default.
After years of experience (20201 edit)
Re-reading myself years later, I would say the "cache" ideas are totally dumb, as it's MySQL role to handle these sort of cache (it should actually already be in the innodb pool cache).
A better thing would be, if matrix is full of zeroes, not storing the zero values, and consider 0 as "default" in the client code. That way, you may lightenup the storage (if needed: mysql should actually be pretty fast responding to queries event on such 5 billion row table)
Another thing, if storage makes issue, is to use a single ID to identify both row and col: you say number of rows is fixed (450000) so you may replace (row, col) with a single (id = 450000*col+row) value [tho it needs BIGINT so maybe not better than 2 columns)
Don't do like below: don't reinvent MySQL cache
Add a cache (actually no)
Since you said you add values, and doesn't seem to edit matrix values, a cache can speed up frequently asked rows/columns.
If you often read the same rows/columns, you can cache their result in another table (same structure to make it easier):
CREATE TABLE `stackoverflow`.`cachedPartialMatrix` (
`rowNum` MEDIUMINT NOT NULL ,
`colNum` MEDIUMINT NOT NULL ,
`value` INT NOT NULL ,
PRIMARY KEY ( `rowNum`, `colNum` )
) ENGINE = MYISAM ;
That table will be void at the beginning, and each SELECT on the matrix table will feed the cache. When you want to get a column / row:
SELECT the row/column from that caching table
If the SELECT returns a void/partial result (no data returned or not enough data to match the expected row/column number) then do the SELECT on the matrix table
Save the SELECT from the matrix table to the cachingPartialMatrix
If the caching matrix gets too big, clear it (the bigger cached matrix is, the slower it becomes)
Smarter cache (actually, no)
You can make it even smarter with a third table to count how many times a selection is done:
CREATE TABLE `stackoverflow`.`requestsCounter` (
`isRowSelect` BOOLEAN NOT NULL ,
`index` INT NOT NULL ,
`count` INT NOT NULL ,
`lastDate` DATETIME NOT NULL,
PRIMARY KEY ( `isRowSelect` , `index` )
) ENGINE = MYISAM ;
When you do a request on your matrix (one may use TRIGGERS) for the Nth-row or Kth-column, increment the counter. When the counter gets big enough, feed the cache.
lastDate can be used to remove some old values from the cache (take care: if you remove the Nth-column from cache entries because its ``lastDate```is old enough, you may break some other entries cache) or to regularly clear the cache and only leave the recently selected values.

Related

Updating single table frequently vs using another table and CRON to import changes into main table in MySQL?

I have a table with login logs which is EXTREMELY busy and large InnoDB table. New rows are inserted all the time, the table is queried by other parts of the system, it is by far the busiest table in the DB. In this table, there is logid which is PRIMARY KEY and its generated as a random hash by software (not auto increment ID). I also want to store some data like number of items viewed.
create table loginlogs
(
logid bigint unsigned primary key,
some_data varchar(255),
viewed_items biging unsigned
)
viewed_items is a value that will get updated for multiple rows very often (assume thousands of updates / second). The dilemma I am facing now is:
Should I
UPDATE loginlogs SET viewed_items = XXXX WHERE logid = YYYYY
or should I create
create table loginlogs_viewed_items
(
logid bigint unsigned primary key,
viewed_items biging unsigned,
exported tinyint unsigned default 0
)
and then execute with CRON
UPDATE loginlogs_viewed_items t
INNER JOIN loginlogs l ON l.logid = t.logid
SET
t.exported = 1,
l.viewed_items = t.viewed_items
WHERE
t.exported = 0;
e.g. every hour?
Note that either way the viewed_items counter will be updated MANY TIMES for one logid, it can be even 100 / hour / logid and there is tons of rows. So whichever table I chose for this, either the main one or the separate one, it will be getting updated quite frequently.
I want to avoid unnecessary locking of loginlogs table and at the same time I do not want to degrade performance by duplicating data in another table.

Hmm, I wonder why you'd want to change log entries and not just add new ones...
But anyway, as you said either way the updates have to happen, whether individually or in bulk.
If you have less busy time windows updating in bulk then might have an advantage. Otherwise the bulk update may have more significant impact when running in contrast to individual updates that might "interleave" more with the other operations making the impact less "feelable".
If the column you need to update is not needed all the time, you could think of having a separate table just for this column. That way queries that just need the other columns may be less affected by the updates.

"Tons of rows" -- To some people, that is "millions". To others, even "billions" is not really big. Please provide some numbers; the answer can be different. Meanwhile, here are some general principles.
I will assume the table is ENGINE=InnoDB.
UPDATEing one row at a time is 10 times as costly as updating 100 rows at a time.
UPDATEing more than 1000 rows in a single statement is problematic. It will lock each row, potentially leading to delays in other statements and maybe even deadlocks.
Having a 'random' PRIMARY KEY (as opposed to AUTO_INCREMENT or something roughly chronologically ordered) is very costly when the table is bigger than the buffer_pool. How much RAM do you have?
"the table is queried by other parts of the system" -- by the random PK? One row at a time? How frequently?
Please elaborate on how exported works. For example, does it get reset to 0 by something else?
Is there a single client doing all the work? Or are there multiple servers throwing data and queries at the table? (Different techniques are needed.)

Improving MySQL Query Speeds - 150,000+ Rows Returned Slows Query

Hi I currently have a query which is taking 11(sec) to run. I have a report which is displayed on a website which runs 4 different queries which are similar and all take 11(sec) each to run. I don't really want the customer having to wait a minute for all of these queries to run and display the data.
I am using 4 different AJAX requests to call an APIs to get the data I need and these all start at once but the queries are running one after another. If there was a way to get these queries to all run at once (parallel) so the total load time is only 11(sec) that would also fix my issue, I don't believe that is possible though.
Here is the query I am running:
SELECT device_uuid,
day_epoch,
is_repeat
FROM tracking_daily_stats_zone_unique_device_uuids_per_hour
WHERE day_epoch >= 1552435200
AND day_epoch < 1553040000
AND venue_id = 46
AND zone_id IN (102,105,108,110,111,113,116,117,118,121,287)
I can't think of anyway to speed this query up at all, below are pictures of the table indexes and the explain statement on this query.
I think the above query is using relevant indexes in the where conditions.
If there is anything you can think of to speed this query up please let me know, I have been working on it for 3 days and can't seem to figure out the problem. It would be great to get the query times down to 5(sec) maximum. If I am wrong about the AJAX issue please let me know as this would also fix my issue.
" EDIT "
I have came across something quite strange which might be causing the issue. When I change the day_epoch range to something smaller (5th - 9th) which returns 130,000 rows the query time is 0.7(sec) but then I add one more day onto that range (5th - 10th) and it returns over 150,000 rows the query time is 13(sec). I have ran loads of different ranges and have came to the conclusion if the amount of rows returned is over 150,000 that has a huge effect on the query times.
Table Definition -
CREATE TABLE `tracking_daily_stats_zone_unique_device_uuids_per_hour` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`day_epoch` int(10) NOT NULL,
`day_of_week` tinyint(1) NOT NULL COMMENT 'day of week, monday = 1',
`hour` int(2) NOT NULL,
`venue_id` int(5) NOT NULL,
`zone_id` int(5) NOT NULL,
`device_uuid` binary(16) NOT NULL COMMENT 'binary representation of the device_uuid, unique for a single day',
`device_vendor_id` int(5) unsigned NOT NULL DEFAULT '0' COMMENT 'id of the device vendor',
`first_seen` int(10) unsigned NOT NULL DEFAULT '0',
`last_seen` int(10) unsigned NOT NULL DEFAULT '0',
`is_repeat` tinyint(1) NOT NULL COMMENT 'is the device a repeat for this day?',
`prev_last_seen` int(10) NOT NULL DEFAULT '0' COMMENT 'previous last seen ts',
PRIMARY KEY (`id`,`venue_id`) USING BTREE,
KEY `venue_id` (`venue_id`),
KEY `zone_id` (`zone_id`),
KEY `day_of_week` (`day_of_week`),
KEY `day_epoch` (`day_epoch`),
KEY `hour` (`hour`),
KEY `device_uuid` (`device_uuid`),
KEY `is_repeat` (`is_repeat`),
KEY `device_vendor_id` (`device_vendor_id`)
) ENGINE=InnoDB AUTO_INCREMENT=450967720 DEFAULT CHARSET=utf8
/*!50100 PARTITION BY HASH (venue_id)
PARTITIONS 100 */

The straight forward solution is to add this query specific index to the table:
ALTER TABLE tracking_daily_stats_zone_unique_device_uuids_per_hour
ADD INDEX complex_idx (`venue_id`, `day_epoch`, `zone_id`)
WARNING This query change can take a while on DB.
And then force it when you call:
SELECT device_uuid,
day_epoch,
is_repeat
FROM tracking_daily_stats_zone_unique_device_uuids_per_hour
USE INDEX (complex_idx)
WHERE day_epoch >= 1552435200
AND day_epoch < 1553040000
AND venue_id = 46
AND zone_id IN (102,105,108,110,111,113,116,117,118,121,287)
It is definitely not universal but should work for this particular query.
UPDATE When you have partitioned table you can get profit by forcing particular PARTITION. In our case since that is venue_id just force it:
SELECT device_uuid,
day_epoch,
is_repeat
FROM tracking_daily_stats_zone_unique_device_uuids_per_hour
PARTITION (`p46`)
WHERE day_epoch >= 1552435200
AND day_epoch < 1553040000
AND zone_id IN (102,105,108,110,111,113,116,117,118,121,287)
Where p46 is concatenated string of p and venue_id = 46
And another trick if you go this way. You can remove AND venue_id = 46 from WHERE clause. Because there is no other data in that partition.

What happens if you change the order of conditions? Put venue_id = ? first. The order matters.
Now it first checks all rows for:
- day_epoch >= 1552435200
- then, the remaining set for day_epoch < 1553040000
- then, the remaining set for venue_id = 46
- then, the remaining set for zone_id IN (102,105,108,110,111,113,116,117,118,121,287)
When working with heavy queries, you should always try to make the first "selector" the most effective. You can do that by using a proper index for 1 (or combination) index and to make sure that first selector narrows down the most (at least for integers, in case of strings you need another tactic).
Sometimes, a query simply is slow. When you have a lot of data (and/or not enough resources) you just cant really do anything about that. Thats where you need another solution: Make a summary table. I doubt you show 150.000 rows x4 to your visitor. You can sum it, e.g., hourly or every few minutes and select from that way smaller table.
Offtopic: Putting an index on everything only slows you down when inserting/updating/deleting. Index the least amount of columns, just the once you actually filter on (e.g. use in a WHERE or GROUP BY).

450M rows is rather large. So, I will discuss a variety of issues that can help.
Shrink data A big table leads to more I/O, which is the main performance killer. ('Small' tables tend to stay cached, and not have an I/O burden.)
Any kind of INT, even INT(2) takes 4 bytes. An "hour" can easily fit in a 1-byte TINYINT. That saves over a 1GB in the data, plus a similar amount in INDEX(hour).
If hour and day_of_week can be derived, don't bother having them as separate columns. This will save more space.
Some reason to use a 4-byte day_epoch instead of a 3-byte DATE? Or perhaps you do need a 5-byte DATETIME or TIMESTAMP.
Optimal INDEX (take #1)
If it is always a single venue_id, then either this is a good first cut at the optimal index:
INDEX(venue_id, zone_id, day_epoch)
First is the constant, then the IN, then a range. The Optimizer does well with this in many cases. (It is unclear whether the number of items in an IN clause can lead to inefficiencies.)
Better Primary Key (better index)
With AUTO_INCREMENT, there is probably no good reason to include columns after the auto_inc column in the PK. That is, PRIMARY KEY(id, venue_id) is no better than PRIMARY KEY(id).
InnoDB orders the data's BTree according to the PRIMARY KEY. So, if you are fetching several rows and can arrange for them to be adjacent to each other based on the PK, you get extra performance. (cf "Clustered".) So:
PRIMARY KEY(venue_id, zone_id, day_epoch, -- this order, as discussed above;
id) -- to make sure that the entire PK is unique.
INDEX(id) -- to keep AUTO_INCREMENT happy
And, I agree with DROPping any indexes that are not in use, including the one I recommended above. It is rarely useful to index flags (is_repeat).
UUID
Indexing a UUID can be deadly for performance once the table is really big. This is because of the randomness of UUIDs/GUIDs, leading to ever-increasing I/O burden to insert new entries in the index.
Multi-dimensional
Assuming day_epoch is sometimes multiple days, you seem to have 2 or 3 "dimensions":
A date range
A list of zones
A venue.
INDEXes are 1-dimensional. Therein lies the problem. However, PARTITIONing can sometimes help. I discuss this briefly as "case 2" in http://mysql.rjweb.org/doc.php/partitionmaint .
There is no good way to get 3 dimensions, so let's focus on 2.
You should partition on something that is a "range", such as day_epoch or zone_id.
After that, you should decide what to put in the PRIMARY KEY so that you can further take advantage of "clustering".
Plan A: This assumes you are searching for only one venue_id at a time:
PARTITION BY RANGE(day_epoch) -- see note below
PRIMARY KEY(venue_id, zone_id, id)
Plan B: This assumes you sometimes srefineearch for venue_id IN (.., .., ...), hence it does not make a good first column for the PK:
Well, I don't have good advice here; so let's go with Plan A.
The RANGE expression must be numeric. Your day_epoch works fine as is. Changing to a DATE, would necessitate BY RANGE(TO_DAYS(...)), which works fine.
You should limit the number of partitions to 50. (The 81 mentioned above is not bad.) The problem is that "lots" of partitions introduces different inefficiencies; "too few" partitions leads to "why bother".
Note that almost always the optimal PK is different for a partitioned table than the equivalent non-partitioned table.
Note that I disagree with partitioning on venue_id since it is so easy to put that column at the start of the PK instead.
Analysis
Assuming you search for a single venue_id and use my suggested partitioning & PK, here's how the SELECT performs:
Filter on the date range. This is likely to limit the activity to a single partition.
Drill into the data's BTree for that one partition to find the one venue_id.
Hopscotch through the data from there, landing on the desired zone_ids.
For each, further filter based the date.

Database table with million of rows

example i have some gps devices that send info to my database every seconds
so 1 device create 1 row in mysql database with these columns (8)
id=12341 date=22.02.2018 time=22:40
langitude=22.236558789 longitude=78.9654582 deviceID=24 name=device-name someinfo=asdadadasd
so for 1 minute it create 60 rows , for 24 hours it create 864000 rows
and for 1 month(31days) 2678400 ROWS
so 1 device is creating 2.6 million rows per month in my db table ( records are deleted every month.)
so if there are more devices will be 2.6 Million * number of devices
so my questions are like this:
Question 1: if i make a search like this from php ( just for current day and for 1 device)
SELECT * FROM TABLE WHERE date='22.02.2018' AND deviceID= '24'
max possible results will be 86400 rows
will it overload my server too much
Question 2: limit with 5 hours (18000 rows) will that be problem for database or will it load server like first example or less
SELECT * FROM TABLE WHERE date='22.02.2018' AND deviceID= '24' LIMIT 18000
Question 3: if i show just 1 result from db will it overload server
SELECT * FROM TABLE WHERE date='22.02.2018' AND deviceID= '24' LIMIT 1
does it mean that if i have millions of rows and 1000rows will load server same if i show just 1 result

Millions of rows is not a problem, this is what SQL databases are designed to handle, if you have a well designed schema and good indexes.
Use proper types
Instead of storing your dates and times as separate strings, store them either as a single datetime or separate date and time types. See indexing below for more about which one to use. This is both more compact, allows indexing, faster sorting, and it makes available date and time functions without having to do conversions.
Similarly, be sure to use the appropriate numeric type for latitude, and longitude. You'll probably want to use numeric to ensure precision.
Since you're going to be storing billions of rows, be sure to use a bigint for your primary key. A regular int can only go up to about 2 billion.
Move repeated data into another table.
Instead of storing information about the device in every row, store that in a separate table. Then only store the device's ID in your log. This will cut down on your storage size, and eliminate mistakes due to data duplication. Be sure to declare the device ID as a foreign key, this will provide referential integrity and an index.
Add indexes
Indexes are what allows a database to search through millions or billions of rows very, very efficiently. Be sure there are indexes on the rows you use frequently, such as your timestamp.
A lack of indexes on date and deviceID is likely why your queries are so slow. Without an index, MySQL has to look at every row in the database known as a full table scan. This is why your queries are so slow, you're lacking indexes.
You can discover whether your queries are using indexes with explain.
datetime or time + date?
Normally it's best to store your date and time in a single column, conventionally called created_at. Then you can use date to get just the date part like so.
select *
from gps_logs
where date(created_at) = '2018-07-14'
There's a problem. The problem is how indexes work... or don't. Because of the function call, where date(created_at) = '2018-07-14' will not use an index. MySQL will run date(created_at) on every single row. This means a performance killing full table scan.
You can work around this by working with just the datetime column. This will use an index and be efficient.
select *
from gps_logs
where '2018-07-14 00:00:00' <= created_at and created_at < '2018-07-15 00:00:00'
Or you can split your single datetime column into date and time columns, but this introduces new problems. Querying ranges which cross a day boundary becomes difficult. Like maybe you want a day in a different time zone. It's easy with a single column.
select *
from gps_logs
where '2018-07-12 10:00:00' <= created_at and created_at < '2018-07-13 10:00:00'
But it's more involved with a separate date and time.
select *
from gps_logs
where (created_date = '2018-07-12' and created_time >= '10:00:00')
or (created_date = '2018-07-13' and created_time < '10:00:00');
Or you can switch to a database with partial indexes like Postgresql. A partial index allows you to index only part of a value, or the result of a function. And Postgresql does a lot of things better than MySQL. This is what I recommend.
Do as much work in SQL as possible.
For example, if you want to know how many log entries there are per device per day, rather than pulling all the rows out and calculating them yourself, you'd use group by to group them by device and day.
select gps_device_id, count(id) as num_entries, created_at::date as day
from gps_logs
group by gps_device_id, day;
gps_device_id | num_entries | day
---------------+-------------+------------
1 | 29310 | 2018-07-12
2 | 23923 | 2018-07-11
2 | 23988 | 2018-07-12
With this much data, you will want to rely heavily on group by and the associated aggregate functions like sum, count, max, min and so on.
Avoid select *
If you must retrieve 86400 rows, the cost of simply fetching all that data from the database can be costly. You can speed this up significantly by only fetching the columns you need. This means using select only, the, specific, columns, you, need rather than select *.
Putting it all together.
In PostgreSQL
Your schema in PostgreSQL should look something like this.
create table gps_devices (
id serial primary key,
name text not null
-- any other columns about the devices
);
create table gps_logs (
id bigserial primary key,
gps_device_id int references gps_devices(id),
created_at timestamp not null default current_timestamp,
latitude numeric(12,9) not null,
longitude numeric(12,9) not null
);
create index timestamp_and_device on gps_logs(created_at, gps_device_id);
create index date_and_device on gps_logs((created_at::date), gps_device_id);
A query can generally only use one index per table. Since you'll be searching on the timestamp and device ID together a lot timestamp_and_device combines indexing both the timestamp and device ID.
date_and_device is the same thing, but it's a partial index on just the date part of the timestamp. This will make where created_at::date = '2018-07-12' and gps_device_id = 42 very efficient.
In MySQL
create table gps_devices (
id int primary key auto_increment,
name text not null
-- any other columns about the devices
);
create table gps_logs (
id bigint primary key auto_increment,
gps_device_id int references gps_devices(id),
foreign key (gps_device_id) references gps_devices(id),
created_at timestamp not null default current_timestamp,
latitude numeric(12,9) not null,
longitude numeric(12,9) not null
);
create index timestamp_and_device on gps_logs(created_at, gps_device_id);
Very similar, but no partial index. So you'll either need to always use a bare created_at in your where clauses, or switch to separate date and time types.

Just read you question, for me the Answer is
Just create a separate table for Latitude and longitude and make your ID Foreign key and save it their.

Without knowing the exact queries you want to run I can just guess the best structure. Having said that, you should aim for the optimal types that use the minimum number of bytes per row. This should make your queries faster.
For example, you could use the structure below:
create table device (
id int primary key not null,
name varchar(20),
someinfo varchar(100)
);
create table location (
device_id int not null,
recorded_at timestamp not null,
latitude double not null, -- instead of varchar; maybe float?
longitude double not null, -- instead of varchar; maybe float?
foreign key (device_id) references device (id)
);
create index ix_loc_dev on location (device_id, recorded_at);
If you include the exact queries (naming the columns) we can create better indexes for them.
Since probably your query selectivity is bad, your queries may run Full Table Scans. For this case I took it a step further I used the smallest possible data types for the columns, so it will be faster:
create table location (
device_id tinyint not null,
recorded_at timestamp not null,
latitude float not null,
longitude float not null,
foreign key (device_id) references device (id)
);
Can't really think of anything smaller than this.

The best what I can recommend to you is to use time-series database for storing and accessing time-series data. You can host any kind of time-series database engine locally, just put a little bit more resources into development of it's access methods or use any specialized databases for telematics data like this.

Search 1 row data on bigtable 800'000'000 row MariaDB InnoDB

I have table storing phone numbers with 800M rows.
column
region_code_id smallint(4) unsigned YES
local_number mediumint(7) unsigned YES
region_id smallint(4) unsigned YES
operator_id smallint(4) unsigned YES
id int(10) unsigned NO PRI auto_increment
I need find number.id where region_code_id = 119 and localnumber = 1234567
select * from numbers where numbers.region_code_id = 119 and numbers.local_number = 1234567;
this query execute over 600 second.
How can I improve it ?
UPDATE
Thank for unswer, i understand i need index for this column, i try this as soon as I get the server with more SSD, now i have free 1GB SSD space. How i can to find out how much space the index will occupy?

Consider adding INDEX on columns which you use in WHERE clause.
Start with:
ALTER TABLE `numbers`
ADD INDEX `region_code_id_local_number`
(`region_code_id`, `local_number`);
Note : it can take some time for index to build.
Before and after change, execute explain plan to compare:
EXPLAIN EXTENDED select * from numbers where numbers.region_code_id = 119 and numbers.local_number = 1234567;
References:
How MySQL uses indexes

For this query:
select *
from numbers
where numbers.region_code_id = 119 and
numbers.local_number = 1234567;
You want an index on numbers(region_code_id, local_number) or numbers(local_number, region_code_id). The order of the columns doesn't matter because the conditions are equality for both columns.
create index idx_numbers_region_local on numbers(region_code_id, local_number);

I agree that INDEX(region_code_id, local_number) (in either order) is mandatory for this problem, but I am sticking my nose in to carry it a step further. Isn't that pair "unique"? Or do you have duplicate numbers in the table? If it is unique, then get rid of id and make that pair PRIMARY KEY(region_code_id, local_number). The table will possibly be smaller after the change.
Back to your question of "how big". How big is the table now? Perhaps 40GB? A secondary index (as originally proposed) would probably add about 20GB. And you would need 20-60GB of free disk space to perform the ALTER. This depends on whether adding the index can be done "inplace" in that version.
Changing the PK (as I suggest) would result in a little less than 40GB for the table. It will take 40GB of free space to perform the ALTER.
In general (and pessimistically), plan on an ALTER needing the both the original table and the new table sitting on disk at one time. That includes full copies of the data and index(es).
(A side question: Are you sure local_number is limited to 7 digits everywhere?)
Another approach to the question... For calculating the size of a table or index in InnoDB, add up the datatype sizes (3 bytes for MEDIUMINT, some average for VARCHAR, etc). Then multiply by the number of rows. Then multiply by 4; this will give you the approximate disk space needed. (Usually 2-3 is sufficient for the last multiplier.)
When changing the PK, do it in one step:
ALTER TABLE foo
DROP PRIMARY KEY,
ADD PRIMARY KEY(region_code_id, local_number);
Changing the PK cannot be done "inplace".
Edit (mostly for other readers)
#berap points out that id is needed for other purposes. Hence, dropping id and switching the PK is not an option.
However, this is sometimes an option (perhaps not in this case):
ALTER TABLE foo
DROP PRIMARY KEY,
ADD PRIMARY KEY(region_code_id, local_number),
ADD INDEX(id);
Notes:
The id..AUTO_INCREMENT will continue to work even with just INDEX.
The SELECT in question will be more efficient because it is the PK.
SELECT .. WHERE id = ... will be less efficient because id is a secondary key.
The table will be the same size either way; the secondary key would also be the same size either way -- because every secondary key contains the PK columns, too. (This note is InnoDB-specific.)

Best solution for saving boolean values and saving cpu and memory on searches

What is the best solution for inserting boolean values on database if you want more query performance and minimum losing of memory on select statement.
For example:
I have a table with 36 fields that 30 of them has boolean values (zero or one) and i need to search records using the boolean fields that just have true values.
SELECT * FROM `myTable`
WHERE
`field_5th` = 1
AND `field_12th` = 1
AND `field_20` = 1
AND `field_8` = 1
Is there any solution?

If you want to store boolean values or flags there are basically three options:
Individual columns
This is reflected in your example above. The advantage is that you will be able to put indexes on the flags you intend to use most often for lookups. The disadvantage is that this will take up more space (since the minimum column size that can be allocated is 1 byte.)
However, if you're column names are really going to be field_20, field_21, etc. Then this is absolutely NOT the way to go. Numbered columns are a sign you should use either of the other two methods.
Bitmasks
As was suggested above you can store multiple values in a single integer column. A BIGINT column would give you up to 64 possible flags.
Values would be something like:
UPDATE table SET flags=b'100';
UPDATE table SET flags=b'10000';
Then the field would look something like: 10100
That would represent having two flag values set. To query for any particular flag value set, you would do
SELECT flags FROM table WHERE flags & b'100';
The advantage of this is that your flags are very compact space-wise. The disadvantage is that you can't place indexes on the field which would help improve the performance of searching for specific flags.
One-to-many relationship
This is where you create another table, and each row there would have the id of the row it's linked to, and the flag:
CREATE TABLE main (
main_id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
);
CREATE TABLE flag (
main_id INT UNSIGNED NOT NULL,
name VARCHAR(16)
);
Then you would insert multiple rows into the flag table.
The advantage is that you can use indexes for lookups, and you can have any number of flags per row without changing your schema. This works best for sparse values, where most rows do not have a value set. If every row needs all flags defined, then this isn't very efficient.
For performance comparisson you can read a blog post I wrote on the topic:
Set Performance Compare
Also when you ask which is "Best" that's a very subjective question. Best at what? It all really depends on what your data looks like and what your requirements are and how you want to query it.
Keep in mind that if you want to do a query like:
SELECT * FROM table WHERE some_flag=true
Indexes will only help you if few rows have that value set. If most of the rows in the table have some_flag=true, then mysql will ignore indexes and do a full table scan instead.

How many rows of data are you querying over? You can store the boolean values in an integer value and use bit operations to test for them them. It's not indexable, but storage is very well packed. Using TINYINT fields with indexes would pick one index to use and scan from there.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008