I've been running a website, with a large amount of data in the process.
A user's save data like ip , id , date to the server and it is stored in a MySQL database. Each entry is stored as a single row in a table.
Right now there are approximately 24 million rows in the table
Problem 1:
Things are getting slow now, as a full table scan can take too many minutes but I already indexed the table.
Problem 2:
If a user is pulling a select data from table it could potentially block all other users (as the table is locked) access to the site until the query is complete.
Our server
32 Gb Ram
12 core with 24 thread cpu
table use MyISAM engine
EXPLAIN SELECT SUM(impresn), SUM(rae), SUM(reve), `date` FROM `publisher_ads_hits` WHERE date between '2015-05-01' AND '2016-04-02' AND userid='168' GROUP BY date ORDER BY date DESC
Lock to comment from #Max P. If you write to MyIsam Tables ALL SELECTs are blocked. There is only a Table lock. If you use InnoDB there is a ROW Lock that only locks the ROWs they need. Aslo show us the EXPLAIN of your Queries. So it is possible that you must create some new one. MySQL can only handle one Index per Query. So if you use more fields in the Where Condition it can be useful to have a COMPOSITE INDEX over this fields
According to explain, query doesn't use index. Try to add composite index (userid, date).
If you have many update and delete operations, try to change engine to INNODB.
Basic problem is full table scan. Some suggestion are:
Partition the table based on date and dont keep more than 6-12months data in live system
Add an index on user_id
Related
I have a database table which is around 700GB with 1 Billion rows, the data is approximately 500 GB and index is 200GB,
I am trying to delete all the data before 2021,
Roughly around 298,970,576 rows in 2021 and there are 708,337,583 rows remaining.
To delete this I am running a non-stop query in my python shell
DELETE FROM table_name WHERE id < 1762163840 LIMIT 1000000;
id -> 1762163840 represent data from 2021. Deleting 1Mil row taking almost 1200-1800sec.
Is there any way I can speed up this because the current way is running for more than 15 days and there is not much data delete so far and it's going to do more days.
I thought that if I make a table with just ids of all the records that I want to delete and then do an exact map like
DELETE FROM table_name WHERE id IN (SELECT id FROM _tmp_table_name);
Will that be fast? Is it going to be faster than first making a new table with all the records and then deleting it?
The database is setup on RDS and instance class is db.r3.large 2 vCPU and 15.25 GB RAM, only 4-5 connections running.
I would suggest recreating the data you want to keep -- if you have enough space:
create table keep_data as
select *
from table_name
where id >= 1762163840;
Then you can truncate the table and re-insert new data:
truncate table table_name;
insert into table_name
select *
from keep_data;
This will recreate the index.
The downside is that this will still take a while to re-insert the data (renaming keep_data would be faster). But it should be much faster than deleting the rows.
AND . . . this will give you the opportunity to partition the table so future deletes can be handled much faster. You should look into table partitioning if you have such a large table.
Multiple techniques for big deletes: http://mysql.rjweb.org/doc.php/deletebig
It points out that LIMIT 1000000 is unnecessarily big and causes more locking than might be desirable.
In the long run, PARTITIONing would be beneficial, it mentions that.
If you do Gordon's technique (rebuilding table with what you need), you lose access to the table for a long time; I provide an alternative that has essentially zero downtime.
id IN (SELECT...) can be terribly slow -- both because of the inefficiency of in-SELECT and due to the fact that DELETE will hang on to a huge number of rows for transactional integrity.
I'm using a MySQL database and have to perform some select queries on large/huge tables (e.g. 267,736 rows and 30 columns).
Query details:
Only select queries (the data in the table is fixed, never an update, insert or delete)
Select query on all the columns (business requirement)
Mostly limit the number of rows (LIMIT 10 to all rows -> user can choose)
Could be ordered by one or multiple columns (creation of indexes here will not help since the user can order by any column he likes)
Could be filtered by a value the user chooses (where filter on one or more columns)
Currently the queries take up to 2 seconds, which is to long.
Is there a way to speed them up?
Which storage engine should I use: InnoDB/MyISAM/...
Should I have a primary key, even if I will never use him?
...?
You should (must actually) use indexes.
Create indexes on all columns with which WHERE or ORDER BY is going to be used. Also study and use EXPLAIN to see the impact of the indexes and to optimize your queries.
You don't have to create a primary key if there is no column with unique data in your table, but it is very likely that you do have such a column (id, time...). In this case you should use primary key to filter your queries.
Number of columns in the query has close to no impact on SELECT speed.
As long as you make "Only select queries" storage engine does not matter either. MyISAM might be a bit faster, but InnoDB has many features you will need when you decide that your "Only select queries" rule must be broken.
we have table in MySql DB with size approximately 35 giga bytes
I ran a simple query
select count(*) from table_name
This query taking more than 10 min then connection getting disconnected, why it's taking so long time
We don't have an primary key in our table schema, is this the reason??
If you need any other details I can provide here
Thanks
It's probably an InnoDB table. Since InnoDB supports transactions, the table is never in a static state, parts of it can always be changing. The count() has to walk through and count every record, which is why it takes so long. Even then, it's more of an estimate, depending on the activity on the table.
A quicker way to get a close count on InnoDB tables is to look at the cardinality of a unique index (i.e. primary key on auto increment field). You can see this by running a "SHOW INDEX FROM table_name" command. The cardinality is the unique number of values in that index. For unique indexes, that's the number of records.
In mssql, i would do
Select count(*) from table_name(nolock)
There must be a way to read uncommitted in mysql too..
Background
I have spent couple of days trying to figure out how I should handle large amounts of data in MySQL. I have selected some programs and techniques for the new server for the software. I am probably going to use Ubuntu 14.04LTS running nginx, Percona Server and will be using TokuDB for the 3 tables I have planned and InnoDB for the rest of the tables.
But yet I have the major problem unresolved. How to handle the huge amount of data in database?
Data
My estimates for the possible data to receive is 500 million rows a year. I will be receiving measurement data from sensors every 4 minutes.
Requirements
Insertion speed is not very critical, but I want to be able to select few hundred measurements in 1-2 seconds. Also the amount of required resources is a key factor.
Current plan
Now I have thought of splitting the sensor data in 3 tables.
EDIT:
On every table:
id = PK, AI
sensor_id will be indexed
CREATE TABLE measurements_minute(
id bigint(20),
value float,
sensor_id mediumint(8),
created timestamp
) ENGINE=TokuDB;
CREATE TABLE measurements_hour(
id bigint(20),
value float,
sensor_id mediumint(8),
created timestamp
) ENGINE=TokuDB;
CREATE TABLE measurements_day(
id bigint(20),
value float,
sensor_id mediumint(8),
created timestamp
) ENGINE=TokuDB;
So I would be storing this 4 minute data for one month. After the data is 1 month old it would be deleted from minute table. Then average value would be calculated from the minute values and inserted into the measurements_hour table. Then again when the data is 1 year old all the hour data would be deleted and daily averages would be stored in measurements_day table.
Questions
Is this considered a good way of doing this? Is there something else to take in consideration? How about table partitioning, should I do that? How should I execute the splitting of the date into different tables? Triggers and procedures?
EDIT: My ideas
Any idea if MonetDB or Infobright would be any good for this?
I have a few suggestions, and further questions.
You have not defined a primary key on your tables, so MySQL will create one automatically. Assuming that you meant for "id" to be your primary key, you need to change the line in all your table create statements to be something like "id bigint(20) NOT NULL AUTO_INCREMENT PRIMARY KEY,".
You haven't defined any indexes on the tables, how do you plan on querying? Without indexes, all queries will be full table scans and likely very slow.
Lastly, for this use-case, I'd partition the tables to make the removal of old data quick and easy.
I had to solve that type of ploblem before, with nearly a Million rows per hour.
Some tips:
Engine Mysam. You don't need to update or manage transactions with that tables. You are going to insert, select the values, and eventualy delete it.
Be careful with the indexes. In my case, It was critical the insertion and sometimes Mysql queue was full of pending inserts. A insert spend more time if your table has more index. The indexes depends of your calculated values and when you are going to do it.
Sharding your buffer tables. I only trigger the calculated values when the table was ready. When I was calculating my a values in buffer_a table, it's because the insertions was on buffer_b one. In my case, I calculate the values every day, so I switch the destination table every day. In fact, I dumped all the data and exported it in another database to make the avg, and other process without disturb the inserts.
I hope you find this helpful.
I have a MySQL MYISAM table (say tbl) consisting of 2 unsigned int fields, say, f1 and f2. There is an index on f2 and the table is very large (approximately 320,000,000+ rows). I update this table periodically (with approximately 100,000 new rows a week), and, in order to be able to search this table without doing an ORDER BY (which would be very time consuming in real-time queries), I physically ORDER the table according to the way in which I want to retrieve its rows.
So, I perform an ALTER TABLE tbl ORDER BY f1 DESC. (I know I have enough physical space on the server for a copy of the table.) I have read that during this operation, a temporary table is created and SELECT statements are not affected on the current rows.
However, I have experienced that this is not the case, and SELECT statements on the table that occur at the same time with the ALTER table are getting blocked and do not terminate. After the ALTER TABLE tbl completes (about 40 minutes on the production server), the SELECT statements on tbl start executing fine again.
Is there any reason why the "ALTER table tbl ORDER BY f1 DESC" seems to be blocking other clients from querying tbl?
Altering a table will always grab a lock on the table, preventing SELECTs from running.
I'll admin that I didn't even know you could do that with an ALTER TABLE.
What are you trying to get from the table? For example, all records in a given range? 320 million rows is not a trivial number. I'll give you my gut reactions:
Switch to InnoDB (allows #2, also gives transactions, but without #2 may hurt performance)
Partition the table (makes it act like a number of slightly smaller tables)
Consider a redesign, such as having a "working set" table and a "historical" table, basically manually partitioning. If you usually look for recently inserted data, this (along with partitioning) will help a lot. If your lookups are evenly distributed, this probably won't make a difference.
Consider adding a new column you could use in conjunction to narrow down selects (so instead of searching on date, search on date and customer ID)
Since I don't know what you're storing, some of these (such as #4) may not apply.
There are some other things you could try. OPTIMIZE TABLE may help you but take less time, but I doubt it. I think internally it's implemented as a dump/reload, at least on the InnoDB side.