Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions concerning problems with code you've written must describe the specific problem — and include valid code to reproduce it — in the question itself. See SSCCE.org for guidance.
Closed 9 years ago.
Improve this question
Lately i have been tasked to delete and reinsert approximately 15 million rows on a myisam table that has about 150 million rows doing so while the table/db still remains available
for inserts/reads.
In order to do so i have started a process that takes small chunks of data
and reinserts it via insert select statements into a cloned table with the same structure with sleep in between the runs to not overload the server, skips over the data to be deleted and insert the replacement data.
This way while cloned table was in the build process (took 8+ hours) new data was coming in into the source table. At the end i had to just sync the tables with the new data that was
added in the 8+ hours and do a rename of the tables.
Everything was fine with exception of one thing. The cardinality
of the indexes on the cloned table is way off, and execution plans for queries executed against it are awful (went from few seconds to 30+ min for some of them).
I know that this can be fixed by running an Analyze table on it, but this takes also a lot of time (currently i'm running one on a slave server and is been executed for more then 10h now) and i can't afford to have this table offline to write while the analyze is performed. Also this will stress the IO of the server putting pressure on the server and slowing it down.
Can someone explain why building a myisam table via insert select statements results in a table which has such a poor internal statistics for indexes?
Also is there a way to incrementally build the table and have the indexes in good shape at the end?
Thanks in advance.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 days ago.
Improve this question
I have a wordpress site and when updating the main theme, I saw that mysql was consuming a high percentage of CPU, then I entered phpmyadmin and this appeared in the process list.
"Waiting for table metadata lock" and "copy to tmp table"
what should i do, my site stopped working and my server space is running out
Only the process running "copying to tmp table" is doing any work. The others are waiting.
Many types of ALTER TABLE operations in MySQL work by making a copy of the table and filling it with an altered form of the data. In your case, ALTER TABLE wp_posts ENGINE=InnoDB converts the table to the InnoDB storage engine. If the table was already using that storage engine, it's almost a no-op, but it can serve to defragment a tablespace after you delete a lot of rows.
Because it is incrementally copying rows to a new tablespace, it takes more storage space. Once it is done, it will drop the original tablespace. So it will temporarily need to use up to double the size of that table.
There should be no reason to run that command many times. Did you do that? The one that's doing the work is in progress, but it takes some time, depending on how many rows are stored in the table and also depending on how powerful your database server is. Be patient, and don't try to start the ALTER TABLE again in more than one tab.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I'm working on a database that has one table with 21 million records. Data is loaded once when the database is created and there are no more insert, update or delete operations. A web application accesses the database to make select statements.
It currently takes 25 second per request for the server to receive a response. However if multiple clients are making simultaneous requests the response time increases significantly. Is there a way of speeding this process up ?
I'm using MyISAM instead of InnoDB with fixed max rows and have indexed based on the searched field.
If no data is being updated/inserted/deleted, then this might be case where you want to tell the database not to lock the table while you are reading it.
For MYSQL this seems to be something along the lines of:
SET SESSION TRANSACTION ISOLATION LEVEL READ UNCOMMITTED ;
SELECT * FROM TABLE_NAME ;
SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ ;
(ref: http://itecsoftware.com/with-nolock-table-hint-equivalent-for-mysql)
More reading in the docs, if it helps:
https://dev.mysql.com/doc/refman/5.7/en/innodb-transaction-isolation-levels.html
The TSQL equivalent, which may help if you need to google further, is
SELECT * FROM TABLE WITH (nolock)
This may improve performance. As noted in other comments some good indexing may help, and maybe breaking the table out further (if possible) to spread things around so you aren't accessing all the data if you don't need it.
As a note; locking a table prevents other people changing data while you are using it. Not locking a table that is has a lot of inserts/deletes/updates may cause your selects to return multiple rows of the same data (as it gets moved around on the harddrive), rows with missing columns and so forth.
Since you've got one table you are selecting against your requests are all taking turns locking and unlocking the table. If you aren't doing updates, inserts or deletes then your data shouldn't change, so you should be ok to forgo the locks.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
So say you have a single SQL table in some relation DB (SQL Server, MySQL, whatever). Say you have 500 tasks that runs every 15mins. Each task will delete a portion of data related to that task, insert new data related to that task then an external source will some selects for data related to that task.
From my experience this inevitably leads to deadlocks, timeouts and all around sub-par performance for selects, even when doing dirty reads.
You can try to stagger the start time of the tasks as best you can but it doesn't really solve the issue. There are so many tasks that there will always be overlap.
You can try upgrading the server with a better CPU to handle the connections but this is very costly for just 500 tasks imo.
So what I've done is duplicated the tables Schema and gave every task it's own distinct table with that schema.
Also, when constructing new data for the table all it does is inserts into a new table then flips the name of current table with that one.
ie
CREATE TABLE Task001_Table_Inactive
Bulk insert the fresh data to Task001_Table_Inactive
DROP TABLE Task001_Table_Active
RENAME Task001_Table_Inactive Task001_Table_Active
The advantages:
-Fast processing. SQL has NEVER been good at constant deletes. The ability to just bulk insert and flips the name has drastically reduced the processing time for the task.
-Scalable. Now that all these inserts and deletes aren't constantly fighting over one table I can run many tasks on one very cheap machine in the cloud.
-Anti-fragmentation. Since the table is recreated every time, any fragmentation issues that plague systems with constant deletes is no longer an issue.
-Selects are as quick as possible WITH OUT the need for dirty reads. Since the insertion is done in separate table the select statements being done by the external source will be as quick as it can be with out the need to do a dirty read. This is the biggest advantage imo!
-Fast migrations. Eventually I'll have too many tasks and run out of processing power, even with this setup. So if I need to migrate a task to a different server it's a simple matter of copying two tables rather than a hacky and extremely slow select statement on a blob table...
-Indexability. When a tables gets too big (300mill rows +) you can not index it. No matter what it'll just chug for a few days and give up because of a transaction buffer limit. This is just how SQL is. By segmenting out the huge blob table into smaller tables you can index successfully. Take this combined with parallelization, and you can index all the data FASTER than if you were doing indexing on a big table.
Disadvantages:
-Makes navigation of tables in a GUI difficult
-Makes select / schema alter statements slightly tricky because now it has to do some sort of cursor over each table like %Table% and apply the SQL command to each.
So why do many SQL enthusiast loathe Schema duplication if it has SO many advantages? Is there a better way to handle this?
There are no enough information to advise.., but below should be considered:
Snapshot isolation
https://msdn.microsoft.com/en-us/library/ms173763.aspx
Except when a database is being recovered, SNAPSHOT transactions do not request locks when reading data. SNAPSHOT transactions reading data do not block other transactions from writing data. Transactions writing data do not block SNAPSHOT transactions from reading data.
Use transactional replication / log shipping / mirroring / Change Data Capture etc. to offload the main table
Instead of deletion, soft deletion (update IsDeleted flag or maintaining xxxx_deleted table) is an option.
If the system (DB, hardware, network architecture) is properly designed, there are no issues even when where are thousands DML requests per second.
In Microsoft SQL Server, if your tasks never touch rows that do not belong to them, you can try to use either optimistic or pessimistic versioning - Snapshot Isolation in SQL Server.
Depending on your situation, either Read Committed Snapshot Isolation (RCSI) or Snapshot isolation level will do the trick. If both seem to suit you, I would recommend the former since it results in much less performance overhead.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
I'm debugging a problem with slow queries in a MySQL server. Queries normally complete in 100-400 millisecs but sometimes rocket to 10's or 100's of seconds.
The queries are generated by an application over which I have no control, and there are multiple databases (one for each customer). The slow queries seem to appear randomly, and neither RAM, disk or CPU is loaded when the slow queries are logged. When I run the queries manually, they run fine (as in millisecs), which makes me suspect locking issues in combination with other read and write queries. The queries itself are horrible (unable to use the index in either the WHERE or ORDER BY clause) but the largest tables are relatively small (up to 200.000 rows), and there are almost no JOINs. When I profile the queries, most time is spent sorting the result (in the case where the query runs fine).
I'm unable to reproduce the extreme slowness in a test environment, and my best idea right now is to stop the production MySQL server, create a copy of the databases, enable full query logging and starting the server again. This way I should be able to replay the load and reproduce the problem. But the general query log seems to only record the query, not the target database for the query. Do I have any other record / replay options for MySQL?
You can use the slow query log: http://dev.mysql.com/doc/refman/5.1/en/slow-query-log.html
Just set the threshold to a very small value (hopefully you're running mysql > 5.1 )
Otherwise you can use tcpdump:
http://www.mysqlperformanceblog.com/2008/11/07/poor-mans-query-logging/
and of course if you use that, you may want to look at the percona toolkit's pt-query-digest to process the tcpdump output: http://www.percona.com/doc/percona-toolkit/2.1/pt-query-digest.html
For future reference, you may want to set up query and server monitoring:
https://github.com/box/Anemometer/wiki
and
https://github.com/box/RainGauge/wiki/What-is-Rain-Gauge%3F
I finally nailed the problem. The application is doing something like this:
cursor = conn.execute("SELECT * FROM `LargeTable`")
while cursor.has_more_rows():
cursor.fetchrow()
do_something_that_takes_a_while()
cursor.close()
It's fetching and processing the result set, 1 row at a time. If the loop takes 100 seconds to complete, then the table is locked on the server for 100 seconds.
Changing this setting on the MySQL server:
set global SQL_BUFFER_RESULT=ON;
made the slow queries disappear instantly, because result sets are now pushed to a temp table so the table lock can be removed, regardless of how slowly the application consumes the result set. The setting brings in a host of other performance problems, but fortunately the server is dimensioned to handle these problems.
Percona is working on a new tool called Playback which does exactly what you want:
http://www.mysqlperformanceblog.com/2013/04/09/percona-playback-0-6-for-mysql-now-available/
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
We deploy an (AJAX - based) Instant messenger which is serviced by a Comet server. We have a requirement to store the sent messages in a DB for long-term archival purposes in order to meet legal retention requirements.
Which DB engine provides the best performance in this write-once, read never (with rare exceptions) requirement?
We need at least 5000 Insert/Sec. I am assuming neither MySQL nor PostgreSQL
can meet these requirements.
Any proposals for a higher performance solution? HamsterDB, SQLite, MongoDB ...?
Please ignore the above Benchmark we had a bug inside.
We have Insert 1M records with following columns: id (int), status (int), message (140 char, random).
All tests was done with C++ Driver on a Desktop PC i5 with 500 GB Sata Disk.
Benchmark with MongoDB:
1M Records Insert without Index
time: 23s, insert/s: 43478
1M Records Insert with Index on Id
time: 50s, insert/s: 20000
next we add 1M records to the same table with Index and 1M records
time: 78s, insert/s: 12820
that all result in near of 4gb files on fs.
Benchmark with MySQL:
1M Records Insert without Index
time: 49s, insert/s: 20408
1M Records Insert with Index
time: 56s, insert/s: 17857
next we add 1M records to the same table with Index and 1M records
time: 56s, insert/s: 17857
exactly same performance, no loss on mysql on growth
We see Mongo has eat around 384 MB Ram during this test and load 3 cores of the cpu, MySQL was happy with 14 MB and load only 1 core.
Edorian was on the right way with his proposal, I will do some more Benchmark and I'm sure we can reach on a 2x Quad Core Server 50K Inserts/sec.
I think MySQL will be the right way to go.
If you are never going to query the data, then i wouldn't store it to a database at all, you will never beat the performance of just writing them to a flat file.
What you might want to consider is the scaling issues, what happens when it's to slow to write the data to a flat file, will you invest in faster disk's, or something else.
Another thing to consider is how to scale the service so that you can add more servers without having to coordinate the logs of each server and consolidate them manually.
edit: You wrote that you want to have it in a database, and then i would also consider security issues with havening the data on line, what happens when your service gets compromised, do you want your attackers to be able to alter the history of what have been said?
It might be smarter to store it temporary to a file, and then dump it to an off-site place that's not accessible if your Internet fronts gets hacked.
If you don't need to do queries, then database is not what you need. Use a log file.
it's only stored for legal reasons.
And what about the detailed requirements? You mention the NoSQL solutions, but these can't promise the data is realy stored on disk. In PostgreSQL everything is transaction safe, so you're 100% sure the data is on disk and is available. (just don't turn of fsync)
Speed has a lot to do with your hardware, your configuration and your application. PostgreSQL can insert thousands of record per second on good hardware and using a correct configuration, it can be painfully slow using the same hardware but using a plain stupid configuration and/or the wrong approach in your application. A single INSERT is slow, many INSERT's in a single transaction are much faster, prepared statements even faster and COPY does magic when you need speed. It's up to you.
I don't know why you would rule out MySQL. It could handle high inserts per second. If you really want high inserts, use the BLACK HOLE table type with replication. It's essentially writing to a log file that eventually gets replicated to a regular database table. You could even query the slave without affecting insert speeds.
Firebird can easily handle 5000 Insert/sec if table doesn't have indices.
Depending in your system setup MySql can easily handle over 50.000 inserts per sec.
For tests on a current system i am working on we got to over 200k inserts per sec. with 100 concurrent connections on 10 tables (just some values).
Not saying that this is the best choice since other systems like couch could make replication/backups/scaling easier but dismissing mysql solely on the fact that it can't handle so minor amounts of data it a little to harsh.
I guess there are better solutions (read: cheaper, easier to administer) solutions out there.
Use Event Store (https://eventstore.org), you can read (https://eventstore.org/docs/getting-started/which-api-sdk/index.html) that when using TCP client you can achieve 15000-20000 writes per second. If you will ever need to do anything with data, you can use projections or do the transformations based on streams to populate any other datastore you wish.
You can create even cluster.
If money plays no role, you can use TimesTen.
http://www.oracle.com/timesten/index.html
A complete in memory database, with amazing speed.
I would use the log file for this, but if you must use a database, I highly recommend Firebird. I just tested the speed, it inserts about 10k records per second on quite average hardware (3 years old desktop computer). The table has one compound index, so I guess it would work even faster without it:
milanb#kiklop:~$ fbexport -i -d test -f test.fbx -v table1 -p **
Connecting to: 'LOCALHOST'...Connected.
Creating and starting transaction...Done.
Create statement...Done.
Doing verbatim import of table: TABLE1
Importing data...
SQL: INSERT INTO TABLE1 (AKCIJA,DATUM,KORISNIK,PK,TABELA) VALUES (?,?,?,?,?)
Prepare statement...Done.
Checkpoint at: 1000 lines.
Checkpoint at: 2000 lines.
Checkpoint at: 3000 lines.
...etc.
Checkpoint at: 20000 lines.
Checkpoint at: 21000 lines.
Checkpoint at: 22000 lines.
Start : Thu Aug 19 10:43:12 2010
End : Thu Aug 19 10:43:14 2010
Elapsed : 2 seconds.
22264 rows imported from test.fbx.
Firebird is open source, and completely free even for commercial projects.
I believe the answer will as well depend on hard disk type (SSD or not) and also the size of the data you insert. I was inserting a single field data into MongoDB on a dual core Ubuntu machine and was hitting over 100 records per second. I introduced some quite large data to a field and it dropped down to about 9ps and the CPU running at about 175%! The box doesn't have SSD and so I wonder if I'd have gotten better with that.
I also ran MySQL and it was taking 50 seconds just to insert 50 records on a table with 20m records (with about 4 decent indexes too) so as well with MySQL it will depend on how many indexes you have in place.