Сan't remove duplicates from the table ClickHouse - duplicates

We have a replicated shard table, on the engine ReplicatedMergeTree. Now on one shard (3 in total) in the table there are 484 million rows, about 21GB.
Engine deduplication does not work on that many lines, optimization also hangs.
Maybe someone came across a similar one and somehow solved this problem?

Related

MyIsam table level locking / deadlock

I having different list of tables apart of that one table is type of InnoDb and others are MyIsm type.
My problem is, I am not getting any response on my website. When I check process list on MySql I found that :
A table having list of all question that sending data , some other process in waiting to update question's column and some other process in waiting to insert data in same table.
This table contains list of question. The type is MyIsam . The size of table is about 5 GB.
How can I resolve that?
There are a couple of solutions for this, some may apply, some not.
Add indexes to some columns
Change database type to innoDB
Take a look at your query and improve the performance
When you add indexes to you table the lookup will be much faster and therefor the table lock won't be as long.
When you change your database type to InnoDB the lock will only affect the row(s) the query is using, so other rows are still available for other query's
In your query itself most of the times a lot of performance can be gained by removing unnecessary joins or order by clauses. Maybe you can use temporary tables instead of multiple subselects, etc...

Performance of MySQL LOAD DATA INFILE over small file

I'm loading a small data file which consists around 1K rows into a MyISAM table
{
id INT(8),
text TEXT(or VARCHAR(1000))
}
The cost is around 2 seconds for LOAD DATA INFILE. I've seen MySQL could load more than 10K rows per second in average when loading large files. And I roughly know there are cost such as open/close tables. Can someone help me know what exactly happen in this 2 seconds and is it possible to optimize it under seconds as my program is running in a time critical environment. Thanks.
Somebody asked a similar question here
http://forums.mysql.com/read.php?144,558753,558753.
Looks like it has not been well answered yet.
Scenario Description
The whole MySQL setup is for some academic projects, which has around 300G databases for various projects. Most of these databases are in MyISAM engine if not ALL. These databases contains imported dumps, and processed intermediate tables in experiments. There are delete and update operations on these tables, but now they are all idle. I have a project which generate some result tuples that are inserted into a table in one of the databases. The table is initialized to be empty. The schema is very simple which contains only two columns as I pasted. Now if I set the ENGINE=MyISAM, it always takes 2s to insert 1-1K row, however, if I switch to ENGINE=INNODB, it becomes 0.01s. I installed a new MySQL in the other machine, create the table with ENGINE=MyISAM, and insert the same number of rows, it only takes 0.01s.
At 1k rows, you may find that multi-inserts are faster. Some benchmarking should help. This should be helpful as well:
http://dev.mysql.com/doc/refman/5.5/en/optimizing-innodb.html

Update IN MYSQL InnoDB million records

MYSQL Innodb Update Issue:
Once I receive a response (status) for a record ,I need to update the response to a very large table (Approximate 1 million records and will keep increasing),and this will keep happen may be 100 times per second. May I know will there any performance issue? OR any setting I can modify to avoid table locking or query slowing issue.
Thanks.
It sounds like a design issue.
Instead storing the flag (which the status-record update changes) for million data-records, you should store a reference in data-records pointing to the status-record. So, when you update the status-record, no further db operation required. Also, when you're scanning through the data-records, you should JOIN for the status-records (if it's needed to display). If status-record change occurs often, it's better than update millions of data-records.
Maybe, I'm wrong, you should explain the db (structure, table record counts) for more accurate answers.
If you store your table using the MyISAM storage engine, then your table will lock with every update.
However, the InnoDB storage engine is capable of locking individual rows.
If you need to UPDATE multiple records simultaneously, InnoDB may be better.
Any indexes you have on the database (especially clustered indexes) will slow your writes down.
Indexes speed up reading, but they slow down writing. Most databases get read more than written to, but it sounds like yours gets written to much more.

MySql - transfer a lot of records from one table to the other

I have a large table (~50M records) and i want to pass the records from this table to a different table that have the same structure (the new table have one extra index).
I'm using INSERT IGNORE INTO... to pass the records.
whats the fastest way to do this? is it by passing small chunks (lets say of 1M records) or bigger chunks?
is there any way i could speed the process?
Before perform Insert, disable indexes (DISABLE KEYS) (if you can) on destination table:
Reference can be found: Here
Also if you not using transanction / relations maybe consider switch to MyIsam engine.

Generating a massive 150M-row MySQL table

I have a C program that mines a huge data source (20GB of raw text) and generates loads of INSERTs to execute on simple blank table (4 integer columns with 1 primary key). Setup as a MEMORY table, the entire task completes in 8 hours. After finishing, about 150 million rows exist in the table. Eight hours is a completely-decent number for me. This is a one-time deal.
The problem comes when trying to convert the MEMORY table back into MyISAM so that (A) I'll have the memory freed up for other processes and (B) the data won't be killed when I restart the computer.
ALTER TABLE memtable ENGINE = MyISAM
I've let this ALTER TABLE query run for over two days now, and it's not done. I've now killed it.
If I create the table initially as MyISAM, the write speed seems terribly poor (especially due to the fact that the query requires the use of the ON DUPLICATE KEY UPDATE technique). I can't temporarily turn off the keys. The table would become over 1000 times larger if I were to and then I'd have to reprocess the keys and essentially run a GROUP BY on 150,000,000,000 rows. Umm, no.
One of the key constraints to realize: The INSERT query UPDATEs records if the primary key (a hash) exists in the table already.
At the very beginning of an attempt at strictly using MyISAM, I'm getting a rough speed of 1,250 rows per second. Once the index grows, I imagine this rate will tank even more.
I have 16GB of memory installed in the machine. What's the best way to generate a massive table that ultimately ends up as an on-disk, indexed MyISAM table?
Clarification: There are many, many UPDATEs going on from the query (INSERT ... ON DUPLICATE KEY UPDATE val=val+whatever). This isn't, by any means, a raw dump problem. My reasoning for trying a MEMORY table in the first place was for speeding-up all the index lookups and table-changes that occur for every INSERT.
If you intend to make it a MyISAM table, why are you creating it in memory in the first place? If it's only for speed, I think the conversion to a MyISAM table is going to negate any speed improvement you get by creating it in memory to start with.
You say inserting directly into an "on disk" table is too slow (though I'm not sure how you're deciding it is when your current method is taking days), you may be able to turn off or remove the uniqueness constraints and then use a DELETE query later to re-establish uniqueness, then re-enable/add the constraints. I have used this technique when importing into an INNODB table in the past, and found even with the later delete it was overall much faster.
Another option might be to create a CSV file instead of the INSERT statements, and either load it into the table using LOAD DATA INFILE (I believe that is faster then the inserts, but I can't find a reference at present) or by using it directly via the CSV storage engine, depending on your needs.
Sorry to keep throwing comments at you (last one, probably).
I just found this article which provides an example of a converting a large table from MyISAM to InnoDB, while this isn't what you are doing, he uses an intermediate Memory table and describes going from memory to InnoDB in an efficient way - Ordering the table in memory the way that InnoDB expects it to be ordered in the end. If you aren't tied to MyISAM it might be worth a look since you already have a "correct" memory table built.
I don't use mysql but use SQL server and this is the process I use to handle a file of similar size. First I dump the file into a staging table that has no constraints. Then I identify and delete the dups from the staging table. Then I search for existing records that might match and put the idfield into a column in the staging table. Then I update where the id field column is not null and insert where it is null. One of the reasons I do all the work of getting rid of the dups in the staging table is that it means less impact on the prod table when I run it and thus it is faster in the end. My whole process runs in less than an hour (and actually does much more than I describe as I also have to denormalize and clean the data) and affects production tables for less than 15 minutes of that time. I don't have to wrorry about adjusting any constraints or dropping indexes or any of that since I do most of my processing before I hit the prod table.
Consider if a simliar process might work better for you. Also could you use some sort of bulk import to get the raw data into the staging table (I pull the 22 gig file I have into staging in around 16 minutes) instead of working row-by-row?