Copying large amount of data in ClickHouse except one column - duplicates

I am trying to duplicate data from one table to another table with slight modification.
Essentially taking 1x volume of data and converting into 100x volume of data into another table.
Is there any fastest way to do that same? The schema is identical. I would like to alter one column value while duplicating the data.
Something in this line
INSERT INTO 100x_table (test_id, ......) FROM (SELECT 1, * FROM 10x table EXCEPT test_id)
INSERT INTO 100x_table (test_id, ......) FROM (SELECT 2, * FROM 10x table EXCEPT test_id)
INSERT INTO 100x_table (test_id, ......) FROM (SELECT 3, * FROM 10x table EXCEPT test_id)
and so on
Is there better way of duplicating the table data by changing only one specific column?

You can use the EXCEPT clause
e.g.
SELECT * EXCEPT county
FROM uk_price_paid
LIMIT 10

Related

MySQL updates getting very slow towards end of the table

I have a table "data" which holds around 100,000,000 records.
I have added a new column to it "batch_id" (Integer).
On the application layer, I'm updating the batch_id in batches of 10,000 records for each of the 100,000,000 records (the batch_id is always the same for 10k).
I'm doing something like this (application layer pseudo code):
loop {
$batch_id = $batch_id + 1;
mysql.query("UPDATE data SET batch_id='$batch_id' WHERE batch_id IS NULL LIMIT 10000");
}
I have an index on the batch_id column.
In the beginning, this update statement took ~30 seconds. I'm now halfway through the Table and it's getting slower and slower. At the moment the same statement takes around 10 minutes(!). It reached a point where this is no longer feasible as it would take over a month to update the whole table at the current speed.
What could I do to speed it up, and why is MySQL Getting slower towards the end of the table?
Could an index on the primary key help?
Is the primary key automatically indexed in MySQL? The answer is Yes
So instead one index for batch_id will help.
The problem is without index the engine do a full table scan. At first is easy find 10k with null values, but when more and more records are updated the engine have to scan much more to find those nulls.
But should be easier create batch_id as an autonumeric column
OTHER OPTION: Create a new table and then add the index and replace old table.
CREATE newTable as
SELECT IF(#newID := #newID + 1,
#newID DIV 10000,
#newID DIV 10000) as batch_id,
<other fields>
FROM YourTable
CROSS JOIN (SELECT #newID :=0 ) as v
Insert auto increment primary key to existing table
Do you have a monotonically increasing id in the table? And all rows for a "batch" have 'consecutive' ids? Then don't add batch_id to the table, instead, create another table Batches with one row per batch: (batch_id (PK), id_start, id_end, start_time, end_time, etc).
If you stick to exact chunks of 10K, then don't even materialize batch_id. Instead, compute it from id DIV 10000 whenever you need it.
If you want to discuss this further, please provide SHOW CREATE TABLE for the existing table, and explain what you will be doing with the "batches".
To answer your question about "slow near the end": It is having to scan farther and farther in the table to find the NULLs. You would be better to walk through the table once, fiddling with each 10K chunk as you go. Do this using the PRIMARY KEY, whatever it is. (That is, even if it is not AUTO_INCREMENT.) More Details .

update sql table current row

Complete noob alert! I need to store a largish set of data fields (480) for each of many devices i am measuring. Each field is a Decimal(8,5). First, is this an unreasonably large table? I have no experience really, so if it is unmanageable, I might start thinking of an alternative storage method.
Right now, I am creating a new row using INSERT, then trying to put the 480 data values in to the new row using UPDATE (in a loop). Currently each UPDATE is overwriting the entire column. How do I specify only to modify the last row? For example, with a table ("magnitude") having columns "id", "field1", "field2",...:
sql UPDATE magnitude SET field1 = 3.14; this modifies the entire "field1" column.
Was trying to do something like:
sql UPDATE magnitude SET field1 = 3.14 WHERE id = MAX(id)
Obviously I am a complete noob. Just trying to get this one thing working and move on... Did look around a lot but can't find a solution. Any help appreciated.
Instead of inserting a row and then updating it with values, you should insert an entire row, with populated values, at once, using the insert command.
I.e.
insert into tTable (column1, column2, ..., column n) values (datum1, datum2, ..., datum n)
Your table's definition should have the ID column with property identity, which means that it will autofill it for you when you insert, i.e. you don't need to specify it.
Re: appropriateness of the schema, I think 480 is a large number of columns. However, this is a straightforward enough example that you could try it and determine empirically if your system is able to give you the performance you need.
If I were doing this myself, I would go for a different solution that has many rows instead of many columns:
Create a table tDevice (ID int, Name nvarchar)
Create a table tData (ID int, Device_ID int, Value decimal(8,5))
-- With a foreign key on Device_ID back to tDevice.ID
Then, to populate:
Insert all your devices in tDevice
Insert one row into tData for every Device / Data combination
-- i.e. 480 x n rows, n being the number of devices
Then, you can query the data you want like so:
select * from tData join tDevice on tDevice.ID = tData.Device_ID

How to insert 13 million records with selected columns of a table to another existing table?

I've two tables one is the main table having data and I want to insert data from another existing table having about 13 million records. I'm using the query to insert from another table i.e.
insert into table1 ( column1, col2 ...) select col1, col2... from table2;
But, unfortunately the query fails as lock wait timeout comes Error 1205.
What is the best way to do it in least time without timeout.
If you have a primary key on table2, then you can use that for ordering and inserting in batches:
insert into table1 ( column1, col2 ...)
select col1, col2...
from table2
order by <primary key>
limit 0, 100000
Then repeat this for additional values. (Of course, the 100,000 is arbitrary. A larger value might work. A smaller value might be necessary.)
Another possibility is to remove all indexes and insert triggers from table1, try the insert without them, and then add them back after the new data is in the table.

MySQL - how to check which items in an arbitrary list (~1,000 items) are in a table?

Here's my problem...
I need to be able to check which items in a list of about 1,000 items (the needles) are in a fairly large table containing about ~500,000 rows (the haystack).
My question is, what's the best/fastest/most efficient way to do this?
I know that I can create a SQL statement like this:
SELECT id FROM haystack WHERE id IN (ID1, ID2, ID3, ..., IDn)
(assuming ID1, ID2, ID3, ..., IDn are the the needles.)
However, I'm not sure how performant or wise that is if the needles list contains 1,000+ items.
I also know that, if my needles list was in a table of it's own, I could join that table to the haystack table. However, the needles list isn't already in a table.
So - I guess another possible option is to put those 1,000 items into a temporary table and then join that to the haystack table. If that's the best option - then what's the best way to quickly load 1,000 items into a temporary table? (E.g., 1,000 individual INSERT statements? Insert all rows in a single INSERT statment? Is there a limit on how long an INSERT statement can be?)
A third possible option - write the needles list to a text file, then use LOAD DATA INFILE to load that into a (temporary) table, then join the temp table to the haystack table. But, wow... that seems like a lot of overhead.
Is there another, better option?
For what it's worth, the context of this is PHP, and I'm getting the needles list from a JSON web-service response, and using MySQLi for the database interaction.
According to this benchmark, it is faster in your case to use a temporary table and the JOIN method.
I am not sure though that's not a premature optimisation. You should perform your own benchmark and determine if the added complexity deserves the effort. I would recommend going with the simple IN method and only start to optimise when you detect a performance issue.
Just remember that according to the manual:
The number of values in the IN list is only limited by the max_allowed_packet value.
I think your query SELECT id FROM haystack WHERE id IN (ID1, ID2, ID3, ..., IDn) would be fine. I have a very similar use case where I have millions of "needles" and I pass them to the IN clause in blocks of 10,000 via PDO with no issues.
I would add that the column you are checking should be indexed. In my case it is the primary key of the table.
If the needles are going to be used to query the haystack frequently, you absolutely want to create a new table. For this example, I'm going to assume that the needles are int values and will label them as id in the table needle.
First, you need to create the table
CREATE TABLE needle (
id INT(11) PRIMARY KEY
)
Next, you need to insert the values
INSERT INTO needle (id)
VALUES (ID1),
(ID2),
...,
(IDn)
Now, you can query haystack using a join.
SELECT h.id
FROM haystack h
JOIN needle n
ON h.id = n.id
If this is an infrequent query and the number of needles won't grow beyond the 1,000, using the IN clause won't hurt your performance greatly.

Index counter shared by multiple tables in mysql

I have two tables, each one has a primary ID column as key. I want the two tables to share one increasing key counter.
For example, when the two tables are empty, and counter = 1. When record A is about to be inserted to table 1, its ID will be 1 and the counter will be increased to 2. When record B is about to be inserted to table 2, its ID will be 2 and the counter will be increased to 3. When record C is about to be inserted to table 1 again, its ID will be 3 and so on.
I am using PHP as the outside language. Now I have two options:
Keep the counter in the database as a single-row-single-column table. But every time I add things to table A or B, I need to update this counter table.
I can keep the counter as a global variable in PHP. But then I need to initialize the counter from the maximum key of the two tables at the start of apache, which I have no idea how to do.
Any suggestion for this?
The background is, I want to display a mix of records from the two tables in either ASC or DESC order of the creation time of the records. Furthermore, the records will be displayed in page-style, say, 50 records per page. Records are only added to the database rather than being removed. Following my above implementation, I can just perform a "select ... where key between 1 and 50" from two tables and merge the select datasets together, sort the 50 records according to IDs and display them.
Is there any other idea of implementing this requirement?
Thank you very much
Well, you will gain next to nothing with this setup; if you just keep the datetime of the insert you can easily do
SELECT * FROM
(
SELECT columnA, columnB, inserttime
FROM table1
UNION ALL
SELECT columnA, columnB, inserttime
FROM table2
)
ORDER BY inserttime
LIMIT 1, 50
And it will perform decently.
Alternatively (if chasing last drop of preformance), if you are merging the results it can be an indicator to merge the tables (why have two tables anyway if you are merging the results).
Or do it as SQL subclass (then you can have one table maintain IDs and other common attributes, and the other two reference the common ID sequence as foreign key).
if you need creatin time wont it be easier to add a timestamp field to your db and sort them according to that field?
i believe using ids as a refrence of creation is bad practice.
If you really must do this, there is a way. Create a one-row, one-column table to hold the last-used row number, and set it to zero. On each of your two data tables, create an AFTER INSERT trigger to read that table, increment it, and set the newly-inserted row number to that value. I can't remember the exact syntax because I haven't created a trigger for years; see here http://dev.mysql.com/doc/refman/5.0/en/triggers.html