ClickHouse - Merge similar entries into a new one

ClickHouse - Merge similar entries into a new one - duplicates

This is my current (simplified) model in ClickHouse:
Credential table
user: String
domain: String
password: String
leak: UInt64
The field leak is a reference to the id of a leak.
Leak table
id: UInt64
name: String
desc: String
date: String
I'm using the MergeTree engine for Credential and Log for Leak.
Sometimes, two sames credentials are inserted, so I have for example:
| user | domain | password | 0 |
| user | domain | password | 1 |
I would like to turn this into:
| user | domain | password | [0, 1] |
I've read about ReplacingMergeTree but I did not find another way to specify some rules of rewriting/deleting than the version parameter.
My problem is I have billions of entries and it sounds crazy to process the data before insertion. Even if ClickHouse's response times are incredibly low, it seems ridiculous to check if every new credential I'm trying to insert is already in there, and if it is, update its data, right ?
I'm trying to reduce disk storage as much as possible, but it looks hard to balance it with keeping these incredible response times. I'm listening to your ideas.

There are number of ways to deal with this.
You can create a Materialized Table along with your actual table.
You can use ReplacingMergeTree.
You can mutate(ALTER UPDATE) existing data.
Use your table as is but query it using groupArray to get the result you want.
Here I demonstrate 4th solution. You just insert your data in MergeTree as you always do. But when you query it, you use groupArray to roll the desired column to be an array based on other columns.
Let's say your table is something like this:
SELECT *
FROM mem
┌─a─────┬─b───────┬─c─────────┬─d─┐
│ user1 │ domain1 │ password1 │ 0 │
│ user1 │ domain1 │ password1 │ 1 │
│ user2 │ domain2 │ password2 │ 0 │
│ user2 │ domain2 │ password2 │ 2 │
└───────┴─────────┴───────────┴───┘
You use groupArray to solve your problem:
SELECT a, b, c, groupArray(d)
FROM mem
GROUP BY a, b, c
┌─a─────┬─b───────┬─c─────────┬─groupArray(d)─┐
│ user2 │ domain2 │ password2 │ [0,2] │
│ user1 │ domain1 │ password1 │ [0,1] │
└───────┴─────────┴───────────┴───────────────┘

Related

MariaDB server isn't rejoining as slave after being down

I was testing a few failover cases & initially this was my setup
maxctrl list servers
┌─────────┬────────────────┬──────┬─────────────┬─────────────────┬────────────┐
│ Server │ Address │ Port │ Connections │ State │ GTID │
├─────────┼────────────────┼──────┼─────────────┼─────────────────┼────────────┤
│ server1 │ XXX.XXX.XX.XXX │ 3306 │ 0 │ Slave, Running │ 0-1-853336 │
├─────────┼────────────────┼──────┼─────────────┼─────────────────┼────────────┤
│ server2 │ XXX.XXX.XX.XXX │ 3306 │ 0 │ Master, Running │ 0-1-853336 │
├─────────┼────────────────┼──────┼─────────────┼─────────────────┼────────────┤
│ server3 │ XXX.XXX.XX.XXX │ 3306 │ 0 │ Slave, Running │ 0-1-853336 │
├─────────┼────────────────┼──────┼─────────────┼─────────────────┼────────────┤
│ server4 │ XXX.XXX.XX.XXX │ 3307 │ 0 │ Slave, Running │ 0-1-853336 │
└─────────┴────────────────┴──────┴─────────────┴─────────────────┴────────────┘
I shut down Master (server2) & Slave (server1) & started them again manually, so this became the setup -
maxctrl list servers
┌─────────┬────────────────┬──────┬─────────────┬─────────────────┬────────────┐
│ Server │ Address │ Port │ Connections │ State │ GTID │
├─────────┼────────────────┼──────┼─────────────┼─────────────────┼────────────┤
│ server1 │ XXX.XXX.XX.XXX │ 3306 │ 0 │ Running │ 0-1-853336 │
├─────────┼────────────────┼──────┼─────────────┼─────────────────┼────────────┤
│ server2 │ XXX.XXX.XX.XXX │ 3306 │ 0 │ Running │ 0-1-853336 │
├─────────┼────────────────┼──────┼─────────────┼─────────────────┼────────────┤
│ server3 │ XXX.XXX.XX.XXX │ 3306 │ 0 │ Master, Running │ 0-1-853336 │
├─────────┼────────────────┼──────┼─────────────┼─────────────────┼────────────┤
│ server4 │ XXX.XXX.XX.XXX │ 3307 │ 0 │ Slave, Running │ 0-1-853336 │
└─────────┴────────────────┴──────┴─────────────┴─────────────────┴────────────┘
Now, since auto_failover=true & auto_rejoin=true, server1 & server2 should rejoin as slaves but they continue to show status as running. I even tried manually rejoining them the command maxctrl call command mariadbmon rejoin DatabaseMonitor server1 but it's showing this error -
Error: Server at 127.0.0.1:8989 responded with status code 403 to `POST maxscale/modules/mariadbmon/rejoin?DatabaseMonitor&server1`:{
"errors": [
{
"detail": "'server1' cannot replicate from master server 'server3': gtid_current_pos of 'server1' (0-1-853336) is incompatible with gtid_binlog_pos of 'server3' (0-200-3)."
}
]
I'm sure I'm missing out something on GTID replication but I can't understand why. Can anyone tell what's happening or how to fix this? Thanks.

Make sure you have log_slave_updates enabled on all your database nodes: this is required for both failover and switchover to work as the binlog events must be available on all nodes.
This might also be related to this bug report which describes a similar situation: if no new transactions occur between the failover from one node to another, the rejoining nodes cannot join as the gtid_binlog_pos of new the master server is not compatible with the gtid_current_pos of the old master server, exactly as the error message describes.
If you run a command that creates a binlog event (e.g. FLUSH LOGS) on the new master server, the rejoin should work after that.

How can I insert into a table related data in other tables using stored procedure and transaction?

I've got a table that has been imported into the database. This table has a series of user-friendly data as it comes from an Excel template that is exported to CSV then imported into a table.
In an oversimplified example, I've got the following:
MYDB.CSV
+----------+--------------------+------------------+-------------------------------------+
| language | item | title | description |
+----------+--------------------+------------------+-------------------------------------+
| Spanish | Grandioso Artículo | Gran Título | Este es un ejemplo en español. |
| English | Powerful Item | Power Title | This is an example in english. |
| French | Incroyable Article | Incroyable Titre | Ceci est un exemple en français. |
| English | Dull Item | Dull Title | This is another example in english. |
+----------+--------------------+------------------+-------------------------------------+
The data of these columns belong to different tables that contain more related characteristics, like these:
MYDB.LANGUAGES
════╤════════════════════
id │ code description
────┼────────────────────
1 │ eng English
2 │ spa Spanish
3 │ fra French
════╧════════════════════
MYDB.ITEMS
════╤═══════════════════════════════
id │ code description
────┼───────────────────────────────
1 │ dull-789 Dull Item
2 │ incr-456 Incroyable Article
3 │ gran-123 Grandioso Artículo
4 │ powe-951 Powerful Item
════╧═══════════════════════════════
How can I get related data to the mydb.csv columns to be able to insert it into another table using a stored procedure? I'm trying to get something like this:
MYDB.NICETABLE
id │ code description title
════╪═══════════════════════════════════════
1 │ spa gran-123 Gran Título
2 │ eng powe-951 Powerful Item
3 │ fra incr-456 Incroyable Titre
4 │ eng dull-789 Dull Title
I've tried the following code in a stored procedure. The data that has to be inserted as is, like csv.title in this example, I can insert it::
DELIMITER //
DROP PROCEDURE IF EXISTS addItems //
CREATE PROCEDURE addItems ()
BEGIN
START TRANSACTION;
/* INSERT INTO the corresponding tables */
INSERT INTO mydb.nicetable (title)
SELECT `title` FROM csv;
COMMIT;
END //
DELIMITER ;
CALL addBrand();
How do I compare row by row data (i.e. csv.language) and get what I need (i.e. languages.code) and insert it into another table (i.e. nicetable.code)?

The SELECT is like any query and follow the same rules.
In your case you make a subselct in languages
If you need data from other tables, you need to join them.
DELIMITER //
DROP PROCEDURE IF EXISTS addItems //
CREATE PROCEDURE addItems ()
BEGIN
START TRANSACTION;
/* INSERT INTO the corresponding tables */
INSERT INTO mydb.nicetable (code ,title)
SELECT (SELECT `code` FROM LANGUAGES LS WHERE LS.description = c.language LIMIT 1), c.`title` FROM csv c;
COMMIT;
END //
DELIMITER ;

Use table fields as stored procedure parameters (to redistribute a given table fields into other tables)

I have the structure of the following tables:
ITEMS:
╔═══════════╤══════════════╤══════╤═════╤═════════╤════════════════╗
║ FIELD │ TYPE │ NULL │ KEY │ DEFAULT │ EXTRA ║
╠═══════════╪══════════════╪══════╪═════╪═════════╪════════════════╣
║ id │ int │ NO │ PRI │ │ auto_increment ║
╟───────────┼──────────────┼──────┼─────┼─────────┼────────────────╢
║ image_url │ varchar(255) │ NO │ │ │ ║
╚═══════════╧══════════════╧══════╧═════╧═════════╧════════════════╝
ITEM_TRANSLATIONS:
╔═════════════╤══════════════╤══════╤═════╤═════════╤════════════════╗
║ FIELD │ TYPE │ NULL │ KEY │ DEFAULT │ EXTRA ║
╠═════════════╪══════════════╪══════╪═════╪═════════╪════════════════╣
║ id │ int │ NO │ PRI │ │ auto_increment ║
╟─────────────┼──────────────┼──────┼─────┼─────────┼────────────────╢
║ description │ varchar(255) │ NO │ │ │ ║
╟─────────────┼──────────────┼──────┼─────┼─────────┼────────────────╢
║ title │ varchar(45) │ NO │ │ │ ║
╚═════════════╧══════════════╧══════╧═════╧═════════╧════════════════╝
I also have a stored procedure that redistributes its parameters to the needed tables this way:
DELIMITER //
DROP PROCEDURE IF EXISTS addItem //
CREATE PROCEDURE addItem (
IN _item__image_url VARCHAR(255),
IN _item_translations__title VARCHAR(45),
IN _item_translations__description VARCHAR(255)
)
BEGIN
START TRANSACTION;
INSERT INTO item (
image_url
)
VALUES (
_item__image_url
);
INSERT INTO item_translations (
item_id,
title,
`description`
)
VALUES (
LAST_INSERT_ID(),
_item_translations__title,
_item_translations__description
);
COMMIT ;
END //
DELIMITER ;
If I call this procedure this way:
CALL addBrand(
"/images/items.png",
"My Item",
"An oversimplified item just for this question."
);
I got the following as expected:
ITEMS:
╔════╤═══════════════════╗
║ ID │ IMAGE_URL ║
╠════╪═══════════════════╣
║ 19 │ /images/items.png ║
╚════╧═══════════════════╝
ITEM_TRANSLATIONS:
╔════╤═════════╤═════════╤════════════════════════════════════════════════╗
║ ID │ ITEM_ID │ TITLE │ DESCRIPTION ║
╠════╪═════════╪═════════╪════════════════════════════════════════════════╣
║ 7 │ 19 │ My Item │ An oversimplified item just for this question. ║
╚════╧═════════╧═════════╧════════════════════════════════════════════════╝
I have a third table with N rows with all the required fields:
IMPORTED_TABLE
╔════╤══════════════╤══════════════════════════════════╤════════════════════════╗
║ ID │ TITLE │ DESCRIPTION │ IMAGE_URL ║
╠════╪══════════════╪══════════════════════════════════╪════════════════════════╣
║ 42 │ Another Item │ Yet another oversimplified item. │ /images/items_2.png ║
╟────┼──────────────┼──────────────────────────────────┼────────────────────────╢
║ 43 │ This Item │ A nice item │ /images/thanks.png ║
╟────┼──────────────┼──────────────────────────────────┼────────────────────────╢
║ 44 │ Trixie Item │ The great and powerful item! │ /images/mlp/trixie.png ║
╚════╧══════════════╧══════════════════════════════════╧════════════════════════╝
How can I use this table contents as parameters of the stored procedure to be able to populate the required tables as needed?
In order to get this:
ITEMS:
╔════╤════════════════════════╗
║ ID │ IMAGE_URL ║
╠════╪════════════════════════╣
║ 19 │ /images/items.png ║
╟────┼────────────────────────╢
║ 20 │ /images/items_2.png ║
╟────┼────────────────────────╢
║ 21 │ /images/thanks.png ║
╟────┼────────────────────────╢
║ 22 │ /images/mlp/trixie.png ║
╚════╧════════════════════════╝
ITEM_TRANSLATIONS
╔════╤═════════╤══════════════╤════════════════════════════════════════════════╗
║ ID │ ITEM_ID │ TITLE │ DESCRIPTION ║
╠════╪═════════╪══════════════╪════════════════════════════════════════════════╣
║ 7 │ 19 │ My Item │ An oversimplified item just for this question. ║
╟────┼─────────┼──────────────┼────────────────────────────────────────────────╢
║ 8 │ 20 │ Another Item │ Yet another oversimplified item. ║
╟────┼─────────┼──────────────┼────────────────────────────────────────────────╢
║ 9 │ 21 │ This Item │ A nice item ║
╟────┼─────────┼──────────────┼────────────────────────────────────────────────╢
║ 10 │ 22 │ Trixie Item │ The great and powerful item! ║
╚════╧═════════╧══════════════╧════════════════════════════════════════════════╝
Obviously this is an oversimplified example. In the stored procedure there are different data treatments to all the parameters, so I don't have to recreate the stored procedure.

Are the ID values in your IMPORTED_TABLE supposed to be used for the ID in ITEMS?
If so, then you can do this:
START TRANSACTION;
INSERT INTO ITEMS (ID, IMAGE_URL)
SELECT ID, IMAGE_URL FROM IMPORTED_TABLE;
INSERT INTO ITEM_TRANSLATIONS (ITEM_ID, TITLE, DESCRIPTION)
SELECT ID, TITLE, DESCRIPTION FROM IMPORTED_TABLE;
COMMIT;
This will use the ID values verbatim for both ITEMS.ID and ITEM_TRANSLATIONS.ITEM_ID.
However, if you want to insert the URLs and ignore the ID values in the imported data, and let the ITEMS table generate new ID values, then you can do it in a batch and assume the batch is a set of consecutive values.
START TRANSACTION;
INSERT INTO ITEMS (IMAGE_URL)
SELECT IMAGE_URL FROM IMPORTED_TABLE;
SET #START_ID = LAST_INSERT_ID() - 1;
INSERT INTO ITEM_TRANSLATIONS (ITEM_ID, TITLE, DESCRIPTION)
SELECT (#START_ID := #START_ID+1), TITLE, DESCRIPTION FROM IMPORTED_TABLE;
COMMIT;
Is it safe to assume the values are consecutive? By default, yes it is safe. For example, MySQL's JDBC driver makes this assumption when you do batched inserts, so it can return the set of generated ID values.
The exception is if you are on a MySQL instance that has the option innodb_autoinc_lock_mode=2 set, the values are not guaranteed to be consecutive. This is not the default, so it probably won't apply in your case.
(read https://dev.mysql.com/doc/refman/8.0/en/innodb-auto-increment-handling.html for details)

(Posted an answer improvement on behalf of the question author, in order to move it to the answer space).
Thanks to the accepted answer, I was able to solve this challenge. However the auto_increment value is increased the number of rows inserted/read (possible because a bug in MySQL as stated in a comment).
So I needed to create another stored procedure to be able to get the max ID value and use it to alter all the involved tables. This stored procedure is the following:
DELIMITER //
DROP PROCEDURE IF EXISTS tableMaxID //
CREATE PROCEDURE tableMaxID(IN nameOfTable VARCHAR(20))
BEGIN
SET #qry=CONCAT('SELECT MAX(id) INTO #maxIdValue FROM ', nameOfTable);
PREPARE st FROM #qry;
EXECUTE st;
DEALLOCATE PREPARE st;
SET #maxIdValue := #maxIdValue + 1;
END //
DELIMITER ;
Then, I added this fix to the original stored procedure for each involved table:
DELIMITER //
DROP PROCEDURE IF EXISTS addItem //
CREATE PROCEDURE addItem ()
BEGIN
START TRANSACTION;
/* INSERT INTO the corresponding tables */
INSERT INTO items (image_url)
SELECT CONCAT(`items.image_url`, ' testing') FROM imported_table;
SET #START_ID = LAST_INSERT_ID() - 1;
INSERT INTO items_translations (item_id, title, `description`)
SELECT (#START_ID := #START_ID+1), `items_translations.title`, `brands_translations.description` FROM imported_table;
/***********/
/* Fix brands auto_increment */
CALL tableMaxID('items');
SET #qry=CONCAT('ALTER TABLE items AUTO_INCREMENT =', #maxIdValue);
PREPARE st FROM #qry;
EXECUTE st;
DEALLOCATE PREPARE st;
/* Fix items_translations auto_increment */
CALL tableMaxID('items_translations');
SET #qry=CONCAT('ALTER TABLE items_translations AUTO_INCREMENT =', #maxIdValue);
PREPARE st FROM #qry;
EXECUTE st;
DEALLOCATE PREPARE st;
COMMIT;
END //
DELIMITER ;
Now, all that's needed is to call the stored procedure and the given data will be inserted into the corresponding tables, fixing the involved tables with the correct auto_increment value:
CALL addBrand();

How do I properly sort a materialized paths in SQL?

I am using Materialized Path to store a tree structure in SQl (MySQL 5.7 in my case). I am storing the paths as slash-separated slugs. All the tutorials I have read said to sort the rows by the path to extract it in the right order, but it appears not to work when parts of the path have a similar prefix.
Some example code:
CREATE TABLE categories (
id int(11),
parent_id int(11) DEFAULT NULL,
slug varchar(255),
path varchar(255)
);
INSERT INTO categories VALUES
(1, null, 'foo', '/foo'),
(2, 1, 'bar', '/foo/bar'),
(3, null, 'foo-it', '/foo-it'),
(4, 3, 'boy', '/foo-it/boy');
Now, when sorting by path I get the wrong order:
SELECT * FROM categories ORDER BY path;
Output:
+------+-----------+--------+-------------+
| id | parent_id | slug | path |
+------+-----------+--------+-------------+
| 1 | NULL | foo | /foo |
| 3 | NULL | foo-it | /foo-it |
| 4 | 3 | boy | /foo-it/boy |
| 2 | 1 | bar | /foo/bar |
+------+-----------+--------+-------------+
4 rows in set (0.00 sec)
This appears to be caused because - precedes / in most (all?) collations.
Crazy thing is, the unix sort commandline utility does the right thing. If I put all the paths in a file and sort it, I get the correct output:
$ sort paths.txt
/foo
/foo/bar
/foo-it
/foo-it/boy
Is there any way to make MySQL sort the tree properly? To sort it the same way that unix's sort utility does? Perhaps a different collation or something? Or any other tricks?

Try this:
SELECT * FROM categories ORDER BY path + '/';
Produces:
/foo-it
/foo-it/boy
/foo
/foo/bar
/foo is sorted after /foo-it because /foo/ comes after /foo-.
You can fiddle around a bit like replacing - with something that comes after / in ordering and not allowed in paths or file name.
SELECT * FROM categories ORDER BY replace(path,'-','?') + '/';
Produces:
/foo
/foo/bar
/foo-it
/foo-it/boy

What "restart" shows in PM2 list command

when I execute command pm2 list it shows following output will columns :
App name id mode pid status restart uptime memory watching
what is significance of restart column here ?

restart column in PM2 shows how many times that particular script was restarted.
So if you initially start a script it will be 0 as in the below output.
┌──────────┬────┬──────┬───────┬────────┬─────────┬────────┬─────┬──────────┬──────────┐
│ App name │ id │ mode │ pid │ status │ restart │ uptime │ cpu │ mem │ watching │
├──────────┼────┼──────┼───────┼────────┼─────────┼────────┼─────┼──────────┼──────────┤
│ server │ 0 │ fork │ 10505 │ online │ 0 │ 0s │ 0% │ 14.0 MB │ disabled │
└──────────┴────┴──────┴───────┴────────┴─────────┴────────┴─────┴──────────┴──────────┘
When you run the command pm2 restart script.js the output will be as below.
┌──────────┬────┬──────┬───────┬────────┬─────────┬────────┬─────┬──────────┬──────────┐
│ App name │ id │ mode │ pid │ status │ restart │ uptime │ cpu │ mem │ watching │
├──────────┼────┼──────┼───────┼────────┼─────────┼────────┼─────┼──────────┼──────────┤
│ server │ 0 │ fork │ 10525 │ online │ 1 │ 0s │ 0% │ 11.5 MB │ disabled │
└──────────┴────┴──────┴───────┴────────┴─────────┴────────┴─────┴──────────┴──────────┘
The value of restart is 1, and it will be incremented every time you restart the script.

Actually, the restart column lists all restarts.
When you use 'pm2 restart process.js' or 'pm2 restart process_name', that adds to the restart count.
When pm2 restarts the process automatically if it crashes, that adds to the restart count as well.
It's rather simple to try out, make a simple node file, run it, restart it, make an error in the code, run it again to see the automatic restarts increasing the count.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

ClickHouse - Merge similar entries into a new one - duplicates

Related

MariaDB server isn't rejoining as slave after being down

How can I insert into a table related data in other tables using stored procedure and transaction?

Use table fields as stored procedure parameters (to redistribute a given table fields into other tables)

How do I properly sort a materialized paths in SQL?

What "restart" shows in PM2 list command

Categories

Resources