Generate anonymous ids in JOIN query - mysql

The task is to extract customer and order data from a sensitive system. Data is stored in a MySQL database.
A customer can be associated with many orders. A simple LEFT JOIN gives me exactly what I require:
---------------------------------------------------------
| customer_id | order_id | order_quantity | order_value |
---------------------------------------------------------
| 1 | 100 | 3 | 100.00 |
| 1 | 105 | 12 | 400.00 |
| 2 | 103 | 2 | 75.00 |
---------------------------------------------------------
However, in the generated extract, I'm not allowed to reveal the customer_id nor the order_id. Instead, these ids need to be replaced by random, anonymized identifiers generated at the time of data export.
The relationship between customers and their orders still needs to be maintained in the resulting, extracted data export.
Desired outcome:
-------------------------------------------------------------------
| anon_customer_id | anon_order_id | order_quantity | order_value |
-------------------------------------------------------------------
| xyz | abc123 | 3 | 100.00 |
| xyz | def567 | 12 | 400.00 |
| pqr | hij890 | 2 | 75.00 |
-------------------------------------------------------------------
Is there a way to generate anon_customer_id and anon_order_id as part of the SELECT I'm running to build the data result?

One option would be using MySQL's native encryption methods like SHA1 or SHA2 and make a VIEW which you query and join with.
I've choosen to use SHA 512 because it has a very low probability different data could generate the same hash.
CREATE VIEW Table1_VIEW AS (
SELECT
<table>.*
, SHA2(<table>.customer_id, 512) AS anon_customer_id
, SHA2(<table>.order_id, 512) AS anon_order_id
FROM
<table>
)
Query and result
SELECT
*
FROM
Table1_VIEW
| customer_id | order_id | order_quantity | order_value | anon_customer_id | anon_order_id |
| ----------- | -------- | -------------- | ----------- | -------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------- |
| 1 | 100 | 3 | 100 | 4dff4ea340f0a823f15d3f4f01ab62eae0e5da579ccb851f8db9dfe84c58b2b37b89903a740e1ee172da793a6e79d560e5f7f9bd058a12a280433ed6fa46510a | 643c30f73a3017050b287794fc8c5bb9ab06b9ce38a1fc58df402a8b66ff58f69bf0a606ae17585352a0306f0e9752de8c5c064aed7003f52808b43ff992a603 |
| 1 | 105 | 12 | 400 | 4dff4ea340f0a823f15d3f4f01ab62eae0e5da579ccb851f8db9dfe84c58b2b37b89903a740e1ee172da793a6e79d560e5f7f9bd058a12a280433ed6fa46510a | 03d25c7071bce10d6b462d53854b969d9f61b982e3aee8771bdcca1ecb70495574e6929042f52e859ee9a253b58f776514180ff16e1338f5505e86c7ff328f72 |
| 2 | 103 | 2 | 75 | 40b244112641dd78dd4f93b6c9190dd46e0099194d5a44257b7efad6ef9ff4683da1eda0244448cb343aa688f5d3efd7314dafe580ac0bcbf115aeca9e8dc114 | 947de04bfae0bf062a66fc055d4c284c9779793d9bd58833ee7549fde1ff1effaf7aefdbc6c90ed0ac86c0acc82329e7c057d900c28ea7ed4724486f717ee38d |
demo
p.s You can also directly use SHA2() directly in a JOIN offcource.
Example Query
SELECT
table11.*
, SHA2(table11.customer_id, 512) AS anon_customer_id
, SHA2(table11.order_id, 512) AS anon_order_id
FROM
Table1 table11
LEFT JOIN
Table1 table12
ON
table11.customer_id = table12.customer_id
demo
MYSQL 5.7+ only
If you have atleast MySQL 5.7+ you have a even better option. Which is generated columns
CREATE TABLE Table1 (
`customer_id` INTEGER,
`order_id` INTEGER,
`order_quantity` INTEGER,
`order_value` INTEGER,
anon_customer_id VARCHAR(255) AS ( SHA2(Table1.customer_id, 512) ) VIRTUAL,
anon_order_id VARCHAR(255) AS ( SHA2(Table1.order_id, 512) ) VIRTUAL
);
demo
Edit because of the comment from Louis
my point was that someone will be able to extract the sensitive
customer id after it's hashed. Simply by calculating hashes of all
possible or likely customer id and seeing which is the same. If the
customer ID is not an increasing number with predictable range but
some randomly assigned very large number or indeed a long random
string it might be better
This is very true what you can do is add more entropy to the hash so the real id arn't that easy to bruteforce annymore.
In this case you add atleast 52 characters (datetime(6) and the reverse one) as entropy that should be more then enough to protect against bruteforces for (some) years to come.
CREATE VIEW Table1_VIEW_more_entropy AS (
SELECT
Table1.*
, SHA2(CONCAT_WS(':', Table1.id, Table1.date_created, REVERSE(Table1.date_created), Table1.customer_id), 512)
, SHA2(CONCAT_WS(':', Table1.id, Table1.date_created, REVERSE(Table1.date_created), Table1.order_id), 512)
FROM
Table1
);
see demo

You could use any hash function, for example, MD5:
SELECT MD5(customer_id) AS anon_customer_id FROM customers;
But be aware though, MD5 is not very secure: https://security.stackexchange.com/questions/19906/is-md5-considered-insecure

Related

MySQL query for data by step size in a given range

So basically I have a users table which has a column named "completed_surveys" which holds total number of completed surveys.
I need to create a query which would take step size number and would group them by that range.
Example result which would suit my needs:
+---------+-------------------+
| range | completed_surveys |
+---------+-------------------+
| 0-14 | 4566 |
| 14-28 | 3412 |
| 28-42 | 5456 |
| 42-56 | 33 |
| 56-70 | 31 |
| 70-84 | 441 |
| 84-98 | 576 |
| 98-112 | 23 |
| 112-126 | 12 |
| 126-140 | 1 |
+---------+-------------------+
What I have so far:
select concat(what should i add here??) as `range`,
count(users.completed_surveys) as `completed_surveys` from users WHERE users.completed_surveys > 0 group by 1 order by users.completed_surveys;
I think this query is correct however in the concat function I don't really know how to increase the previous number by 14. Any ideas?
One idea is to first create a helper table with values 0..9 .
CREATE TABLE tmp ( i int );
INSERT INTO tmp VALUES (0) , (1) ... (9);
Then join the two tables:
SELECT concat(i,' - ',(i+1)*7) as `range`,
count(users.completed_surveys) as `completed_surveys` from users
INNER JOIN tmp ON (users.completed_surveys>tmp.i AND users.completed_surveys<=(i+1)*7)
WHERE users.completed_surveys > 0
GROUP BY tmp.i
ORDER BY tmp.i

Splitting a cell in mySQL into multiple rows while keeping the same "ID"

In my table I have two columns "sku" and "fitment". The sku represents a part and the fitment represents all the vehicles this part will fit on. The problem is, in the fitment cells, there could be up to 20 vehicles in there, separated by ^^. For example
**sku -- fitment**
part1 -- Vehichle 1 information ^^ vehichle 2 information ^^ vehichle 3 etc
I am looking to split the cells in the fitment column, so it would look like this:
**sku -- fitment**
part1 -- Vehicle 1 information
part1 -- Vehicle 2 information
part1 -- Vehicle 3 information
Is this possible to do? And if so, would a mySQL db be able to handle hundreds of thousands of items "splitting" like this? I imagine it would turn my db of around 250k lines to about 20million lines. Any help is appreciated!
Also a little more background, this is going to be used for a drill down search function so I would be able to match up parts to vehicles (year, make, model, etc) so if you have a better solution, I am all ears.
Thanks
Possible duplicate of this: Split value from one field to two
Unfortunately, MySQL does not feature a split string function. As in the link above indicates there are User-defined Split function's.
A more verbose version to fetch the data can be the following:
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(fitment, '^^', 1), '^^', -1) as fitmentvehicle1,
SUBSTRING_INDEX(SUBSTRING_INDEX(fitment, '^^', 2), '^^', -1) as fitmentvehicle2
....
SUBSTRING_INDEX(SUBSTRING_INDEX(fitment, '^^', n), '^^', -1) as fitmentvehiclen
FROM table_name;
Since your requirement asks for a normalized format (i.e. not separated by ^^) to be retrieved, it is always better to store it in that way in the first place. And w.r.t the DB size bloat up, you might want to look into possibilities of archiving older data and deleting the same from the table.
Also, you should partition your table using an efficient partitioning strategy based on your requirement. It would be more easier to archive and truncate a partition of the table itself, instead of row by row.
E.g.
DROP TABLE IF EXISTS my_table;
CREATE TABLE my_table (user_id INT NOT NULL PRIMARY KEY,stuff VARCHAR(50) NOT NULL);
INSERT INTO my_table VALUES (101,'1,2,3'),(102,'3,4'),(103,'4,5,6');
SELECT *
FROM my_table;
+---------+-------+
| user_id | stuff |
+---------+-------+
| 101 | 1,2,3 |
| 102 | 3,4 |
| 103 | 4,5,6 |
+---------+-------+
SELECT * FROM ints;
+---+
| i |
+---+
| 0 |
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
| 6 |
| 7 |
| 8 |
| 9 |
+---+
SELECT DISTINCT user_id
, SUBSTRING_INDEX(SUBSTRING_INDEX(stuff,',',i2.i*10+i1.i+1),',',-1) x
FROM my_table
, ints i1
, ints i2
ORDER
BY user_id,x;
+---------+---+
| user_id | x |
+---------+---+
| 101 | 1 |
| 101 | 2 |
| 101 | 3 |
| 102 | 3 |
| 102 | 4 |
| 103 | 4 |
| 103 | 5 |
| 103 | 6 |
+---------+---+

Script to combine multiple MySQL records into one via summing

I'm a MySQL newbie, but I'm sure there must be a way to do this. I've been looking through StackOverflow for quite a while, though, and haven't found it yet.
I have a MySQL table that is generated from a multi-reducer Hadoop MapReduce job which is analyzing log files. The table is being used in the database that supports a Ruby-on-Rails app, and it looks like this:
+----+-----+------+---------+-----------+
| id | src | dest | time | requests |
+----+-----+------+---------+-----------+
| 0 | abc | xyz | 1000000 | 200000000 |
| 1 | def | uvw | 10 | 300 |
| 2 | abc | xyz | 100000 | 200000 |
| 3 | def | xyz | 1000 | 40000 |
| 4 | abc | uvw | 100 | 5000 |
| 5 | def | xyz | 10000 | 100000 |
+----+-----+------+---------+-----------+
I'm trying to coalesce/sum the columns which have the same src and dest, but I just can't figure out how to do it even after searching through the MySQL 5.1 documentation.
I'm trying to write a script which I could run and obtain something like this at the end (neither the order of the rows nor the id column is important):
+----+-----+------+---------+-----------+
| id | src | dest | time | requests |
+----+-----+------+---------+-----------+
| 6 | abc | xyz | 1100000 | 200200000 |
| 7 | def | uvw | 10 | 300 |
| 8 | abc | uvw | 100 | 5000 |
| 9 | def | xyz | 11000 | 140000 |
+----+-----+------+---------+-----------+
Any ideas on how I could figure this out?
You can't really combine the rows in a single table -- at least not easily. That would require both updates and deletes.
So, just create another table:
create table summary_t as
select src, desc, sum(time) as time, sum(requests) as requests
from table t
group by src, desc;
If you really want this go go back into the original table, then use a temporary table and re-insert the data:
create temporary table summary_t as
select src, desc, sum(time) as time, sum(requests) as requests
from t
group by src, desc;
truncate table t;
insert into t(src, desc, time, requests)
select src, desc, time, requests
from summary_t;
However, having said all that, you should just add another step to your map-reduce application to do that final summary.
Group By with SUM aggregate should work
select src, dest, sum(`time`) as `time`, sum(requests) as requests
from yourtable
group by src, dest
Check if this suite your needs, Create a table with the columns src and dest as primary key and other fields like totaltime and totalrequest.
Create an INSERT AFTER trigger on the existing tabl, which updates the other table totaltime and totalrequest with (old + new) using the src and dest as the key for where condition.

MySQL- List inside of a column

What I'm trying to is create a table that will keep track of users who report a comment on a website. Right now, I have a table that would look something like this:
id | num_reports | users
-----------------------------------
12345 1
12489 4
For this table, I'd like id to be unique and number_reports to keep incrementing starting at 1. But for users, I'm getting confused because I'd like to keep a record of user_ids who created a report and I'm unsure of how to make it so I can store multiple user_ids.
I thought of doing something like
id | user_id
---------------
123 567
123 689
and in this case, you would just count the number of rows with id being duplicated and user_id being unique, but this just seemed inefficient.
I've been looking around, and it looks like the correct way would be creating another table, but how does that allow me to store multiple user_ids?
That's the right way to do it. Here is what you should have:
USERS COMMENTS
+---------+------+ +------------+---------+------------+---------------------+
| id_user | name | | id_comment | id_user | id_article | date |
+---------+------+ +------------+---------+------------+---------------------+
| 171 | Joe | | 245 | 245 | 24 | 2015-03-22 10:12:00 |
| 180 | Jack | | 1245 | 180 | 68 | 2015-03-23 23:01:19 |
| ... | ... | | ... | ... | ... | ... |
+---------+------+ +------------+---------+------------+---------------------+
COMMENT_REPORTS
+-----------+------------+---------+---------------------+
| id_report | id_comment | id_user | date |
+-----------+------------+---------+---------------------+
| 1 | 245 | 171 | 2015-03-24 16:11:15 |
| 2 | 654 | 180 | 2015-03-24 18:13:42 |
| 3 | 1245 | 180 | 2015-03-24 18:34:01 |
| 4 | 1245 | 456 | 2015-03-25 09:58:10 |
| ... | ... | ... | ... |
+-----------+------------+---------+---------------------+
You then will be able to get:
# Every reports made by an user
SELECT *
FROM comment_reports
WHERE user_id = 180
# Every reports related to a comment
SELECT *
FROM comment_reports
WHERE comment_id = 1245
# Every reports made today
SELECT *
FROM comment_reports
WHERE date >= CURDATE()
# The amount of reports related to an user's comments
SELECT c.id_user AS User, COUNT(cr.id_report) AS Reported
FROM comment_reports cr
JOIN comments c ON (cr.id_comment = c.id_comment)
WHERE c.id_user = 180
GROUP BY c.id_user
Are you making datawarehouse? Normally quantity of reports for the websites are not saved. They are calculated on the fly by taking COUNT(*) by the website_id from the table where reports are saved. There you can save user who made this report. And then you can play by taking total of reports, or total of reports by user etc.
However if you have solution like that then you have no other option than to create separate link table for storing report<-->user links.
You can find users by there unique id, due to increment, user always unique, and never be overwrite.

SQL algorithm to as near to linear time as possible and tweaking of select statement

I am using MySQL version 5.5 on Ubuntu.
My database tables are setup as follows:
DDLs:
CREATE TABLE 'asx' (
'code' char(3) NOT NULL,
'high' decimal(9,3),
'low' decimal(9,3),
'close' decimal(9,3),
'histID' int(11) NOT NULL AUTO_INCREMENT,
PRIMARY KEY ('histID'),
UNIQUE KEY 'code' ('code')
)
CREATE TABLE 'asxhist' (
'date' date NOT NULL,
'average' decimal(9,3),
'histID' int(11) NOT NULL,
PRIMARY KEY ('date','histID'),
KEY 'histID' ('histID'),
CONSTRAINT 'asxhist_ibfk_1' FOREIGN KEY ('histID') REFERENCES 'asx' ('histID')
ON UPDATE CASCADE
)
t1:
| code | high | low | close | histID (primary key)|
| asx | 10.000 | 9.500 | 9.800 | 1
| nab | 42.000 | 41.250 | 41.350 | 2
t2:
| date | average | histID (foreign key) |
| 2013-01-01| 10.000 | 1 |
| 2013-01-01| 39.000 | 2 |
| 2013-01-02| 9.000 | 1 |
| 2013-01-02| 38.000 | 2 |
| 2013-01-03| 9.500 | 1 |
| 2013-01-03| 39.500 | 2 |
| 2013-01-04| 11.000 | 1 |
| 2013-01-04| 38.500 | 2 |
I am attempting to complete a select query that produces this as a result:
| code | high | low | close | asxhist.average |
| asx | 10.000 | 9.500 | 9.800 | 11.000, 9.5000 |
| nab | 42.000 | 41.250 | 41.350 | 38.500,39.500 |
Where the most recent information in table 2 is returned with table 1 in a csv format.
I have managed to get this far:
SELECT code, high, low, close,
(SELECT GROUP_CONCAT(DISTINCT t2.average ORDER BY date DESC SEPARATOR ',') FROM t2
WHERE t2.histID = t1.histID)
FROM t1;
Unfortunately this returns all values associated with hID. I'm taking a look at xaprb.com's firstleastmax-row-per-group-in-sql solution but I have been banging my head all day and the slight wooziness seems to be dimming my ability to comprehend how I should use it to my benefit. How can I limit the results to the most 5 recent values and considering the tables will eventually be megabytes in size, try and remain in O(n2) or less? (Or can I?)
Temporary work around using SUBSTRING_INDEX and not a feasible solution for huge data
SELECT code, high, low, close,
(SELECT SUBSTRING_INDEX(GROUP_CONCAT(asxhist.average), ',', 3)
FROM asxhist
WHERE asxhist.histID = asx.histID
ORDER BY date DESC)
FROM asx;
From what I gather Limit option in GROUP_CONCAT is still under feature-request.
Also on stackoverflow hack MySQL GROUP_CONCAT