MYSQL masking data from update very slow on large DB - mysql

I have a DEV DB with 16 million(ish) records. I need to 'mask' columns of personal data (name, address, phone, etc.). I found a nice function that will do the data masking wonderfully Howto generate meaningful test data using a MySQL function.
The problem is, when I call the function, it is only processing about 30 records per second.
This is way to slow.
Is there anyway to speed this up. Maybe create a temp table or something.
Here is the UPDATE statement that calls the function.
UPDATE table1
SET first_name = (str_random('Cc{3}c(4)')),
last_name = (str_random('Cc{5}c(6)')),
email = (str_random('c{3}c(5)[.|_]c{8}c(8)#[google|yahoo|live|mail]".com"')),
address1 = (str_random('d{3}d{1} Cc{5} [Street|Lane|Road|Park]')),
city = (str_random('Cc{5}c(6)')),
state = (str_random('C{2}')),
zip = (str_random('d{5}-d{4}'))
Thanks!!

Instead of calling a random function 7*16m times, it would probably be faster if you operated on procedurally generated text.
I checked out the str_random function you linked to. (That's very clever btw - cool stuff)
It calls RAND() once for each random character in the string and once each time you say "choose from list". That's a lot of rands.
I think one way to improve it would be to create and cache (in a table) a large set of random characters and instead of calling rand (say) 5 times for 5 random characters, call it once to determine an offset into the big string of random crap, then just increment the index it uses to pull from the string... (if it needs a bunch in a row - it can just pull them all at once in a row and multi-increment the offset)
The str_random_character function that the parent function calls could be replaced by something that does this instead of calling rand into an array.
It's a bit beyond me for a throwaway piece of code, but it might put you (or a better mysql guru) on a path for speeding this puppy up (maybe).
A different option would be rather than random-masking all the data... can you transform the data in some way? Since you don't need the original back, you could do something like a caesar cipher on each character in their data based on a (single) rand call for the rotation count. (If you rotate the uppers, lowers, and digits in each string separately, the data will stay looking "normal" despite not being easily reversible because of the randomized rotation) -- I wouldn't slap a SECURE sticker on it but it would be a lot quicker and not easy to reverse.
I think I have a Caesar rotator that does that somewhere if it suffices.

Related

How to store a row index for very large (csv) files?

I am working with large csv files (>> 10^6 lines) and need a row index for some operations. I need to do comparisons between two versions of the files identifying deletions, so I thought it'd be easiest to include a row index. I guess that the number of lines would quickly render traditional integers inefficient. I'm adverse to the idea of having a column containing lets say 634567775577 in plain text as row index (followed by the actual data row). Are there any best practise suggestions for this scenario?
The resulting files have to remain plain text, so serialisation / sqlite is not an option.
At the moment, I'm considering an index based on the actual row data (for example concatening the row data, converting to base64 or the likes), but would that be more reasonable than a plain integer? There should be no duplicate row within each file, so I guess this could be one way.
Cheers, Sacha
Ps: I heavily modified the initial question for clarification
you can use regular numbers.
Python is not afraid of large numbers :) (well, to the order of magnitude you described...)
just open a python shell and type 10**999 and see that it doesn't overflow or anything.
In Python, there's no actual bit limit for integers. In Python 2, there is technically -- an int is 32 bits, and a long is more than 32 bits. But if you're just declaring the number, that type casting will happen implicitly.
Python 3 just has one type, and it only cares about memory space.
So there's no real reason why you can't use an integer if you really want to add an index.
Python built-in library contains SQLite, a self-contained, one-file-fits-everything DBMS - which contrary to normal perception can be quite performant. If the records are to be consulted by a single application with no concurrency, it compares to specialized DBMS thta requires a separate daemon.
So, essentially, you can dump your CSV to a SQLITE database and create the indices you need - even on all four columns if it is the case.
Here is a template script you could customize to create such a DB -
I guessed the "1000" numbers for number of insert a time, but it could
not be optimal - try tweaking inserting is too slow.
import sqlite3
import csv
inserts_at_time = 1000
def create_and_populate_db(dbfilename, csvfilename):
db = sqlite3.connect(dbfilename)
db.execute("""CREATE TABLE data (col1, col2, col3, col4)""")
for col_name in "col1 col2 col3 col4".split():
db.execute(f"""CREATE INDEX {col_name} ON data ({col_name})""")
with open(csvfilanem) as in_file:
reader = csv.reader(in_file)
next(reader) # skips header row
total = counter = 0
lines = []
while True:
for counter, line in zip(range(inserts_at_time), reader):
lines.append(line)
db.executemany('INSERT INTO data VALUES (?,?,?,?)', lines)
total += counter
counter = 0
lines.clear()
print("\b" * 80, f"Inserted {counter} lines - total {total}")
if counter < inserts_at_time - 1:
break

Picking JSON objects out of array based on their value

Perhaps I think about this wrong, but here is a problem:
I have NSMutableArray all full of JSON objects. Each object look like this, here are 2 of them for example:
{
player = "Lorenz";
speed = "12.12";
},
{
player = "Firmino";
speed = "15.35";
}
Okay so this is fine, this is dynamic info I get from webserver feed. Now what I want though is lets pretend there are 22 such entries, and the speeds vary.
I want to have a timer going that starts at 1.0 seconds and goes to 60.0 seconds, and a few times a second I want it to grab all the players whose speed has just been passed. So for instance if the timer goes off at 12.0 , and then goes off again at 12.5, I want it to grab out all the player names who are between 12.0 and 12.5 in speed, you see?
The obvious easy way would be to iterate over the array completely every time that the timer goes off, but I would like the timer to go off a LOT, like 10 times a second or more, so that would be a fairly wasteful algorithm I think. Any better ideas? I could attempt to alter the way data comes from the webserver but don't feel that should be necessary.
NOTE: edited to reflect a corrected understanding that the number in 1 to 60 is incremented continously across that range rather than being a random number in that interval.
Before you enter the timer loop, you should do some common preprocessing:
Convert the speeds from strings to numeric values upfront for fast comparison without having to parse each time. This is O(1) for each item and O(n) to process all the items.
Put the data in an ordered container such as a sorted list or sorted binary tree. This will allow you to easily find elements in the target range. This is O(n log n) to sort all the items.
On the first iteration:
Use binary search to find the start index. This is O(log n).
Use binary search to find the end index, using the start index to bound the search.
On subsequent iterations:
If each iteration increases by a predictable amount and the step between elements in the list is likewise a predictable amount, then just maintain a pointer and increment as per Pete's comment. This would make each iteration cost O(1) (just stepping ahead by a fixed amount).
If the steps between iterations and/or the entries in the list are not predictable, then do a binary search as in the initial case. If the values are monotonically increasing (as I now understand the problem to be stating), even if they are unpredictable, you can incorporate this into your binary search algorithm by maintaining an index as in the other case, but instead of resuming iteration directly from there, if the values are unpredictable, instead use the remembered index to set a lower bound on the binary search so that you narrow the region being searched. This would make each iteration cost O(log m), where "m" are the remaining elements to be considered.
Overall, this produces an algorithm that is no worse than O((N + I) log N) where "I" is the number of iterations compared to the previous algorithm that was O(I * N) (and shifts most of the computation outside of the loop, rather than inside the loop).
A modern computer can do billions of operations per second. Even if your timer goes off 1000 times per second, and your need to process 1000 entries, you will still be fine with a naive approach.
But to answer the question, the best approach would be to sort the data first based on speed, and then have an index of the last player whose speed was already passed. At the beginning the pointer, obviously, points at the first player. Then every time your timer goes off, you will need to process some continuous chunk of players starting at that index. Something along the lines of (in pseudocode):
global index = 0;
sort(players); // sort on speed
onTimer = function(currentSpeed) {
while (index < players.length && players[index].speed < currentSpeed) {
processPlayer(players[index]);
++ index;
}
}

Will an MD5 hash keep changing as its input grows?

Does the value returned by MySQL's MD5 hash function continue to change indefinitely as the string given to it grows indefinitely?
E.g., will these continue to return different values:
MD5("A"+"B"+"C")
MD5("A"+"B"+"C"+"D")
MD5("A"+"B"+"C"+"D"+"E")
MD5("A"+"B"+"C"+"D"+"E"+"D")
... and so on until a very long list of values ....
At some point, when we are giving the function very long input strings, will the results stop changing, as if the input were being truncated?
I'm asking because I want to use the MD5 function to compare two records with a large set of fields by storing the MD5 hash of these fields.
======== MADE-UP EXAMPLE (YOU DON'T NEED THIS TO ANSWER THE QUESTION BUT IT MIGHT INTEREST YOU: ========
I have a database application that periodically grabs data from an external source and uses it to update a MySQL table.
Let's imagine that in month #1, I do my first download:
downloaded data, where the first field is an ID, a key:
1,"A","B","C"
2,"A","D","E"
3,"B","D","E"
I store this
1,"A","B","C"
2,"A","D","E"
3,"B","D","E"
Month #2, I get
1,"A","B","C"
2,"A","D","X"
3,"B","D","E"
4,"B","F","E"
Notice that the record with ID 2 has changed. Record with ID 4 is new. So I store two new records:
1,"A","B","C"
2,"A","D","E"
3,"B","D","E"
2,"A","D","X"
4,"B","F","E"
This way I have a history of *changes* to the data.
I don't want have to compare each field of the incoming data with each field of each of the stored records.
E.g., if I'm comparing incoming record x with exiting record a, I don't want to have to say:
Add record x to the stored data if there is no record a such that x.ID == a.ID AND x.F1 == a.F1 AND x.F2 == a.F2 AND x.F3 == a.F3 [4 comparisons]
What I want to do is to compute an MD5 hash and store it:
1,"A","B","C",MD5("A"+"B"+"C")
Let's suppose that it is month #3, and I get a record:
1,"A","G","C"
What I want to do is compute the MD5 hash of the new fields: MD5("A"+"G"+"C") and compare the resulting hash with the hashes in the stored data.
If it doesn't match, then I add it as a new record.
I.e., Add record x to the stored data if there is no record a such that x.ID == a.ID AND MD5(x.F1 + x.F2 + x.F3) == a.stored_MD5_value [2 comparisons]
My question is "Can I compare the MD5 hash of, say, 50 fields without increasing the likelihood of clashes?"
Yes, practically, it should keep changing. Due to the pigeonhole principle, if you continue doing that enough, you should eventually get a collision, but it's impractical that you'll reach that point.
The security of the MD5 hash function is severely compromised. A collision attack exists that can find collisions within seconds on a computer with a 2.6Ghz Pentium4 processor (complexity of 224).
Further, there is also a chosen-prefix collision attack that can produce a collision for two chosen arbitrarily different inputs within hours, using off-the-shelf computing hardware (complexity 239).
The ability to find collisions has been greatly aided by the use of off-the-shelf GPUs. On an NVIDIA GeForce 8400GS graphics processor, 16-18 million hashes per second can be computed. An NVIDIA GeForce 8800 Ultra can calculate more than 200 million hashes per second.
These hash and collision attacks have been demonstrated in the public in various situations, including colliding document files and digital certificates.
See http://www.win.tue.nl/hashclash/On%20Collisions%20for%20MD5%20-%20M.M.J.%20Stevens.pdf
A number of projects have published MD5 rainbow tables online, that can be used to reverse many MD5 hashes into strings that collide with the original input, usually for the purposes of password cracking.

How can I get better randomization in my sql query?

I am attempting to get a random bearing, from 0 to 359.9.
SET bearing = FLOOR((RAND() * 359.9));
I may call the procedure that runs this request within the same while loop, immediately one after the next. Unfortunately, the randomization seems to be anything but unique. e.g.
Results
358.07
359.15
357.85
I understand how randomization works, and I know because of my quick calls to the same function, the ticks used to generate the random number are very close to one another.
In any other situation, I would wait a few milliseconds in between calls or reinit my Random object (such as in C#), which would greatly vary my randomness. However, I don't want to wait in this situation.
How can I increase randomness without waiting?
I understand how randomization works, and I know because of my quick calls to the same function, the ticks used to generate the random number are very close to one another.
That's not quite right. Where folks get into trouble is when they re-seed a random number generator repeatedly with the current time, and because they do it very quickly the time is the same and they end up re-seeding the RNG with the same seed. This results in the RNG spitting out the same sequence of numbers each time it is re-seeded.
Importantly, by "the same" I mean exactly the same. An RNG is either going to return an identical sequence or a completely different one. A "close" seed won't result in a "similar" sequence. You will either get an identical sequence or a totally different one.
The correct solution to this is not to stagger your re-seeds, but actually to stop re-seeding the RNG. You only need to seed an RNG once.
Anyways, that is neither here nor there. MySQL's RAND() function does not require explicit seeding. When you call RAND() without arguments the seeding is taken care of for you meaning you can call it repeatedly without issue. There's no time-based limitation with how often you can call it.
Actually your SQL looks fine as is. There's something missing from your post, in fact. Since you're calling FLOOR() the result you get should always be an integer. There's no way you'll get a fractional result from that assignment. You should see integral results like this:
187
274
89
345
That's what I got from running SELECT FLOOR(RAND() * 359.9) repeatedly.
Also, for what it's worth RAND() will never return 1.0. Its range is 0 &leq; RAND() < 1.0. You are safe using 360 vs. 359.9:
SET bearing = FLOOR(RAND() * 360);

Generating unique codes that are different in two digits

I want to generate unique code numbers (composed of 7 digits exactly). The code number is generated randomly and saved in MySQL table.
I have another requirement. All generated codes should differ in at least two digits. This is useful to prevent errors while typing the user code. Hopefully, it will prevent referring to another user code while doing some operations as it is more unlikely to miss two digits and match another existing user code.
The generate algorithm works simply like:
Retrieve all previous codes if any from MySQL table.
Generate one code at a time.
Subtract the generated code with all previous codes.
Check the number of non-zero digits in the subtraction result.
If it is > 1, accept the generated code and add it to previous codes.
Otherwise, jump to 2.
Repeat steps from 2 to 6 for the number of requested codes.
Save the generated codes in the DB table.
The algorithm works fine, but the problem is related to performance. It takes a very long to finish generating the codes when requesting to generate a large number of codes like: 10,000.
The question: Is there any way to improve the performance of this algorithm?
I am using perl + MySQL on Ubuntu server if that matters.
Have you considered a variant of the Luhn algorithm? Luhn is used to generate a check digit for strings of numbers in lots of applications, including credit card account numbers. It's part of the ISO-7812-1 standard for generating identifiers. It will catch any number that is entered with one incorrect digit, which implies any two valid numbers differ in a least two digits.
Check out Algorithm::LUHN in CPAN for a perl implementation.
Don't retrieve the existing codes, just generate a potential new code and see if there are any conflicting ones in the database:
SELECT code FROM table WHERE abs(code-?) regexp '^[1-9]?0*$';
(where the placeholder is the newly generated code).
Ah, I missed the generating lots of codes at once part. Do it like this (completely untested):
my #codes = existing_codes();
my $frontwards_index = {};
my $backwards_index = {};
for my $code (#codes) {
index_code($code, $frontwards_index);
index_code(reverse($code), $backwards_index);
}
my #new_codes = map generate_code($frontwards_index, $backwards_index), 1..10000;
sub index_code {
my ($code, $index) = #_;
push #{ $index{ substr($code, 0, length($code)/2) } }, $code;
return;
}
sub check_index {
my ($code, $index) = #_;
my $found = grep { ($_ ^ $code) =~ y/\0//c <= 1 } #{ $index{ substr($code, 0, length($code)/2 } };
return $found;
}
sub generate_code {
my ($frontwards_index, $backwards_index) = #_;
my $new_code;
do {
$new_code = sprintf("%07d", rand(10000000));
} while check_index($new_code, $frontwards_index)
|| check_index(reverse($new_code), $backwards_index);
index_code($new_code, $frontwards_index);
index_code(reverse($new_code), $backwards_index);
return $new_code;
}
Put the numbers 0 through 9,999,999 in an augmented binary search tree. The augmentation is to keep track of the number of sub-nodes to the left and to the right. So for example when your algorithm begins, the top node should have value 5,000,000, and it should know that it has 5,000,000 nodes to the left, and 4,999,999 nodes to the right. Now create a hashtable. For each value you've used already, remove its node from the augmented binary search tree and add the value to the hashtable. Make sure to maintain the augmentation.
To get a single value, follow these steps.
Use the top node to determine how many nodes are left in the tree. Let's say you have n nodes left. Pick a random number between 0 and n. Using the augmentation, you can find the nth node in your tree in log(n) time.
Once you've found that node, compute all the values that would make the value at that node invalid. Let's say your node has value 1,111,111. If you already have 2,111,111 or 3,111,111 or... then you can't use 1,111,111. Since there are 8 other options per digit and 7 digits, you only need to check 56 possible values. Check to see if any of those values are in your hashtable. If you haven't used any of those values yet, you can use your random node. If you have used any of them, then you can't.
Remove your node from the augmented tree. Make sure that you maintain the augmented information.
If you can't use that value, return to step 1.
If you can use that value, you have a new random code. Add it to the hashtable.
Now, checking to see if a value is available takes O(1) time instead of O(n) time. Also, finding another available random value to check takes O(log n) time instead of... ah... I'm not sure how to analyze your algorithm.
Long story short, if you start from scratch and use this algorithm, you will generate a complete list of valid codes in O(n log n). Since n is 10,000,000, it will take a few seconds or something.
Did I do the math right there everybody? Let me know if that doesn't check out or if I need to clarify anything.
Use a hash.
After generating a successful code (not conflicting with any existing code), but that code in the hash table, and also put the 63 other codes that differ by exactly one digit into the hash.
To see if a randomly generated code will conflict with an existing code, just check if that code exists in the hash.
Howabout:
Generate a 6 digit code by autoincrementing the previous one.
Generate a 1 digit code by incrementing the previous one mod 10.
Concatenate the two.
Presto, guaranteed to differ in two digits. :D
(Yes, being slightly facetious. I'm assuming that 'random' or at least quasi-random is necessary. In which case, generate a 6 digit random key, repeat until its not a duplicate (i.e. make the column unique, repeat until the insert doesn't fail the constraint), then generate a check digit, as someone already said.)