Anonymization of Account Numbers in 2TB of CSV's - csv

I have ~2TB of CSV's where the first 2 columns contains two ID numbers. These need to be anonymized so the data can be used in academic research. The anonymization can be (but does not have to be) irreversible. These are NOT medical records, so I do not need the fanciest cryptographic algorithm.
The Question:
Standard hashing algorithms make really long strings, but I will have to do a bunch of ID-matching (i.e. 'for subset of rows in data containing ID XXX, do...)' to process the anonymized data, so this is not ideal. Is there a better way?
For example, If I know there are ~10 million unique account numbers, is there a standard way of using the set of integers [1:10million] as replacement/anonymized ID's?
The computational constraint is that data will likely be anonymized on a 32-core ~500GB server machine.

I will assume that you want to make a single pass, one CSV with ID
numbers as input, another CSV with anonymized numbers as output. I will
also assume the number of unique IDs is somewhere on the order of 10
million or less.
It is my thought that it would be best to use some totally arbitrary
one-to-one function from the set of ID numbers (N) to the set of
de-identified numbers (D). This would be more secure. If you used some
sort of hash function, and an adversary learned what the hash was, the
numbers in N could be recovered without too much trouble with a
dictionary attack. Instead I suggest a simple lookup table: ID 1234567
maps to de-identified number 4672592, etc. The correspondence would be
stored in another file, and an adversary without that file would not be
able to do much.
With 10 million or fewer records, on a machine such as you describe,
this is not a big problem. A sketch program in pseudo-Python:
mapping = {}
unused_numbers = list(range(10000000))
while data:
read record
for each ID number N in record:
if N in mapping:
D = mapping[N]
else:
D = choose_random(unused_numbers)
unused_numbers.del(D)
mapping[N] = D
replace N with D in record
write record
write mapping to lookup table file

It seems you don't care about the ids being reversible, but if it helps, you can try one of the format preserving encryption ideas. They are pretty much designed for this use case.
Otherwise if hashes are too large, you can always just strip the end of it. Even if you replace each digit (of the original ID) with a hex digit (from the hash), the collisions are unlikely. You could first read the file and check for collisions though.
PS. If you end up doing hashing, make sure you prepend salt of a reasonable size. Hashes of IDs in the range [1:10M] would be trivial to bruteforce otherwise.

Related

Comparing a Set with a Big Collection of Sets

How to match a Set with a Big Collection of Sets stored in database.
[The collection may have millions of Sets].
Detailed Statement
[Prerequisite] A cluster has special property which is a set of attribute.
I will get an entity having a set of attribute.
If i have any existing cluster with exact same set of attribute (neither more nor less) then i will add the entity to that cluster. Else i will create a cluster having property as attribute set of new entity.
Above is the process of the clustering.
The problem is how i should store the data so that the system can run smoothly on very large dataset without performance issue.
What kind of database should i use for this? in SQL or NoSQL
What Possible Solution i thought of:
[MySQL]Store the attributes with cluster in a table so that clusterId to attributeId has m:n relation.[table cluster_attribute].
whenever an entity comes.
we run.
select clusterId,count(1) from cluster_attribute where attributeId in("comma separated IDs of attributes");
But this will not be good since we may find a long list of clusterId's which fullfills the above query.
In the same above table we perform query like.
select clusterId,count(1) cnt from cluster_attributes a
inner join cluster_attributes b on a.cluesterId=b.cluesterId
where b.attributeId in("comma separated IDs of attributes")
group by clusterId
having cnt = #sizeOfEntityAttributeSet;
This will scan much rows resulting slow query.
We store attribute as sorted Concatenation of attribute by any character | and make this column indexed.This way we will be able to query faster.But when ever i need to know which clusters have a certain attribute (A1), my query will go slow since i will need to use regexp search in mysql.
Items in set is non-duplicate.that is [a1,b1,c1] is valid while [a1,b1,a1,c1] is not.
millions of sets, each will hundreds of items.
Have 2 columns in the table for searching. One is the exact, complete, list of the values, sorted. It's a long string, probably TEXT. The other is a hash of that string. I might suggest MD5, then chop to 32 bits and put into INT UNSIGNED (or BINARY(4)). INDEX this column, but not UNIQUE.
Now, to check for existence, do likewise with the incoming 'set' -- build the string, and compute the hash. Look up the hashed value in the table. It will give you only a few rows, including some duds. Double check with the long string.
WHERE hash = $hash
AND str = '$str'
The lookup will be quite fast. The prep work (building the sorted string and computing the hash) will not be too difficult. It will be quite easy to code in, say, PHP.
Caveats:
This works only for an exact match of the set.
It scales quite well. If you have more than, say, a billion sets, then a 32-bit hash won't be big adequate. (But BIGINT and a longer BINARY would work.)

How to store big matrix(data frame) that can be subsetted easily later

I will generate a big matrix(data frame) in R whose size is about 1300000*10000, about 50 GB. I want to store this matrix in a appropriate format, so later I can feed the data into Python or other program codes to make some analysis. Of course I cannot feed the data one time, so I have to subset the matrix and feed them little by little.
But I don't know how to store the matrix. I think of two ways, but I think neither is appropriate:
(1) plain text(including csv or excel table), because it is very hard to subset(e.g. if I just want some columns and some rows of the data)
(2) database, I have searched information about mysql and sqlite, but it seems that the number of columns is limited in sql database(1024).
So I just want to know if there are any good strategies to store the data, so that I can subset the data by row/column indexes or name.
Have separate column(s) for each of the few columns you need to search/filter on. Then put the entire 10K columns into some data format that is convenient for the client code to parse. JSON is one common possibility.
So the table would 1.3M rows and perhaps 3 columns: an id (auto_increment, primary key), the column search on, and the JSON blob - as datatype JSON or TEXT (depending on software version) for the many data values.

MySQL PRIMARY KEYs: UUID / GUID vs BIGINT (timestamp+random)

tl;dr: Is assigning rows IDs of {unixtimestamp}{randomdigits} (such as 1308022796123456) as a BIGINT a good idea if I don't want to deal with UUIDs?
Just wondering if anyone has some insight into any performance or other technical considerations / limitations in regards to IDs / PRIMARY KEYs assigned to database records across multiple servers.
My PHP+MySQL application runs on multiple servers, and the data needs to be able to be merged. So I've outgrown the standard sequential / auto_increment integer method of identifying rows.
My research into a solution brought me to the concept of using UUIDs / GUIDs. However the need to alter my code to deal with converting UUID strings to binary values in MySQL seems like a bit of a pain/work. I don't want to store the UUIDs as VARCHAR for storage and performance reasons.
Another possible annoyance of UUIDs stored in a binary column is the fact that rows IDs aren't obvious when looking at the data in PhpMyAdmin - I could be wrong about this though - but straight numbers seem a lot simpler overall anyway and are universal across any kind of database system with no conversion required.
As a middle ground I came up with the idea of making my ID columns a BIGINT, and assigning IDs using the current unix timestamp followed by 6 random digits. So lets say my random number came about to be 123456, my generated ID today would come out as: 1308022796123456
A one in 10 million chance of a conflict for rows created within the same second is fine with me. I'm not doing any sort of mass row creation quickly.
One issue I've read about with randomly generated UUIDs is that they're bad for indexes, as the values are not sequential (they're spread out all over the place). The UUID() function in MySQL addresses this by generating the first part of the UUID from the current timestamp. Therefore I've copied that idea of having the unix timestamp at the start of my BIGINT. Will my indexes be slow?
Pros of my BIGINT idea:
Gives me the multi-server/merging advantages of UUIDs
Requires very little change to my application code (everything is already programmed to handle integers for IDs)
Half the storage of a UUID (8 bytes vs 16 bytes)
Cons:
??? - Please let me know if you can think of any.
Some follow up questions to go along with this:
Should I use more or less than 6 random digits at the end? Will it make a difference to index performance?
Is one of these methods any "randomer" ?: Getting PHP to generate 6 digits and concatenating them together -VS- getting PHP to generate a number in the 1 - 999999 range and then zerofilling to ensure 6 digits.
Thanks for any tips. Sorry about the wall of text.
I have run into this very problem in my professional life. We used timestamp + random number and ran into serious issues when our applications scaled up (more clients, more servers, more requests). Granted, we (stupidly) used only 4 digits, and then change to 6, but you would be surprised how often that the errors still happen.
Over a long enough period of time, you are guaranteed to get duplicate key errors. Our application is mission critical, and therefore even the smallest chance it could fail to due inherently random behavior was unacceptable. We started using UUIDs to avoid this issue, and carefully managed their creation.
Using UUIDs, your index size will increase, and a larger index will result in poorer performance (perhaps unnoticeable, but poorer none-the-less). However MySQL supports a native UUID type (never use varchar as a primary key!!), and can handle indexing, searching,etc pretty damn efficiently even compared to bigint. The biggest performance hit to your index is almost always the number of rows indexed, rather than the size of the item being index (unless you want to index on a longtext or something ridiculous like that).
To answer you question: Bigint (with random numbers attached) will be ok if you do not plan on scaling your application/service significantly. If your code can handle the change without much alteration and your application will not explode if a duplicate key error occurs, go with it. Otherwise, bite-the-bullet and go for the more substantial option.
You can always implement a larger change later, like switching to an entirely different backend (which we are now facing... :P)
You can manually change the autonumber starting number.
ALTER TABLE foo AUTO_INCREMENT = ####
An unsigned int can store up to 4,294,967,295, lets round it down to 4,290,000,000.
Use the first 3 digits for the server serial number, and the final 7 digits for the row id.
This gives you up to 430 servers (including 000), and up to 10 million IDs for each server.
So for server #172 you manually change the autonumber to start at 1,720,000,000, then let it assign IDs sequentially.
If you think you might have more servers, but less IDs per server, then adjust it to 4 digits per server and 6 for the ID (i.e. up to 1 million IDs).
You can also split the number using binary digits instead of decimal digits (perhaps 10 binary digits per server, and 22 for the ID. So, for example, server 76 starts at 2^22*76 = 318,767,104 and ends at 322,961,407).
For that matter you don't even need a clear split. Take 4,294,967,295 divide it by the maximum number of servers you think you will ever have, and that's your spacing.
You could use a bigint if you think you need more identifiers, but that's a seriously huge number.
Use the GUID as a unique index, but also calculate a 64-bit (BIGINT) hash of the GUID, store that in a separate NOT UNIQUE column, and index it. To retrieve, query for a match to both columns - the 64-bit index should make this efficient.
What's good about this is that the hash:
a. Doesn't have to be unique.
b. Is likely to be well-distributed.
The cost: extra 8-byte column and its index.
If you want to use the timestamp method then do this:
Give each server a number, to that append the proccess ID of the application that is doing the insert (or the thread ID) (in PHP it's getmypid()), then to that append how long that process has been alive/active for (in PHP it's getrusage()), and finally add a counter that starts at 0 at the start of each script invocation (i.e. each insert within the same script adds one to it).
Also, you don't need to store the full unix timestamp - most of those digits are for saying it's year 2011, and not year 1970. So if you can't get a number saying how long the process was alive for, then at least subtract a fixed timestamp representing today - that way you'll need far less digits.

Storing large prime numbers in a database

This problem struck me as a bit odd. I'm curious how you could represent a list of prime numbers in a database. I do not know of a single datatype that would be able to acuratly and consistently store a large amount of prime numbers. My concern is that when the prime numbers are starting to contain 1000s of digits, that it might be a bit difficult to reference form the database. Is there a way to represent a large set of primes in a DB? I'm quite sure that this has topic has been approached before.
One of the issues about this that makes it difficult is that prime numbers can not be broken down into factors. If they could this problem would be much easier.
If you really want to store primes as numbers and one of questions, stopping you is "prime numbers can not be broken down into factors", there are another thing: store it in list of modulus of any number ordered by sequence.
Small example:
2831781 == 2*100^3 + 83*100^2 + 17*100^1 + 81*100^0
List is:
81, 17, 83, 2
In real application is useful to split by modulus of 2^32 (32-bits integers), specially if prime numbers in processing application stored as byte arrays.
Storage in DB:
create table PRIMES
(
PRIME_ID NUMBER not null,
PART_ORDER NUMBER(20) not null,
PRIME_PART_VALUE NUMBER not null
);
alter table PRIMES
add constraint PRIMES_PK primary key (PRIME_ID, PART_ORDER) using index;
insert for example above (1647 is for example only):
insert into primes(PRIME_ID, PART_ORDER, PRIME_PART_VALUE) values (1647, 0, 81);
insert into primes(PRIME_ID, PART_ORDER, PRIME_PART_VALUE) values (1647, 1, 17);
insert into primes(PRIME_ID, PART_ORDER, PRIME_PART_VALUE) values (1647, 2, 83);
insert into primes(PRIME_ID, PART_ORDER, PRIME_PART_VALUE) values (1647, 3, 82);
prime_id value can be assigned from oracle sequence ...
create sequence seq_primes start with 1 increment by 1;
Get ID of next prime number to insert:
select seq_primes.nextval from dual;
select prime number content with specified id:
select PART_ORDER, PRIME_PART_VALUE
from primes where prime_id = 1647
order by part_order
You could store them as binary data. They won't be human readable straight from the database, but that shouldn't be a problem.
Databases (depending on which) can routinely store numbers up to 38-39 digits accurately. That gets you reasonably far.
Beyond that you won't be doing arithmetic operations on them (accurately) in databases (barring arbitrary-precision modules that may exist for your particular database). But numbers can be stored as text up to several thousand digits. Beyond that you can use CLOB type fields to store millions of digits.
Also, it's worth nothing that if you're storing sequences of prime numbers and your interest is in space-compression of that sequence you could start by storing the difference between one number and the next rather than the whole number.
This is a bit inefficient, but you could store them as strings.
If you are not going to use database-side calculations with these numbers, just store them as bit sequences of their binary representation (BLOB, VARBINARY etc.)
Here's my 2 cents worth. If you want to store them as numbers in a database then you'll be constrained by the maximum size of integer that your database can handle. You'd probably want a 2 column table, with the prime number in one column and it's sequence number in the other. Then you'd want some indexes to make finding the stored values quick.
But you don't really want to do that do you, you want to store humongous (sp?) primes way beyond any integer datatype you've even though of yet. And you say that you are averse to strings so it's binary data for you. (It would be for me too.) Yes, you could store them in a BLOB in a database but what sort of facilities will the DBMS offer you for finding the n-th prime or checking the primeness of a candidate integer ?
How to design a suitable file structure ? This is the best I could come up with after about 5 minutes thinking:
Set a counter to 2.
Write the two-bits which represent the first prime number.
Write them again, to mark the end of the section containing the 2-bit primes.
Set the counter to counter+1
Write the 3-bit primes in order. ( I think there are two: 5 and 7)
Write the last of the 3-bit primes again to mark the end of the section containing the 3-bit primes.
Go back to 4 and carry on mutatis mutandis.
The point about writing the last n-bit prime twice is to provide you with a means to identify the end of the part of the file with n-bit primes in it when you come to read the file.
As you write the file, you'll probably also want to make note of the offsets into the files at various points, perhaps the start of each section containing n-bit primes.
I think this would work, and it would handle primes up to 2^(the largest unsigned integer you can represent). I guess it would be easy enough to find code for translating a 325467-bit (say) value into a big integer.
Sure, you could store this file as a BLOB but I'm not sure why you'd bother.
It all depends on what kinds of operations you want to do with the numbers. If just store and lookup, then just use strings and use a check constraint / domain datatype to enforce that they are numbers. If you want more control, then PostgreSQL will let you define custom datatypes and functions. You can for instance interface with the GMP library to have correct ordering and arithmetic for arbitrary precision integers. Using such a library will even let you implement a check constraint that uses the probabilistic primality test to check if the numbers really are prime.
The real question is actually whether a relational database is the correct tool for the job.
I think you're best off using a BLOB. How the data is stored in your BLOB depends on your intended use of the numbers. If you want to use them in calculations I think you'll need to create a class or type to store the values as some variety of ordered binary value and allow them to be treated as numbers, etc. If you just need to display them then storing them as a sequence of characters would be sufficient, and would eliminate the need to convert your calculatable values to something displayable, which can be very time consuming for large values.
Share and enjoy.
Probably not brilliant, but what if you stored them in some recursive data structure. You could store it as an int, it's exponent, and a reference to the lower bit numbers.
Like the string idea, it probably wouldn't be very good for memory considerations. And query time would be increased due to the recursive nature of the query.

How does a hash table work? Is it faster than "SELECT * from .."

Let's say, I have :
Key | Indexes | Key-values
----+---------+------------
001 | 100001 | Alex
002 | 100002 | Micheal
003 | 100003 | Daniel
Lets say, we want to search 001, how to do the fast searching process using hash table?
Isn't it the same as we use the "SELECT * from .. " in mysql? I read alot, they say, the "SELECT *" searching from beginning to end, but hash table is not? Why and how?
By using hash table, are we reducing the records we are searching? How?
Can anyone demonstrate how to insert and retrieve hash table process in mysql query code? e.g.,
SELECT * from table1 where hash_value="bla" ...
Another scenario:
If the indexes are like S0001, S0002, T0001, T0002, etc. In mysql i could use:
SELECT * from table WHERE value = S*
isn't it the same and faster?
A simple hash table works by keeping the items on several lists, instead of just one. It uses a very fast and repeatable (i.e. non-random) method to choose which list to keep each item on. So when it is time to find the item again, it repeats that method to discover which list to look in, and then does a normal (slow) linear search in that list.
By dividing the items up into 17 lists, the search becomes 17 times faster, which is a good improvement.
Although of course this is only true if the lists are roughly the same length, so it is important to choose a good method of distributing the items between the lists.
In your example table, the first column is the key, the thing we need to find the item. And lets suppose we will maintain 17 lists. To insert something, we perform an operation on the key called hashing. This just turns the key into a number. It doesn't return a random number, because it must always return the same number for the same key. But at the same time, the numbers must be "spread out" widely.
Then we take the resulting number and use modulus to shrink it down to the size of our list:
Hash(key) % 17
This all happens extremely fast. Our lists are in an array, so:
_lists[Hash(key % 17)].Add(record);
And then later, to find the item using that key:
Record found = _lists[Hash(key % 17)].Find(key);
Note that each list can just be any container type, or a linked list class that you write by hand. When we execute a Find in that list, it works the slow way (examine the key of each record).
Do not worry about what MySQL is doing internally to locate records quickly. The job of a database is to do that sort of thing for you. Just run a SELECT [columns] FROM table WHERE [condition]; query and let the database generate a query plan for you. Note that you don't want to use SELECT *, since if you ever add a column to the table that will break all your old queries that relied on there being a certain number of columns in a certain order.
If you really want to know what's going on under the hood (it's good to know, but do not implement it yourself: that is the purpose of a database!), you need to know what indexes are and how they work. If a table has no index on the columns involved in the WHERE clause, then, as you say, the database will have to search through every row in the table to find the ones matching your condition. But if there is an index, the database will search the index to find the exact location of the rows you want, and jump directly to them. Indexes are usually implemented as B+-trees, a type of search tree that uses very few comparisons to locate a specific element. Searching a B-tree for a specific key is very fast. MySQL is also capable of using hash indexes, but these tend to be slower for database uses. Hash indexes usually only perform well on long keys (character strings especially), since they reduce the size of the key to a fixed hash size. For data types like integers and real numbers, which have a well-defined ordering and fixed length, the easy searchability of a B-tree usually provides better performance.
You might like to look at the chapters in the MySQL manual and PostgreSQL manual on indexing.
http://en.wikipedia.org/wiki/Hash_table
Hash tables may be used as in-memory data structures. Hash tables may also be adopted for use with persistent data structures; database indices sometimes use disk-based data structures based on hash tables, although balanced trees are more popular.
I guess you could use a hash function to get the ID you want to select from. Like
SELECT * FROM table WHERE value = hash_fn(whatever_input_you_build_your_hash_value_from)
Then you don't need to know the id of the row you want to select and can do an exact query. Since you know that the row will always have the same id because of the input you build the hash value form and you can always recreate this id through the hash function.
However this isn't always true depending on the size of the table and the maximum number of hashvalues (you often have "X mod hash-table-size" somewhere in your hash). To take care of this you should have a deterministic strategy you use each time you get two values with the same id. You should check Wikipedia for more info on this strategy, its called collision handling and should be mentioned in the same article as hash-tables.
MySQL probably uses hashtables somewhere because of the O(1) feature norheim.se (up) mentioned.
Hash tables are great for locating entries at O(1) cost where the key (that is used for hashing) is already known. They are in widespread use both in collection libraries and in database engines. You should be able to find plenty of information about them on the internet. Why don't you start with Wikipedia or just do a Google search?
I don't know the details of mysql. If there is a structure in there called "hash table", that would probably be a kind of table that uses hashing for locating the keys. I'm sure someone else will tell you about that. =)
EDIT: (in response to comment)
Ok. I'll try to make a grossly simplified explanation: A hash table is a table where the entries are located based on a function of the key. For instance, say that you want to store info about a set of persons. If you store it in a plain unsorted array, you would need to iterate over the elements in sequence in order to find the entry you are looking for. On average, this will need N/2 comparisons.
If, instead, you put all entries at indexes based on the first character of the persons first name. (A=0, B=1, C=2 etc), you will immediately be able to find the correct entry as long as you know the first name. This is the basic idea. You probably realize that some special handling (rehashing, or allowing lists of entries) is required in order to support multiple entries having the same first letter. If you have a well-dimensioned hash table, you should be able to get straight to the item you are searching for. This means approx one comparison, with the disclaimer of the special handling I just mentioned.