indexing varchars without duplicating the data

indexing varchars without duplicating the data - mysql

I've huge data-set of (~1 billion) records in the following format
|KEY(varchar(300),UNIQE,PK)|DATA1(int)|DATA2(bool)|DATA4(varchar(10)|
Currently the data is stored in MySAM MYSQL table, but the problem is that the key data (10G out of 12G table size) is stored twice - once in the table and once as index. (the data is append only there won't ever be UPDATE query on the table)
There are two major actions that run against the data-set :
contains - Simple check if a key is found
count - Aggregation (mostly) functions according to the data fields
Is there a way to store the key data only once?
One idea I had is to drop the DB all together and simply create 2-5 char folder structure.
this why the data assigned to the key "thesimon_wrote_this" would be stored in the fs as
~/data/the/sim/on_/wro/te_/thi/s.data
This way the data set will function much as btree and the "contains" and data retrieval functions will run in almost O(1) (with the obvious HDD limitations).
This makes the backups pretty easy (backing up only files with A attribute) but the aggregating functions became almost useless as I need to grep 1 billion of files every time. The allocation unit size is irrelevant as I can adjust the file structure so that only 5% of the disk space is taken without use.
I'm pretty sure that there is another - much more elegant way to do that, I can't Google it out :).

It would seem like a very good idea to consider having a fixed-width, integral key, like a 64-bit integer. Storing and searching a varchar key is very slow by comparison! You can still add an additional index on the KEY column for fast lookup, but it shouldn't be your primary key.

Related

Duplicate table fields vs indexing only

I have a huge and very busy table (few thousands INSERT / second). The table stores loginlogs, it has a bigint ID which is not generated by MySQL but rather by pseudorandom generator on MySQL client.
Simply put, the table has loginlog_id, client_id, tons,of,other,columns,with,details,about,session....
I have few indexes on this table such as PRIMARY_KEY(loginlog_id) and INDEX(client_id)
In some other part of our system I need to fetch client_id based on loginlog_id. This does not happen that often (just few hundreds SELECT client_id FROM loginlogs WHERE loginlog_id=XXXXXX / second). Table loginlogs is read by various other scripts now and then, and always various columns are needed. But the most frequent call to read is for sure the above mentioned get client_id by loginlog_id.
My question is: should I create another table loginlogs_clientids and duplicate loginlog_id, client_id in there (this means another few thousands INSERTS, as for every loginlogs INSERT I get this new one). Or should I be happy with InnoDB handling my lookups by PRIMARY KEY efficiently.
We have tons of RAM (128GB, most of which is used by MySQL). Load of MySQL is between 40% and 350% CPU (we have 12 core CPU). When I tried to use the new table, I did not see any difference. But I am asking for the future, if our usage grows even more, what is the suggested approach? Duplicate or index?
Thanks!

No.
Looking up table data for a single row using the primary key is extremely efficient, and will take the same time for both tables.
Exceptions to that might be very large row sizes (e.g. 8KB+), and client_id is e.g. a varchar that is stored off-page, in which case you might need to read an additional data block, which at least theoretically could cost you some milliseconds.
Even if this strategy would have an advantage, you would not actually do it by creating a new table, but by adding an index (loginlog_id, client_id) to your original table. InnoDB stores everything, including the actual data, in an index structure, so that adding an index is basically the same as adding a new table with the same columns, but without (you) having the problem of synchronizing those two "tables".
Having a structure with a smaller row size can have some advantages for ranged scans, e.g. MySQL will evaluate select count(*) from tablename using the smallest index of the table, as it has to read less bytes. You already have such a small index (on client_id), so even in that regard, adding such an additonal table/index shouldn't have an effect. If you have any range scan on the primary key (which is probably unlikely for pseudorandom data), you may want to consider this though, or keep it in mind for cases when you have.

Work around to use two unique keys on partitioned table

We have a dataset of roughly 400M rows, 200G in size. 200k rows are added in a daily batch. It mainly serves as an archive that is indexed for full text search by another application.
In order to reduce the database footprint, the data is stored in plain myIsam.
We are considering a range-partitioned table to streamline the backup process, but cannot figure a good way to handle unique keys. We absolutely need two of them. One to be directly compatible with the rest of the schema (ex: custId), another to be compatible with the full text search app (ex: seqId).
My understanding is that partitions do not support more than one globally unique key. We would have to merge both unique keys (custId, seqId), which will not work in our case.
Am I missing something?

Most efficient way to store VARCHAR in 150B-row table

We have to ingest and store 150 billion records in our MySQL InnoDB database. One field in particular is a field that is a VARCHAR as is taking up a lot of space. Its characteristics:
Can be NULL
Highly duplicated but we can't de-dupe because it increases ingestion time exponentially
Average length is about 75 characters
It has to have an index as it will have to join with another table
We don't need to store it in human readable format but we need to be able to match it to another table which would have to have the same format for this column
I've tried the following:
Compressing the table, this helps with space but dramatically increases ingestion time, so I'm not sure compression is going to work for us
Tried hashing to SHA2 which reduced the string length to 56, which gives us reasonable space saving but just not quite enough. Also I'm not sure SHA2 will generate unique values for this sort of data
Was thinking about MD5 which would further reduce string length to probably the right level but not sure again whether MD5 is string enough to generate unique values to be able to match with another table

A hash function like MD5 produces a 128-bit hash in a string of 32 hex characters, but you can use UNHEX() to cut that in half to 16 binary characters, and store the result in a column of type BINARY(16). See my answer to What data type to use for hashed password field and what length?
MD5 has 2128 distinct hashes, or 340,282,366,920,938,463,463,374,607,431,768,211,456. The chances of two different strings resulting in a collision is pretty reasonably low, even if you have 15 billion distinct inputs. See How many random elements before MD5 produces collisions? If you're still concerned, use SHA1 or SHA2.
I'm a bit puzzled by your attempts to use a hash function, though. You must not care what the original string is, since you must understand that hashing is not reversible. That is, you can't get the original string from a hash.
I like the answer from #Data Mechanics, that you should enumerate the unique string inputs in a lookup table, and use a BIGINT primary key (a INT has only 4+ billion values so it isn't large enough for 15 billion rows).
I understand what you mean that you'd have to look up the strings to get the primary key. What you'll have to do is write your own program to do this data input. Your program will do the following:
Create an in-memory hash table to map strings to integer primary keys.
Read a line of your input
If the hash table does not yet have an entry for the input, insert that string into the lookup table and fetch the generated insert id. Store this as a new entry in your hash table, with the string as the key and the insert id as the value of that entry.
Otherwise the hash table does have an entry already, and just read the primary key bigint from the hash table.
Insert the bigint into your real data table, as a foreign key, along with other data you want to load.
Loop to step 2.
Unfortunately it would take over 1 TB of memory to hold a HashMap of 15 billion entries, even if you MD5 the string before using it as the key of your HashMap.
So I would recommend putting the full collection of mappings into a database table, and keep a subset of it in memory. So you have to do an extra step around 3. above, if the in-memory HashMap doesn't have an entry for your string, first check the database. If it's in the database, load it into the HashMap. If it isn't in the database, then proceed to insert it to the database and then to the HashMap.
You might be interested in using a class like LruHashMap. It's a HashMap with a maximum size (which you choose according to how much memory you can dedicate to it). If you put a new element when it's full, it kicks out the least recently referenced element. I found an implementation of this in Apache Lucene, but there are other implementations too. Just Google for it.

Is the varchar ordinary text? Such is compressible 3:1. Compressing just the one field may get it down to 25-30 bytes. Then use something like VARBINARY(99).
INT (4 bytes) is not big enough for normalizing 15 billion distinct values, so you need something bigger. BIGINT takes 8 bytes. BINARY(5) and DECIMAL(11,0) are 5 bytes each, but are messier to deal with.
But you are concerned by the normalization speed. I would be more concerned by the ingestion speed, especially if you need to index this column!
How long does it take to build the table? You haven't said what the schema is; I'll guess that you can put 100 rows in an InnoDB block. I'll say you are using SSDs and can get 10K IOPs. 1.5B blocks / 10K blocks/sec = 150K seconds = 2 days. This assumes no index other than an ordered PRIMARY KEY. (If it is not ordered, then you will be jumping around the table, and you will need a lot more IOPs; change the estimate to 6 months.)
The index on the column will effectively be a table 150 billion 'rows' -- it will take several terabytes just for the index BTree. You can either index the field as you insert the rows, or you can build the index later.
Building index as you insert, even with the benefit of InnoDB's "change buffer", will eventually slow down to not much faster than 1 disk hit per row inserted. Are you using SSDs? (Spinning drives are rated about 10ms/hit.) Let's say you can get 10K hits (inserts) per second. That works out to 15M seconds, which is 6 months.
Building the index after loading the entire table... This effectively builds a file with 150 billion lines, sorts it, then constructs the index in order. This may take a week, not months. But... It will require enough disk space for a second copy of the table (probably more) during the index-building.
So, maybe we can do the normalization in a similar way? But wait. You said the column was so big that you can't even get the table loaded? So we have to compress or normalize that column?
How will the load be done?
Multiple LOAD DATA calls (probably best)? Single-row INSERTs (change "2 days" to "2 weeks" at least)? Multi-row INSERTs (100-1000 is good)?
autocommit? Short transactions? One huge transaction (this is deadly)? (Recommend 1K-10K rows per COMMIT.)
Single threaded (perhaps cannot go fast enough)? Multi-threaded (other issues)?
My discussion of high-speed-ingestion.
Or will the table be MyISAM? The disk footprint will be significantly smaller. Most of my other comments still apply.
Back to MD5/SHA2. Building the normalization table, assuming it is much bigger than can be cached in RAM, will be a killer, too, no matter how you do it. But, let's get some of the other details ironed out first.
See also TokuDB (available with newer versions of MariaDB) for good high-speed ingestion and indexing. TokuDB will slow down some for your table size, whereas InnoDB/MyISAM will slow to a crawl, as I already explained. TokuDB also compresses automatically; some say by 10x. I don't have any speed or space estimates, but I see TokuDB as very promising.
Plan B
It seems that the real problem is in compressing or normalizing the 'router address'. To recap: Of the 150 billion rows, there are about 15 billion distinct values, plus a small percentage of NULLs. The strings average 75 bytes. Compressing may be ineffective because of the nature of the strings. So, let's focus on normalizing.
The id needs to be at least 5 bytes (to handle 15B distinct values); the string averages 75 bytes. (I assume that is bytes, not characters.) Add on some overhead for BTree, etc, and the total ends up somewhere around 2TB.
I assume the router addresses are rather random during the load of the table, so lookups for the 'next' address to insert is a random lookup in the ever-growing index BTree. Once the index grows past what can fit in the buffer_pool (less than 768GB), I/O will be needed more and more often. By the end of the load, approximately 3 out of 4 rows inserted will have to wait for a read from that index BTree to check for the row already existing. We are looking at a load time of months, even with SSDs.
So, what can be done? Consider the following. Hash the address with MD5 and UNHEX it - 16 bytes. Leave that in the table. Meanwhile write a file with the hex value of the md5, plus the router address - 150B lines (skipping the NULLs). Sort, with deduplication, the file. (Sort on the md5.) Build the normalization table from the sorted file (15B lines).
Result: The load is reasonably fast (but complex). The router address is not 75 bytes (nor 5 bytes), but 16. The normalization table exists and works.

You state its highly duplicated ?
My first thought would be to create another table with the actual varchar value and a primary int key pointing to this value.
Then the existing table can simply change to contain as a foreign key the reference to this value (and additionally be efficiently index able).

MySQL DB normalization

I've got a single table DB with 100K rows. There are about 30 columns and 28 of them are varchars / tiny text and one of them is an int primary key and one of them is a blob.
My question, is in terms of performance, would it be better to separate the blob from the rest of the table and store them in their own table with foreign key constraint to the primary id?
The table will eventually be turned into a sqlite persistent store for iOS core data and a lot of the searching / filtering will be done based on the NSPredicate for the lighter varchar columns.
Sorry if this is too subjective, but I'm thinking there is a recommended way.
Thanks!

If you do SELECT * FROM table (which you shouldn't if you don't need the BLOB field actually) then yes, the query will be faster because in that case pages with BLOB won't be touched.
If you do frequent SELECT f1, f2, f3 FROM table (all fields are non-BLOBs) then yes, storing BLOBS in a separate table will make the query faster because of the same reason - MySQL will have to read less pages.
If however the BLOB is selected frequently then it makes no sense to keep it separately.

This totally depends on data usage.
If you need the data everytime you query the table, there is no difference in haviong a separate table for it (as long as blob data is unique in each row - that is, "as long as the database is normalized").
If you don'T need the blob data but only metadata from other columns, there may be a speed bonus qhen querying if the blob has its own table. querying the blob data is slower thoguh, as you need to query bowth tables.
The USUAL way is not to store any blob data inside the database (at least not huge data), but store the binary data into files and have the fiel path inside the database instead. This is recommended, as binary data most likely doesn'T benefit from being inside a DBMS (not indexable, sortable, groupable, ..), so there is no drawback of storing it inside files, while the database isn't optimized for binary data ('cause, again, it can't do much with it anyway).

Blobs are stored on disk only the point to the storage is in memory in Mysql. Moving it to another table with a foreign key will not noticeably help your performance. Don't know if this is the case for sqlite.

SQL - performance in varchar vs. int

I have a table which has a primary key with varchar data type. And another table with foreign key as varchar datatype.
I am making a join statement using this pair of varchar datatype. Though I am dealing with few number of rows (say hunderd rows), it is taking 60ms. But when the system will finally be deployed, it will have around thousands of rows.
I read Performance of string comparison vs int join in SQL, and concluded that the performance of SQL Query depend upon DB and number of rows it is dealing with.
But when I am dealing with a very large amount of data, would it matter much?
Should I create a new column with a number datatype in both the table and join the table to reduce the time taken by the SQL Query.?

You should use the correct data type for that data that you are representing -- any dubious theoretical performance gains are secondary to the overhead of having to deal with data conversions.
It's really impossible to say what that is based on the question, but most cases are rather obvious. Where they are not obvious are in situations where you have a data element that is represented by a set of digits but which you do not treat as a number -- for example, a phone number.
Clues that you are dealing with this situation are:
leading zeroes that must be preserved
no arithmetic operations are carried out on the element.
string operations are carried out: eg. "take the last four characters"
If that's the case then you probably want to store your "number" as a varchar.

Yes, you should give that a shot. But before you do, make a test version of your db that you populate with the level of data you expect to have in production, and run some tests on not just SELECT, but also INSERT, UPDATE, and DELETE as well. Then make a version with integer keys, and perform equvialent tests.
The numeric-keys WILL be faster, for the simple reason that the keys are of smaller size, but the difference may not be noticeable. Don't blindly trust books when you can test and measure the difference yourself.
(One thing to remember: if there are occasions when all you need from a relation is the value you currently have as its key, your database may run significantly faster if you can skip entire table lookups by just referencing the foreign-key on the records you have.)

Should I create a new column with a number datatype in both the table and join the table to reduce the time taken by the SQL Query.?
If you're in a position where you can change the design of the database with ease then yes, your Primary Key should be an integer. Unless there is a really good reason to have an FK as a varchar, then they should be integers as well.
If you can't change the PK or FK fields, then make sure they're indexed properly. This will eventually become a bottleneck though.

It just does not sound right to me. It will use more space result in more reads etc. Then is the varchar the clustered index key? If so the table is going to get very fragmented.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008