Fastest database engine to store huge string list - mysql

I have a huge unique string list (1.000.000.000+ lines).
I need to know if a string does exist in this list or not.
What is the fastest way to do it ?
I guess I need a very simple database engine with a Btree index which lets me do fast lookup ... and MySQL may be too slow and complex for this.

If this is all you need to do, you should take a long look at tries and related data structures specialized for strings (e.g. suffix array). With this many strings, you are guaranteed to have a lot of overlap, and these data structures can eliminate such overlap (saving not only memory but also processing time).

Related

How does a relational database organize data?

I was thinking that a relational database will store every possible query and the values to return for that query in a hash table.
So like, if each entry in your table had 5 attributes, then you would make a copy of that element for each subset of the 5 attributes that appear in any given query that should return that specific entry. So every individual entry would appear 2^5 = 32 times in the table. This seems like it would be very memory inefficient for large data sets with many entries, but it also allows for the fastest possible query time.
Do real world relational-databases have a mixed version of this where some response time for queries/lookups is traded off for more memory efficiency? If so, how would this be implemented?
That's not how relational databases store data. Keep in mind it's a lot more than 2^32, because you can make queries that have expressions, not simply references to attribute columns. Also queries that are joins, which expands the possibilities immensely.
Even if you could store all possible combinations, it would be a waste because most of them will never be needed.
Instead, databases typically store records, where a record includes all columns of one table. If you run a query that only needs some columns, the DBMS still fetches the whole record, and simply ignores columns that you didn't ask for. Then it evaluates any expressions in your query. And finally returns the result set.
MySQL does not use hash tables to store these records, it uses a B+Tree data structure, so looking up a record by its primary key takes O(log n) time.

MySQL multiple rows vs storing values all in one string

I was just wondering about the efficiency of storing a large amount of boolean values inside of a CHAR or VARCHAR
data
"TFTFTTF"
vs
isFoo isBar isText
false true false
Would it be worth the worse performance by switching storing these values in this manner? I figured it would just be easier just to set a single value rather than having all of those other fields
thanks
Don't do it. MySQL offers types such as char(1) and tinyint that occupy the same space as a single character. In addition, MySQL offers enumerated types, if you want your flags to have more than one value -- and for the values to be recognizable.
That last point is the critical point. You want your code to make sense. The string 'FTF' does not make sense. The columns isFoo, isBar, and isText do make sense.
There is no need to obfuscate your data model.
This would be a bad idea, not only does it have no advantage in terms of the space used, it also has a bad influence on query performance and the comprehensibility of your data model.
Disk Space
In terms of storage usage, it makes no real difference whether the data is stored in a single varchar(n) or char(n) column or in multiple tinynt, char(1)or bit(1) columns. Only when using varchar you would need 1 to 2 bytes more disk space per entry.
For more information about the storage requirements of the different data types, see the MySql documentation.
Query Performance
If boolean values were stored in a VarChar, the search for all entries where a specific value is True would take much longer, since string operations would be necessary to find the correct entries. Even when searching for a combination of Boolean values such as "TFTFTFTFTT", the query would still take longer than if the boolean values were stored in individual columns. Furthermore you can assign indexes to single columns like isFoo or isBar, which has a great positive effect on query performance.
Data Model
A data model should be as comprehensible as possible and if possible independent of any kind of implementation considerations.
Realistically, a database field should only contain one atomic value, that is to say: a value that can't be subdivided into separate parts.
Columns that do not contain atomic values:
cannot be sorted
cannot be grouped
cannot be indexed
So let's say you want to find all rows where isFoo is true you wouldn't be able to do it unless you were to do string operations like "find the third characters in this string and see if it's equal to "F". This would imply a full table scan with every query which would degrade performance quite dramatically.
it depends on what you want to do after storing the data in this format.
after retrieving this record you will have to do further processing on the server side which worsen the performance if you want to load the data by checking specific conditions. the logic in the server would become complex.
The columns isFoo, isBar, and isText would help you to write queries better.

How does columnar Databases do indexing?

I understand that the columnar databases put column data together on the disk rather than rows. I also understand that in traditional row-wise RDBMS, leaf index node of B-Tree contains pointer to the actual row.
But since columnar doesn't store rows together, and they are particularly designed for columnar operations, how do they differ in the indexing techniques?
Do they also use B-tress?
How do they index inside whatever datastructure they use?
Or there is no accepted format, every vendor have their own indexing scheme to cater their needs?
I have been searching, but unable to find any text. Every text I found is for row-wise DBMS.
There are no BTrees. (Or, if they are, they are not the main part of the design)
Infinidb stores 64K rows per chunk. Each column in that chunk is compressed and indexed. With the chunk is a list of things like min, max, avg, etc, for each column that may or may not help in queries.
Running a SELECT first looks at that summary info for each chunk to see if the WHERE clause might be satisfied by any of the rows in the chunk.
The chunks that pass that filtering get looked at in more detail.
There is no copy of a row. Instead, if, say, you ask for SELECT a,b,c, then the compressed info for 64K rows (in one chunk) for each of a, b, c need to be decompressed to further filter and deliver the row. So, it behooves you to list only the desired columns, not blindly say SELECT *.
Since every column is separately indexed all the time, there is no need to say INDEX(a). (I don't know if INDEX(a,b) can even be specified for a columnar DB.)
Caveat: I am describing Infinidb, which is available with MariaDB. I don't know about any other columnar engines.
If you understand
1)How columnar DBs store the data actually, and
2)How Indexes work, (how they store the data)
Then you may feel that there is no need of indexing in columnar Dbs.
For any kind of database rowid is very important, it is like the address where the data is stored.
Indexing is nothing but, mapping the rowids to the columns that are being indexed in a sorted order.
Columnar databases are born basing this logic. They try to store the data in this fashion itself, meaning - They store the data as a key-value pair in a serialized manner where the actual column value is Key and the rowid when the data is residing as its value and if they find any duplicates for a key, they just compress and store.
So if you compare the format how columnar databases store the data actually on Disk, it is almost the same (but not exactly because, as the difference is compression, representation of key-value in a vice versa fashion) how the row oriented databases store indexes.
That's the reason you don't need separate indexing again. and you won't find any columnar database trying to implement indexing.
Columnar Indexes (also known as "vertical data storage") stores data in a hash and compressed mode. All columns invoked in the index key are indexed separately. Hashing decrease the volume of data stored. The compressing method use only one value for repetitive occurrences (dictionnary, eventually partial).
This technic have two major difficulties :
First you can have collision, because a hash result can be the same
for two distinct values. So the index must manage collisions.
Second, the hash and compress algorithms used is a very heavy consumer of
resources like CPU.
Those type of indexes are stored as vectors.
Ordinary, those type of indexes are used only for read only tables, especially for the business intelligence (OLAP databases).
A columnar index can be used in a "seekable" way only for an equality predicate (COLUMN_A = OneValue). But it is also adequate for GROUPING or DISTINCT operations. Columnar index does not support range seek, including the LIKE 'foo%'.
Some database vendors have get around the huge resources needed while inserting or updating by adding some intermediate algorithms that decrease the CPU. This is the case for Microsoft SQL Server that use a delta store for newly modified rows. With this technic, the table can be used in a relational way like any classical OLTP dataabase.
For instance, Microsoft SQL Server introduced first the columnstore index in 2012 version, but this made the table read only. In 2014 the clustered columnstore index (all the columns of the table was indexed) was released and the table was writetable. And finally in the 2016 version, the columnstore index clustered ornot, no longer demands any part of the table to be read only.
This was made possible because a particular search algorithm, named "Batch Mode" was developed by Microsoft Research, and does not works by reading the data row by row...
To read :
Enhancements to SQL Server Column Stores
Columnstore and B+ tree –Are Hybrid Physical Designs Important?

Strategies for large databases with changing schemas

We have a mysql database table with hundreds of millions of rows. We run into issues with performing any kind of operation on it. For example, adding columns is becoming impossible to do with any kind of predictable time frame. When we want to roll out a new column the "ALTER TABLE" command takes forever so we dont have a good idea as to what the maintenance window is.
We're not tied to keeping this data in mysql, but I was wondering if there are strategies for mysql or databases in general, for updating schemas for large tables.
One idea, which I dont particularly like, would be to create a new table with the old schema plus additional column, and run queries against a view which unioned the results until all data could be moved to the new table schema.
Right now we already run into issues where deleting large numbers of records based on a where clause exit in error.
Ideas?
In MySQL, you can create a new table using an entity-attribute-value model. This would have one row per entity and attribute, rather than putting the attribute in a new column.
This is particularly useful for sparse data. Words of caution: types are problematic (everything tends to get turned into strings) and you cannot define foreign key relationships.
EAV models are particularly useful for sparse values -- when you have attributes that only apply to a minimum number of roles. They may prove useful in your case.
In NOSQL data models, adding new attributes or lists of attributes is simpler. However, there is no relationship to the attributes in other rows.
Columnar databases (at least the one in MariaDB) is very frugal on space -- some say 10x smaller than InnoDB. The shrinkage, alone, may be well worth it for 100M rows.
You have not explained whether your data is sparse. If it is, the JSON is not that costly for space -- completely leave out any 'fields' that are missing; zero space. With almost any other approach, there is at least some overhead for missing cells.
As you suggest, use regular columns for common fields. But also for the main fields that you are likely to search on. Then throw the rest into JSON.
I like to compress (in the client) the JSON string and use a BLOB. This give 3x shrinkage over using uncompressed TEXT.
I dislike the one-row per attribute EAV approach; it is very costly in space, JOINs, etc, etc.
[More thoughts] on EAV.
Do avoid ALTER whenever possible.

MySQL or NoSQL? Recommended way of dealing with large amount of data

I have a database of which will be used by a large amount of users to store random long string (up to 100 characters). The table columns will be: userid, stringid and the actual long string.
So it will look pretty much like this:
Userid will be unique and stringid will be unique for each user.
The app is like a simple todo-list app, so each user will have an average amount of 50 todo's.
I am using the stringid in order that users will be able to delete the specific task at any given time.
I assume this todo app could end up with 7 million tasks in 3 years time and that scares me of using MySQL.
So my question is if this is the actual recommended way of dealing with large amount of data with long string (every new task gets a new row)? and is MySQL is the right database solution to choose for this kind of projects ?
I have not experienced with large amount of data yet and I am trying to save myself for the far future.
This is not a question of "large amounts" of data (mysql handles large amounts of data just fine and 2 mio rows isn't "large amounts" in any case).
MySql is a relational database. So if you have data that can be normalized, that is distributed among a number of tables that ensures every datapoint is saved only once then you should use MySql (or Maria, or any other relational database).
If you have schema-less data and speed is more important than consistency than you can/should use some NoSql database. Personally I don't see how a todo list would profit from NoSql (doesn't really matter in this case, but I guess as of now most programmig frameworks have better support for relational databases than for Nosql).
This is a pretty straightforward relational use case. I wouldn't see a need for NoSQL here.
The table you present should work fine however, I personally would question the need for the compound primary key as you would present this. I would probably have a primary key on stringid only to enforce uniqueness across all records. Rather than a compound primary key across userid and stringid. I would then put a regular index on userid.
The reason for this is in case you just want to query by stringid only (i.e. for deletes or updates), you are not tied into always having to query across both field to leverage your index (or adding having to add individual indexes on stringid and userid to enable querying by each field, which means my space in memory and disk taken up by indexes).
As far as whether MySQL is the right solution, this would really be for you to determine. I would say that MySQL should have no problem handling tables with 2 million rows and 2 indexes on two integer id fields. This is assuming you have allocated enough memory to hold these indexes in memory. There is certainly a ton of information available on working with MySQL, so if you are just trying to learn, it would likely be a good choice.
Regardless of what you consider a "large amount of data", modern DB engines are designed to handle a lot. The question of "Relational or NoSQL?" isn't about which option can support more data. Different relational and NoSQL solutions will handle the large amounts of data differently, some better than others.
MySQL can handle many millions of records, SQLite can not (at least not as effectively). Mongo (NoSQL) attempts to hold it's collections in memory (as well as the file system) so I have seen it fail with less than 1 million records on servers with limited memory, although it offers sharding which can help it scale more effectively.
The bottom line is: The number of records you store should not play into SQL vs NoSQL decisions, that decision should be left to how you will save and retrieve the data. It sounds like your data is already normalized (e.g. UserID) and if you also desire consistency when you i.e. delete a user (the TODO items also get deleted) then I would suggest using a SQL solution.
I assume that all queries will reference a specific userid. I also assume that the stringid is a dummy value used internally instead of the actual task-text (your random string).
Use an InnoDB table with a compound primary key on {userid, stringid} and you will have all the performance you need, due to the way a clustered index works.