Do we use hash tables in practise? - language-agnostic

I just read about Hash tables and am curious if we use it in practise because if I write a program that stores data in a Hash table, the storage will only be temporary. So, why not use a database to store it ?
In other words, what kinds of real world programs use hash tables for their functioning?

You would use hash tables to store data while you are working. Using the database for that would in many cases be orders of magnitude slower then using in-memory hash tables. See for example:
http://en.wikipedia.org/wiki/Caching
Hash maps are about speed, not persistence.
Take a look at the other uses in the Uses section of Hash table entry on Wikipedia:
http://en.wikipedia.org/wiki/Hash_table

There is such a thing as an on-disk hash table, e.g. Tokyo Cabinet's hash database.

hash table is for fast access, let's say that you need to search a lot of records, that will be very expensive. Using a hash function will take you almost directly to the portion you want to search. And you may implement a data base schema hash table alike, so they are not necessarily temporary storage.

Related

what is the effect of storing a lot of row in a mysql table (maybe more than one bilion) for a search engine?

I want to store more than one billion row in my search engine db.
I think when the number of rows going upper than a number we experiencing pressure in fetching queries to introduce result.
What is the best way to store a lot of data in mysql? one approach that I tested was split data to more tables instead of using just one table, but it make me to write some complicated methods to fetching the data from different tables in each search.
What is the effect of storing a lot of row in a mysql table (maybe more than one billion) for a search engine and what we can do to increasing number of rows without negative effects on fetch speed ?
I think google and other search engines use a technique to render some query sets and produce results based on a ranking algorithm, is this true?
I experienced data storage in splitted tables for better efficiency and it was good but complicated for fetching.
Please give me a technical approach to store more data in one table with minimum resource usage.
In any database, the primary methods for accessing large tables are:
Proper indexing
Partitioning (i.e. storing single tables in multiple "files").
These usually suffice and should be sufficient for your data.
There are some additional techniques that depend on the data and database. Two that come to mind are:
Vertical/column partitioning: storing separate columns in separate table spaces.
Data compression for wide columns.
You can easily store billions of data in MySQL and for fast handling, you should do proper indexing, and the splitting of tables is mainly for normalization by column-wise distribution, not for row-wise distribution.
This earlier link may help:How to store large number of records in MySql database?

How does columnar Databases do indexing?

I understand that the columnar databases put column data together on the disk rather than rows. I also understand that in traditional row-wise RDBMS, leaf index node of B-Tree contains pointer to the actual row.
But since columnar doesn't store rows together, and they are particularly designed for columnar operations, how do they differ in the indexing techniques?
Do they also use B-tress?
How do they index inside whatever datastructure they use?
Or there is no accepted format, every vendor have their own indexing scheme to cater their needs?
I have been searching, but unable to find any text. Every text I found is for row-wise DBMS.
There are no BTrees. (Or, if they are, they are not the main part of the design)
Infinidb stores 64K rows per chunk. Each column in that chunk is compressed and indexed. With the chunk is a list of things like min, max, avg, etc, for each column that may or may not help in queries.
Running a SELECT first looks at that summary info for each chunk to see if the WHERE clause might be satisfied by any of the rows in the chunk.
The chunks that pass that filtering get looked at in more detail.
There is no copy of a row. Instead, if, say, you ask for SELECT a,b,c, then the compressed info for 64K rows (in one chunk) for each of a, b, c need to be decompressed to further filter and deliver the row. So, it behooves you to list only the desired columns, not blindly say SELECT *.
Since every column is separately indexed all the time, there is no need to say INDEX(a). (I don't know if INDEX(a,b) can even be specified for a columnar DB.)
Caveat: I am describing Infinidb, which is available with MariaDB. I don't know about any other columnar engines.
If you understand
1)How columnar DBs store the data actually, and
2)How Indexes work, (how they store the data)
Then you may feel that there is no need of indexing in columnar Dbs.
For any kind of database rowid is very important, it is like the address where the data is stored.
Indexing is nothing but, mapping the rowids to the columns that are being indexed in a sorted order.
Columnar databases are born basing this logic. They try to store the data in this fashion itself, meaning - They store the data as a key-value pair in a serialized manner where the actual column value is Key and the rowid when the data is residing as its value and if they find any duplicates for a key, they just compress and store.
So if you compare the format how columnar databases store the data actually on Disk, it is almost the same (but not exactly because, as the difference is compression, representation of key-value in a vice versa fashion) how the row oriented databases store indexes.
That's the reason you don't need separate indexing again. and you won't find any columnar database trying to implement indexing.
Columnar Indexes (also known as "vertical data storage") stores data in a hash and compressed mode. All columns invoked in the index key are indexed separately. Hashing decrease the volume of data stored. The compressing method use only one value for repetitive occurrences (dictionnary, eventually partial).
This technic have two major difficulties :
First you can have collision, because a hash result can be the same
for two distinct values. So the index must manage collisions.
Second, the hash and compress algorithms used is a very heavy consumer of
resources like CPU.
Those type of indexes are stored as vectors.
Ordinary, those type of indexes are used only for read only tables, especially for the business intelligence (OLAP databases).
A columnar index can be used in a "seekable" way only for an equality predicate (COLUMN_A = OneValue). But it is also adequate for GROUPING or DISTINCT operations. Columnar index does not support range seek, including the LIKE 'foo%'.
Some database vendors have get around the huge resources needed while inserting or updating by adding some intermediate algorithms that decrease the CPU. This is the case for Microsoft SQL Server that use a delta store for newly modified rows. With this technic, the table can be used in a relational way like any classical OLTP dataabase.
For instance, Microsoft SQL Server introduced first the columnstore index in 2012 version, but this made the table read only. In 2014 the clustered columnstore index (all the columns of the table was indexed) was released and the table was writetable. And finally in the 2016 version, the columnstore index clustered ornot, no longer demands any part of the table to be read only.
This was made possible because a particular search algorithm, named "Batch Mode" was developed by Microsoft Research, and does not works by reading the data row by row...
To read :
Enhancements to SQL Server Column Stores
Columnstore and B+ tree –Are Hybrid Physical Designs Important?

MySQL DB normalization

I've got a single table DB with 100K rows. There are about 30 columns and 28 of them are varchars / tiny text and one of them is an int primary key and one of them is a blob.
My question, is in terms of performance, would it be better to separate the blob from the rest of the table and store them in their own table with foreign key constraint to the primary id?
The table will eventually be turned into a sqlite persistent store for iOS core data and a lot of the searching / filtering will be done based on the NSPredicate for the lighter varchar columns.
Sorry if this is too subjective, but I'm thinking there is a recommended way.
Thanks!
If you do SELECT * FROM table (which you shouldn't if you don't need the BLOB field actually) then yes, the query will be faster because in that case pages with BLOB won't be touched.
If you do frequent SELECT f1, f2, f3 FROM table (all fields are non-BLOBs) then yes, storing BLOBS in a separate table will make the query faster because of the same reason - MySQL will have to read less pages.
If however the BLOB is selected frequently then it makes no sense to keep it separately.
This totally depends on data usage.
If you need the data everytime you query the table, there is no difference in haviong a separate table for it (as long as blob data is unique in each row - that is, "as long as the database is normalized").
If you don'T need the blob data but only metadata from other columns, there may be a speed bonus qhen querying if the blob has its own table. querying the blob data is slower thoguh, as you need to query bowth tables.
The USUAL way is not to store any blob data inside the database (at least not huge data), but store the binary data into files and have the fiel path inside the database instead. This is recommended, as binary data most likely doesn'T benefit from being inside a DBMS (not indexable, sortable, groupable, ..), so there is no drawback of storing it inside files, while the database isn't optimized for binary data ('cause, again, it can't do much with it anyway).
Blobs are stored on disk only the point to the storage is in memory in Mysql. Moving it to another table with a foreign key will not noticeably help your performance. Don't know if this is the case for sqlite.

Fastest database engine to store huge string list

I have a huge unique string list (1.000.000.000+ lines).
I need to know if a string does exist in this list or not.
What is the fastest way to do it ?
I guess I need a very simple database engine with a Btree index which lets me do fast lookup ... and MySQL may be too slow and complex for this.
If this is all you need to do, you should take a long look at tries and related data structures specialized for strings (e.g. suffix array). With this many strings, you are guaranteed to have a lot of overlap, and these data structures can eliminate such overlap (saving not only memory but also processing time).

do I need to store domain names to md5 mode in database?

I had a feeling that searching domain names taking time more than as usual in mysql. actually domain name column has a unique index though query seems slow.
My question is do I need to convert to binary mode?? say md5 hash or something??
Normally keeping the domain names in a "VARCHAR" data type, with an UNIQUE index defined for that field, is the most simple & efficient way of managing your data.
Never try to use any complexity (like using Binary mode or "BLOB" data type), for the sake of one / two field(s), as it will further deteriorate your MySQL performance.
Hope it helps.