How does columnar Databases do indexing?

How does columnar Databases do indexing? - mysql

I understand that the columnar databases put column data together on the disk rather than rows. I also understand that in traditional row-wise RDBMS, leaf index node of B-Tree contains pointer to the actual row.
But since columnar doesn't store rows together, and they are particularly designed for columnar operations, how do they differ in the indexing techniques?
Do they also use B-tress?
How do they index inside whatever datastructure they use?
Or there is no accepted format, every vendor have their own indexing scheme to cater their needs?
I have been searching, but unable to find any text. Every text I found is for row-wise DBMS.

There are no BTrees. (Or, if they are, they are not the main part of the design)
Infinidb stores 64K rows per chunk. Each column in that chunk is compressed and indexed. With the chunk is a list of things like min, max, avg, etc, for each column that may or may not help in queries.
Running a SELECT first looks at that summary info for each chunk to see if the WHERE clause might be satisfied by any of the rows in the chunk.
The chunks that pass that filtering get looked at in more detail.
There is no copy of a row. Instead, if, say, you ask for SELECT a,b,c, then the compressed info for 64K rows (in one chunk) for each of a, b, c need to be decompressed to further filter and deliver the row. So, it behooves you to list only the desired columns, not blindly say SELECT *.
Since every column is separately indexed all the time, there is no need to say INDEX(a). (I don't know if INDEX(a,b) can even be specified for a columnar DB.)
Caveat: I am describing Infinidb, which is available with MariaDB. I don't know about any other columnar engines.

If you understand
1)How columnar DBs store the data actually, and
2)How Indexes work, (how they store the data)
Then you may feel that there is no need of indexing in columnar Dbs.
For any kind of database rowid is very important, it is like the address where the data is stored.
Indexing is nothing but, mapping the rowids to the columns that are being indexed in a sorted order.
Columnar databases are born basing this logic. They try to store the data in this fashion itself, meaning - They store the data as a key-value pair in a serialized manner where the actual column value is Key and the rowid when the data is residing as its value and if they find any duplicates for a key, they just compress and store.
So if you compare the format how columnar databases store the data actually on Disk, it is almost the same (but not exactly because, as the difference is compression, representation of key-value in a vice versa fashion) how the row oriented databases store indexes.
That's the reason you don't need separate indexing again. and you won't find any columnar database trying to implement indexing.

Columnar Indexes (also known as "vertical data storage") stores data in a hash and compressed mode. All columns invoked in the index key are indexed separately. Hashing decrease the volume of data stored. The compressing method use only one value for repetitive occurrences (dictionnary, eventually partial).
This technic have two major difficulties :
First you can have collision, because a hash result can be the same
for two distinct values. So the index must manage collisions.
Second, the hash and compress algorithms used is a very heavy consumer of
resources like CPU.
Those type of indexes are stored as vectors.
Ordinary, those type of indexes are used only for read only tables, especially for the business intelligence (OLAP databases).
A columnar index can be used in a "seekable" way only for an equality predicate (COLUMN_A = OneValue). But it is also adequate for GROUPING or DISTINCT operations. Columnar index does not support range seek, including the LIKE 'foo%'.
Some database vendors have get around the huge resources needed while inserting or updating by adding some intermediate algorithms that decrease the CPU. This is the case for Microsoft SQL Server that use a delta store for newly modified rows. With this technic, the table can be used in a relational way like any classical OLTP dataabase.
For instance, Microsoft SQL Server introduced first the columnstore index in 2012 version, but this made the table read only. In 2014 the clustered columnstore index (all the columns of the table was indexed) was released and the table was writetable. And finally in the 2016 version, the columnstore index clustered ornot, no longer demands any part of the table to be read only.
This was made possible because a particular search algorithm, named "Batch Mode" was developed by Microsoft Research, and does not works by reading the data row by row...
To read :
Enhancements to SQL Server Column Stores
Columnstore and B+ tree –Are Hybrid Physical Designs Important?

Related

what is the effect of storing a lot of row in a mysql table (maybe more than one bilion) for a search engine?

I want to store more than one billion row in my search engine db.
I think when the number of rows going upper than a number we experiencing pressure in fetching queries to introduce result.
What is the best way to store a lot of data in mysql? one approach that I tested was split data to more tables instead of using just one table, but it make me to write some complicated methods to fetching the data from different tables in each search.
What is the effect of storing a lot of row in a mysql table (maybe more than one billion) for a search engine and what we can do to increasing number of rows without negative effects on fetch speed ?
I think google and other search engines use a technique to render some query sets and produce results based on a ranking algorithm, is this true?
I experienced data storage in splitted tables for better efficiency and it was good but complicated for fetching.
Please give me a technical approach to store more data in one table with minimum resource usage.

In any database, the primary methods for accessing large tables are:
Proper indexing
Partitioning (i.e. storing single tables in multiple "files").
These usually suffice and should be sufficient for your data.
There are some additional techniques that depend on the data and database. Two that come to mind are:
Vertical/column partitioning: storing separate columns in separate table spaces.
Data compression for wide columns.

You can easily store billions of data in MySQL and for fast handling, you should do proper indexing, and the splitting of tables is mainly for normalization by column-wise distribution, not for row-wise distribution.
This earlier link may help:How to store large number of records in MySql database?

MySQL or NoSQL? Recommended way of dealing with large amount of data

I have a database of which will be used by a large amount of users to store random long string (up to 100 characters). The table columns will be: userid, stringid and the actual long string.
So it will look pretty much like this:
Userid will be unique and stringid will be unique for each user.
The app is like a simple todo-list app, so each user will have an average amount of 50 todo's.
I am using the stringid in order that users will be able to delete the specific task at any given time.
I assume this todo app could end up with 7 million tasks in 3 years time and that scares me of using MySQL.
So my question is if this is the actual recommended way of dealing with large amount of data with long string (every new task gets a new row)? and is MySQL is the right database solution to choose for this kind of projects ?
I have not experienced with large amount of data yet and I am trying to save myself for the far future.

This is not a question of "large amounts" of data (mysql handles large amounts of data just fine and 2 mio rows isn't "large amounts" in any case).
MySql is a relational database. So if you have data that can be normalized, that is distributed among a number of tables that ensures every datapoint is saved only once then you should use MySql (or Maria, or any other relational database).
If you have schema-less data and speed is more important than consistency than you can/should use some NoSql database. Personally I don't see how a todo list would profit from NoSql (doesn't really matter in this case, but I guess as of now most programmig frameworks have better support for relational databases than for Nosql).

This is a pretty straightforward relational use case. I wouldn't see a need for NoSQL here.
The table you present should work fine however, I personally would question the need for the compound primary key as you would present this. I would probably have a primary key on stringid only to enforce uniqueness across all records. Rather than a compound primary key across userid and stringid. I would then put a regular index on userid.
The reason for this is in case you just want to query by stringid only (i.e. for deletes or updates), you are not tied into always having to query across both field to leverage your index (or adding having to add individual indexes on stringid and userid to enable querying by each field, which means my space in memory and disk taken up by indexes).
As far as whether MySQL is the right solution, this would really be for you to determine. I would say that MySQL should have no problem handling tables with 2 million rows and 2 indexes on two integer id fields. This is assuming you have allocated enough memory to hold these indexes in memory. There is certainly a ton of information available on working with MySQL, so if you are just trying to learn, it would likely be a good choice.

Regardless of what you consider a "large amount of data", modern DB engines are designed to handle a lot. The question of "Relational or NoSQL?" isn't about which option can support more data. Different relational and NoSQL solutions will handle the large amounts of data differently, some better than others.
MySQL can handle many millions of records, SQLite can not (at least not as effectively). Mongo (NoSQL) attempts to hold it's collections in memory (as well as the file system) so I have seen it fail with less than 1 million records on servers with limited memory, although it offers sharding which can help it scale more effectively.
The bottom line is: The number of records you store should not play into SQL vs NoSQL decisions, that decision should be left to how you will save and retrieve the data. It sounds like your data is already normalized (e.g. UserID) and if you also desire consistency when you i.e. delete a user (the TODO items also get deleted) then I would suggest using a SQL solution.

I assume that all queries will reference a specific userid. I also assume that the stringid is a dummy value used internally instead of the actual task-text (your random string).
Use an InnoDB table with a compound primary key on {userid, stringid} and you will have all the performance you need, due to the way a clustered index works.

sql query LIKE % on Index

I am using a mysql database.
My website is cut in different elements (PRJ_12 for projet 12, TSK_14 for task 14, DOC_18 for document 18, etc). We currently store the references to these elements in our database as VARCHAR. The relation columns are Indexed so it is faster to select.
We are thinking of currint these columns in 2 columns (on column "element_type" with PRJ and one "element_id" with 12). We are thinking on this solution as we do a lot of requests containing LIKE ...% (for example retrieve all tasks of one user, no matter the id of the task).
However, splitting these columns in 2 will increase the number of Indexed columns.
So, I have two questions :
Is a LIKE ...% request in an Indexed column realy more slow than a a simple where query (without like). I know that if the column is not indexed, it is not advisable to do where ... LIKE % requests but I don't realy know how Index work).
The fact that we split the reference columns in two will double the number of Indexed table. Is that a problem?
Thanks,

1) A like is always more costly than a full comparison (with = ), however it all comes down to the field data types and the number of records (unless we're talking of a huge table you shouldn't have issues)
2) Multicolumn indexes are not a problem, yes it makes the index bigger, but so what? Data types and ammount of total rows matter, but thats what indexes are for.
So go for it

There are a number of factors involved, but in general, adding one more index on a table that has only one index already is unlikely to be a big problem. Some things to consider.
If the table most mostly read-only, then it is almost certainly not a problem. If updates are rare, then the indexes won't need to be modified often meaning there will be very little extra cost (aside from the additional disk space).
If updates to existing records do not change either of those key values, then no index modification should be needed and so again there would be no additional runtime cost.
DELETES and INSERTS will need to update both indexes. So if that is the majority of the operations (and far exceeding reads), then an additional index might incur measurable performance degradation (but it might not be a lot and not noticeable from a human perspective).
The like operator as you describe the usage should be fully optimized. In other words, the clause WHERE combinedfield LIKE 'PRJ%' should perform essentially the same as WHERE element_type = 'PRJ' if there is an index existing in both situations. The more expensive situation is if you use the wild card at the beginning (e.g., LIKE '%abc%'). You can think of a LIKE search as being equivalent to looking up a word in a dictionary. The search for 'overf%' is basically the same as a search for 'overflow'. You can do a "manual" binary search in the dictionary and quickly find the first word beginning with 'overf'. Searching for '%low', though is much more expensive. You have to scan the entire dictionary in order to find all the words that end with "low".
Having two separate fields to represent two separate values is almost always better in the long run since you can construct more efficient queries, easily perform joins, etc.
So based on the given information, I would recommend splitting it into two fields and index both fields.

How to structure an extremely large table

This is more a conceptual question. It's inspired from using some extremely large table where even a simple query takes a long time (properly indexed). I was wondering is there is a better structure then just letting the table grow, continually.
By large I mean 10,000,000+ records that grows every day by something like 10,000/day. A table like that would hit 10,000,000 additional records every 2.7 years. Lets say that more recent records are accesses the most but the older ones need to remain available.
I have two conceptual ideas to speed it up.
1) Maintain a master table that holds all the data, indexed by date in reverse order. Create a separate view for each year that holds only the data for that year. Then when querying, and lets say the query is expected to pull only a few records from a three year span, I could use a union to combine the three views and select from those.
2) The other option would be to create a separate table for every year. Then, again using a union to combine them when querying.
Does anyone else have any other ideas or concepts? I know this is a problem Facebook has faced, so how do you think they handled it? I doubt they have a single table (status_updates) that contains 100,000,000,000 records.

The main RDBMS providers all have similar concepts in terms of partitioned tables and partitioned views (as well as combinations of the two)
There is one immediate benefit, in that the data is now split across multiple conceptual tables, so any query that includes the partition key within the query can automatically ignore any partition that the key would not be in.
From a RDBMS management perspective, having the data divided into seperate partitions allows operations to be performed at a partition level, backup / restore / indexing etc. This helps reduce downtimes as well as allow for far faster archiving by just removing an entire partition at a time.
There are also non relational storage mechanisms such as nosql, map reduce etc, but ultimately how it is used, loaded and data is archived become a driving factor in the decision of the structure to use.
10 million rows is not that large in the scale of large systems, partitioned systems can and will hold billions of rows.

Your second idea looks like partitioning.
I don't know how well it works, but there is support for partition in MySQL -- see, in its manual : Chapter 17. Partitioning

There is good scalability approach for this tables. Union is right way, but there is better way.
If your database engine supports "semantical partitioning", then you can split one table into partitions. Each partition will cover some subrange (say 1 partition per year). It will not affect anything in SQL syntax, except DDL. And engine will transparently run hidden union logic and partitioned index scans with all parallel hardware it has (CPU, I/O, storage).
For example Sybase allows up to 255 partitions, as it is limit of union. But you will never need keyword "union" in queries.

Often the best plan is to have one table and then use database partioning.
Or you can archive data and create a view for the archived and combined data and keep only the active data in the table most functions are referencing. You will have to have a good archiving stategy though (which is automated) or you can lose data or not get things done efficiently in moving them. This is typically more difficult to maintain.

What you're talking about is horizontal partitioning or sharding.

What is the optimal amount of data for a table?

How much data should be in a table so that reading is optimal? Assuming that I have 3 fields varchar(25). This is in MySQL.

I would suggest that you consider the following in optimizing your database design:
Consider what you want to accomplish with the database. Will you be performing a lot of inserts to a single table at very high rates? Or will you be performing reporting and analytical functions with the data?
Once you've determined the purpose of the database, define what data you need to store to perform whatever functions are necessary.
Normalize till it hurts. If you're performing transaction processing (the most common function for a database) then you'll want a highly normalized database structure. If you're performing analytical functions, then you'll want a more denormalized structure that doesn't have to rely on joins to generate report results.
Typically, if you've really normalized the structure till it hurts then you need to take your normalization back a step or two to have a data structure that will be both normalized and functional.
A normalized database is mostly pointless if you fail to use keys. Make certain that each table has a primary key defined. Don't use surrogate keys just cause its what you always see. Consider what natural keys might exist in any given table. Once you are certain that you have the right primary key for each table, then you need to define your foreign key references. Establishing explicit foreign key relationships rather than relying on implicit definition will give you a performance boost, provide integrity for your data, and self-document the database structure.
Look for other indexes that exist within your tables. Do you have a column or set of columns that you will search against frequently like a username and password field? Indexes can be on a single column or multiple columns so think about how you'll be querying for data and create indexes as necessary for values you'll query against.

Number of rows should not matter. Make sure the fields your searching on are indexed properly. If you only have 3 varchar(25) fields, then you probably need to add a primary key that is not a varchar.

Agree that you should ensure that your data is properly indexed.
Apart from that, if you are worried about table size, you can always implement some type of data archival strategy to later down the line.
Don't worry too much about this until you see problems cropping up, and don't optimise prematurely.

For optimal reading you should have an index. A table exists to hold the rows it was designed to contain. As the number of rows increases, the value of the index comes into play and reading remains brisk.

Phrased as such I don't know how to answer this question. An idexed table of 100,000 records is faster than an unindexed table of 1,000.
What are your requirements? How much data do you have? Once you know the answer to these questions you can make decisions about indexing and/or partitioning.

This is a very loose question, so a very loose answer :-)
In general if you do the basics - reasonable normalization, a sensible primary key and run-of-the-mill queries - then on today's hardware you'll get away with most things on a small to medium sized database - i.e. one with the largest table having less than 50,000 records.
However once you get past the 50k - 100k rows, which roughly corresponds to the point when the rdbms is likely to be memory constrained - then unless you have your access paths set up correctly (i.e. indexes) then performance will start to fall off catastrophically. That is in the mathematical sense - in such scenario's it's not unusual to see performance deteriorate by an order of magnitude or two for a doubling in table size.
Obviously therefore the critical table size at which you need to pay attention will vary depending upon row size, machine memory, activity and other environmental issues, so there is no single answer, but it is well to be aware that performance generally does not degrade gracefully with table size and plan accordingly.

I have to disagree with Cruachan about "50k - 100k rows .... roughly correspond(ing) to the point when the rdbms is likely to be memory constrained". This blanket statement is just misleading without two additional data: approx. size of the row, and available memory. I'm currently developing a database to find the longest common subsequence (a la bio-informatics) of lines within source code files, and reached millions of rows in one table, even with a VARCHAR field of close to 1000, before it became memory constrained. So, with proper indexing, and sufficient RAM (a Gig or two), as regards the original question, with rows of 75 bytes at most, there is no reason why the proposed table couldn't hold tens of millions of records.

The proper amount of data is a function of your application, not of the database. There are very few cases where a MySQL problem is solved by breaking a table into multiple subtables, if that's the intent of your question.
If you have a particular situation where queries are slow, it would probably be more useful to discuss how to improve that situation by modifying query or the table design.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008