How are non-leaf b-tree nodes physically represented in innodb?
Recall that a b-tree (more specifically a b+tree) has both leaf nodes and non-leaf nodes. In a b+tree all the leaf nodes sit below a tree of non-leaf or "internal" nodes and point to the pages that actually contain row data.
I know that non-leaf nodes are stored in the non-leaf node segment and use pages sort of like data pages. I have found ample documentation on how data pages are physically stored, but I haven't been able to find anything on what the non-leaf index pages look like.
In On learning InnoDB: A journey to the core, I introduced the innodb_diagrams project to document the InnoDB internals, which provides the diagrams used in this post. Later on in A quick introduction to innodb_ruby I walked through installation and a few quick demos of the innodb_space command-line tool.
The physical structure of InnoDB’s INDEX pages was described in The physical structure of InnoDB index pages. We’ll now look into how InnoDB logically structures its indexes, using some practical examples.
An aside on terminology: B+Tree, root, leaf, and level
InnoDB uses a B+Tree structure for its indexes. A B+Tree is particularly efficient when data doesn’t fit in memory and must be read from the disk, as it ensures that a fixed maximum number of reads would be required to access any data requested, based only on the depth of the tree, which scales nicely.
An index tree starts at a “root” page, whose location is fixed (and permanently stored in the InnoDB’s data dictionary) as a starting point for accessing the tree. The tree may be as small as the single root page, or as large as many millions of pages in a multi-level tree.
Pages are referred to as being “leaf” pages or “non-leaf” pages (also called “internal” or “node” pages in some contexts). Leaf pages contain actual row data. Non-leaf pages contain only pointers to other non-leaf pages, or to leaf pages. The tree is balanced, so all branches of the tree have the same depth.
InnoDB assigns each page in the tree a “level”: leaf pages are assigned level 0, and the level increments going up the tree. The root page level is based on the depth of the tree. All pages that are neither leaf pages nor the root page can also be called “internal” pages, if a distinction is important.
Leaf and non-leaf pages
For both leaf and non-leaf pages, each record (including the infimum and supremum system records) contain a “next record” pointer, which stores an offset (within the page) to the next record. The linked list starts at infimum and links all records in ascending order by key, terminating at supremum. The records are not physically ordered within the page (they take whatever space is available at the time of insertion); their only order comes from their position in the linked list.
Leaf pages contain the non-key values as part of the “data” contained in each record:
Non-leaf pages have an identical structure, but instead of non-key fields, their “data” is the page number of the child page, and instead of an exact key, they represent the minimum key on the child page they point to:
Pages at the same level
Most indexes contain more than one page, so multiple pages are linked together in ascending and descending order:
Each page contains pointers (in the FIL header) for “previous page” and “next page”, which for INDEX pages are used to form a doubly-linked list of pages at the same level (e.g. leaf pages, at level 0 form one list, level 1 pages form a separate list, etc.).
A detailed look at a single-page table
Let’s take a look at most of what’s B+Tree related in a single index page:
Create and populate the table
The test table in use in the illustration above can be created and populated with (make sure you’re using innodb_file_per_table and using Barracuda file format):
CREATE TABLE t_btree (
i INT NOT NULL,
s CHAR(10) NOT NULL,
PRIMARY KEY(i)
) ENGINE=InnoDB;
INSERT INTO t_btree (i, s)
VALUES (0, "A"), (1, "B"), (2, "C");
While this table is quite small and not realistic, it does demonstrate nicely how records and record traversal works.
Verify the basic structure of the space file
The table should match what we’ve examined before, with the three standard overhead pages (FSP_HDR, IBUF_BITMAP, and INODE) followed by a single INDEX page for the root of the index, and in this case two unused ALLOCATED pages.
$ innodb_space -f t_btree.ibd space-page-type-regions
start end count type
0 0 1 FSP_HDR
1 1 1 IBUF_BITMAP
2 2 1 INODE
3 3 1 INDEX
4 5 2 FREE (ALLOCATED)
The space-index-pages-summary mode will give us a count of records in each page, and is showing the expected 3 records:
$ innodb_space -f t_btree.ibd space-index-pages-summary
page index level data free records
3 18 0 96 16156 3
4 0 0 0 16384 0
5 0 0 0 16384 0
(Note that space-index-pages-summary also shows the empty ALLOCATED pages as empty pages with zero records, since that’s often what you’re interested in for plotting purposes.)
The space-indexes mode will show the stats about our PRIMARY KEY index, which is consuming a single page on its internal file segment:
$ innodb_space -f t_btree.ibd space-indexes
id root fseg used allocated fill_factor
18 3 internal 1 1 100.00%
18 3 leaf 0 0 0.00%
Set up a record describer
In order for innodb_ruby to parse the contents of records, we need to provide a record describer, which is just a Ruby class providing a method that returns a description of an index:
class SimpleTBTreeDescriber < Innodb::RecordDescriber
type :clustered
key "i", :INT, :NOT_NULL
row "s", "CHAR(10)", :NOT_NULL
end
We need to note that this is the clustered key, provide the column descriptions for the key, and the column descriptions for the non-key (“row”) fields. It’s necessary to ask innodb_space to load this class with the following additional arguments:
-r -r ./simple_t_btree_describer.rb -d SimpleTBTreeDescriber
Look at the record contents
The root page (which is a leaf page) in this example can be dumped using the page-dump mode and providing the page number for the root page:
$ innodb_space -f t_btree.ibd -r ./simple_t_btree_describer.rb -d
SimpleTBTreeDescriber -p 3 page-dump
Aside from some parts of this output we’ve looked at before, it will now print a “records:” section with the following structure per record:
{:format=>:compact,
:offset=>125,
:header=>
{:next=>157,
:type=>:conventional,
:heap_number=>2,
:n_owned=>0,
:min_rec=>false,
:deleted=>false,
:field_nulls=>nil,
:field_lengths=>[0, 0, 0, 0],
:field_externs=>[false, false, false, false]},
:next=>157,
:type=>:clustered,
:key=>[{:name=>"i", :type=>"INT", :value=>0, :extern=>nil}],
:transaction_id=>"0000000f4745",
:roll_pointer=>
{:is_insert=>true, :rseg_id=>8, :undo_log=>{:page=>312, :offset=>272}},
:row=>[{:name=>"s", :type=>"CHAR(10)", :value=>"A", :extern=>nil}]}
This should align with the above detailed illustration perfectly, as I’ve copied most of the information from this example for accuracy. Note the following aspects:
The :format being :compact indicates that the record is the new “compact” format in Barracuda format tables (as opposed to “redundant” in Antelope tables).
The :key listed in the output is an array of key fields for the index, and :row is an array of non-key fields.
The :transaction_id and :roll_pointer fields are internal fields for MVCC included in each record, since this is a clustered key (the PRIMARY KEY).
The :next field within the :header hash is a relative offset (32) which must be added to the current record offset (125) to yield the actual offset of the next record (157). For convenience this calculated offset is included as :next in the record hash.
Recurse the index
A nice and simple output of recursing the entire index can be achieved with the index-recurse mode, but since this is still a single-page index, the output will be very short:
$ innodb_space -f t_btree.ibd -r ./simple_t_btree_describer.rb -d
SimpleTBTreeDescriber -p 3 index-recurse
ROOT NODE #3: 3 records, 96 bytes
RECORD: (i=0) -> (s=A)
RECORD: (i=1) -> (s=B)
RECORD: (i=2) -> (s=C)
Building a non-trivial index tree
A multi-level index tree (overly simplified) in InnoDB looks like:
As previously described, all pages at each level are doubly-linked to each other, and within each page, records are singly-linked in ascending order. Non-leaf pages contain “pointers” (containing the child page number) rather than non-key row data.
If we use the simpler table schema with 1 million rows created in A quick introduction to innodb_ruby, the tree structure looks a little more interesting:
$ innodb_space -f t.ibd -r ./simple_t_describer.rb -d SimpleTDescriber
-p 3 index-recurse
ROOT NODE #3: 2 records, 26 bytes
NODE POINTER RECORD >= (i=252) -> #36
INTERNAL NODE #36: 1117 records, 14521 bytes
NODE POINTER RECORD >= (i=252) -> #4
LEAF NODE #4: 446 records, 9812 bytes
RECORD: (i=1) -> ()
RECORD: (i=2) -> ()
RECORD: (i=3) -> ()
RECORD: (i=4) -> ()
NODE POINTER RECORD >= (i=447) -> #1676
LEAF NODE #1676: 444 records, 9768 bytes
RECORD: (i=447) -> ()
RECORD: (i=448) -> ()
RECORD: (i=449) -> ()
RECORD: (i=450) -> ()
NODE POINTER RECORD >= (i=891) -> #771
LEAF NODE #771: 512 records, 11264 bytes
RECORD: (i=891) -> ()
RECORD: (i=892) -> ()
RECORD: (i=893) -> ()
RECORD: (i=894) -> ()
This is a three-level index tree, which can be seen by the ROOT, INTERNAL, LEAF lines above. We can see that some pages are completely full, with 468 records consuming almost 15 KiB of the 16 KiB page.
Looking at a non-leaf page (page 36, in the above output) using the page-dump mode, records look slightly different than the leaf pages shown previously:
$ innodb_space -f t.ibd -r ./simple_t_describer.rb -d SimpleTDescriber
-p 36 page-dump
{:format=>:compact,
:offset=>125,
:header=>
{:next=>11877,
:type=>:node_pointer,
:heap_number=>2,
:n_owned=>0,
:min_rec=>true,
:deleted=>false,
:field_nulls=>nil,
:field_lengths=>[0],
:field_externs=>[false]},
:next=>11877,
:type=>:clustered,
:key=>[{:name=>"i", :type=>"INT UNSIGNED", :value=>252, :extern=>nil}],
:child_page_number=>4}
The :key array is present, although it represents the minimum key rather than an exact key, and no :row is present, as a :child_page_number takes its place.
The root page is a bit special
Since the root page is allocated when the index is first created, and that page number is stored in the data dictionary, the root page can never relocated or removed. Once the root page fills up, it will need to be split, forming a small tree of a root page plus two leaf pages.
However, the root page itself can’t actually be split, since it cannot be relocated. Instead, a new, empty page is allocated, the records in the root are moved there (the root is “raised” a level), and that new page is split into two. The root page then does not need to be split again until the level immediately below it has enough pages that the root becomes full of child page pointers (called “node pointers”), which in practice often means several hundred to more than a thousand.
B+Tree levels and increasing tree depth
As an example of the efficiency of B+Tree indexes, assume perfect record packing (every page full, which will never quite happen in practice, but is useful for discussion). A B+Tree index in InnoDB for the simple table in the examples above will be able to store 468 records per leaf page, or 1203 records per non-leaf page. The index tree can then be a maximum of the following sizes at the given tree heights:
Height Non-leaf pages Leaf pages Rows Size in bytes
1 0 1 468 16.0 KiB
2 1 1203 > 563 thousand 18.8 MiB
3 1204 1447209 > 677 million 22.1 GiB
4 1448413 1740992427 > 814 billion 25.9 TiB
As you can imagine, most tables with sensible PRIMARY KEY definitions are 2-3 levels, with some achieving 4 levels. Using an excessively large PRIMARY KEY can cause the B+Tree to be much less efficient, however, since primary key values must be stored in the non-leaf pages. This can drastically inflate the size of the records in non-leaf pages, meaning many fewer of those records fit in each non-leaf page, making the whole structure less efficient.
Related
I am reading this docs to study wasm binary format. I am finding it very tough to understand the composition of element section. Can someone please give me an example / explanation about it ? Maybe similar to one given here
The element segments section
The idea of this section is to fill the WebAssembly.Table objects with content. Initially there was only one table, and its only possible content were indexes/ids of functions. You could write:
(elem 0 (offset (i32.const 1)) 2)
It means: during the instantiation of the instance fill index 1 of table 0 with a value of 2, like tables[0][1] = 2;. Here 2 is the index of the function the table will store.
The type of element segment above is called active nowadays, and after the instantiation it will no longer be accessible by the application (they are "dropped"). From the specs:
An active element segment copies its elements into a table during instantiation, as specified by a table index and a constant expression defining an offset into that table
So far so good. But it was recognized that there is a need for a more powerful element segment section. Introduced were the passive and the declarative element segments.
The passive segment is not used during the instantiation and it is always available at runtime (until is not dropped, by the application itself, with elem.drop). There are instructions (from the Bulk memory and table instructions proposal, already integrated into the standard) that can be used to do operations with tables and element segments.
A declarative element segment is not available at runtime but merely serves to forward-declare references that are formed in code with instructions like ref.func.
Here is the test suite, where you can see many examples of element segments (in a text format).
The binary format
Assuming that you parser the code, you read one u32, and based on its value you expect the format from specification:
0 means an active segment, as the one above, for an implicit table index of 0, and a vector of func.refs.
1 means a passive segment, the elemkind (0x00 for func.ref at this time), followed by a vector of the respective items (func.refs).
2 is an active segment.
3 means a declarative segment.
4 is an active segment where the values in the vector are expressions, not just plain indexes (so you could have (i32.const 2) in the above example, instead of 2).
5 passive with expressions
6 active with table index and expressions
7 declarative with expressions
For this reason the specs says that from this u32 [0..7] you can use its three lower bits to detect what is the format that you have to parse. For example, the 3th bit signifies "is the vector made of expressions?".
Now, all that said, it seems that the reference types proposal is not (yet) fully integrated into the specification's binary format (but seems to be in the text one). When it is you will be able to have other then 0x00 (func.ref) for an elemkind.
It is visible that some of this formats overlap, but the specification evolves, and for backward compatibility reasons with the earliest versions the format is like this today.
I think I understand how each of repartition, hive partitioning, and bucketing affect the number of output files, but I am not quite clear on the interaction of the various features. Can someone help fill in the number of output files for each of the below situations where I've left a blank? The intent is to understand what the right code is for a situation where I have a mix of high and low cardinality columns that I need to partition / bucket by, where I have frequent operations that filter on the low cardinality columns, and join on the high cardinality columns.
Assume that we have a data frame df that starts with 200 input partitions, colA has 10 unique values, and colB has 1000 unique values.
First a few ones to check my understanding:
df.repartition(100) = 100 output files of the same size
df.repartition('colA') = 10 output files of different sizes, since each file will contain all rows for 1 value of colA
df.repartition('colB') = 1000 output files
df.repartition(50, 'colA') = 50 output files?
df.repartition(50, 'colB') = 50 output files, so some files will contain more than one value of colB?
Hive partitions:
output.write_dataframe(df, partition_cols=['colA']) = 1,000 output files (because I get potentially 100 files in each of the 10 hive partitions 10)
output.write_dataframe(df, partition_cols=['colB']) = 10,000 output files
output.write_dataframe(df, partition_cols=['colA', 'colB']) = 100,000 output files
output.write_dataframe(df.repartition('colA'), partition_cols=['colA']) = 10 output files of different sizes (1 file in each hive partition)
Bucketing:
output.write_dataframe(df, bucket_cols=[‘colB’], bucket_count=100) = 100 output files? In an experiment, this did not seem to be the case
output.write_dataframe(df, bucket_cols=[‘colA’], bucket_count=10) = 10 output files?
output.write_dataframe(df.repartition(‘colA’), bucket_cols=[‘colA’], bucket_count=10) = ???
All together now:
output.write_dataframe(df, partition_cols=[‘colA’], bucket_cols=[‘colB’], bucket_count=200) = ???
output.write_dataframe(df.repartition(‘colA’, ‘colB’), partition_cols=[‘colA’], bucket_cols=[‘colB’], bucket_count=200) = ??? -- Is this the command that I want to use in the end? And anything downstream would first filter on colA to take advantage of the hive partitioning, then join on colB to take advantage of the bucketing?
For hive partitioning + bucketing, the # of output files is not constant and will depend on the actual data of the input partition.To clarify, let's say df is 200 partitions, not 200 files. Output files scale with # of input partitions, not # of files. 200 files could be misleading as that could be 1 partition to 1000's of partitions.
First a few ones to check my understanding:
df.repartition(100) = 100 output files of the same size
df.repartition('colA') = 10 output files of different sizes, since each file will contain all rows for 1 value of colA
df.repartition('colB') = 1000 output files
df.repartition(50, 'colA') = 50 output files
df.repartition(50, 'colB') = 50 output files
Hive partitions:
output.write_dataframe(df, partition_cols=['colA']) = upper bound of 2,000 output files (200 input partitions * max 10 values per partition)
output.write_dataframe(df, partition_cols=['colB']) = max 200,000 output files (200 * 1000 values per partition)
output.write_dataframe(df, partition_cols=['colA', 'colB']) = max 2,000,000 output files (200 partitions * 10 values * 1000)
output.write_dataframe(df.repartition('colA'), partition_cols=['colA']) = 10 output files of different sizes (1 file in each hive partition)
Bucketing:
output.write_dataframe(df, bucket_cols=[‘colB’], bucket_count=100) = max 20,000 files (200 partitions * max 100 buckets per partition)
output.write_dataframe(df, bucket_cols=[‘colA’], bucket_count=10) = max 2,000 files (200 partitions * max 10 buckets per partition)
output.write_dataframe(df.repartition(‘colA’), bucket_cols=[‘colA’], bucket_count=10) = exactly 10 files (repartitioned dataset makes 10 input partitions, each partition outputs to only 1 bucket)
All together now:
output.write_dataframe(df, partition_cols=[‘colA’], bucket_cols=[‘colB’], bucket_count=200) = I could be wrong on this, but I believe it's max of 400,000 output files (200 input partitions * 10 colA partitions * 200 colB buckets)
output.write_dataframe(df.repartition(‘colA’, ‘colB’), partition_cols=[‘colA’], bucket_cols=[‘colB’], bucket_count=200) = I believe this is exactly 10,000 output files (repartition colA,colB = 10,000 partitions, each partition contains exactly 1 colA and 1 bucket of colB)
Background
The key to being able to reason about output file counts is understanding at which level each concept applies.
Repartition (df.repartition(N, 'colA', 'colB')) creates a new spark stage with the data shuffled as requested, into the specified number of shuffle partitions. This will change the number of tasks in the following stage, as well as the data layout in those tasks.
Hive partitioning (partition_cols=['colA', 'colB']) and bucketing (bucket_cols/bucket_count) only have any effect within the scope of the final stage's tasks, and effect how the task writes its data into files on disk.
In particular, each final stage task will write one file per hive-partition/bucket combination present in its data. Combinations not present in that task will not write an empty file if you're using hive-partitioning or bucketing.
Note: if not using hive-partitioning or bucketing, each task will write out exactly one file, even if that file is empty.
So in general you always want to make sure you repartition your data before writing to make sure the data layout matches your hive-partitioning/bucketing settings (i.e. each hive-partition/bucket combination is not split between multiple tasks), otherwise you could end up writing huge numbers of files.
Your examples
I think there is some misunderstanding floating around, so let's go through these one by one.
First a few ones to check my understanding:
df.repartition(100) = 100 output files of the same size
Yes - the data will be randomly, evenly shuffled into 100 partitions, causing 100 tasks, each of which will write exactly one file.
df.repartition('colA') = 10 output files of different sizes, since each file will contain all rows for 1 value of colA
No - the number of partitions to shuffle into is unspecified, so it will default to 200. So you'll have 200 tasks, at most 10 of which will contain any data (could be fewer due to hash collisions), so you will end up with 190 empty files, and 10 with data.
*Note: with AQE in spark 3, spark may decide to coalesce the 200 partitions into fewer when it realizes most of them are very small. I don't know the exact logic there, so technically the answer is actually "200 or fewer, only 10 will contain data".
df.repartition('colB') = 1000 output files
No - Similar to above, the data will be shuffled into 200 partitions. However in this case they will (likely) all contain data, so you will get 200 roughly-equally sized files.
Note: due to hash collisions, files may be larger or smaller depending on how many values of colB happened to land in each partition.
df.repartition(50, 'colA') = 50 output files?
Yes - Similar to before, except now we've overridden the partition count from 200 to 50. So 10 files with data, 40 empty. (or fewer because of AQE)
df.repartition(50, 'colB') = 50 output files, so some files will contain more than one value of colB?
Yes - Same as before, we'll get 50 files of slightly varying sizes depending on how the hashes of the colB values work out.
Hive partitions:
(I think the below examples are written assuming df is in 100 partitions to start rather than 200 as specified, so I'm going to go with that)
output.write_dataframe(df, partition_cols=['colA']) = 1,000 output files (because I get potentially 100 files in each of the 10 hive partitions 10)
Yes - You'll have 100 tasks, each of which will write one file for each colA value they see. So up to 1,000 files in the case the data is randomly distributed.
output.write_dataframe(df, partition_cols=['colB']) = 10,000 output files
No - Missing a 0 here. 100 tasks, each of which could write as many as 1,000 files (one for each colB value), for a total of up to 100,000 files.
output.write_dataframe(df, partition_cols=['colA', 'colB']) = 100,000 output files
No - 100 tasks, each of which will write one file for each combination of partition cols it sees. There are 10,000 such combinations, so this could write as many as 100 * 10,000 = 1,000,000 files!
output.write_dataframe(df.repartition('colA'), partition_cols=['colA']) = 10 output files of different sizes (1 file in each hive partition)
Yes - The repartition will shuffle our data into 200 tasks, but only 10 will contain data. Each will contain exactly one value of colA, so will write exactly one file. The other 190 tasks will write no files. So 10 files exactly.
Bucketing:
Again, assuming 100 partitions for df, not 200
output.write_dataframe(df, bucket_cols=[‘colB’], bucket_count=100) = 100 output files? In an experiment, this did not seem to be the case
No - Since we haven't laid out the data carefully, we have 100 tasks with (maybe) randomly distributed data. Each task will write one file per bucket it sees. So this could write up to 100 * 100 = 10,000 files!
output.write_dataframe(df, bucket_cols=[‘colA’], bucket_count=10) = 10 output files?
No - Similar to above, 100 tasks, each could write up to 10 files. So worst-case is 1,000 files here.
output.write_dataframe(df.repartition(‘colA’), bucket_cols=[‘colA’], bucket_count=10) = ???
Now we're adjusting the data layout before writing, we'll have 200 tasks, at most 10 of which will contain any data. Each value of colA will exist in only one task.
Each task will write one file per bucket it sees. So we should get at most 10 files here.
Note: Due to hash collisions, one or more buckets might be empty, so we might not get exactly 10.
All together now:
Again, assuming 100 partitions for df, not 200
output.write_dataframe(df, partition_cols=[‘colA’], bucket_cols=[‘colB’], bucket_count=200) = ???
100 tasks. 10 hive-partitions. 200 buckets.
Worst case is each task writes one file per hive-partition/bucket combination. i.e. 100 * 10 * 200 = 200,000 files.
output.write_dataframe(df.repartition(‘colA’, ‘colB’), partition_cols=[‘colA’], bucket_cols=[‘colB’], bucket_count=200) = ??? -- Is this the command that I want to use in the end? And anything downstream would first filter on colA to take advantage of the hive partitioning, then join on colB to take advantage of the bucketing?
This one is sneaky. We have 200 tasks and the data is shuffled carefully so each colA/colB combination is in just one task. So everything seems good.
BUT each bucket contains multiple values of colB, and we have done nothing to make sure that an entire bucket is localized to one spark task.
So at worst, we could get one file per value of colB, per hive partition (colA value). i.e. 10 * 1,000 = 10,000 files.
Given our particular parameters, we can do slightly better by just focusing on getting the buckets laid out optimally:
output.write_dataframe(df.repartition(200, ‘colB’), partition_cols=[‘colA’], bucket_cols=[‘colB’], bucket_count=200)
Now we're making sure that colB is shuffled exactly how it will be bucketed, so each task will contain exactly one bucket.
Then we'll get one file for each colA value in the task (likely 10 since colA is randomly shuffled), so at most 200 * 10 = 2,000 files.
This is the best we can do, assuming colA and colB are not correlated.
Conclusion
There's no one-size fits all approach to controlling file sizes.
Generally you want to make sure you shuffle your data so it's laid out in accordance with the hive-partition/bucketing strategy you're applying before writing.
However the specifics of what to do may vary in each case depending on your exact parameters.
The most important thing is to understand how these 3 concepts interact (as described in "Background" above), so you can reason about what will happen from first principals.
I am new to Cassandra and just run a cassandra cluster (version 1.2.8) with 5 nodes, and I have created several keyspaces and tables on there. However, I found all data are stored in one node (in the below output, I have replaced ip addresses by node numbers manually):
Datacenter: 105
==========
Address Rack Status State Load Owns Token
4
node-1 155 Up Normal 249.89 KB 100.00% 0
node-2 155 Up Normal 265.39 KB 0.00% 1
node-3 155 Up Normal 262.31 KB 0.00% 2
node-4 155 Up Normal 98.35 KB 0.00% 3
node-5 155 Up Normal 113.58 KB 0.00% 4
and in their cassandra.yaml files, I use all default settings except cluster_name, initial_token, endpoint_snitch, listen_address, rpc_address, seeds, and internode_compression. Below I list those non-ip address fields I modified:
endpoint_snitch: RackInferringSnitch
rpc_address: 0.0.0.0
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
- seeds: "node-1, node-2"
internode_compression: none
and all nodes using the same seeds.
Can I know where I might do wrong in the config? And please feel free to let me know if any additional information is needed to figure out the problem.
Thank you!
If you are starting with Cassandra 1.2.8 you should try using the vnodes feature. Instead of setting the initial_token, uncomment # num_tokens: 256 in the cassandra.yaml, and leave initial_token blank, or comment it out. Then you don't have to calculate token positions. Each node will randomly assign itself 256 tokens, and your cluster will be mostly balanced (within a few %). Using vnodes will also mean that you don't have to "rebalance" you cluster every time you add or remove nodes.
See this blog post for a full description of vnodes and how they work:
http://www.datastax.com/dev/blog/virtual-nodes-in-cassandra-1-2
Your token assignment is the problem here. An assigned token are used determines the node's position in the ring and the range of data it stores. When you generate tokens the aim is to use up the entire range from 0 to (2^127 - 1). Tokens aren't id's like with mysql cluster where you have to increment them sequentially.
There is a tool on git that can help you calculate the tokens based on the size of your cluster.
Read this article to gain a deeper understanding of the tokens. And if you want to understand the meaning of the numbers that are generated check this article out.
You should provide a replication_factor when creating a keyspace:
CREATE KEYSPACE demodb
WITH REPLICATION = {'class' : 'SimpleStrategy', 'replication_factor': 3};
If you use DESCRIBE KEYSPACE x in cqlsh you'll see what replication_factor is currently set for your keyspace (I assume the answer is 1).
More details here
I am looking for the best way to search through a very large rainbow table file (13GB file). It is a CSV-style file, looking something like this:
1f129c42de5e4f043cbd88ff6360486f; somestring
78f640ec8bf82c0f9264c277eb714bcf; anotherstring
4ed312643e945ec4a5a1a18a7ccd6a70; yetanotherstring
... you get the idea - there are about ~900 Million lines, always with a hash, semicolon, clear text string.
So basically, the program should look if a specific hash is lited in this file.
Whats the fastest way to do this?
Obviously, I can't read the entire file into memory and then put a strstr() on it.
So whats the most efficent way to do this?
read file line by line, always to a strstr();
read larger chunk of the file (e.g. 10.000 lines), do a strstr()
Or would it be more efficient import all this data into an MySQL database and then search for the hash via SQL querys?
Any help is appreciated
The best way to do it would be to sort it and then use a binary search-like algorithm on it. After sorting it, it will take around O(log n) time to find a particular entry where n is the number of entries you have. Your algorithm might look like this:
Keep a start offset and end offset. Initialize the start offset to zero and end offset to the file size.
If start = end, there is no match.
Read some data from the offset (start + end) / 2.
Skip forward until you see a newline. (You may need to read more, but if you pick an appropriate size (bigger than most of your records) to read in step 3, you probably won't have to read any more.)
If the hash you're on is the hash you're looking for, go on to step 6.
Otherwise, if the hash you're on is less than the hash you're looking for, set start to the current position and go to step 2.
If the hash you're on is greater than the hash you're looking for, set end to the current position and go to step 2.
Skip to the semicolon and trailing space. The unhashed data will be from the current position to the next newline.
This can be easily converted into a while loop with breaks.
Importing it into MySQL with appropriate indices and such would use a similarly (or more, since it's probably packed nicely) efficient algorithm.
Your last solution might be the easiest one to implement as you move the whole performance optimizing to the database (and usually they are optimized for that).
strstr is not useful here as it searches a string, but you know a specific format and can jump and compare more goal oriented. Thing about strncmp, and strchr.
The overhead for reading a single line would be really high (as it is often the case for file IO). So I'd recommend reading a larger chunk and perform your search on that chunk. I'd even think about parallelizing the search by reading the next chunk in another thread and do comparison there aswell.
You can also think about using memory mapped IO instead of the standard C file API. Using this you can leave the whole contents loading to the operating system and don't have to care about caching yourself.
Of course restructuring the data for faster access would help you too. For example insert padding bytes so all datasets are equally long. This will provide you "random" access to your data stream as you can easily calculate the position of the nth entry.
I'd start by splitting the single large file into 65536 smaller files, so that if the hash begins with 0000 it's in the file 00/00data.txt, if the hash begins with 0001 it's in the file 00/01data.txt, etc. If the full file was 12 GiB then each of the smaller files would be (on average) 208 KiB.
Next, separate the hash from the string; such that you've got 65536 "hash files" and 65536 "string files". Each hash file would contain the remainder of the hash (the last 12 digits only, because the first 4 digits aren't needed anymore) and the offset of the string in the corresponding string file. This would mean that (instead of 65536 files at an average of 208 KiB each) you'd have 65536 hash files at maybe 120 KiB each and 65536 string files at maybe 100 KiB each.
Next, the hash files should be in a binary format. 12 hexadecimal digits costs 48 bits (not 12*8=96-bits). This alone would halve the size of the hash files. If the strings are aligned on a 4 byte boundary in the strings file then a 16-bit "offset of the string / 4" would be fine (as long as the string file is less than 256 KiB). Entries in the hash file should be sorted in order, and the corresponding strings file should be in the same order.
After all these changes; you'd use the highest 16-bits of the hash to find the right hash file, load the hash file and do a binary search. Then (if found) you'd get the offset for the start of the string (in the strings file) from entry in the hash file, plus get the offset for the next string from next entry in the hash file. Then you'd load data from the strings file, starting at the start of the correct string and ending at the start of the next string.
Finally, you'd implement a "hash file cache" in memory. If your application can allocate 1.5 GiB of RAM, then that'd be enough to cache half of the hash files. In this case (half the hash files cached) you'd expect that half the time the only thing you'd need to load from disk is the string itself (e.g. probably less than 20 bytes) and the other half the time you'd need to load the hash file into the cache first (e.g. 60 KiB); so on average for each lookup you'd be loading about 30 KiB from disk. Of course more memory is better (and less is worse); and if you can allocate more than about 3 GiB of RAM you can cache all of the hash files and start thinking about caching some of the strings.
A faster way would be to have a reversible encoding, so that you can convert a string into an integer and then convert the integer back into the original string without doing any sort of lookup at all. For an example; if all your strings use lower case ASCII letters and are a max. of 13 characters long, then they could all be converted into a 64-bit integer and back (as 26^13 < 2^63). This could lead to a different approach - e.g. use a reversible encoding (with bit 64 of the integer/hash clear) where possible; and only use some sort of lookup (with bit 64 of the integer/hash set) for strings that can't be encoded in a reversible way. With a little knowledge (e.g. carefully selecting the best reversible encoding for your strings) this could slash the size of your 13 GiB file down to "small enough to fit in RAM easily" and be many orders of magnitude faster.
Does the value returned by MySQL's MD5 hash function continue to change indefinitely as the string given to it grows indefinitely?
E.g., will these continue to return different values:
MD5("A"+"B"+"C")
MD5("A"+"B"+"C"+"D")
MD5("A"+"B"+"C"+"D"+"E")
MD5("A"+"B"+"C"+"D"+"E"+"D")
... and so on until a very long list of values ....
At some point, when we are giving the function very long input strings, will the results stop changing, as if the input were being truncated?
I'm asking because I want to use the MD5 function to compare two records with a large set of fields by storing the MD5 hash of these fields.
======== MADE-UP EXAMPLE (YOU DON'T NEED THIS TO ANSWER THE QUESTION BUT IT MIGHT INTEREST YOU: ========
I have a database application that periodically grabs data from an external source and uses it to update a MySQL table.
Let's imagine that in month #1, I do my first download:
downloaded data, where the first field is an ID, a key:
1,"A","B","C"
2,"A","D","E"
3,"B","D","E"
I store this
1,"A","B","C"
2,"A","D","E"
3,"B","D","E"
Month #2, I get
1,"A","B","C"
2,"A","D","X"
3,"B","D","E"
4,"B","F","E"
Notice that the record with ID 2 has changed. Record with ID 4 is new. So I store two new records:
1,"A","B","C"
2,"A","D","E"
3,"B","D","E"
2,"A","D","X"
4,"B","F","E"
This way I have a history of *changes* to the data.
I don't want have to compare each field of the incoming data with each field of each of the stored records.
E.g., if I'm comparing incoming record x with exiting record a, I don't want to have to say:
Add record x to the stored data if there is no record a such that x.ID == a.ID AND x.F1 == a.F1 AND x.F2 == a.F2 AND x.F3 == a.F3 [4 comparisons]
What I want to do is to compute an MD5 hash and store it:
1,"A","B","C",MD5("A"+"B"+"C")
Let's suppose that it is month #3, and I get a record:
1,"A","G","C"
What I want to do is compute the MD5 hash of the new fields: MD5("A"+"G"+"C") and compare the resulting hash with the hashes in the stored data.
If it doesn't match, then I add it as a new record.
I.e., Add record x to the stored data if there is no record a such that x.ID == a.ID AND MD5(x.F1 + x.F2 + x.F3) == a.stored_MD5_value [2 comparisons]
My question is "Can I compare the MD5 hash of, say, 50 fields without increasing the likelihood of clashes?"
Yes, practically, it should keep changing. Due to the pigeonhole principle, if you continue doing that enough, you should eventually get a collision, but it's impractical that you'll reach that point.
The security of the MD5 hash function is severely compromised. A collision attack exists that can find collisions within seconds on a computer with a 2.6Ghz Pentium4 processor (complexity of 224).
Further, there is also a chosen-prefix collision attack that can produce a collision for two chosen arbitrarily different inputs within hours, using off-the-shelf computing hardware (complexity 239).
The ability to find collisions has been greatly aided by the use of off-the-shelf GPUs. On an NVIDIA GeForce 8400GS graphics processor, 16-18 million hashes per second can be computed. An NVIDIA GeForce 8800 Ultra can calculate more than 200 million hashes per second.
These hash and collision attacks have been demonstrated in the public in various situations, including colliding document files and digital certificates.
See http://www.win.tue.nl/hashclash/On%20Collisions%20for%20MD5%20-%20M.M.J.%20Stevens.pdf
A number of projects have published MD5 rainbow tables online, that can be used to reverse many MD5 hashes into strings that collide with the original input, usually for the purposes of password cracking.