Where might I begin on this optimization problem?

Where might I begin on this optimization problem? - language-agnostic

I have a simple program which reads a bunch of things from the filesystem, filters the results , and prints them. This simple program implements a domain specific language to make selection easier. This DSL "compiles" down into an execution plan that looks like this (Input was C:\Windows\System32\* OR -md5"ABCDEFG" OR -tf):
Index Success Failure Description
0 S 1 File Matches C:\Windows\System32\*
1 S 2 File MD5 Matches ABCDEFG
2 S F File is file. (Not directory)
The filter is applied to the given file, and if it succeeds, the index pointer jumps to the index indicated in the success field, and if it fails, the index pointer jumps to the number indicated in the failure field. "S" means that the file passes the filter, F means that the file should be rejected.
Of course, a filter based upon a simple file attribute (!FILE_ATTRIBUTE_DIRECTORY) check is much faster than a check based upon the MD5 of the file, which requires opening and performing the actual hash of the file. Each filter "opcode" has a time class associated with it; MD5 gets a high timing number, ISFILE gets a low timing number.
I would like to reorder this execution plan so that opcodes that take a long time are executed as rarely as possible. For the above plan, that would mean it would have to be:
Index Success Failure Description
0 S 1 File is file. (Not directory)
1 S 2 File Matches C:\Windows\System32\*
2 S F File MD5 Matches ABCDEFG
According to the "Dragon Book", picking the best order of execution for three address code is an NP-Complete problem (At least according to page 511 of the second edition of that text), but in that case they are talking about register allocation and other issues of the machine. In my case, the actual "intermediate code" is much simpler. I'm wondering of a scheme exists that would allow me to reorder the source IL into the optimal execution plan.
Here is another example:
{ C:\Windows\Inf* AND -tp } OR { -tf AND NOT C:\Windows\System32\Drivers* }
Parsed to:
Index Success Failure Description
0 1 2 File Matches C:\Windows\Inf\*
1 S 2 File is a Portable Executable
2 3 F File is file. (Not directory)
3 F S File Matches C:\Windows\System32\Drivers\*
which is optimally:
Index Success Failure Description
0 1 2 File is file. (Not directory)
1 2 S File Matches C:\Windows\System32\Drivers\*
2 3 F File Matches C:\Windows\Inf\*
3 S F File is a Portable Executable

It sounds like it might be easier to pick an optimal order before compiling down to your opcodes. If you have a parse tree, and it is as "flat" as possible, then you can assign a score to each node and then sort each node's children by the lowest total score first.
For example:
{ C:\Windows\Inf* AND -tp } OR { -tf AND NOT C:\Windows\System32\Drivers* }
1 2 3 4
OR
/ \
AND AND
/ \ / \
1 2 3 4
You could sort the AND nodes (1, 2) and (3, 4) by the lowest score and then assign that score to each node. Then sort the children of the OR node by the lowest score of their children.
Since AND and OR are commutative, this sorting operation won't change the meaning of your overall expression.

#Greg Hewgill is right, this is easier to perform on the AST than on the Intermediate code. As you want to work on the Intermediate code, the first goal is to transform it into a dependency tree (which will look like the AST /shrug).
Start with the leaves - and it is probably easiest if you use negative-predicates for NOT.
Index Success Failure Description
0 1 2 File Matches C:\Windows\Inf\*
1 S 2 File is a Portable Executable
2 3 F File is file. (Not directory)
3 F S File Matches C:\Windows\System32\Drivers\*
Extract Leaf (anything with both children as S, F, or an extracted Node; insert NOT where required; Replace all references to Leaf with reference to parent node of leaf)
Index Success Failure Description
0 1 2 File Matches C:\Windows\Inf\*
1 S 2 File is a Portable Executable
2 L1 F File is file. (Not directory)
L1=NOT(cost(child))
|
Pred(cost(PATH))
Extract Node (If Success points to Extracted Node use conjunction to join; Failure uses disjunction; Replace all references to Node with reference to resulting root of tree containing Node).
Index Success Failure Description
0 1 L3 File Matches C:\Windows\Inf\*
1 S L3 File is a Portable Executable
L3=AND L1 L2 (cost(Min(L1,L2) + Selectivity(Min(L1,L2)) * Max(L1,L2)))
/ \
L1=NOT(cost(child)) L2=IS(cost(child))
| |
3=Pred(cost(PATH)) 2=Pred(cost(ISFILE))
Extract Node
Index Success Failure Description
0 L5 L3 File Matches C:\Windows\Inf\*
L5=OR L3 L4 (cost(Min(L3,L4) + (1.0 - Selectivity(Min(L3,L4))) * Max(L3,L4)))
/ \
| L4=IS(cost(child))
| |
| 1=Pred(cost(PORT_EXE))
|
L3=AND L1 L2 (cost(Min(L1,L2) + Selectivity(Min(L1,L2)) * Max(L1,L2)))
/ \
L1=NOT(cost(child)) L2=IS(cost(child))
| |
3=Pred(cost(PATH)) 2=Pred(cost(ISFILE))
Extract Node (In the case where Success and Failure both refer to Nodes, you will have to inject the Node into the tree by pattern matching on the root of the sub-tree defined by the Node)
If root is OR, invert predicate if necessary to ensure reference is Success and inject as conjunction with child not referenced by Failure.
If root is AND, invert predicate if necessary to ensure reference is Failure and inject as disjunction with child root referenced by Success.
Resulting in:
L5=OR L3 L4 (cost(Min(L3,L4) + (1.0 - Selectivity(Min(L3,L4))) * Max(L3,L4)))
/ \
| L4=AND(cost(as for L3))
| / \
| L6=IS(cost(child)) L7=IS(cost(child))
| | |
| 1=Pred(cost(PORT_EXE)) 0=Pred(cost(PATH))
|
L3=AND L1 L2 (cost(Min(L1,L2) + Selectivity(Min(L1,L2)) * Max(L1,L2)))
/ \
L1=NOT(cost(child)) L2=IS(cost(child))
| |
3=Pred(cost(PATH)) 2=Pred(cost(ISFILE))

Related

How the get value function works in sparse merkle trees?

I have just started reading about sparse merkle trees and I came across a function(get value) which is used to find value for the specified key. I can't find an explanation on the internet which can explain how the get value function works.
My understanding is that each node is of 256 bits so there can be 2^256 leaf nodes and keys are indexed. So we start from root and keep choosing left or right node based on weather the bit is 0 or 1 but I'm not able to understand v = db.get(v)[32:] statement. How is it leading me to the value for the key provided?
def get(db, root, key):
v = root
path = key_to_path(key)
for i in range(256):
if (path >> 255) & 1:
v = db.get(v)[32:]
else:
v = db.get(v)[:32]
path <<= 1
return v

"A Merkle tree [21] is a binary tree that incorporates the use of cryptographic hash
functions. One or many attributes are inserted into the leaves, and every node
derives a digest which is recursively dependent on all attributes in its subtree.
That is, leaves compute the hash of their own attributes, and parents derive the
hash of their children’s digests concatenated left-to-right."
This is a citation from "https://eprint.iacr.org/2016/683.pdf"
Each hash has a roadmap to all it's dependent hashes.

How to do a low RAM full cross join?

I have a hope to perform a full self-cross join on a large data file of points. However, I cannot use programming language to perform the operation because I cannot store it in memory. I would like to find all combinations of points within the set. Below would be an example of my dataset.
x y
1 9
2 8
3 7
4 6
5 5
I would like to cross join on this data to generate 25-row table containing all the combination of points. Would there be a low memory solution? perhaps with awk ?
Thank you,
Nicholas Hayden
P.S. I am a novice programmer.

perhaps in two steps, create a header, column1 and column2 files and join the column1 and column2 and append to header file
awk 'NR==1{print > "cross"} NR>1 {print $1 > "col1"; print $2 > "col2"}' file
join -j9 col1 col2 -o1.1,2.1 >> cross
rm col1, col2
obviously make sure the temp and final file names won't clash with the existing ones.
Note, the join command on MacOS doesn't have the -j option, so change it to equivalent long form
join -19 -29 col1 col2 -o1.1,2.1 >> cross
in both alternatives we're asking join to use the non-existent 9th field as the key which matches every line of the first file to every line in the second to generate the cross product of the two files.

If the memory usage wasn't an issue I'd probably do this:
$ awk 'NR==1 { print; next } # print the header
{ x[NR]=$1; y[NR]=$2 } # read data ro two hashes x and y
END { for(i=2;i<=NR;i++)
for(j=2;j<=NR;j++)
print x[i],y[j] # print all combinations of x and y
}' file
Keeping the memory usage low obviously requires keeping data out of memory and that means accessing the file a lot. So while processing FILENAME for x, open the same file with another name (file below) and process that record by record for y:
$ awk 'NR==1 { print; next } # print header
{ file=FILENAME; x=$1; nr=1 # duplicate FILENAME, keep $1, create local nr
while((getline <file) > 0) # process file record by record
if(nr++>1) {print x,$2 } # print $1 of FILENAME and $2 of file
close(file) }' file # close the file
x y
1 9
1 8
1 7
1 6
1 5
2 9
...
I'd probably never use that code as it is for anything useful but maybe you can mix those 2 solutions to create something suitable.

Understanding Recursion and the Traversal of Stack Frames

This is not a homework question. I'm merely trying to understand the process for my own edification. As a computer science student, I have attended several lectures where the concept of recursion was discussed. However, the lecturer was slightly vague, in my opinion, regarding the concept of a stack frame and how the call stack is traversed in order to calculate the final value. The manner in which I currently envision the process is analogous to building a tree from the top down (pushing items onto the call stack - a last in, first out data structure) then climbing the newly constructed tree where upon the final value is obtained upon reaching the top. Perhaps the canonical example:
def fact(n):
if n == 0:
ans = 1
else:
ans = n * fact(n-1)
return ans
value = fact(5)
print (value)
As indicated above, I think the call stack eventually resembles the following (crudely) drawn diagram:
+----------+
| 5 |
| 4 |
| 3 |
| 2 |
| 1 |
+----------+
Each number would be "enclosed" within a stack frame and control now proceeds from the bottom (with the value of 1) to the 2 then 3, etc. I'm not entirely certain where the operator resides in the process though. Would I be mistaken in assuming an abstract syntax tree (AST) in involved at some point or is a second stack present that contains the operator(s)?
Thanks for the help.
~Caitlin
Edit: Removed the 'recursion' tag and added 'function' and 'stackframe' tags.

The call stack frame stores arguments, return address and local variables. The code (not only the operator) itself is stored elsewhere. The same code is executed on different stack frames.
You can find some more information and visualization here: http://www.programmerinterview.com/index.php/recursion/explanation-of-recursion/

This question is more about how function calls work, rather than about recursion. When a function is called a frame is created and pushed on the stack. The frame includes a pointer to the calling code, so that the program knows where to return after the function call. The operator resides in the executable code, after the call point.

How are internal nodes in a innodb b-tree physically stored?

How are non-leaf b-tree nodes physically represented in innodb?
Recall that a b-tree (more specifically a b+tree) has both leaf nodes and non-leaf nodes. In a b+tree all the leaf nodes sit below a tree of non-leaf or "internal" nodes and point to the pages that actually contain row data.
I know that non-leaf nodes are stored in the non-leaf node segment and use pages sort of like data pages. I have found ample documentation on how data pages are physically stored, but I haven't been able to find anything on what the non-leaf index pages look like.

In On learning InnoDB: A journey to the core, I introduced the innodb_diagrams project to document the InnoDB internals, which provides the diagrams used in this post. Later on in A quick introduction to innodb_ruby I walked through installation and a few quick demos of the innodb_space command-line tool.
The physical structure of InnoDB’s INDEX pages was described in The physical structure of InnoDB index pages. We’ll now look into how InnoDB logically structures its indexes, using some practical examples.
An aside on terminology: B+Tree, root, leaf, and level
InnoDB uses a B+Tree structure for its indexes. A B+Tree is particularly efficient when data doesn’t fit in memory and must be read from the disk, as it ensures that a fixed maximum number of reads would be required to access any data requested, based only on the depth of the tree, which scales nicely.
An index tree starts at a “root” page, whose location is fixed (and permanently stored in the InnoDB’s data dictionary) as a starting point for accessing the tree. The tree may be as small as the single root page, or as large as many millions of pages in a multi-level tree.
Pages are referred to as being “leaf” pages or “non-leaf” pages (also called “internal” or “node” pages in some contexts). Leaf pages contain actual row data. Non-leaf pages contain only pointers to other non-leaf pages, or to leaf pages. The tree is balanced, so all branches of the tree have the same depth.
InnoDB assigns each page in the tree a “level”: leaf pages are assigned level 0, and the level increments going up the tree. The root page level is based on the depth of the tree. All pages that are neither leaf pages nor the root page can also be called “internal” pages, if a distinction is important.
Leaf and non-leaf pages
For both leaf and non-leaf pages, each record (including the infimum and supremum system records) contain a “next record” pointer, which stores an offset (within the page) to the next record. The linked list starts at infimum and links all records in ascending order by key, terminating at supremum. The records are not physically ordered within the page (they take whatever space is available at the time of insertion); their only order comes from their position in the linked list.
Leaf pages contain the non-key values as part of the “data” contained in each record:
Non-leaf pages have an identical structure, but instead of non-key fields, their “data” is the page number of the child page, and instead of an exact key, they represent the minimum key on the child page they point to:
Pages at the same level
Most indexes contain more than one page, so multiple pages are linked together in ascending and descending order:
Each page contains pointers (in the FIL header) for “previous page” and “next page”, which for INDEX pages are used to form a doubly-linked list of pages at the same level (e.g. leaf pages, at level 0 form one list, level 1 pages form a separate list, etc.).
A detailed look at a single-page table
Let’s take a look at most of what’s B+Tree related in a single index page:
Create and populate the table
The test table in use in the illustration above can be created and populated with (make sure you’re using innodb_file_per_table and using Barracuda file format):
CREATE TABLE t_btree (
i INT NOT NULL,
s CHAR(10) NOT NULL,
PRIMARY KEY(i)
) ENGINE=InnoDB;
INSERT INTO t_btree (i, s)
VALUES (0, "A"), (1, "B"), (2, "C");
While this table is quite small and not realistic, it does demonstrate nicely how records and record traversal works.
Verify the basic structure of the space file
The table should match what we’ve examined before, with the three standard overhead pages (FSP_HDR, IBUF_BITMAP, and INODE) followed by a single INDEX page for the root of the index, and in this case two unused ALLOCATED pages.
$ innodb_space -f t_btree.ibd space-page-type-regions
start end count type
0 0 1 FSP_HDR
1 1 1 IBUF_BITMAP
2 2 1 INODE
3 3 1 INDEX
4 5 2 FREE (ALLOCATED)
The space-index-pages-summary mode will give us a count of records in each page, and is showing the expected 3 records:
$ innodb_space -f t_btree.ibd space-index-pages-summary
page index level data free records
3 18 0 96 16156 3
4 0 0 0 16384 0
5 0 0 0 16384 0
(Note that space-index-pages-summary also shows the empty ALLOCATED pages as empty pages with zero records, since that’s often what you’re interested in for plotting purposes.)
The space-indexes mode will show the stats about our PRIMARY KEY index, which is consuming a single page on its internal file segment:
$ innodb_space -f t_btree.ibd space-indexes
id root fseg used allocated fill_factor
18 3 internal 1 1 100.00%
18 3 leaf 0 0 0.00%
Set up a record describer
In order for innodb_ruby to parse the contents of records, we need to provide a record describer, which is just a Ruby class providing a method that returns a description of an index:
class SimpleTBTreeDescriber < Innodb::RecordDescriber
type :clustered
key "i", :INT, :NOT_NULL
row "s", "CHAR(10)", :NOT_NULL
end
We need to note that this is the clustered key, provide the column descriptions for the key, and the column descriptions for the non-key (“row”) fields. It’s necessary to ask innodb_space to load this class with the following additional arguments:
-r -r ./simple_t_btree_describer.rb -d SimpleTBTreeDescriber
Look at the record contents
The root page (which is a leaf page) in this example can be dumped using the page-dump mode and providing the page number for the root page:
$ innodb_space -f t_btree.ibd -r ./simple_t_btree_describer.rb -d
SimpleTBTreeDescriber -p 3 page-dump
Aside from some parts of this output we’ve looked at before, it will now print a “records:” section with the following structure per record:
{:format=>:compact,
:offset=>125,
:header=>
{:next=>157,
:type=>:conventional,
:heap_number=>2,
:n_owned=>0,
:min_rec=>false,
:deleted=>false,
:field_nulls=>nil,
:field_lengths=>[0, 0, 0, 0],
:field_externs=>[false, false, false, false]},
:next=>157,
:type=>:clustered,
:key=>[{:name=>"i", :type=>"INT", :value=>0, :extern=>nil}],
:transaction_id=>"0000000f4745",
:roll_pointer=>
{:is_insert=>true, :rseg_id=>8, :undo_log=>{:page=>312, :offset=>272}},
:row=>[{:name=>"s", :type=>"CHAR(10)", :value=>"A", :extern=>nil}]}
This should align with the above detailed illustration perfectly, as I’ve copied most of the information from this example for accuracy. Note the following aspects:
The :format being :compact indicates that the record is the new “compact” format in Barracuda format tables (as opposed to “redundant” in Antelope tables).
The :key listed in the output is an array of key fields for the index, and :row is an array of non-key fields.
The :transaction_id and :roll_pointer fields are internal fields for MVCC included in each record, since this is a clustered key (the PRIMARY KEY).
The :next field within the :header hash is a relative offset (32) which must be added to the current record offset (125) to yield the actual offset of the next record (157). For convenience this calculated offset is included as :next in the record hash.
Recurse the index
A nice and simple output of recursing the entire index can be achieved with the index-recurse mode, but since this is still a single-page index, the output will be very short:
$ innodb_space -f t_btree.ibd -r ./simple_t_btree_describer.rb -d
SimpleTBTreeDescriber -p 3 index-recurse
ROOT NODE #3: 3 records, 96 bytes
RECORD: (i=0) -> (s=A)
RECORD: (i=1) -> (s=B)
RECORD: (i=2) -> (s=C)
Building a non-trivial index tree
A multi-level index tree (overly simplified) in InnoDB looks like:
As previously described, all pages at each level are doubly-linked to each other, and within each page, records are singly-linked in ascending order. Non-leaf pages contain “pointers” (containing the child page number) rather than non-key row data.
If we use the simpler table schema with 1 million rows created in A quick introduction to innodb_ruby, the tree structure looks a little more interesting:
$ innodb_space -f t.ibd -r ./simple_t_describer.rb -d SimpleTDescriber
-p 3 index-recurse
ROOT NODE #3: 2 records, 26 bytes
NODE POINTER RECORD >= (i=252) -> #36
INTERNAL NODE #36: 1117 records, 14521 bytes
NODE POINTER RECORD >= (i=252) -> #4
LEAF NODE #4: 446 records, 9812 bytes
RECORD: (i=1) -> ()
RECORD: (i=2) -> ()
RECORD: (i=3) -> ()
RECORD: (i=4) -> ()
NODE POINTER RECORD >= (i=447) -> #1676
LEAF NODE #1676: 444 records, 9768 bytes
RECORD: (i=447) -> ()
RECORD: (i=448) -> ()
RECORD: (i=449) -> ()
RECORD: (i=450) -> ()
NODE POINTER RECORD >= (i=891) -> #771
LEAF NODE #771: 512 records, 11264 bytes
RECORD: (i=891) -> ()
RECORD: (i=892) -> ()
RECORD: (i=893) -> ()
RECORD: (i=894) -> ()
This is a three-level index tree, which can be seen by the ROOT, INTERNAL, LEAF lines above. We can see that some pages are completely full, with 468 records consuming almost 15 KiB of the 16 KiB page.
Looking at a non-leaf page (page 36, in the above output) using the page-dump mode, records look slightly different than the leaf pages shown previously:
$ innodb_space -f t.ibd -r ./simple_t_describer.rb -d SimpleTDescriber
-p 36 page-dump
{:format=>:compact,
:offset=>125,
:header=>
{:next=>11877,
:type=>:node_pointer,
:heap_number=>2,
:n_owned=>0,
:min_rec=>true,
:deleted=>false,
:field_nulls=>nil,
:field_lengths=>[0],
:field_externs=>[false]},
:next=>11877,
:type=>:clustered,
:key=>[{:name=>"i", :type=>"INT UNSIGNED", :value=>252, :extern=>nil}],
:child_page_number=>4}
The :key array is present, although it represents the minimum key rather than an exact key, and no :row is present, as a :child_page_number takes its place.
The root page is a bit special
Since the root page is allocated when the index is first created, and that page number is stored in the data dictionary, the root page can never relocated or removed. Once the root page fills up, it will need to be split, forming a small tree of a root page plus two leaf pages.
However, the root page itself can’t actually be split, since it cannot be relocated. Instead, a new, empty page is allocated, the records in the root are moved there (the root is “raised” a level), and that new page is split into two. The root page then does not need to be split again until the level immediately below it has enough pages that the root becomes full of child page pointers (called “node pointers”), which in practice often means several hundred to more than a thousand.
B+Tree levels and increasing tree depth
As an example of the efficiency of B+Tree indexes, assume perfect record packing (every page full, which will never quite happen in practice, but is useful for discussion). A B+Tree index in InnoDB for the simple table in the examples above will be able to store 468 records per leaf page, or 1203 records per non-leaf page. The index tree can then be a maximum of the following sizes at the given tree heights:
Height Non-leaf pages Leaf pages Rows Size in bytes
1 0 1 468 16.0 KiB
2 1 1203 > 563 thousand 18.8 MiB
3 1204 1447209 > 677 million 22.1 GiB
4 1448413 1740992427 > 814 billion 25.9 TiB
As you can imagine, most tables with sensible PRIMARY KEY definitions are 2-3 levels, with some achieving 4 levels. Using an excessively large PRIMARY KEY can cause the B+Tree to be much less efficient, however, since primary key values must be stored in the non-leaf pages. This can drastically inflate the size of the records in non-leaf pages, meaning many fewer of those records fit in each non-leaf page, making the whole structure less efficient.

How do I detect circular logic or recursion in a multi-levels references and dependencies

I have a graph of multi-level dependecies like this, and I need to detect any circular reference in this graph.
A = B
B = C
C = [D, B]
D = [C, A]
Somebody have a problem like this?
Any solution???
Thanks and sorry by english.
========= updated ==========
I had another situation.
1
2 = 1
3 = 2
4 = [2, 3]
5 = 4
In this case, my recursive code iterate two times in "4" reference, but this references don't generate a infinite loop. My problem is to know when function iterate more than one time a reference and is not infinite loop and when is a infinite loop, to inform user.
1 = 4
2 = 1
3 = 2
4 = [2, 3]
5 = 4
This case is a bit diferent from 2th example. This generate a infinite loop. how can I know when cases generate a infinite loop or not?

Topological sorting. The description on Wikipedia is clear and works for all your examples.
Basically you start with a node that has no dependencies, put it in a list of sorted nodes, and then remove that dependency from every node. For you second example that means you start with 1. Once you remove all dependencies on 1 you're left with 2. You end up sorting them 1,2,3,4,5 and seeing that there's no cycle.
For your third example, every node has a dependency, so there's nowhere to start. Such a graph must contain at least one cycle.

Keep a list of uniquely identified nodes. Try to loop through the entire tree but keep checking nodes in the list till you get a node being referred as a child which is already there in the unique list - take it from there (handle the loop or simply ignore it depending on your requirement)

One way to detect circular dependency is to keep a record of the length of the dependency chains that your ordering algorithm detects. If a chain becomes longer than the total number of nodes (due to repetition over a loop) then there is a circular dependency. This should work both for an iterative and for a recursive algorithm.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008