Boolean function, what is the purpose of DNF and CNF? - boolean-logic

Boolean functions can be expressed in Disjunctive normal form (DNF) or Conjunctive normal form (CNF). Can anyone explain why these forms are useful?

CNF is useful because this form directly describes the Boolean SAT problem, which while NP-complete, has many incomplete and heuristic exponential time solvers. CNF has been further standardized into a file format called the "DIMACS CNF file format", from which most solvers can operate on. Thus for example, the chip industry can verify their circuit designs by converting to DIMACS CNF, and feeding in to any of the solvers available. The Tseitin Transformation can convert circuits to an equisatisfiable CNF.
DNF is not as useful practically, but converting a formula to DNF means one can see a list of possible assignments that would satisfy the formula. Unfortunately, converting a formula to DNF in general, is hard, and can lead to exponential blow up (very large DNF), which is evident, because there can be exponentially number of satisfying assignments to a formula.
While CNF can be succinct compared with DNF, it is sometimes hard to reason with, because it can lose structure when converted from a circuit for example, and so another succinct form would be useful. The and-inverter graph data structure was designed for this purpose; it can more closely model the structure of circuits, and is thus much easier to reason with in some types of formulas. However there are not many solvers that operate on it.
Other forms include:
Binary Decision Diagram
Propositional directed acyclic graph
Negation normal form
Others (wikipedia)

It helps to express the functions in some standard way. With that, it's easier to automatically go through many algorithms.
Both forms can be used, for example, in automated theorem solving, namely CNF in the resolution method.

Related

High order forward made automatic differentiation

For quite some time I have been wondering how automatic differentiation works. However, I am a bit confused on how the forward mode works -- I am not equipped to deal with reverse mode at the moment. I have tried to read the source code of some libraries (mainly autodiff) and read some papers (e.g. FAD) in order to understand how people are doing it, with little success.
My main issue is I don't get how dual numbers are used. For example, let's say we define a class of dual numbers (in C++) that holds two numbers; value and derivative. Then, we can overload different mathematical functions and operators, in order to define the dual number algebra (as in the complex number case). Then, and this is my problem, no matter we do, we are only going to get first derivatives.
I keep reading about implementation of hyper-dual numbers, which are described as duals that store values, Jacobian, Hessian, etc. If this is true, then if I have a function of 15 variables and I need the third derivative wrt all of them, my computer is going to blow up... Since there are very efficient libraries out there that do such calculations, I am clearly missing something.
I don't have a specific coding question, I would appreciate any input on how forward mode autodiff can be implemented in a practical way.
More info
I have written a basic dual number library in C++, which you can find on github. However, once I finished writing the class and a few function overloads, I gave up due to the problem I describe above (DualNumbers.cpp has several examples, thogouh).
Recently I also started again, this time using expression templates (because I wanted to learn how to use them) -- see github, but this approach has another issue I describe in another question.

What is differentiable programming?

Native support for differential programming has been added to Swift for the Swift for Tensorflow project. Julia has similar with Zygote.
What exactly is differentiable programming?
what does it enable? Wikipedia says
the programs can be differentiated throughout
but what does that mean?
how would one use it (e.g. a simple example)?
and how does it relate to automatic differentiation (the two seem conflated a lot of the time)?
I like to think about this question in terms of user-facing features (differentiable programming) vs implementation details (automatic differentiation).
From a user's perspective:
"Differentiable programming" is APIs for differentiation. An example is a def gradient(f) higher-order function for computing the gradient of f. These APIs may be first-class language features, or implemented in and provided by libraries.
"Automatic differentiation" is an implementation detail for automatically computing derivative functions. There are many techniques (e.g. source code transformation, operator overloading) and multiple modes (e.g. forward-mode, reverse-mode).
Explained in code:
def f(x):
return x * x * x
∇f = gradient(f)
print(∇f(4)) # 48.0
# Using the `gradient` API:
# ▶ differentiable programming.
# How `gradient` works to compute the gradient of `f`:
# ▶ automatic differentiation.
I never heard the term "differentiable programming" before reading your question, but having used the concepts noted in your references, both from the side of creating code to solve a derivative with Symbolic differentiation and with Automatic differentiation and having written interpreters and compilers, to me this just means that they have made the ability to calculate the numeric value of the derivative of a function easier. I don't know if they made it a First-class citizen, but the new way doesn't require the use of a function/method call; it is done with syntax and the compiler/interpreter hides the translation into calls.
If you look at the Zygote example it clearly shows the use of prime notation
julia> f(10), f'(10)
Most seasoned programmers would guess what I just noted because there was not a research paper explaining it. In other words it is just that obvious.
Another way to think about it is that if you have ever tried to calculate a derivative in a programming language you know how hard it can be at times and then ask yourself why don't they (the language designers and programmers) just add it into the language. In these cases they did.
What surprises me is how long it to took before derivatives became available via syntax instead of calls, but if you have ever worked with scientific code or coded neural networks at at that level then you will understand why this is a concept that is being touted as something of value.
Also I would not view this as another programming paradigm, but I am sure it will be added to the list.
How does it relate to automatic differentiation (the two seem conflated a lot of the time)?
In both cases that you referenced, they use automatic differentiation to calculate the derivative instead of using symbolic differentiation. I do not view differentiable programming and automatic differentiation as being two distinct sets, but instead that differentiable programming has a means of being implemented and the way they chose was to use automatic differentiation, they could have chose symbolic differentiation or some other means.
It seems you are trying to read more into what differential programming is than it really is. It is not a new way of programming, but just a nice feature added for doing derivatives.
Perhaps if they named it differentiable syntax it might have been more clear. The use of the word programming gives it more panache than I think it deserves.
EDIT
After skimming Swift Differentiable Programming Mega-Proposal and trying to compare that with the Julia example using Zygote, I would have to modify the answer into parts that talk about Zygote and then switch gears to talk about Swift. They each took a different path, but the commonality and bottom line is that the languages know something about differentiation which makes the job of coding them easier and hopefully produces less errors.
About the Wikipedia quote that
the programs can be differentiated throughout
At first reading it seems nonsense or at least lacks enough detail to understand it in context which is why I am sure you asked.
In having many years of digging into what others are trying to communicate, one learns that unless the source has been peer reviewed to take it with a grain of salt, and unless it is absolutely necessary to understand, then just ignore it. In this case if you ignore the sentence most of what your reference makes sense. However I take it that you want an answer, so let's try and figure out what it means.
The key word that has me perplexed is throughout, but since you note the statement came from Wikipedia and in Wikipedia they give three references for the statement, a search of the word throughout appears only in one
∂P: A Differentiable Programming System to Bridge Machine Learning and Scientific Computing
Thus, since our ∂P system does not require primitives to handle new
types, this means that almost all functions and types defined
throughout the language are automatically supported by Zygote, and
users can easily accelerate specific functions as they deem necessary.
So my take on this is that by going back to the source, e.g. the paper, you can better understand how that percolated up into Wikipedia, but it seems that the meaning was lost along the way.
In this case if you really want to know the meaning of that statement you should ask on the Wikipedia talk page and ask the author of the statement directly.
Also note that the paper referenced is not peer reviewed, so the statements in there may not have any meaning amongst peers at present. As I said, I would just ignore it and get on with writing wonderful code.
You can guess its definition by application of differentiability.
It's been used for optimization i.e. to calculate minimum value or maximum value
Many of these problems can be solved by finding the appropriate function and then using techniques to find the maximum or the minimum value required.

Tools to help reverse engineer binary file formats

What tools are available to aid in decoding unknown binary data formats?
I know Hex Workshop and 010 Editor both support structures. These are okay to a limited extent for a known fixed format but get difficult to use with anything more complicated, especially for unknown formats. I guess I'm looking at a module for a scripting language or a scriptable GUI tool.
For example, I'd like to be able to find a structure within a block of data from limited known information, perhaps a magic number. Once I've found a structure, then follow known length and offset words to find other structures. Then repeat this recursively and iteratively where it makes sense.
In my dreams, perhaps even automatically identify possible offsets and lengths based on what I've already told the system!
Here are some tips that come to mind:
From my experience, interactive scripting languages (I use Python) can be a great help. You can write a simple framework to deal with binary streams and some simple algorithms. Then you can write scripts that will take your binary and check various things. For example:
Do some statistical analysis on various parts. Random data, for example, will tell you that this part is probably compressed/encrypted. Zeros may mean padding between parts. Scattered zeros may mean integer values or Unicode strings and so on. Try to spot various offsets. Try to convert parts of the binary into 2 or 4 byte integers or into floats, print them and see if they make sence. Write some functions that will search for repeating or very similar parts in the data, this way you can easily spot headers.
Try to find as many strings as possible, try different encodings (c strings, pascal strings, utf8/16, etc.). There are some good tools for that (I think that Hex Workshop has such a tool). Strings can tell you a lot.
Good luck!
For Mac OS X, there's a great tool that's even better than my iBored: Synalyze It!
(http://www.synalysis.net/)
Compared to iBored, it is better suited for non-blocked files, while also giving full control over structures, including scriptability (with Lua). And it visualizes structures better, too.
Tupni; to my knowledge not directly available out of Microsoft Research, but there is a paper about this tool which can be of interest to someone wanting to write a similar program (perhaps open source):
Tupni: Automatic Reverse Engineering of Input Formats (# ACM digital library)
Abstract
Recent work has established the importance of automatic reverse
engineering of protocol or file format specifications. However, the
formats reverse engineered by previous tools have missed important
information that is critical for security applications. In this
paper, we present Tupni, a tool that can reverse engineer an input
format with a rich set of information, including record sequences,
record types, and input constraints. Tupni can generalize the format
specification over multiple inputs. We have implemented a
prototype of Tupni and evaluated it on 10 different formats: five
file formats (WMF, BMP, JPG, PNG and TIF) and five network
protocols (DNS, RPC, TFTP, HTTP and FTP). Tupni identified all
record sequences in the test inputs. We also show that, by aggregating
over multiple WMF files, Tupni can derive a more complete
format specification for WMF. Furthermore, we demonstrate the
utility of Tupni by using the rich information it provides for zeroday
vulnerability signature generation, which was not possible with
previous reverse engineering tools.
My own tool "iBored", which I released just recently, can do parts of this. I wrote the tool to visualize and debug file system formats (UDF, HFS, ISO9660, FAT etc.), and implemented search, copy and later even structure and templates support. The structure support is pretty straight-forward, and the templates are a way to identify structures dynamically.
The entire thing is programmable in a Visual BASIC dialect, allowing you to test values, read specific blocks, and all.
The tool is free, works on all platforms (Win, Mac, Linux), but as it's personal tool which I just released to the public to share it, it's not much documented.
However, if you want to give it a try, and like to give feedback, I might add more useful features.
I'd even open source it, but as it's written in REALbasic, I doubt many people will join such a project.
Link: iBored home page
I still occasionally use an old hex editor called A.X.E., Advanced Hex Editor. It seems to have largely disappeared from the Internet now, though Google should still be able to find it for you. The last version I know of was version 3.4, but I've really only used the free-for-personal-use version 2.1.
Its most interesting feature, and the one I've had the most use for deciphering various game and graphics formats, is its graphical view mode. That basically just shows you the file with each byte turned into a color-coded pixel. And as simple as that sounds, it has made my reverse-engineering attempts a lot easier at times.
I suppose doing it by eye is quite the opposite of doing automatic analysis, though, and the graphical mode won't be much use for finding and following offsets...
The later version has some features that sound like they could fit your needs (scripts, regularity finder, grammar generator), but I have no idea how good they are.
There is Hachoir which is a Python library for parsing any binary format into fields, and then browse the fields. It has lots of parsers for common formats, but you can also write own parsers for your files (eg. when working with code that reads or writes binary files, I usually write a Hachoir parser first to have a debugging aid). Looks like the project is pretty much inactive by now, though.
Kaitai is an open-source language for describing binary structures in data streams. It comes with a translator that can output parsing code for many programming languages, for inclusion in your own program code.
My project icebuddha.com supports this using python to describe the format in the browser.
A cut'n'paste of my answer to a similar question:
One tool is WinOLS, which is designed for interpreting and editing vehicle engine managment computer binary images (mostly the numeric data in their lookup tables). It has support for various endian formats (though not PDP, I think) and viewing data at various widths and offsets, defining array areas (maps) and visualising them in 2D or 3D with all kinds of scaling and offset options. It also has a heuristic/statistical automatic map finder, which might work for you.
It's a commercial tool, but the free demo will let you do everything but save changes to the binary and use engine management features you don't need.

Can a Sequence Diagram realistically capture your logic in the same depth as code?

I use UML Sequence Diagrams all the time, and am familiar with the UML2 notation.
But I only ever use them to capture the essence of what I intend to do. In other words the diagram always exists at a level of abstraction above the actual code. Every time I use them to try and describe exactly what I intend to do I end up using so much horizontal space and so many alt/loop frames that its not worth the effort.
So it may be possible in theory but has anyone every really used the diagram in this level of detail? If so can you provide an example please?
I have the same problem but when I realize that I am going low-level I re-read this:
You should use sequence diagrams
when you want to look at the behavior
of several objects within a single use
case. Sequence diagrams are good at
showing collaborations among the
objects; they are not so good at
precise definition of the behavior.
If you want to look at the behavior of
a single object across many use cases,
use a state diagram. If you want
to look at behavior across many use
cases or many threads, consider an
activity diagram.
If you want to explore multiple
alternative interactions quickly, you
may be better off with CRC cards,
as that avoids a lot of drawing and
erasing. It’s often handy to have a
CRC card session to explore design
alternatives and then use sequence
diagrams to capture any interactions
that you want to refer to later.
[excerpt from Martin Fowler's UML Distilled book]
It's all relative. The law of diminishing returns always applies when making a diagram. I think it's good to show the interaction between objects (objectA initializes objectB and calls method foo on it). But it's not practical to show the internals of a function. In that regard, a sequence diagram is not practical to capture the logic at the same depth as code. I would argue for intricate logic, you'd want to use a flowchart.
I think there are two issues to consider.
Be concrete
Sequence diagrams are at their best when they are used to convey to a single concrete scenario (of a use case for example).
When you use them to depict more than one scenario, usually to show what happens in every possible path through a use case, they get complicated very quickly.
Since source code is just like a use case in this regard (i.e. a general description instead of a specific one), sequence diagrams aren't a good fit. Imagine expanding x levels of the call graph of some method and showing all that information on a single diagram, including all if & loop conditions..
That's why 'capturing the essence' as you put it, is so important.
Ideally a sequence diagram fits on a single A4/Letter page, anything larger makes the diagram unwieldy. Perhaps as a rule of thumb, limit the number of objects to 6-10 and the number of calls to 10-25.
Focus on communication
Sequence diagrams are meant to highlight communication, not internal processing.
They're very expressive when it comes to specifying the communication that happens (involved parties, asynchronous, synchronous, immediate, delayed, signal, call, etc.) but not when it comes to internal processing (only actions really)
Also, although you can use variables it's far from perfect. The objects at the top are, well, objects. You could consider them as variables (i.e. use their names as variables) but it just isn't very convenient.
For example, try depicting the traversal of a linked list where you need to keep tabs on an element and its predecessor with a sequence diagram. You could use two 'variable' objects called 'current' and 'previous' and add the necessary actions to make current=current.next and previous=current but the result is just awkward.
Personally I have used sequence diagrams only as a description of general interaction between different objects, i.e. as a quick "temporal interaction sketch". When I tried to get more in depth, all quickly started to be confused...
I've found that the best compromise is a "simplified" sequence diagram followed by a clear but in depth description of the logic underneath.
The answer is no - it does capture it better then your source code!
At least in some aspects. Let me elaborate.
You - like the majority of the programmers, including me - think in source code lines. But the software end product - let's call it the System - is much more than that. It only exists in the mind of your team members. In better cases it also exists on paper or in other documented forms.
There are plenty of standard 'views' to describe the System. Like UML Class diagrams, UML activity diagrams etc. Each diagram shows the System from another point of view. You have static views, dynamic views, but in an architectural/software document you don't have to stop there. You can present nonstandard views in your own words, e.g. deployment view, performance view, usability view, company-values view, boss's favourite things view etc.
Each view captures and documents certain properties of the System.
It's very important to realize that the source code is just one view. The most important though because it's needed to generate a computer program. But it doesn't contain every piece of information of your System, nor explicitly nor implicitly. (E.g. the shared data between program modules, what are only connected via offline user activity. No trace in the source). It's just a static view which helps very little to understand your processes, the runtime dynamics of your living-breathing program.
A classic example of the Observer pattern. Especially if it used heavily, you'll hardly understand the System mechanis from the source code. That's why you use Sequence diagrams in that case. It captures the 'dynamic logic' of your system a lot better than your source code.
But if you meant some kind of business logic in great detail, you are better off with plain text/source code/pseudocode etc. You don't have to use UML diagrams just because they are the standard. You can use usecase modeling without drawing usecase diagrams. Always choose the view what's the best for you and for your purpose.
U.M.L. diagrams are guidelines, not strictly rules.
You don't have to make them exactly & detailed as the source code, but, you may try it, if you want it.
Sometimes, its possible to do it, sometimes, its not possible, because of the detail or complexity of systems, or don't have the time or details to do it.
Cheers.
P.D.
Any cheese-burguer or tuna-fish-burguer for the cat ?

how are serial generators / cracks developed?

I mean, I always was wondered about how the hell somebody can develop algorithms to break/cheat the constraints of legal use in many shareware programs out there.
Just for curiosity.
Apart from being illegal, it's a very complex task.
Speaking just at a teoretical level the common way is to disassemble the program to crack and try to find where the key or the serialcode is checked.
Easier said than done since any serious protection scheme will check values in multiple places and also will derive critical information from the serial key for later use so that when you think you guessed it, the program will crash.
To create a crack you have to identify all the points where a check is done and modify the assembly code appropriately (often inverting a conditional jump or storing costants into memory locations).
To create a keygen you have to understand the algorithm and write a program to re-do the exact same calculation (I remember an old version of MS Office whose serial had a very simple rule, the sum of the digit should have been a multiple of 7, so writing the keygen was rather trivial).
Both activities requires you to follow the execution of the application into a debugger and try to figure out what's happening. And you need to know the low level API of your Operating System.
Some heavily protected application have the code encrypted so that the file can't be disassembled. It is decrypted when loaded into memory but then they refuse to start if they detect that an in-memory debugger has started,
In essence it's something that requires a very deep knowledge, ingenuity and a lot of time! Oh, did I mention that is illegal in most countries?
If you want to know more, Google for the +ORC Cracking Tutorials they are very old and probably useless nowdays but will give you a good idea of what it means.
Anyway, a very good reason to know all this is if you want to write your own protection scheme.
The bad guys search for the key-check code using a disassembler. This is relative easy if you know how to do this.
Afterwards you translate the key-checking code to C or another language (this step is optional). Reversing the process of key-checking gives you a key-generator.
If you know assembler it takes roughly a weekend to learn how to do this. I've done it just some years ago (never released anything though. It was just research for my game-development job. To write a hard to crack key you have to understand how people approach cracking).
Nils's post deals with key generators. For cracks, usually you find a branch point and invert (or remove the condition) the logic. For example, you'll test to see if the software is registered, and the test may return zero if so, and then jump accordingly. You can change the "jump if equals zero (je)" to "jump if not-equals zero (jne)" by modifying a single byte. Or you can write no-operations over various portions of the code that do things that you don't want to do.
Compiled programs can be disassembled and with enough time, determined people can develop binary patches. A crack is simply a binary patch to get the program to behave differently.
First, most copy-protection schemes aren't terribly well advanced, which is why you don't see a lot of people rolling their own these days.
There are a few methods used to do this. You can step through the code in a debugger, which does generally require a decent knowledge of assembly. Using that you can get an idea of where in the program copy protection/keygen methods are called. With that, you can use a disassembler like IDA Pro to analyze the code more closely and try to understand what is going on, and how you can bypass it. I've cracked time-limited Betas before by inserting NOOP instructions over the date-check.
It really just comes down to a good understanding of software and a basic understanding of assembly. Hak5 did a two-part series on the first two episodes this season on kind of the basics of reverse engineering and cracking. It's really basic, but it's probably exactly what you're looking for.
A would-be cracker disassembles the program and looks for the "copy protection" bits, specifically for the algorithm that determines if a serial number is valid. From that code, you can often see what pattern of bits is required to unlock the functionality, and then write a generator to create numbers with those patterns.
Another alternative is to look for functions that return "true" if the serial number is valid and "false" if it's not, then develop a binary patch so that the function always returns "true".
Everything else is largely a variant on those two ideas. Copy protection is always breakable by definition - at some point you have to end up with executable code or the processor couldn't run it.
The serial number you can just extract the algorithm and start throwing "Guesses" at it and look for a positive response. Computers are powerful, usually only takes a little while before it starts spitting out hits.
As for hacking, I used to be able to step through programs at a high level and look for a point where it stopped working. Then you go back to the last "Call" that succeeded and step into it, then repeat. Back then, the copy protection was usually writing to the disk and seeing if a subsequent read succeeded (If so, the copy protection failed because they used to burn part of the floppy with a laser so it couldn't be written to).
Then it was just a matter of finding the right call and hardcoding the correct return value from that call.
I'm sure it's still similar, but they go through a lot of effort to hide the location of the call. Last one I tried I gave up because it kept loading code over the code I was single-stepping through, and I'm sure it's gotten lots more complicated since then.
I wonder why they don't just distribute personalized binaries, where the name of the owner is stored somewhere (encrypted and obfuscated) in the binary or better distributed over the whole binary.. AFAIK Apple is doing this with the Music files from the iTunes store, however there it's far too easy, to remove the name from the files.
I assume each crack is different, but I would guess in most cases somebody spends
a lot of time in the debugger tracing the application in question.
The serial generator takes that one step further by analyzing the algorithm that
checks the serial number for validity and reverse engineers it.