how are serial generators / cracks developed? - reverse-engineering

I mean, I always was wondered about how the hell somebody can develop algorithms to break/cheat the constraints of legal use in many shareware programs out there.
Just for curiosity.

Apart from being illegal, it's a very complex task.
Speaking just at a teoretical level the common way is to disassemble the program to crack and try to find where the key or the serialcode is checked.
Easier said than done since any serious protection scheme will check values in multiple places and also will derive critical information from the serial key for later use so that when you think you guessed it, the program will crash.
To create a crack you have to identify all the points where a check is done and modify the assembly code appropriately (often inverting a conditional jump or storing costants into memory locations).
To create a keygen you have to understand the algorithm and write a program to re-do the exact same calculation (I remember an old version of MS Office whose serial had a very simple rule, the sum of the digit should have been a multiple of 7, so writing the keygen was rather trivial).
Both activities requires you to follow the execution of the application into a debugger and try to figure out what's happening. And you need to know the low level API of your Operating System.
Some heavily protected application have the code encrypted so that the file can't be disassembled. It is decrypted when loaded into memory but then they refuse to start if they detect that an in-memory debugger has started,
In essence it's something that requires a very deep knowledge, ingenuity and a lot of time! Oh, did I mention that is illegal in most countries?
If you want to know more, Google for the +ORC Cracking Tutorials they are very old and probably useless nowdays but will give you a good idea of what it means.
Anyway, a very good reason to know all this is if you want to write your own protection scheme.

The bad guys search for the key-check code using a disassembler. This is relative easy if you know how to do this.
Afterwards you translate the key-checking code to C or another language (this step is optional). Reversing the process of key-checking gives you a key-generator.
If you know assembler it takes roughly a weekend to learn how to do this. I've done it just some years ago (never released anything though. It was just research for my game-development job. To write a hard to crack key you have to understand how people approach cracking).

Nils's post deals with key generators. For cracks, usually you find a branch point and invert (or remove the condition) the logic. For example, you'll test to see if the software is registered, and the test may return zero if so, and then jump accordingly. You can change the "jump if equals zero (je)" to "jump if not-equals zero (jne)" by modifying a single byte. Or you can write no-operations over various portions of the code that do things that you don't want to do.
Compiled programs can be disassembled and with enough time, determined people can develop binary patches. A crack is simply a binary patch to get the program to behave differently.

First, most copy-protection schemes aren't terribly well advanced, which is why you don't see a lot of people rolling their own these days.
There are a few methods used to do this. You can step through the code in a debugger, which does generally require a decent knowledge of assembly. Using that you can get an idea of where in the program copy protection/keygen methods are called. With that, you can use a disassembler like IDA Pro to analyze the code more closely and try to understand what is going on, and how you can bypass it. I've cracked time-limited Betas before by inserting NOOP instructions over the date-check.
It really just comes down to a good understanding of software and a basic understanding of assembly. Hak5 did a two-part series on the first two episodes this season on kind of the basics of reverse engineering and cracking. It's really basic, but it's probably exactly what you're looking for.

A would-be cracker disassembles the program and looks for the "copy protection" bits, specifically for the algorithm that determines if a serial number is valid. From that code, you can often see what pattern of bits is required to unlock the functionality, and then write a generator to create numbers with those patterns.
Another alternative is to look for functions that return "true" if the serial number is valid and "false" if it's not, then develop a binary patch so that the function always returns "true".
Everything else is largely a variant on those two ideas. Copy protection is always breakable by definition - at some point you have to end up with executable code or the processor couldn't run it.

The serial number you can just extract the algorithm and start throwing "Guesses" at it and look for a positive response. Computers are powerful, usually only takes a little while before it starts spitting out hits.
As for hacking, I used to be able to step through programs at a high level and look for a point where it stopped working. Then you go back to the last "Call" that succeeded and step into it, then repeat. Back then, the copy protection was usually writing to the disk and seeing if a subsequent read succeeded (If so, the copy protection failed because they used to burn part of the floppy with a laser so it couldn't be written to).
Then it was just a matter of finding the right call and hardcoding the correct return value from that call.
I'm sure it's still similar, but they go through a lot of effort to hide the location of the call. Last one I tried I gave up because it kept loading code over the code I was single-stepping through, and I'm sure it's gotten lots more complicated since then.

I wonder why they don't just distribute personalized binaries, where the name of the owner is stored somewhere (encrypted and obfuscated) in the binary or better distributed over the whole binary.. AFAIK Apple is doing this with the Music files from the iTunes store, however there it's far too easy, to remove the name from the files.

I assume each crack is different, but I would guess in most cases somebody spends
a lot of time in the debugger tracing the application in question.
The serial generator takes that one step further by analyzing the algorithm that
checks the serial number for validity and reverse engineers it.

Related

High order forward made automatic differentiation

For quite some time I have been wondering how automatic differentiation works. However, I am a bit confused on how the forward mode works -- I am not equipped to deal with reverse mode at the moment. I have tried to read the source code of some libraries (mainly autodiff) and read some papers (e.g. FAD) in order to understand how people are doing it, with little success.
My main issue is I don't get how dual numbers are used. For example, let's say we define a class of dual numbers (in C++) that holds two numbers; value and derivative. Then, we can overload different mathematical functions and operators, in order to define the dual number algebra (as in the complex number case). Then, and this is my problem, no matter we do, we are only going to get first derivatives.
I keep reading about implementation of hyper-dual numbers, which are described as duals that store values, Jacobian, Hessian, etc. If this is true, then if I have a function of 15 variables and I need the third derivative wrt all of them, my computer is going to blow up... Since there are very efficient libraries out there that do such calculations, I am clearly missing something.
I don't have a specific coding question, I would appreciate any input on how forward mode autodiff can be implemented in a practical way.
More info
I have written a basic dual number library in C++, which you can find on github. However, once I finished writing the class and a few function overloads, I gave up due to the problem I describe above (DualNumbers.cpp has several examples, thogouh).
Recently I also started again, this time using expression templates (because I wanted to learn how to use them) -- see github, but this approach has another issue I describe in another question.

How to make a good anti-crack protection?

I will start off with saying I know that it is impossible to prevent your software from reverse engineering.
But, when I take a look at crackmes.de, there are crackmes with a difficulty grade of 8 and 9 (on a scale of 1 to 10). These crackmes are getting cracked by genius brains, who write a tutorial on how to crack it. Some times, such tutorials are 13+ pages long!
When I try to make a crackme, they crack it in 10 minutes. Followed by a "how-to-crack" tutorial with a length of 20 lines.
So the questions are:
How can I make a relatively good anti-crack protection.
Which techniques should I use?
How can I learn it?
...
Disclaimer: I work for a software-protection tools vendor (Wibu-Systems).
Stopping cracking is all we do and all we have done since 1989. So we thoroughly understand how SW gets cracked and how to avoid it. Bottom line: only with a secure hardware dongle, implemented correctly, can you guarantee against cracking.
Most strong anti-cracking relies on encryption (symmetric or public key). The encryption can be very strong, but unless the key storage/generation is equally strong it can be attacked. Lots of other methods are possible too, even with good encryption, unless you know what you are doing. A software-only solution will have to store the key in an accessible place, easily found or vulnerable to a man-in-the-middle attack. Same thing is true with keys stored on a web server. Even with good encryption and secure key storage, unless you can detect debuggers the cracker can just take a snapshot of memory and build an exe from that. So you need to never completely decrypt in memory at any one time and have some code for debugger detection. Obfuscation, dead code, etc, won't slow them down for long because they don't crack by starting at the beginning and working through your code. They are far more clever than that. Just look at some of the how-to cracking videos on the net to see how to find the security detection code and crack from there.
Brief shameless promotion: Our hardware system has NEVER been cracked. We have one major client who uses it solely for anti-reverse engineering. So we know it can be done.
Languages like Java and C# are too high-level and do not provide any effective structures against cracking. You could make it hard for script kiddies through obfuscation, but if your product is worth it it will be broken anyway.
I would turn this round slightly and think about:
(1) putting in place simple(ish) measures so that your program isn't trivial to hack, so e.g. in Java:
obfuscate your code so at least make your enemy have to go to the moderate hassle of looking through a decompilation of obfuscated code
maybe write a custom class loader to load some classes encrypted in a custom format
look at what information your classes HAVE to expose (e.g. subclass/interface information can't be obfuscated away) and think about ways round that
put some small key functionality in a DLL/format less easy to disassemble
However, the more effort you go to, the more serious hackers will see it as a "challenge". You really just want to make sure that, say, an average 1st year computer science degree student can't hack your program in a few hours.
(2) putting more subtle copyright/authorship markers (e.g. metadata in images, maybe subtly embed a popup that will appear in 1 year's time to all copies that don't connect and authenticate with your server...) that hackers might not bother to look for/disable because their hacked program "works" as it is.
(3) just give your program away in countries where you don't realistically have a chance of making a profit from it and don't worry about it too much-- if anything, it's a form of viral marketing. Remember that in many countries, what we see in the UK/US as "piracy" of our Precious Things is openly tolerated by government/law enforcement; don't base your business model around copyright enforcement that doesn't exist.
I have a pretty popular app (which i won't specify here, to avoid crackers' curiosity, of course) and suffered with cracked versions some times in the past, fact that really caused me many headaches.
After months struggling with lots of anti-cracking techniques, since 2009 i could establish a method that proved to be effective, at least in my case : my app has not been cracked since then.
My method consists in using a combination of three implementations :
1 - Lots of checks in the source code (size, CRC, date and so on : use your creativity. For instance, if my app detects tools like OllyDbg being executed, it will force the machine to shutdown)
2 - CodeVirtualizer virutalization in sensitive functions in source code
3 - EXE encryption
None of these are really effective alone : checks can be passed by a debugger, virtualization can be reversed and EXE encryption can be decrypted.
But when you used altogether, they will cause BIG pain to any cracker.
It's not perfect although : so many checks makes the app slower and the EXE encrypt can lead to false positive in some anti-virus software.
Even so there is nothing like not be cracked ;)
Good luck.
Personaly I am fan of server side check.
It can be as simple as authentication of application or user each time it runs. However that can be easly cracked. Or puting some part of code to server side and that would requere a lot more work.
However your program will requere internet connection as must have and you will have expenses for server. But that the only way to make it relatively good protected. Any stand alone application will be cracked relatively fast.
More logic you will move to server side more hard to crack it will get. But it will if it will be worth it. Even large companies like Blizzrd can't prevent theyr server side being reversed engineered.
I purpose the following:
Create in home a key named KEY1 with N bytes randomly.
Sell the user a "License number" with the Software. Take note of his/her name and surname and tell him/her that those data are required to activate the Software, also an Internet conection.
Upload within the next 24 hours to your server the "License number", and the name and surname, also the KEY3 = (KEY1 XOR hash_N_bytes(License_number, name and surname) )
The installer asks for a "Licese_number" and the name and surname, then it sends those data to the server and downloads the key named "KEY3" if those data correspond to a valid sell.
Then the installer makes KEY1 = KEY3 XOR hash_N_bytes(License_number, name and surname)
The installer checks KEY1 using a "Hash" of 16 bits. The application is encrypted with the KEY1 key. Then it decrypts the application with the key and it's ready.
Both the installer and application must have a CRC content check.
Both could check is being debugged.
Both could have encrypted parts of code during execution time.
What do you think about this method?

Being pressured to GOTO the dark-side

We have a situation at work where developers working on a legacy (core) system are being pressured into using GOTO statements when adding new features into existing code that is already infected with spaghetti code.
Now, I understand there may be arguments for using 'just one little GOTO' instead of spending the time on refactoring to a more maintainable solution. The issue is, this isolated 'just one little GOTO' isn't so isolated. At least once every week or so there is a new 'one little GOTO' to add. This codebase is already a horror to work with due to code dating back to or before 1984 being riddled with GOTOs that would make many Pastafarians believe it was inspired by the Flying Spaghetti Monster itself.
Unfortunately the language this is written in doesn't have any ready made refactoring tools, so it makes it harder to push the 'Refactor to increase productivity later' because short-term wins are the only wins paid attention to here...
Has anyone else experienced this issue whereby everybody agrees that we cannot be adding new GOTOs to jump 2000 lines to a random section, but continually have Anaylsts insist on doing it just this one time and having management approve it?
tldr;
How can one go about addressing the issue of developers being pressured (forced) to continually add GOTO statements (by add, I mean add to jump to random sections many lines away) because it 'gets that feature in quicker'?
I'm beginning to fear we may lose valuable developers to the raptors over this...
Clarification:
Goto here
alsoThere: No, I'm talking about the kind of goto that jumps 1000 lines out of one subroutine into another one mid way into a while loop. Goto somewhereClose
there: I wasn't even talking about the kind of gotos you can reasonably read over and determine what a program was doing. Goto alsoThere
somewhereClose: This is the sort of code that makes meatballs midpoint: If first time here Goto nextpoint detail:(each one almost completely different) Goto pointlessReturn
here: In this question, I was not talking about the occasionally okay use of a goto. Goto there
tacoBell: and it has just gone back to the drawing board. Goto Jail
elsewhere: When it takes Analysts weeks to decypher what a program is doing each time it is touched, something is deeply wrong with your codebase. In fact, I'm actually up to my hell:if not up-to-date goto 4 rendition of the spec goto detail pointlessReturn: goto tacoBell
Jail: Actually, just a small update with a small victory. I spent 4 hours refactoring a portion of this particular program a single label at a time, saving each iteration in svn as I went. Each step (about 20 of them) was smallish, logical and easy enough to goto bypass nextpoint: spontaneously jump out of your meal and onto you screen through some weird sort of spaghetti-meatball magnetism. Goto elseWhere
bypass: reasonably verify that it should not introduce any logic changes. Using this new more readable version, I've sat down with the analyst and completed almost all of this change now. Goto end
4: first *if first time here goto hell, no second if first time here goto hell, no third if first time here goto hell fourth now up-to-date goto hell
end:
How many bugs have been introduced because of incorrectly written GOTOs? How much money did they cost the company? Turn the issue into something concrete, rather than "this feels bad". Once you can get it recognized as a problem by the people in charge, turn it into a policy like, "no new GOTOs for anything except simplifying the exit logic for a function", or "no new GOTOs for any functions that don't have 100% unit test coverage". Over time, tighten the policies.
GOTOs don't make good code spaghetti, there are a multitude of other factors involved. This linux mailing list discussion can help put a few things into perspective (comments from Linus Torvalds about the bigger picture of using gotos).
Trying to institute a "no goto policy" just for the sake of not having gotos will not achive anything in the long run, and will not make your code more maintainable. The changes will need to be more subtle and focus on increasing the overall quality of the code, think along the lines of using best practices for the platform/language, unit test coverage, static analysis etc.
The reality of development is that despite all the flowery words about doing it the right way, most clients are more interested in doing it the fast way. The concept of a code base rapidly moving towards the point of imploding and the resulting fallout on their business is something that they cannot comprehend because that would mean having to think beyond today.
What you have is just one example. How you stand on this will dictate how you do development in the future. I think you have 4 options:
Accept the request and accept that you will always be doing this.
Accept the request, and immediately start looking for a new job.
Refuse to do and and be prepared to fight to fix the mess.
Resign.
Which option you choose is going to depend on how much you value your job and your self esteem.
Maybe you can use the boyscout-pattern: Leave the place a little more clean than you found it. So for every featurerequest: don't introduce new gotos, but remove one.
This won't spend too much time for improvements, would give more time, to find newly introduced bugs. Maybe it wouldn't cost much additional time, if you remove a goto from the part, which although had to spend time understanding, and bringing the new feature in.
Argue, that a refactoring of 2 hours will save 20 times 15 minutes in the future, and allow faster and deeper improvements.
Back to principles:
Is it readable?
Does it work?
Is it maintainable?
This is the classic "management" vs. "techies" conflict.
In spite of being on the "techie" team, over the years I have settled firmly the "management" side of this debate.
There are a number of reasons for this:
It's quite possible to have well written easy to read properly structured programs with gotos! Ask any assembler programmer; conditional branch and a primitive do loop are all they have to work with. Just because the "style" is not current and doesn't mean its not well written.
If it is a mess then its going to be a real pain to extract the busines rules you will need if you are going for a complete re-write, or, if you are doing a technical refactoring of the program you will never be sure the behaviour of the refactored program is "correct" i.e. it does exactly what the old program does.
Return on investment -- sticking to the original programming style and making minimal changes will take minimum effort and quickly statisfy the customers request. Spending a lot of time and effort refactoring will be more expensive and take longer.
Risk -- rewrites and refactoring are hard to get right, extensive testing of the refactored code is required and things that look like "bugs" may have been "features" as far as the customer is concerned. A particular problem with "improving" legacy code is that business users may have well established work arounds that depend on a bug being there, or, exploit the existense of a bugs to change the business procedures without informing the IT department.
So all in all management is faced with a decision -- "add one little goto" test the change and get it into production in double quick time with little risk -- or -- go in for a major programming effort and have the business scream at them when a new set of bugs crops up.
And if you do decide to refactor in five years time some snotty nosed college graduate will be moaning that your refactored program is not buzzword compliant any more and demanding he be allowed to spend weeks rewriting it.
If it ain't broke dont fix it!
PS: Even Joel thinks this is a bad idea: things you should never do
Update!-
OK if you want to refactor and improve the code you need to go about it properly.
The basic problem is you are saying to the client is "I want to spend n weeks working on this program and, if everything goes well, it will do exactly what it does now." -- this is a hard sell to say the least.
You need to gather long term data on the number of crashes and outages, the time spent analysing and programming seemingly small changes, the number of change requests that were not done because they were too hard, business opertunities lost because the system could not change fast enough. Also gather some acedemic data on the costs of maintaing well structured programs vs. letting it sink.
Unless you have a watertight case to present to the beancounters you will not get the budget. You really have to sell this to the business management, not, your immediate bosses.
I recently had to work on some code that wasn't legacy per se, but the former developers' habits certainly were and thus GOTO's were everywhere. I don't like GOTO's; they make a hideous mess of things and make debugging a nightmare. Worse yet, replacing them with normal code is not always straightforward.
IF you can't unwind your GOTO's I certainly recommend that you no longer use them.
Unfortunately the language this is written in doesn't have any ready made refactoring tools
Don't you have editors with macro capabilities? Don't you have shell scripts? I've done tons of refactoring over the years, very little of it with refactoring browsers.
The underlying problem here seems to be that you have 'analysts' who describe, presumably necessary, functional changes in terms of adding a goto to some code. And then you have 'programmers' whose job appears to be limited to typing in that change and then complaining about it.
To make anything different, that dysfunctional allocation of responsibilities between those people needs to change. There are a lot of different ways to split up the work: The classic, conventional one (that is quite likely the official, but ignored, policy in your work) is to have analysts write a implementation-independent specification document and programmers implement it as maintainably as they can.
The problem with that theoretical approach is it is actually unworkably unrealistic in many common situations. In particular, doing it 'properly' requires employees seen as junior to win an argument with colleagues who are more senior, have better social connections in the office, and more business-savvy. If the argument 'goto is not implementation-independent, so as an analyst you simply can't say that word' doesn't fly at your workspace, then presumably this is the case.
Far better in many circumstances are alternatives like:
analysts write client-facing code and developers write infrastructure.
Analysts write executable specifications which are used as reference implementations for unit tests.
Developers create a domain specific language in which analysts program.
You drop he distinction between one and the other, having only a hybrid.
If a change to the program requires "just one little goto" I would say that the code was very maintainable.
This is a common problem when dealing with legacy code. Stick to the style the program was originally written in or "modernize" the code. For me the answer is usually to stick with the original style unless you have a really big change that would justify a complete re-write.
Also be aware that several "modern" language features like java's "throw" statement, or SQLs "ON ERROR" are really "GO TO"s in disguise.

How to make sure that your code is secure?

I am a programmer. I have about 5 years experience of programming in different kind of languages. I was concerning about my code speed, about optimizing the memory that uses my code, and about good coding style and so on. But have never thought how secure my code is. So I have disassembled my code to see what can do a hacker. Would it be easy to crack my code?
And I saw that it is! It is very easy, because I was storing
serial number as a string
encryption-decryption codes as well
So if someone has the minimal knowledge of assembler he/she can just simple dissembler and after 10-20 minutes of debugging my code is cracked!!! Even it could be done by opening the exe with notepad I guess! :-)
So what I am asking are the following:
Where I should store that kind of secure information’s?
What are the common strategies of delivering a secure code?
First thing you must realize is that you'll never prevent a determined reverser from cracking any protection schemes because anything that the code can do, the reverser will eventually find out how to replicate it. The only way you can achieve any sort of reliable protection is to have the shipped program be nothing more than a dumb client and have the brunt of the software on some server the reverser has no access to.
With that out of the way, you can certainly make it harder for a would be reverser to break your protections. Obfuscation is the sort of first step in achieving this. I have no experience using obfuscators but I'm sure you can find some suggestions for some on SO. Also if you're using a lower level language like C/C++, simply compiling the code with full optimization and stripping all debugging symbols gets you a decent amount of obfuscation.
I read this article a few years ago, but I still think it's techniques hold up today. It's one of the developers of a video game called Spyro talking about the set of techniques they used to prevent piracy. They claim it wasn't until 3 months after the release that a cracked version became available, which is fairly impressive.
If you are concerned about piracy, then there are many avenues you can take. Making the code security tighter (obfuscation, license codes, binding the software to a particular PC, hardware/dongle protection, etc) is one, but it's worth bearing in mind that every piece of software can be cracked if someone sufficiently talented can be bothered.
Another approach is to consider the pricing model for your software. If you charge $1000 a copy, then there is a big incentive for someone to have a go at cracking it. If you only charge $5 then why should anyone bother to crack it?
So what is needed is a balance. Even the most basic protection will stop ordinary people making casual copies. Beyond that, simple techniques (obfuscation and license codes) and a sensible pricing strategy will hold most would-be crackers at bay by making it not worth the bother of cracking. After that, you start getting into ever more sophisticated techniques (dongles/CDs needing to be present to run the software, only being able to run the software after logging on to an online licensing system) that take a lot of effort/cost to implement and significantly increase the risk of annoying genuine customers (remember how annoyed everyone got when they bought half life but it wouldn't let them play the game?) - unless you have a popular mainstream product (i.e. a huge revenue stream to protect), there probably isn't much point going to that much effort.
Make it web app.
It will generally not be well-protected unless there's an external service doing the checking that you are in control of - and that service can still be spoofed by those who really wants to "crack" it. Instead, trust the customer and provide only minimal copyright protection. I'm sure there was an article or podcast about this by Joel Spolsky somewhere... here's another related SO question.
I have no idea if it will help but Windows provides (since 2000) a mechanism to retrieve and store encrypted information and you can also salt this storage on a per-application basis if needed: Data Protection API (DPAPI)
This is on a machine or a user level but storing serials and perhaps some keys using it might be better than having them hidden in the application?
What sort of secure are you talking about?
Secure from the perspective that you are guarding your users data well? If so, study some real cryptography and utilize Existing libraries to encrypt your data. The win32 API is pretty good for this.
But if you're talking about stopping a cracker from stealing your application? There are many methods, but just give up. They slow crackers down, they don't stop them.
Look at How to hide a string in binary code? question
First you have to define what your code should be secure against, being secure as such is meaningless.
You seem to be worried about reverse engineering and users generating license codes without paying, though you don't say so. To make this harder you can obfuscate your code and key information in various ways. There area also techniques to make the use of debuggers harder, to prevent the reverse engineer from stepping through the code and seeing the information in clear.
But this only makes reverse engineering somewhat harder, not impossible
Another common security threat is execution of unwanted code, for example via buffer overflows.
A simple technique for doing this is to xor over all your code and xor back when you need it... but this needs an innate knowledge of assembly... I'm not sure, but you could try this:
void (*encryptionFunctn)(void);
void hideEncryptnFunctn(void)
{
volatile char * i;
while(*i!=0xC0) // 0xC0 is the opcode for ret
{
*i++^=0x45; // or any other code
}
}
To prevent against hackers viewing your code, you should use an obfuscator. An obfuscator will use various techniques which make it extremely difficult to make sense of the obfuscated code. Some techniques used are string encryption, symbol renaming, control flow obfuscation, etc. Check out Crypto Obfuscator which additionally also has external method call hiding, Anti-Reflector, Anti-Debugging, etc
The goal is to erect as many obstacles as possible in the path of a would-be hacker.

The best way to familiarize yourself with an inherited codebase

Stacker Nobody asked about the most shocking thing new programmers find as they enter the field.
Very high on the list, is the impact of inheriting a codebase with which one must rapidly become acquainted. It can be quite a shock to suddenly find yourself charged with maintaining N lines of code that has been clobbered together for who knows how long, and to have a short time in which to start contributing to it.
How do you efficiently absorb all this new data? What eases this transition? Is the only real solution to have already contributed to enough open-source projects that the shock wears off?
This also applies to veteran programmers. What techniques do you use to ease the transition into a new codebase?
I added the Community-Building tag to this because I'd also like to hear some war-stories about these transitions. Feel free to share how you handled a particularly stressful learning curve.
Pencil & Notebook ( don't get distracted trying to create a unrequested solution)
Make notes as you go and take an hour every monday to read thru and arrange the notes from previous weeks
with large codebases first impressions can be deceiving and issues tend to rearrange themselves rapidly while you are familiarizing yourself.
Remember the issues from your last work environment aren't necessarily valid or germane in your new environment. Beware of preconceived notions.
The notes/observations you make will help you learn quickly what questions to ask and of whom.
Hopefully you've been gathering the names of all the official (and unofficial) stakeholders.
One of the best ways to familiarize yourself with inherited code is to get your hands dirty. Start with fixing a few simple bugs and work your way into more complex ones. That will warm you up to the code better than trying to systematically review the code.
If there's a requirements or functional specification document (which is hopefully up-to-date), you must read it.
If there's a high-level or detailed design document (which is hopefully up-to-date), you probably should read it.
Another good way is to arrange a "transfer of information" session with the people who are familiar with the code, where they provide a presentation of the high level design and also do a walk-through of important/tricky parts of the code.
Write unit tests. You'll find the warts quicker, and you'll be more confident when the time comes to change the code.
Try to understand the business logic behind the code. Once you know why the code was written in the first place and what it is supposed to do, you can start reading through it, or as someone said, prolly fixing a few bugs here and there
My steps would be:
1.) Setup a source insight( or any good source code browser you use) workspace/project with all the source, header files, in the code base. Browsly at a higher level from the top most function(main) to lowermost function. During this code browsing, keep making notes on a paper/or a word document tracing the flow of the function calls. Do not get into function implementation nitti-gritties in this step, keep that for a later iterations. In this step keep track of what arguments are passed on to functions, return values, how the arguments that are passed to functions are initialized how the value of those arguments set modified, how the return values are used ?
2.) After one iteration of step 1.) after which you have some level of code and data structures used in the code base, setup a MSVC (or any other relevant compiler project according to the programming language of the code base), compile the code, execute with a valid test case, and single step through the code again from main till the last level of function. In between the function calls keep moting the values of variables passed, returned, various code paths taken, various code paths avoided, etc.
3.) Keep repeating 1.) and 2.) in iteratively till you are comfortable up to a point that you can change some code/add some code/find a bug in exisitng code/fix the bug!
-AD
I don't know about this being "the best way", but something I did at a recent job was to write a code spider/parser (in Ruby) that went through and built a call tree (and a reverse call tree) which I could later query. This was slightly non-trivial because we had PHP which called Perl which called SQL functions/procedures. Any other code-crawling tools would help in a similar fashion (i.e. javadoc, rdoc, perldoc, Doxygen etc.).
Reading any unit tests or specs can be quite enlightening.
Documenting things helps (either for yourself, or for other teammates, current and future). Read any existing documentation.
Of course, don't underestimate the power of simply asking a fellow teammate (or your boss!) questions. Early on, I asked as often as necessary "do we have a function/script/foo that does X?"
Go over the core libraries and read the function declarations. If it's C/C++, this means only the headers. Document whatever you don't understand.
The last time I did this, one of the comments I inserted was "This class is never used".
Do try to understand the code by fixing bugs in it. Do correct or maintain documentation. Don't modify comments in the code itself, that risks introducing new bugs.
In our line of work, generally speaking we do no changes to production code without good reason. This includes cosmetic changes; even these can introduce bugs.
No matter how disgusting a section of code seems, don't be tempted to rewrite it unless you have a bugfix or other change to do. If you spot a bug (or possible bug) when reading the code trying to learn it, record the bug for later triage, but don't attempt to fix it.
Another Procedure...
After reading Andy Hunt's "Pragmatic Thinking and Learning - Refactor Your Wetware" (which doesn't address this directly), I picked up a few tips that may be worth mentioning:
Observe Behavior:
If there's a UI, all the better. Use the app and get a mental map of relationships (e.g. links, modals, etc). Look at HTTP request if it helps, but don't put too much emphasis on it -- you just want a light, friendly acquaintance with app.
Acknowledge the Folder Structure:
Once again, this is light. Just see what belongs where, and hope that the structure is semantic enough -- you can always get some top-level information from here.
Analyze Call-Stacks, Top-Down:
Go through and list on paper or some other medium, but try not to type it -- this gets different parts of your brain engaged (build it out of Legos if you have to) -- function-calls, Objects, and variables that are closest to top-level first. Look at constants and modules, make sure you don't dive into fine-grained features if you can help it.
MindMap It!:
Maybe the most important step. Create a very rough draft mapping of your current understanding of the code. Make sure you run through the mindmap quickly. This allows an even spread of different parts of your brain to (mostly R-Mode) to have a say in the map.
Create clouds, boxes, etc. Wherever you initially think they should go on the paper. Feel free to denote boxes with syntactic symbols (e.g. 'F'-Function, 'f'-closure, 'C'-Constant, 'V'-Global Var, 'v'-low-level var, etc). Use arrows: Incoming array for arguments, Outgoing for returns, or what comes more naturally to you.
Start drawing connections to denote relationships. Its ok if it looks messy - this is a first draft.
Make a quick rough revision. Its its too hard to read, do another quick organization of it, but don't do more than one revision.
Open the Debugger:
Validate or invalidate any notions you had after the mapping. Track variables, arguments, returns, etc.
Track HTTP requests etc to get an idea of where the data is coming from. Look at the headers themselves but don't dive into the details of the request body.
MindMap Again!:
Now you should have a decent idea of most of the top-level functionality.
Create a new MindMap that has anything you missed in the first one. You can take more time with this one and even add some relatively small details -- but don't be afraid of what previous notions they may conflict with.
Compare this map with your last one and eliminate any question you had before, jot down new questions, and jot down conflicting perspectives.
Revise this map if its too hazy. Revise as much as you want, but keep revisions to a minimum.
Pretend Its Not Code:
If you can put it into mechanical terms, do so. The most important part of this is to come up with a metaphor for the app's behavior and/or smaller parts of the code. Think of ridiculous things, seriously. If it was an animal, a monster, a star, a robot. What kind would it be. If it was in Star Trek, what would they use it for. Think of many things to weigh it against.
Synthesis over Analysis:
Now you want to see not 'what' but 'how'. Any low-level parts that through you for a loop could be taken out and put into a sterile environment (you control its inputs). What sort of outputs are you getting. Is the system more complex than you originally thought? Simpler? Does it need improvements?
Contribute Something, Dude!:
Write a test, fix a bug, comment it, abstract it. You should have enough ability to start making minor contributions and FAILING IS OK :)! Note on any changes you made in commits, chat, email. If you did something dastardly, you guys can catch it before it goes to production -- if something is wrong, its a great way to get a teammate to clear things up for you. Usually listening to a teammate talk will clear a lot up that made your MindMaps clash.
In a nutshell, the most important thing to do is use a top-down fashion of getting as many different parts of your brain engaged as possible. It may even help to close your laptop and face your seat out the window if possible. Studies have shown that enforcing a deadline creates a "Pressure Hangover" for ~2.5 days after the deadline, which is why deadlines are often best to have on a Friday. So, BE RELAXED, THERE'S NO TIMECRUNCH, AND NOW PROVIDE YOURSELF WITH AN ENVIRONMENT THAT'S SAFE TO FAIL IN. Most of this can be fairly rushed through until you get down to details. Make sure that you don't bypass understanding of high-level topics.
Hope this helps you as well :)
All really good answers here. Just wanted to add few more things:
One can pair architectural understanding with flash cards and re-visiting those can solidify understanding. I find questions such as "Which part of code does X functionality ?", where X could be a useful functionality in your code base.
I also like to open a buffer in emacs and start re-writing some parts of the code base that I want to familiarize myself with and add my own comments etc.
One thing vi and emacs users can do is use tags. Tags are contained in a file ( usually called TAGS ). You generate one or more tags files by a command ( etags for emacs vtags for vi ). Then we you edit source code and you see a confusing function or variable you load the tags file and it will take you to where the function is declared ( not perfect by good enough ). I've actually written some macros that let you navigate source using Alt-cursor,
sort of like popd and pushd in many flavors of UNIX.
BubbaT
The first thing I do before going down into code is to use the application (as several different users, if necessary) to understand all the functionalities and see how they connect (how information flows inside the application).
After that I examine the framework in which the application was built, so that I can make a direct relationship between all the interfaces I have just seen with some View or UI code.
Then I look at the database and any database commands handling layer (if applicable), to understand how that information (which users manipulate) is stored and how it goes to and comes from the application
Finally, after learning where data comes from and how it is displayed I look at the business logic layer to see how data gets transformed.
I believe every application architecture can de divided like this and knowning the overall function (a who is who in your application) might be beneficial before really debugging it or adding new stuff - that is, if you have enough time to do so.
And yes, it also helps a lot to talk with someone who developed the current version of the software. However, if he/she is going to leave the company soon, keep a note on his/her wish list (what they wanted to do for the project but were unable to because of budget contraints).
create documentation for each thing you figured out from the codebase.
find out how it works by exprimentation - changing a few lines here and there and see what happens.
use geany as it speeds up the searching of commonly used variables and functions in the program and adds it to autocomplete.
find out if you can contact the orignal developers of the code base, through facebook or through googling for them.
find out the original purpose of the code and see if the code still fits that purpose or should be rewritten from scratch, in fulfillment of the intended purpose.
find out what frameworks did the code use, what editors did they use to produce the code.
the easiest way to deduce how a code works is by actually replicating how a certain part would have been done by you and rechecking the code if there is such a part.
it's reverse engineering - figuring out something by just trying to reengineer the solution.
most computer programmers have experience in coding, and there are certain patterns that you could look up if that's present in the code.
there are two types of code, object oriented and structurally oriented.
if you know how to do both, you're good to go, but if you aren't familiar with one or the other, you'd have to relearn how to program in that fashion to understand why it was coded that way.
in objected oriented code, you can easily create diagrams documenting the behaviors and methods of each object class.
if it's structurally oriented, meaning by function, create a functions list documenting what each function does and where it appears in the code..
i haven't done either of the above myself, as i'm a web developer it is relatively easy to figure out starting from index.php to the rest of the other pages how something works.
goodluck.