Highly efficient filesystem APIs for certain kinds of operations

Highly efficient filesystem APIs for certain kinds of operations - language-agnostic

I occasionally find myself needing certain filesystem APIs which could be implemented very efficiently if supported by the filesystem, but I've never heard of them. For example:
Truncate file from the beginning, on an allocation unit boundary
Split file into two on an allocation unit boundary
Insert or remove a chunk from the middle of the file, again, on an allocation unit boundary
The only way that I know of to do things like these is to rewrite the data into a new file. This has the benefit that the allocation unit is no longer relevant, but is extremely slow in comparison to some low-level filesystem magic.
I understand that the alignment requirements mean that the methods aren't always applicable, but I think they can still be useful. For example, a file archiver may be able to trim down the archive very efficiently after the user deletes a file from the archive, even if that leaves a small amount of garbage either side for alignment reasons.
Is it really the case that such APIs don't exist, or am I simply not aware of them? I am mostly interested in NTFS, but hearing about other filesystems will be interesting too.

For NTFS and FAT there are no such APIs. You can obvoiusly truncate the end a file but not the beginning.
Implementing this is unadvisable due to file system caching.
Most of the time people implement a layer "on top" of NTFS to support this.

Raymond Chen has essentially answered this question.
His answer is that no, such APIs don't exist, because there is too little demand for them. Raymond also suggests the use of sparse files and decomitting blocks by zeroing them.

Related

Really streaming large MySQL result sets

There's a similar question about streaming large results but the answer just points at docs and no clear answer emerges.
I believe that merely treating a full result set as a stream still takes a lot of memory on the jdbc driver side..
I am wondering if there's any clear cut pattern, or best practice, for making it work, especially on the jdbc driver side.
And in particular I am not sure why setFetchSize(Integer.MIN_VALUE) is a very good idea, as it seems far from optimal if that means each row is sent on its own on the wire.
I believe libraries like jooq and slick already take care of that... and am curious as to how to accomplish it with and without them.
Thanks!

I am wondering if there's any clear cut pattern, or best practice, for making it work, especially on the jdbc driver side.
The best practice is not to do synchronous streaming but rather fetch in moderate size chunks. However avoid using OFFSET (also see). If your doing a batch process this can be facilitated by first selecting and pushing the data into a temporary table (ie turn your original results you want into a table first and then select chunks from the table... databases are really fast at copying data internally).
Synchronous streaming in general does not scale (aka iterator). It does not scale well for batch processing and it certainly does not scale for handling loads of clients. This is why the drivers vary and do so many different things because its fairly difficult to figure out how much resources to load because it is a pull model. Async streaming (push model) would probably help but unfortunately the JDBC standard does not support async streaming.
You might notice but this is one of the reasons why many of the wrappers around JDBC such as Spring JDBC do not return Iterators (along with fact that the resource also needs to be manually cleaned up). Some of the wrappers provide iterators but really they just turn the results into a list .
Your link to the Scala version is rather disturbing that its upvoted given the stateful nature of managing a ResultSet... its very un-Scala like... I'm not sure those folks know they have to consume the iterator or close the connection/ResultSet properly which requires a fair amount of imperative programming.
While it may seem inefficient to let the database decide how much to buffer just remember that most database connections are extremely heavy memory wise (at least on postgres they are). So if you take a long time streaming and have many clients your going to have to create more connections and put serious burden on the database. Not to mention the default buffers have probably been highly optimized (ie the resultset size that client ends up with).
Finally for batch processing chunks can be done in parallel which is obviously more efficient than a synchronous pipeline as well as being restarted (with out having to rework already processed data) if a problem occurs.

Why people always encourage single js for a website?

I read some website development materials on the Web and every time a person is asking for the organization of a website's js, css, html and php files, people suggest single js for the whole website. And the argument is the speed.
I clearly understand the fewer request there is, the faster the page is responded. But I never understand the single js argument. Suppose you have 10 webpages and each webpage needs a js function to manipulate the dom objects on it. Putting 10 functions in a single js and let that js execute on every single webpage, 9 out of 10 functions are doing useless work. There is CPU time wasting on searching for non-existing dom objects.
I know that CPU time on individual client machine is very trivial comparing to bandwidth on single server machine. I am not saying that you should have many js files on a single webpage. But I don't see anything go wrong if every webpage refers to 1 to 3 js files and those js files are cached in client machine. There are many good ways to do caching. For example, you can use expire date or you can include version number in your js file name. Comparing to mess the functionality in a big js file for all needs of many webpages of a website, I far more prefer split js code into smaller files.
Any criticism/agreement on my argument? Am I wrong? Thank you for your suggestion.

A function does 0 work unless called. So 9 empty functions are 0 work, just a little exact space.
A client only has to make 1 request to download 1 big JS file, then it is cached on every other page load. Less work than making a small request on every single page.

I'll give you the answer I always give: it depends.
Combining everything into one file has many great benefits, including:
less network traffic - you might be retrieving one file, but you're sending/receiving multiple packets and each transaction has a series of SYN, SYN-ACK, and ACK messages sent across TCP. A large majority of the transfer time is establishing the session and there is a lot of overhead in the packet headers.
one location/manageability - although you may only have a few files, it's easy for functions (and class objects) to grow between versions. When you do the multiple file approach sometimes functions from one file call functions/objects from another file (ex. ajax in one file, then arithmetic functions in another - your arithmetic functions might grow to need to call the ajax and have a certain variable type returned). What ends up happening is that your set of files needs to be seen as one version, rather than each file being it's own version. Things get hairy down the road if you don't have good management in place and it's easy to fall out of line with Javascript files, which are always changing. Having one file makes it easy to manage the version between each of your pages across your (1 to many) websites.
Other topics to consider:
dormant code - you might think that the uncalled functions are potentially reducing performance by taking up space in memory and you'd be right, however this performance is so so so so minuscule, that it doesn't matter. Functions are indexed in memory and while the index table may increase, it's super trivial when dealing with small projects, especially given the hardware today.
memory leaks - this is probably the largest reason why you wouldn't want to combine all the code, however this is such a small issue given the amount of memory in systems today and the better garbage collection browsers have. Also, this is something that you, as a programmer, have the ability to control. Quality code leads to less problems like this.
Why it depends?
While it's easy to say throw all your code into one file, that would be wrong. It depends on how large your code is, how many functions, who maintains it, etc. Surely you wouldn't pack your locally written functions into the JQuery package and you may have different programmers that maintain different blocks of code - it depends on your setup.
It also depends on size. Some programmers embed the encoded images as ASCII in their files to reduce the number of files sent. These can bloat files. Surely you don't want to package everything into 1 50MB file. Especially if there are core functions that are needed for the page to load.
So to bring my response to a close, we'd need more information about your setup because it depends. Surely 3 files is acceptable regardless of size, combining where you would see fit. It probably wouldn't really hurt network traffic, but 50 files is more unreasonable. I use the hand rule (no more than 5), but surely you'll see a benefit combining those 5 1KB files into 1 5KB file.

Two reasons that I can think of:
Less network latency. Each .js requires another request/response to the server it's downloaded from.
More bytes on the wire and more memory. If it's a single file you can strip out unnecessary characters and minify the whole thing.

The Javascript should be designed so that the extra functions don't execute at all unless they're needed.
For example, you can define a set of functions in your script but only call them in (very short) inline <script> blocks in the pages themselves.

My line of thought is that you have less requests. When you make request in the header of the page it stalls the output of the rest of the page. The user agent cannot render the rest of the page until the javascript files have been obtained. Also javascript files download sycronously, they queue up instead of pull at once (at least that is the theory).

How to make sure that your code is secure?

I am a programmer. I have about 5 years experience of programming in different kind of languages. I was concerning about my code speed, about optimizing the memory that uses my code, and about good coding style and so on. But have never thought how secure my code is. So I have disassembled my code to see what can do a hacker. Would it be easy to crack my code?
And I saw that it is! It is very easy, because I was storing
serial number as a string
encryption-decryption codes as well
So if someone has the minimal knowledge of assembler he/she can just simple dissembler and after 10-20 minutes of debugging my code is cracked!!! Even it could be done by opening the exe with notepad I guess! :-)
So what I am asking are the following:
Where I should store that kind of secure information’s?
What are the common strategies of delivering a secure code?

First thing you must realize is that you'll never prevent a determined reverser from cracking any protection schemes because anything that the code can do, the reverser will eventually find out how to replicate it. The only way you can achieve any sort of reliable protection is to have the shipped program be nothing more than a dumb client and have the brunt of the software on some server the reverser has no access to.
With that out of the way, you can certainly make it harder for a would be reverser to break your protections. Obfuscation is the sort of first step in achieving this. I have no experience using obfuscators but I'm sure you can find some suggestions for some on SO. Also if you're using a lower level language like C/C++, simply compiling the code with full optimization and stripping all debugging symbols gets you a decent amount of obfuscation.
I read this article a few years ago, but I still think it's techniques hold up today. It's one of the developers of a video game called Spyro talking about the set of techniques they used to prevent piracy. They claim it wasn't until 3 months after the release that a cracked version became available, which is fairly impressive.

If you are concerned about piracy, then there are many avenues you can take. Making the code security tighter (obfuscation, license codes, binding the software to a particular PC, hardware/dongle protection, etc) is one, but it's worth bearing in mind that every piece of software can be cracked if someone sufficiently talented can be bothered.
Another approach is to consider the pricing model for your software. If you charge $1000 a copy, then there is a big incentive for someone to have a go at cracking it. If you only charge $5 then why should anyone bother to crack it?
So what is needed is a balance. Even the most basic protection will stop ordinary people making casual copies. Beyond that, simple techniques (obfuscation and license codes) and a sensible pricing strategy will hold most would-be crackers at bay by making it not worth the bother of cracking. After that, you start getting into ever more sophisticated techniques (dongles/CDs needing to be present to run the software, only being able to run the software after logging on to an online licensing system) that take a lot of effort/cost to implement and significantly increase the risk of annoying genuine customers (remember how annoyed everyone got when they bought half life but it wouldn't let them play the game?) - unless you have a popular mainstream product (i.e. a huge revenue stream to protect), there probably isn't much point going to that much effort.

Make it web app.

It will generally not be well-protected unless there's an external service doing the checking that you are in control of - and that service can still be spoofed by those who really wants to "crack" it. Instead, trust the customer and provide only minimal copyright protection. I'm sure there was an article or podcast about this by Joel Spolsky somewhere... here's another related SO question.
I have no idea if it will help but Windows provides (since 2000) a mechanism to retrieve and store encrypted information and you can also salt this storage on a per-application basis if needed: Data Protection API (DPAPI)
This is on a machine or a user level but storing serials and perhaps some keys using it might be better than having them hidden in the application?

What sort of secure are you talking about?
Secure from the perspective that you are guarding your users data well? If so, study some real cryptography and utilize Existing libraries to encrypt your data. The win32 API is pretty good for this.
But if you're talking about stopping a cracker from stealing your application? There are many methods, but just give up. They slow crackers down, they don't stop them.

Look at How to hide a string in binary code? question

First you have to define what your code should be secure against, being secure as such is meaningless.
You seem to be worried about reverse engineering and users generating license codes without paying, though you don't say so. To make this harder you can obfuscate your code and key information in various ways. There area also techniques to make the use of debuggers harder, to prevent the reverse engineer from stepping through the code and seeing the information in clear.
But this only makes reverse engineering somewhat harder, not impossible
Another common security threat is execution of unwanted code, for example via buffer overflows.

A simple technique for doing this is to xor over all your code and xor back when you need it... but this needs an innate knowledge of assembly... I'm not sure, but you could try this:
void (*encryptionFunctn)(void);
void hideEncryptnFunctn(void)
{
volatile char * i;
while(*i!=0xC0) // 0xC0 is the opcode for ret
{
*i++^=0x45; // or any other code
}
}

To prevent against hackers viewing your code, you should use an obfuscator. An obfuscator will use various techniques which make it extremely difficult to make sense of the obfuscated code. Some techniques used are string encryption, symbol renaming, control flow obfuscation, etc. Check out Crypto Obfuscator which additionally also has external method call hiding, Anti-Reflector, Anti-Debugging, etc
The goal is to erect as many obstacles as possible in the path of a would-be hacker.

Distributed filesystem sanity check [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I'm in need of a distributed file system that must scale to very large sizes (about 100TB realistic max). Filesizes are mostly in the 10-1500KB range, though some files may peak at about 250MB.
I very much like the thought of systems like GFS with built-in redundancy for backup which would - statistically - render file loss a thing of the past.
I have a couple of requirements:
Open source
No SPOFs
Automatic file replication (that is, no need for RAID)
Managed client access
Flat namespace of files - preferably
Built in versioning / delayed deletes
Proven deployments
I've looked seriously at MogileFS as it does fulfill most of the requirements. It does not have any managed clients, but it should be rather straight forward to do a port of the Java client. However, there is no versioning built in. Without versioning, I will have to do normal backups besides the file replication built into MogileFS.
Basically I need protection from a programming error that suddenly purges a lot of files it shouldn't have. While MogileFS does protect me from disk & machine errors by replicating my files over X number of devices, it doesn't save me if I do an unwarranted delete.
I would like to be able to specify that a delete operation doesn't actually take effect until after Y days. The delete will logically have taken place, but I can restore the file state for Y days until it's actually deleten. Also MogileFS does not have the ability to check for disk corruption during writes - though again, this could be added.
Since we're a Microsoft shop (Windows, .NET, MSSQL) I'd optimally like the core parts to be running on Windows for easy maintainability, while the storage nodes run *nix (or a combination) due to licensing.
Before I even consider rolling my own, do you have any suggestions for me to look at? I've also checked out HadoopFS, OpenAFS, Lustre & GFS - but neither seem to match my requirements.

Do you absolutely need to host this on your own servers? Much of what you need could be provided by Amazon S3. The delayed delete feature could be implemented by recording deletes to a SimpleDB table and running a garbage collection pass periodically to expunge files when necessary.
There is still a single point of failure if you rely on a single internet connection. And of course you could consider Amazon themselves to be a point of failure but the failure rate is always going to be far lower because of scale.
And hopefully you realize the other benefits, the ability to scale to any capacity. No need for IT staff to replace failed disks or systems. Usage costs will continually drop as disk capacity and bandwidth gets cheaper (while disks you purchase depreciate in value).
It's also possible to take a hybrid approach and use S3 as a secure backend archive and cache "hot" data locally, and find a caching strategy that best fits your usage model. This can greatly reduce bandwidth usage and improve I/O, epecially if data changes infrequently.
Downsides:
Files on S3 are immutable, they can
only be replaced entirely or
deleted. This is great for caching,
not so great for efficiency when
making small changes to large files.
Latency and bandwidth are those of
your network connection. Caching can
help improve this but you'll never
get the same level of performance.
Versioning would also be a custom solution, but could be implemented using SimpleDB along with S3 to track sets of revisions to a file. Overally, it really depends on your use case if this would be a good fit.

You could try running a source control system on top of your reliable file system. The problem then becomes how to expunge old check ins after your timeout. You can setup an Apache server with DAV_SVN and it will commit each change made through the DAV interface. I'm not sure how well this will scale with large file sizes that you describe.

#tweakt
I've considered S3 extensively as well, but I don't think it'll be satisfactory for us in the long run. We have a lot of files that must be stored securely - not through file ACL's, but through our application layer. While this can also be done through S3, we do have one bit less control over our file storage. Furthermore there will also be a major downside in forms of latency when we do file operations - both initial saves (which can be done asynchronously though), but also when we later read the files and have to perform operations on them.
As for the SPOF, that's not really an issue. We do have redundant connections to our datacenter and while I do not want any SPOFs, the little downtime S3 has had is acceptable.
Unlimited scalability and no need for maintenance is definitely an advantage.
Regarding a hybrid approach. If we are to host directly from S3 - which would be the case unless we want to store everything locally anyways (and just use S3 as backup), the bandwidth prices are simply too steep when we add S3 + CloudFront (CloudFront would be necessary as we have clients from all around). Currently we host everything from our datacenter in Europe, and we have our own reverse squids setup in the US for a low-budget CDN functionality.
While it's very domain dependent, ummutability is not an issue for us. We may replace files (that is, key X gets new content), but we will never make minor modifications to a file. All our files are blobs.

how are serial generators / cracks developed?

I mean, I always was wondered about how the hell somebody can develop algorithms to break/cheat the constraints of legal use in many shareware programs out there.
Just for curiosity.

Apart from being illegal, it's a very complex task.
Speaking just at a teoretical level the common way is to disassemble the program to crack and try to find where the key or the serialcode is checked.
Easier said than done since any serious protection scheme will check values in multiple places and also will derive critical information from the serial key for later use so that when you think you guessed it, the program will crash.
To create a crack you have to identify all the points where a check is done and modify the assembly code appropriately (often inverting a conditional jump or storing costants into memory locations).
To create a keygen you have to understand the algorithm and write a program to re-do the exact same calculation (I remember an old version of MS Office whose serial had a very simple rule, the sum of the digit should have been a multiple of 7, so writing the keygen was rather trivial).
Both activities requires you to follow the execution of the application into a debugger and try to figure out what's happening. And you need to know the low level API of your Operating System.
Some heavily protected application have the code encrypted so that the file can't be disassembled. It is decrypted when loaded into memory but then they refuse to start if they detect that an in-memory debugger has started,
In essence it's something that requires a very deep knowledge, ingenuity and a lot of time! Oh, did I mention that is illegal in most countries?
If you want to know more, Google for the +ORC Cracking Tutorials they are very old and probably useless nowdays but will give you a good idea of what it means.
Anyway, a very good reason to know all this is if you want to write your own protection scheme.

The bad guys search for the key-check code using a disassembler. This is relative easy if you know how to do this.
Afterwards you translate the key-checking code to C or another language (this step is optional). Reversing the process of key-checking gives you a key-generator.
If you know assembler it takes roughly a weekend to learn how to do this. I've done it just some years ago (never released anything though. It was just research for my game-development job. To write a hard to crack key you have to understand how people approach cracking).

Nils's post deals with key generators. For cracks, usually you find a branch point and invert (or remove the condition) the logic. For example, you'll test to see if the software is registered, and the test may return zero if so, and then jump accordingly. You can change the "jump if equals zero (je)" to "jump if not-equals zero (jne)" by modifying a single byte. Or you can write no-operations over various portions of the code that do things that you don't want to do.
Compiled programs can be disassembled and with enough time, determined people can develop binary patches. A crack is simply a binary patch to get the program to behave differently.

First, most copy-protection schemes aren't terribly well advanced, which is why you don't see a lot of people rolling their own these days.
There are a few methods used to do this. You can step through the code in a debugger, which does generally require a decent knowledge of assembly. Using that you can get an idea of where in the program copy protection/keygen methods are called. With that, you can use a disassembler like IDA Pro to analyze the code more closely and try to understand what is going on, and how you can bypass it. I've cracked time-limited Betas before by inserting NOOP instructions over the date-check.
It really just comes down to a good understanding of software and a basic understanding of assembly. Hak5 did a two-part series on the first two episodes this season on kind of the basics of reverse engineering and cracking. It's really basic, but it's probably exactly what you're looking for.

A would-be cracker disassembles the program and looks for the "copy protection" bits, specifically for the algorithm that determines if a serial number is valid. From that code, you can often see what pattern of bits is required to unlock the functionality, and then write a generator to create numbers with those patterns.
Another alternative is to look for functions that return "true" if the serial number is valid and "false" if it's not, then develop a binary patch so that the function always returns "true".
Everything else is largely a variant on those two ideas. Copy protection is always breakable by definition - at some point you have to end up with executable code or the processor couldn't run it.

The serial number you can just extract the algorithm and start throwing "Guesses" at it and look for a positive response. Computers are powerful, usually only takes a little while before it starts spitting out hits.
As for hacking, I used to be able to step through programs at a high level and look for a point where it stopped working. Then you go back to the last "Call" that succeeded and step into it, then repeat. Back then, the copy protection was usually writing to the disk and seeing if a subsequent read succeeded (If so, the copy protection failed because they used to burn part of the floppy with a laser so it couldn't be written to).
Then it was just a matter of finding the right call and hardcoding the correct return value from that call.
I'm sure it's still similar, but they go through a lot of effort to hide the location of the call. Last one I tried I gave up because it kept loading code over the code I was single-stepping through, and I'm sure it's gotten lots more complicated since then.

I wonder why they don't just distribute personalized binaries, where the name of the owner is stored somewhere (encrypted and obfuscated) in the binary or better distributed over the whole binary.. AFAIK Apple is doing this with the Music files from the iTunes store, however there it's far too easy, to remove the name from the files.

I assume each crack is different, but I would guess in most cases somebody spends
a lot of time in the debugger tracing the application in question.
The serial generator takes that one step further by analyzing the algorithm that
checks the serial number for validity and reverse engineers it.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008