How to reverse the lines of a huge file efficiently with limited main memory - reverse

How might I reverse the lines of a huge text file efficiently with limited main memory? What is an efficient algorithm to use?

Should start reading from End of file, then go backwards all to way to the top.
You can read the file one character at a time backwards.
Cache/save all characters until you reach a carriage return.
Reverse the collected string and make it a line.

I am not sure what exactly you want to do, but have a look at the rev and tac commands (if you are on a system which has them).

Related

Cracking a binary file format if I have the contents of one of these files

I have about 300 measurements (each stored in a dat file) that I would like to read using MATLAB or Python. The files can be exported to text or csv using a proprietary program, but this has to be done one by one.
The question is: what would be the best approach to crack the format of the binary file using the known content from the exported file?
Not sure if this makes any difference to make the cracking easier, but the files are just two columns of (900k) numbers, and from the dat files' size (1,800,668 bytes), it appears as if each number is 16 bits (float) and there is some other information (possible the header).
I tried using HEX-Editor, but wasn't able to pick up any trends from there.
Lastly, I want to make sure to specify that these are measurements I made and the data in them belongs to me. I am not trying to obtain data that I am not supposed to.
Thanks for any help.
EDIT: Reading up a little more I realized that there my be some kind of compression going on. When you look at the data in StreamWare, it gives 7 decimal places, leading me to believe that it is a single precision value (4 bytes). However, the size of the files suggests that each value only takes 2 bytes.
After thinking about it a little more, I finally figured it out. This is very specific, but just in case another Dantec StreamWare user runs into the same problem, it could save him/her a little time.
First, the data is actually only a single vector. The time column is calculated from the length of the recorded signal and the sampling frequency. That information is probably in the header (but I wasn't able to crack that portion).
To obtain the values in MATLAB, I skipped the header bytes using fseek(fid, 668, 'bof'), then I read the data as uint16 using fread(fid, 900000, 'uint16'). This gives you integers.
To get the float value, all you have to do is divide by 2^16 (it's a 16 bit resolution system) and multiply by ten. I assume the factor of ten depends on the range of your data acquisition system.
I hope this helps.

Cython: string to list of strings

in pure Python it is easy:
in_string = 'abc,def,ghi,jklmnop,, '
out = in_string.lower().rstrip().split(',') # too slow!!!
out -> ['abc','def','ghi','jklmnop','']
In my case this is called several million times and I need to speed it up a little. I am already using Cython but I do not know not to speed up this particular portion of code.
There can be up to 300 substrings. Pure ASCII. Letters, numbers and some other printable characters. There can be no comma "," in a substring. So a comma is the separator.
Edit:
OK, I see that a simple question turns into a big one. So the data comes from files which have a CSV-like format (no ready to run software works on this) and in total can be 100GB in size. The method reads the file line by line, needs to get the substrings and then sends the substrings to a SQlite database (I am already using executemany). The whole is done in multiprocessing manner, so each file is processed by its own process. The whole is already fast, but I want to squeeze out the last bit of performance. Additionally I want to learn more about Cython. So I have picked this (simple) part of Python code and have run it with "cython -a" which produces a big amount of generated code. So I think this is the best part to start optimizing.
Profiling the code is not that easy because of multiprocessing and cython is being used.
So once someone answers my question, I could implement this code and make a test run. So even I might not improve the speed of my code I will for sure learn a lot. Unfortunately I am a C noob
Yes you can do this in Cython, larger question is if you should.
Where does the input come from?
Is it a file? Then other optimisations are possible, e.g. you could map the file into memory.
Is it a database or network connection? In that case your runtime is probably dominated by waiting for disk/network.
What do you plan to do with the output?
Does the output have to be a string, or can it be a buffer?
"abc,def" -> "abc\0def\0"
buffer1 ------^ ^
buffer2 -----------!
You mentioned that string splitting fragment was called millions of times, processing the string is not that slow, what probably kills performance is allocating all the small strings, an array to hold the result, and then collecting the garbage once substrings are no longer user.
If you could give out pointers to existing data instead, you could speed things up a bit.
How often are these substrings used? If split is called millions of times, it seems to suggest that most substrings are discarded (or you'd run out of memory).
For example, consider the problem "split into substrings and return numbers only"
filter(str.isdigit, "dfasdf,6785,2,dhs,dfgsd,dsg,dsffg".split(","))
If you know in advance that most substrings are not numbers, you'd want to optimise this larger problem as a single block.
How many substrings are there in a typical input?
If there are 4, like in your example, it's not worth it. If there are millions, or even thousands, you may get somewhere.
Is there unicode?
.lower() on an ASCII string is trivial, but not so on unicode. I'd stick to Python if you expect unicode.

MySQL LONGTEXT pagination

I have table posts which contains LONGTEXT. My issue is that I want to retrieve parts of a specific post (basically paging)
I use the following query:
SELECT SUBSTRING(post_content,1000,1000) FROM posts WHERE id=x
This is somehow good, but the problem is the position and the length. Most of the time, the first word and the last word is not complete, which makes sense.
How can I retrieve complete words from position x for length y?
Presumably you're doing this for the purpose of saving on network traffic overhead between the MySQL server and the machine on which your application is running. As it happens, you're not saving any other sort of workload on the MySQL server. It has to fetch the LONGTEXT item from disk, then run it through SUBSTRING.
Presumably you've already decided based on solid performance analysis that you must save this network traffic. You might want to revisit this analysis now that you know it doesn't save much MySQL server workload. Your savings will be marginal, unless you have zillions of very long LONGTEXT items and lots of traffic to retrieve and display parts of them.
In other words, this is an optimization task. YAGNI? http://en.wikipedia.org/wiki/YAGNI
If you do need it you are going to have to create software to process the LONGTEXT item word by word. Your best bet is to do this in your client software. Start by retrieving the first page plus a k or two of the article. Then, parse the text looking for complete words. After you find the last complete word in the first page and its following whitespace, then that character position is the starting place for the next page.
This kind of task is a huge pain in the neck in a MySQL stored procedure. Plus, when you do it in a stored procedure you're going to use processing cycles on a shared and hard-to-scale-up resource (the MySQL server machine) rather than on a cloneable client machine.
I know I didn't give you clean code to just do what you ask. But it's not obviously a good idea to do what you're suggesting.
Edit:
An observation: A gigabyte of server RAM costs roughly USD20. A caching system like memcached does a good job of exploiting USD100 worth of memory efficiently. That's plenty for the use case you have described.
Another observation: many companies who serve large-scale documents use file systems rather than DBMSs to store them. File systems can be shared or replicated very easily among content servers, and files can be random-accessed trivially without any overhead.
It's a bit innovative to store whole books in single BLOBs or CLOBs. If you can break up the books by some kind of segment -- page? chapter? thousand-word chunk? -- and create separate data rows for each segment, your DBMS will scale up MUCH MUCH better than what you have described.
If you're going to do it anyway, here's what you do:
always retrieve 100 characters more than you need in each segment. For example, when you need characters 30000 - 35000, retrieve 30000 - 35100.
after you retrieve the segment, look for the first word break in the data (except on the very first segment) and display starting from that word.
similarly, find the very first word break in the 100 extra bytes, and display up to that word break.
So your fetched data might be 30000 - 35100 and your displayed data might be 30013 - 35048, but it would be whole words.

Checking for Duplicate Files without Storing their Checksums

For instance, you have an application which processes files that are sent by different clients. The clients send tons of files everyday and you load the content of those files into your system. The files have the same format. The only constraint that you are given is you are not allowed to run the same file twice.
In order to check if you ran a particular file is to create a checksum of the file and store it in another file. So when you get a new file, you can create the checksum of that file and compare against the checksums of others files that you have run and stored.
Now, the file that contains all the checksums of all the files that you have run so far is getting really, really huge. Searching and comparing is taking too much time.
NOTE: The application uses flat files as its database. Please do not suggest to use rdbms or the like. It is simply not possible at the moment.
Do you think there could be another way to check the duplicate files?
Keep them in different places: have one directory where the client(s) upload files for processing, have another where those files are stored.
Or are you in a situation where the client can upload the same file multiple times? If that's the case, then you pretty much have to do a full comparison each time.
And checksums, while they give you confidence that two files are different (and, depending on the checksum, a very high confidence), are not 100% guaranteed. You simply can't take a practically-infinite universe of possible multi-byte streams and reduce them to a 32 byte checksum, and be guaranteed uniqueness.
Also: consider a layered directory structure. For example, a file foobar.txt would be stored using the path /f/fo/foobar.txt. This will minimize the cost of scanning directories (a linear operation) for the specific file.
And if you retain checksums, this can be used for your layering: /1/21/321/myfile.txt (using least-significant digits for the structure; the checksum in this case might be 87654321).
Nope. You need to compare all files. Strictly, need to to compare the contents of each new file against all already seen files. You can approximate this with a checksum or hash function, but should you find a new file already listed in your index then you then need to do a full comparison to be sure, since hashes and checksums can have collisions.
So it comes down to how to store the file more efficiently.
I'd recommend you leave it to professional software such as berkleydb or memcached or voldemort or such.
If you must roll your own you could look at the principles behind binary searching (qsort, bsearch etc).
If you maintain the list of seen checksums (and the path to the full file, for that double-check I mentioned above) in sorted form, you can search for it using a binary search. However, the cost of inserting each new item in the correct order becomes increasingly expensive.
One mitigation for a large number of hashes is to bin-sort your hashes e.g. have 256 bins corresponding to the first byte of the hash. You obviously only have to search and insert in the list of hashes that start with that byte-code, and you omit the first byte from storage.
If you are managing hundreds of millions of hashes (in each bin), then you might consider a two-phase sort such that you have a main list for each hash and then a 'recent' list; once the recent list reaches some threshold, say 100000 items, then you do a merge into the main list (O(n)) and reset the recent list.
You need to compare any new document against all previous documents, the efficient way to do that is with hashes.
But you don't have to store all the hashes in a single unordered list, nor does the next step up have to be a full database. Instead you can have directories based on the first digit, or 2 digits of the hash, then files based on the next 2 digits, and those files containing sorted lists of hashes. (Or any similar scheme - you can even make it adaptive, increasing the levels when the files get too big)
That way searching for matches involves, a couple of directory lookups, followed by a binary search in a file.
If you get lots of quick repeats (the same file submitted at the same time), then a Look-aside cache might also be worth having.
I think you're going to have to redesign the system, if I understand your situation and requirements correctly.
Just to clarify, I'm working on the basis that clients send you files throughout the day, with filenames that we can assume are irrelevant, and when you receive a file you need to ensure its [i]contents[/i] are not the same as another file's contents.
In which case, you do need to compare every file against every other file. That's not really avoidable, and you're doing about the best you can manage at the moment. At the very least, asking for a way to avoid the checksum is asking the wrong question - you have to compare an incoming file against the entire corpus of files already processed today, and comparing the checksums is going to be much faster than comparing entire file bodies (not to mention the memory requirements for the latter...).
However, perhaps you can speed up the checking somewhat. If you store the already-processed checksums in something like a trie, it should be a lot quicker to see if a given file (rather, checksum) has already been processed. For a 32-character hash, you'd need to do a maximum of 32 lookups to see if that file had already been processed rather than comparing with potentially every other file. It's effectively a binary search of the existing checksums rather than a linear search.
You should at the very least move the checksums file into a proper database file (assuming it isn't already) - although SQLExpress with its 4GB limit might not be enough here. Then, along with each checksum store the filename, file size and date received, add indexes to file size and checksum, and run your query against only the checksums of files with an identical size.
But as Will says, your method of checking for duplicates isn't guaranteed anyway.
Despite you asking not to suggets and RDBMS I still will suggest SQLite - if you store all checksums in one table with an index searches will be quite fast and integrating SQLite is not a problem at all.
As Will pointed out in his longer answer, you should not store all hashes in a single large file, but simply split them up into several files.
Let's say the alphanumeric-formatted hash is pIqxc9WI. You store that hash in a file named pI_hashes.db (based on the first two characters).
When a new file comes in, calculate the hash, take the first 2 characters, and only do the lookup in the CHARS_hashes.db file
After creating a checksum, create a directory with the checksum as the name and then put the file in there. If there are already files in there, compare your new file with the existing ones.
That way, you only have to check one (or a few) files.
I also suggest to add a header (a single line) to the file which explains what's inside: The date it was created, the IP address of the client, some business keys. The header should be selected in such a way that you can detect duplicates be reading this single line.
[EDIT] Some file systems bog down when you have a directory with many entries (in this case: the checksum directories). If this is an issue for you, create a second layer by using the first two characters of the checksum as the name of the parent directory. Repeat as necessary.
Don't cut off the two characters from the next level; this way, you can easily find files by checksum if something goes wrong without cutting checksums manually.
As mentioned by others, having a different data structure for storing the checksums is the correct way to go. Anyways, although you have mentioned that you dont want to go the RDBMS way, why not try sqlite? You can use it like a file, and it is lightning fast. It is also very simple to use - most languages has sqlite support built-in, too. It will take you less than 40 lines of code in say python.

What are important points when designing a (binary) file format? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
When designing a file format for recording binary data, what attributes would you think the format should have? So far, I've come up with the following important points:
have some "magic bytes" at the beginning, to be able to recognize the files (in my specific case, this should also help to distinguish the files from "legacy" files)
have a file version number at the beginning, so that the file format can be changed later without breaking compatibility
specify the endianness and size of all data items; or: include some space to describe endianness/size of data (I would tend towards the former)
possibly reserve some space for further per-file attributes that might be necessary in the future?
What else would be useful to make the format more future-proof and minimize headache in the future?
Take a look at the PNG spec. This format has some very good rationale behind it.
Also, decide what's important for your future format: compactness, compatibility, allowing to embed other formats (different compression algorithms) inside it. Another interesting example would be the Google's protocol buffers, where size of the transferred data is the king.
As for endianness, I'd suggest you to pick one option and stick with it, not allowing different byte orders. Otherwise, reading and writing libraries will only get more complex and slower.
I agree that these are good ideas:
Magic numbers at the beginning. Pretty much required in *nix:
File version number for backwards compatibility.
Endianness specification.
But your fourth one is overkill, because #2 lets you add fields as long as you change the version number (and as long as you don't need forward compatibility).
possibly reserve some space for further per-file attributes that might be necessary in the future?
Also, the idea of imposing a block-structure on your file, expressed in many other answers, seems less like a universal requirement for binary files than a solution to a problem with certain kinds of payloads.
In addition to 1-3 above, I'd add these:
simple checksum or other way of detecting that the contents are intact. Otherwise you can't trust magic bytes or version numbers. Be careful to spec which bytes are included in the checksum. Typically you would include all bytes in the file that don't already have error detection.
version of your software (including the most granular number you have, e.g. build number) that wrote the file. You're going to get a bug report with an attached file from someone who can't open it and they will have no clue when they wrote the file because the error didn't occur then. But the bug is in the version that wrote it, not in the one trying to read it.
Make it clear in the spec that this is a binary format, i.e. all values 0-255 are allowed for all bytes (except the magic numbers).
And here are some optional ones:
If you do need forward compatibility, you need some way of expressing which "chunks" are "optional" (like png does), so that a previous version of your software can skip over them gracefully.
If you expect these files to be found "in the wild", you might consider embedding some clue to find the spec. Imagine how helpful it would be to find the string http://www.w3.org/TR/PNG/ in a png file.
It all depends on the purpose of the format, of course.
One flexible approach is to structure entire file as TLV (Tag-Length-Value) triplets.
For example, make your file comprized of records, each record beginning with a 4-byte header:
1 byte = record type
3 bytes = record length
followed by record content
Regarding the endianness, if you store endianness indicator in the file, all your applications will have to support all endianness formats. On the other hand, if you specify a particular endianness for your files, only applications on platforms with non-matching endiannes will have to do additional work, and it can be decided at compile time (using conditional compilation).
Another point, taken from .xz file spec (http://tukaani.org/xz/xz-file-format.txt): one of the first few bytes should be a non-character, "to prevent applications from misdetecting the file as a text file.". Note sure how many header bytes are usually inspected by editors and other tools, but using a non-binary byte in the first four or eight bytes seems useful.
One of the most important things to know before even starting is how your file will be used.
Will random or sequential access be the norm?
How often will the data be read?
How often will the data be written?
Will you write out the file in one go or will you be slowing writing it as data comes in.
Will the file need to be portable? Not all formats need to be.
Does it need to be compatible with other versions? Maybe updating the file is sufficient.
Does it need to be easy to read/write?
Size/Speed/Compexity tradeoff.
Most answers here give good advise on the portability/compatibility front so I am not going to add more. But consider the following (often overlooked) things.
Some files are often written and rarely read (backups, logs, ...) and you may want to focus on filesize and easy-writing.
Converting endianness is slow (relatively) if your file will never leave the host, or leaves rarely enough that conversion is a good option you can get a significant performance boost. Consider writing a number such as 0x1234 as part of the header so that you can detect (and instruct the user to convert) if this is the case.
Sometimes easy reading is really useful. If you are doing logs or text documents, consider compressing all in one go rather than per-entry so that you can zcat | strings the file and see what is inside.
There are many things to keep in mind and designing a good format takes a lot of planning and foresight. The little things such as zcating a file and getting useful information or the small performance boost from using native integers can give your product an edge, however you need to be careful that you don't sacrifice something important to get it.
One way to future proof the file would be to provide for blocks. Straight after your file header data, you can begin the first block. The block could have a byte or word code for the type of block, then a size in bytes. Now you can arbitrarily add new block types, and you can skip to the end of a block.
I would consider defining a substructure that higher levels use to store data, a little like a mini file system inside the file.
For example, even though your file format is going to store application-specific data, I would consider defining records / streams etc. inside the file in such a way that application-agnostic code is able to understand the layout of the file, but not of course understand the opaque payloads.
Let's get a little more concrete. Consider the usual ways of storing data in memory: generally they can be boiled down to either contiguous expandable arrays / lists, pointer/reference-based graphs, and binary blobs of data in particular formats.
Thus, it may be fruitful to define the binary file format along similar lines. Use record headers which indicate the length and composition of the following data, whether it's in the form of an array (a list of identically-typed records), references (offsets to other records in the file), or data blobs (e.g. string data in a particular encoding, but not containing any references).
If carefully designed, this can permit the file format to be used not just for persisting data in and out all in one go, but on an incremental, as-needed basis. If the substructure is properly designed, it can be application agnostic yet still permit e.g. a garbage collection application to be written, which understands the blobs, arrays and reference record types, and is able to trace through the file and eliminate unused records (i.e. records that are no longer pointed to).
That's just one idea. Other places to look for ideas are in general file system designs, or relational database physical storage strategies.
Of course, depending on your requirements, this may be overkill. You may simply be after a binary format for persisting in-memory data, in which case an approach to consider is tagged records.
In this approach, every piece of data is prefixed with a tag. The tag indicates the type of the immediately following data, and possibly its length and name. Lists may be suffixed with an "end-list" tag that has no payload. The tag may have an embedded identifier, so tags that aren't understood can be ignored by the serialization mechanism when it's reading things in. It's a bit like XML in this respect, except using binary idioms instead.
Actually, XML is a good place to look for long-term longevity of a file format. Look at its namespacing capabilities. If you construct your reading and writing code carefully, it ought to be possible to write applications that preserve the location and content of tagged (recursively) data they don't understand, possibly because it's been written by a later version of the same application.
Make sure that you reserve a tag code (or better yet reserve a bit in each tag) that specifies a deleted/free block/chunk.
Blocks can then be deleted by simply changing a block's current tag code to the deleted tag code or set the tag's deleted bit.
This way you don't need to right away completely restructure your file when you delete a block.
Reserving a bit in the tag provides the the option of possibly undeleting the block
(if you leave the block's data unchanged).
For security, however you might want to zero out the deleted block's data, in this case you would use a special deleted/free tag.
I agree with Stepan, that you should choose an endianess, but I would also have an endianess indicator in the file.
If you use an endianess indicator you might consider using
one of the UniCode Byte Order Marks also as an inidicator of any UniCode text encoding used for any text blocks. The BOM is usually the first few bytes of UniCoded text files, so if your BOM is the first entry in your file there might be a problem of some utility identifying your file as UniCode text (I don't think this is much an issue).
I would treat/reserve the BOM as one of your normal tags (using either the UTF16 BOM if using the 16bit tags or the UTF32 BOM if using 32bit tags) with a 0 length block/chunk.
See also http://en.wikipedia.org/wiki/File_format
I agree with atzz's suggestion of using a Tag Length Value system. For future compatibility, you could store a set of "pointers" to TLV entries at the start (or maybe Tag,Pointer and have the pointer point to a Length,Value; or perhaps Tag,Length,Pointer and then have all the data together elsewhere?).
So, my file could look something like:
magic number/file id
version
tag for first data entry
pointer to first data entry --------+
tag for second data entry |
pointer to second data entry |
... |
length of first data entry <--------+
value for first data entry
...
magic number, version, tags, pointers and lengths would all be a predefined set length, for easy decoding. Say, 2 bytes. Or 4, depending on what you need. They don't all need to be the same (eg, all tags are 1 byte, pointers are 4 etc).
The tag lets you know what is being stored. The pointer tells you where (either an offset or absolute value, in bytes), the length tells you how large the data is, and the value is length bytes of data of type tag.
If you use a MyFileFormat v1 decoder on a MyFileFormat v2 file, the pointers allow you to skip sections which the v1 decoder doesn't understand. If you simply skip invalid tags, you can probably simply use TLV instead of TPLV.
I would either hand code something like that, or maybe define my format in ASN.1 and generate a codec (I work in telecommunications, so ASN.1/TLV makes sense to me :-D)
If you're dealing with variable-length data, it's much more efficient to use pointers: Have an array of pointers to your data, ideally near the start of the file, rather than storing the data in an array directly.
Indirection is preferrable in this instance because it allows random access, which is only possible if all items are the same size. If the data was directly stored in an array, without specifying the locations of any records, data access would take O(n) time in the worst case; in order for your file-reading code to access a particular element it would have to know the length of all previous elements, and the only way to find that out is to look at each one. If you're reading the entire file at once, then you'd be doing this anyway, so it wouldn't be a problem. But if you only want one thing, then this isn't the way to go.
Whereas with an array of pointers, it's O(1) time all around: all you need is an index number, and you can retrieve and follow the pointer to get at your data.
When writing a file using this method, you would of course have to build up your table in memory before doing any writing.