At my place of work we have a legacy document management system that for various reasons is now unsupported by the developers. I have been asked to look into extracting the documents contained in this system to eventually be imported into a new 3rd party system.
From tracing and process monitoring I have determined that the document images (mainly tiff files) are stored in a number of 1.5GB files. These files seem to be read from a specific offset and then written to a tmp file that is then served via a web app to the client, and then deleted.
I guess I am looking for suggestions as to how I can inspect these large files that contain the tiff images, and eventually extract and write them to individual files.
Are the TIFFs compressed in some way? If not, then your job may be pretty easy: stitch the TIFFs together from the 1.5G files.
Can you see the output of a particular 1.5G file (or series of them)? If so, then you should be able to piece together what the bytes should look like for that TIFF if it were uncompressed.
If the bytes don't appear to be there, then try some standard compressions (zip, tar, etc.) to see if you get a match.
I'd open a file, seek to the required offset, and then stream into a tiff object (ideally one that supports streaming from memory or file). Then you've got it. Poke around at some of the other bits, as there's likely metadata about the document that may be useful to the next system.
Related
I have a FindFile routine in my program which will list files, but if the "Containing Text" field is filled in, then it should only list files containing that text.
If the "Containing Text" field is entered, then I search each file found for the text. My current method of doing that is:
var
FileContents: TStringlist;
begin
FileContents.LoadFromFile(Filepath);
if Pos(TextToFind, FileContents.Text) = 0 then
Found := false
else
Found := true;
The above code is simple, and it generally works okay. But it has two problems:
It fails for very large files (e.g. 300 MB)
I feel it could be faster. It isn't bad, but why wait 10 minutes searching through 1000 files, if there might be a simple way to speed it up a bit?
I need this to work for Delphi 2009 and to search text files that may or may not be Unicode. It only needs to work for text files.
So how can I speed this search up and also make it work for very large files?
Bonus: I would also want to allow an "ignore case" option. That's a tougher one to make efficient. Any ideas?
Solution:
Well, mghie pointed out my earlier question How Can I Efficiently Read The First Few Lines of Many Files in Delphi, and as I answered, it was different and didn't provide the solution.
But he got me thinking that I had done this before and I had. I built a block reading routine for large files that breaks it into 32 MB blocks. I use that to read the input file of my program which can be huge. The routine works fine and fast. So step one is to do the same for these files I am looking through.
So now the question was how to efficiently search within those blocks. Well I did have a previous question on that topic: Is There An Efficient Whole Word Search Function in Delphi? and RRUZ pointed out the SearchBuf routine to me.
That solves the "bonus" as well, because SearchBuf has options which include Whole Word Search (the answer to that question) and MatchCase/noMatchCase (the answer to the bonus).
So I'm off and running. Thanks once again SO community.
The best approach here is probably to use memory mapped files.
First you need a file handle, use the CreateFile windows API function for that.
Then pass that to CreateFileMapping to get a file mapping handle. Finally use MapViewOfFile to map the file into memory.
To handle large files, MapViewOfFile is able to map only a certain range into memory, so you can e.g. map the first 32MB, then use UnmapViewOfFile to unmap it followed by a MapViewOfFile for the next 32MB and so on. (EDIT: as was pointed out below, make sure that the blocks you map this way overlap by a multiple of 4kb, and at least as much as the length of the text you are searching for, so that you are not overlooking any text which might be split at the block boundary)
To do the actual searching once the (part of) the file is mapped into memory, you can make a copy of the source for StrPosLen from SysUtils.pas (it's unfortunately defined in the implementation section only and not exposed in the interface). Leave one copy as is and make another copy, replacing Wide with Ansi every time. Also, if you want to be able to search in binary files which might contain embedded #0's, you can remove the (Str1[I] <> #0) and part.
Either find a way to identify if a file is ANSI or Unicode, or simply call both the Ansi and Unicode version on each mapped part of the file.
Once you are done with each file, make sure to call CloseHandle first on the file mapping handle and then on the file handling. (And don't forget to call UnmapViewOfFile first).
EDIT:
A big advantage of using memory mapped files instead of using e.g. a TFileStream to read the file into memory in blocks is that the bytes will only end up in memory once.
Normally, on file access, first Windows reads the bytes into the OS file cache. Then copies them from there into the application memory.
If you use memory mapped files, the OS can directly map the physical pages from the OS file cache into the address space of the application without making another copy (reducing the time needed for making the copy and halfing memory usage).
Bonus Answer: By calling StrLIComp instead of StrLComp you can do a case insensitive search.
If you are looking for text string searches, look for the Boyer-Moore search algorithm. It uses memory mapped files and a really fast search engine. The is some delphi units around that contain implementations of this algorithm.
To give you an idea of the speed - i currently search through 10-20MB files and it takes in the order of milliseconds.
Oh just read that it might be unicode - not sure if it supports that - but definately look down this path.
This is a problem connected with your previous question How Can I Efficiently Read The First Few Lines of Many Files in Delphi, and the same answers apply. If you don't read the files completely but in blocks then large files won't pose a problem. There's also a big speed-up to be had for files containing the text, in that you should cancel the search upon the first match. Currently you read the whole files even when the text to be found is in the first few lines.
May I suggest a component ? If yes I would recommend ATStreamSearch.
It handles ANSI and UNICODE (and even EBCDIC and Korean and more).
Or the class TUTBMSearch from the JclUnicode (Jedi-jcl). It was mainly written by Mike Lischke (VirtualTreeview). It uses a tuned Boyer-Moore algo that ensure speed. The bad point in your case, is that is fully works in unicode (widestrings) so the trans-typing from String to Widestring risk to be penalizing.
It depends on what kind of data yre you going to search with it, in order for you to achieve a real efficient results you will need to let your programm parse the interesting directories including all files in there, and keep the data in a database which you can access each time for a specific word in a specific list of files which can be generated up to the searching path. A Database statement can provide you results in milliseconds.
The Issue is that you will have to let it run and parse all files after the installation, which may take even more than 1 hour up to the amount of data you wish to parse.
This Database should be updated eachtime your programm starts, this can be done by comparing the MD5-Value of each file if it was changed, so you dont have to parse all your files each time.
If this way of working can be interesting if you have all your data in a constant place and you analyse data in the same files more than each time totally new files, some code analyser work like this and they are real efficient. So you invest some time on parsing and saving intresting data and you can jump to the exact place where a searching word appears and provide a list of all places it appears on in a very short time.
If the files are to be searched multiple times, it could be a good idea to use a word index.
This is called "Full Text Search".
It will be slower the first time (text must be parsed and indexes must be created), but any future search will be immediate: in short, it will use only the indexes, and not read all text again.
You have the exact parser you need in The Delphi Magazine Issue 78, February 2002:
"Algorithms Alfresco: Ask A Thousand Times
Julian Bucknall discusses word indexing and document searches: if you want to know how Google works its magic this is the page to turn to."
There are several FTS implementation for Delphi:
Rubicon
Mutis
ColiGet
Google is your friend..
I'd like to add that most DB have an embedded FTS engine. SQLite3 even has a very small but efficient implementation, with page ranking and such.
We provide direct access from Delphi, with ORM classes, to this Full Text Search engine, named FTS3/FTS4.
I recently downloaded my location history from Google. From 2014 to present.
The resulting .json file was 997,000 lines, plus a few.
All of the online converters would freeze and lock up unless I did it in really small slices which isn't an option. (Time constraints)
I've gotten a manual process down between Sublime Text and Libre Office to get my information transferred, but I know there's an easier way somewhere.
I even tried the fastFedora plug-in which I couldn't get to work.
Even though I'm halfway done, and will likely finish up using my process, is there an easier way?
I can play with Java though I'm no pro. Any other languages that play well with .json?
A solution that supports nesting without flattening the file. Location data is nested and needs to remain nested (or the like) to make sense. At least grouped.
I have a ".pcapng" binary file, created by Wireshark.
How to detect the beginning of every new package in it?
Is there any specific bytes sequence?
Alternatively, how to detect the end of a package?
(I've seen people whose native language isn't English speak of "packages" rather than "packets" - both words come from the same word "pack", and the same word may be used for both concepts in other languages - so I'm assuming you're referring to network packets; "packages" is generally not used in that sense in English.)
The pcap-NG file format is described in the PCAP Next Generation Dump File Format document. A pcap-NG file is a sequence of blocks; each block has a length field at the beginning (and at the end, to simplify scanning backwards through a file). Not all blocks contain packets; the blocks that do are the Packet Block, Extended Packet Block, and Simple Packet Block.
Note that libpcap 1.1 and later can read pcap-NG files, so any program that uses libpcap to read capture files can, if dynamically liked with libpcap and running on a system where the libpcap shared library is 1.1 or later, or statically linked with libpcap 1.1 or later, can read some pcap-NG files using the same APIs that are used to read pcap files, without any change to the program. (pcap-NG files containing multiple interfaces where not all of them have the same link-layer header type or snapshot length cannot be read, as the current libpcap APIs don't support that.) There is no version of WinPcap based on libpcap 1.1 or later, so WinPcap cannot currently be used to read pcap-NG files.
Another library that can read pcap-NG files is the NTAR library. It, however, can only read pcap-NG files, not pcap files.
I've been working on a WebGL application that requires a tremendous amount of point data to draw to the screen. Currently, that point and surface data is stored on the webserver I am using, and ALL 120MB of JSON is being downloaded by the browser on page load. This takes well over a minute on my network, which is not optimal. I was wondering if anyone has any experience/tips about loading data this large. I've tried eliminating as much whitespace as possible, but that barely made a dent in file size.
Is there any way to either compress this file immensely, or otherwise a better way to download such a large amount of data? Any help would be great!
JSON is incredibly redundant so it compresses well, which is then decompressed on the client.
JavaScript implementation of Gzip
Alternatively you could chunk the data up into 1 MB chunks to be sent over one at a time.
Also the user probably can't interact with 120 MB of data at a time, so maybe implement some sort of level of detail system?
If you control the web server sending the data, you could try enabling compression of json data.
This is done by adding the following in applicationhost.config (IIS 7):
<system.webServer>
<urlCompression doDynamicCompression="true" />
<httpCompression>
<dynamicTypes>
<add mimeType="application/json" enabled="true" />
<add mimeType="application/json; charset=utf-8" enabled="true" />
</dynamicTypes>
</httpCompression>
</system.webServer>
You would then need to restart the App pool for your app.
Source
A few things you might consider:
Is your server compressing the file before sending?
Does this data change often? If it does not, you could set your expires header to a very long time, so the browser could keep it in cache. It wouldn't help on the first page access, but on subsequent ones the file wouldn't have to be loaded again.
Is there a lot of repeating stuff in your json file? For instance, if your object keys are long, you could replace them with shorter ones, send, and replace again in the browser. The benefits will not be that great if the file is compressed (see item 1) but depending on your file it might help a little.
Is all this data consumed by the browser at once? If it's not, you could try breaking it down into smaller pieces, and start processing the first parts while the others load.
But the most important: are you sure JSON is the right tool for this job? A general purpose compression tool can only go so far, but if you explore the particular characteristics of your data you might be able to achieve better results. If you give more details on the format you're using we may be able to help you more.
I had an issue sort of like the one you had, so I've decide to use binary numbers instead of strings so when the user create request to my server I answer with a numbers.
For ex' lets say the user is a table in a restaurant so instead of sending a string like 'burger,orange juice, water,etc...' I can send the number 15 and translate into binary 8,4,2,1.
when the user ask for multiple thing, lets say 4 burgers it will be hard to follow so you can add two numbers of an array with the binary.
I found it very useful and more secure.
If you decide doing it, I suggest on dev' mode use string when you deploy translate into binary.
I have a large HTML file being generated for a report at the moment (around 2-3 mb) and this file is going to be transferred a lot of times. It is not being access through any form of a web host, it is just a file being accessed by a network, but the network is all around the world and therefore not fast everywhere.
I know about gzip compression, but from the looks of it that will only work with an apache web server or something similar to configure it via the .htaccess file. I have already stripped the white spaces from the HTML file, my question is besides just zipping it up in a standard archive, what else can I do to minimize the size of the file?
Thanks, and I will be happy to answer any other questions.
You can certainly look at the HTML structure itself to see if you can reduce the number of tags themselves. For example to you have a bunch of nested table structures that could be replaced? Do you have inline styles that could be put into a separate stylesheet? Do you have any javascript content which could be put into a separate file?
I does not think that you can compress it without a proper web server, because is the web server that say to the browser that the file is to unzip in the HTTP response.
If the format is the greater part of the file (i.e. there are more tags and script than the text) you can use a css to minimize the size.
If the data is the greater, so information are the more than tags, I suggest you to use a web server (also with the Microsoft IIS you can compress it)
But, if possible, consider also to split the data in several file, with different level of details for example
It is possible to contain compressed data within the HTML file and use a JavaScript to dynamically compress the data as the page is rendered using a JavaScript implementation of the Decompression module. See this answer for references: JavaScript implementation of Gzip