I'm trying to interface the htmlhelp api (which is a big word for one function in two variants), and I have a problem with the following usecase:
Assume I have a simple programmer's editor, with a bunch of helpfiles (.CHMs). Some are from the core runtime library, some from more exotic libraries. Assume the CHMs are crafted normally, and their indexes contains all keywords I want to search. I want to be simply able to search through the various CHMs when a user presses F1 on a keyword in the editor
So roughly I want (in pseudo code):
firstchm
while not (out of CHMs) and not Found
{
if keyword in CHM then
{
found=true;
break;
}
nextchm;
}
I've played a bit with HH_HELP_TOPIC, but that would pop up a window for every attempted file, and worse it would be dog slow since the CHMs would not remain cached.
Is there really no solution except DIY with e.g. chmlib? Or is it worth making a study of merged CHM files first?
Language is Delphi or clone, but anything win32/COM and somewhat readable will do.
(edit) Search results for nested index entries might be the next problem:
HTML Help keyword lookup
(/edit)
update 2
After a long time, I got a potential hint elsewhere. Craft a CHM runtime that merges all the other CHMs. Windows will generate CHWs for it containing all the slave CHM's TOC and indexes. Requires Binary TOC=off and Binary Index=on though, for all slave CHMs, and a CHM compiler installed/available. But since that is CHM workshop default, that might be not too bad.
Do you want to create an index or do a one-time search for these keywords?
Couldn't you extract the HTML content from the CHM files with logical filenames, search the HTML content, and relate that back to the CHM file?
Related
I have an application that needs a configuration file with several inputs which depend on the project that is going to be delivered. Things that are included in this conf file are IP's of databases, activating certain functions depending on the customer's needs, changing the values of some title screens, etc... A short example of a file could be something like:
postgresdb=192.156.98.98
transactions.enabled=true
application.name="client-1-logistics"
historicaldb=196.125.125.16
....
This files can become large and it might be difficult to find which parameters must be changed, specially if the configuration process has to be done by an external department.
I was looking into some kind of tool or framework that allows you to create some sort of questionnaire by which the user answers yes or no questions and fills out boxes with specific IP's or messages and get as a result the configuration file needed. This would be much tidier as you could group the questions into sections and has the potential of customising the configuration process with more context on the different parameters.
Does anyone know of such a framework?. How do you handle this kind of complex configuration processes?
The approach I outline below is not exactly what you are looking for, but it might provide some food for thought.
Use a template engine (example, Velocity, or any of the
several dozen listed in Wikipedia) to create a templated
version of your configuration file, containing lots of boilerplate
configuration that won't change, with the occasional
${variable_name} placeholder (the syntax for a placeholder will
vary from one template engine to another).
Write a small metadata file containing variable_name=value
settings.
Write a trivial program that: (a) parses the metadata file and loads
the variable_name=value settings into a Map (the template engine
might refer to the Map as, say, a context object); (b) uses the
template engine to parse the template file; (c)
merges/evaluates/instantiates the parsed template file with the settings in
the Map; and (d) writes the result to the target
configuration file.
You might be able to use steps 1 and 3 above without change. It is only step 2 that you need to adapt to your questionnaire requirements. Instead of a questionnaire, perhaps you could give users a document that explains how to write the metadata file.
I have a FindFile routine in my program which will list files, but if the "Containing Text" field is filled in, then it should only list files containing that text.
If the "Containing Text" field is entered, then I search each file found for the text. My current method of doing that is:
var
FileContents: TStringlist;
begin
FileContents.LoadFromFile(Filepath);
if Pos(TextToFind, FileContents.Text) = 0 then
Found := false
else
Found := true;
The above code is simple, and it generally works okay. But it has two problems:
It fails for very large files (e.g. 300 MB)
I feel it could be faster. It isn't bad, but why wait 10 minutes searching through 1000 files, if there might be a simple way to speed it up a bit?
I need this to work for Delphi 2009 and to search text files that may or may not be Unicode. It only needs to work for text files.
So how can I speed this search up and also make it work for very large files?
Bonus: I would also want to allow an "ignore case" option. That's a tougher one to make efficient. Any ideas?
Solution:
Well, mghie pointed out my earlier question How Can I Efficiently Read The First Few Lines of Many Files in Delphi, and as I answered, it was different and didn't provide the solution.
But he got me thinking that I had done this before and I had. I built a block reading routine for large files that breaks it into 32 MB blocks. I use that to read the input file of my program which can be huge. The routine works fine and fast. So step one is to do the same for these files I am looking through.
So now the question was how to efficiently search within those blocks. Well I did have a previous question on that topic: Is There An Efficient Whole Word Search Function in Delphi? and RRUZ pointed out the SearchBuf routine to me.
That solves the "bonus" as well, because SearchBuf has options which include Whole Word Search (the answer to that question) and MatchCase/noMatchCase (the answer to the bonus).
So I'm off and running. Thanks once again SO community.
The best approach here is probably to use memory mapped files.
First you need a file handle, use the CreateFile windows API function for that.
Then pass that to CreateFileMapping to get a file mapping handle. Finally use MapViewOfFile to map the file into memory.
To handle large files, MapViewOfFile is able to map only a certain range into memory, so you can e.g. map the first 32MB, then use UnmapViewOfFile to unmap it followed by a MapViewOfFile for the next 32MB and so on. (EDIT: as was pointed out below, make sure that the blocks you map this way overlap by a multiple of 4kb, and at least as much as the length of the text you are searching for, so that you are not overlooking any text which might be split at the block boundary)
To do the actual searching once the (part of) the file is mapped into memory, you can make a copy of the source for StrPosLen from SysUtils.pas (it's unfortunately defined in the implementation section only and not exposed in the interface). Leave one copy as is and make another copy, replacing Wide with Ansi every time. Also, if you want to be able to search in binary files which might contain embedded #0's, you can remove the (Str1[I] <> #0) and part.
Either find a way to identify if a file is ANSI or Unicode, or simply call both the Ansi and Unicode version on each mapped part of the file.
Once you are done with each file, make sure to call CloseHandle first on the file mapping handle and then on the file handling. (And don't forget to call UnmapViewOfFile first).
EDIT:
A big advantage of using memory mapped files instead of using e.g. a TFileStream to read the file into memory in blocks is that the bytes will only end up in memory once.
Normally, on file access, first Windows reads the bytes into the OS file cache. Then copies them from there into the application memory.
If you use memory mapped files, the OS can directly map the physical pages from the OS file cache into the address space of the application without making another copy (reducing the time needed for making the copy and halfing memory usage).
Bonus Answer: By calling StrLIComp instead of StrLComp you can do a case insensitive search.
If you are looking for text string searches, look for the Boyer-Moore search algorithm. It uses memory mapped files and a really fast search engine. The is some delphi units around that contain implementations of this algorithm.
To give you an idea of the speed - i currently search through 10-20MB files and it takes in the order of milliseconds.
Oh just read that it might be unicode - not sure if it supports that - but definately look down this path.
This is a problem connected with your previous question How Can I Efficiently Read The First Few Lines of Many Files in Delphi, and the same answers apply. If you don't read the files completely but in blocks then large files won't pose a problem. There's also a big speed-up to be had for files containing the text, in that you should cancel the search upon the first match. Currently you read the whole files even when the text to be found is in the first few lines.
May I suggest a component ? If yes I would recommend ATStreamSearch.
It handles ANSI and UNICODE (and even EBCDIC and Korean and more).
Or the class TUTBMSearch from the JclUnicode (Jedi-jcl). It was mainly written by Mike Lischke (VirtualTreeview). It uses a tuned Boyer-Moore algo that ensure speed. The bad point in your case, is that is fully works in unicode (widestrings) so the trans-typing from String to Widestring risk to be penalizing.
It depends on what kind of data yre you going to search with it, in order for you to achieve a real efficient results you will need to let your programm parse the interesting directories including all files in there, and keep the data in a database which you can access each time for a specific word in a specific list of files which can be generated up to the searching path. A Database statement can provide you results in milliseconds.
The Issue is that you will have to let it run and parse all files after the installation, which may take even more than 1 hour up to the amount of data you wish to parse.
This Database should be updated eachtime your programm starts, this can be done by comparing the MD5-Value of each file if it was changed, so you dont have to parse all your files each time.
If this way of working can be interesting if you have all your data in a constant place and you analyse data in the same files more than each time totally new files, some code analyser work like this and they are real efficient. So you invest some time on parsing and saving intresting data and you can jump to the exact place where a searching word appears and provide a list of all places it appears on in a very short time.
If the files are to be searched multiple times, it could be a good idea to use a word index.
This is called "Full Text Search".
It will be slower the first time (text must be parsed and indexes must be created), but any future search will be immediate: in short, it will use only the indexes, and not read all text again.
You have the exact parser you need in The Delphi Magazine Issue 78, February 2002:
"Algorithms Alfresco: Ask A Thousand Times
Julian Bucknall discusses word indexing and document searches: if you want to know how Google works its magic this is the page to turn to."
There are several FTS implementation for Delphi:
Rubicon
Mutis
ColiGet
Google is your friend..
I'd like to add that most DB have an embedded FTS engine. SQLite3 even has a very small but efficient implementation, with page ranking and such.
We provide direct access from Delphi, with ORM classes, to this Full Text Search engine, named FTS3/FTS4.
I have a ~600MB .DAT file that contains an italian dictionary (accented words with their definitions).
I would like to extract all the strings from this file (a raw dump containing strings and dirty headers/binary data would be all right as long as I can read the words and definitions).
So my question is:
Is there a software that could do this in an automated way?
I would tell it:
'I know that this file contains the strings "TREE", "DOG", "CAT", "COLLISION"... now use some brute force, statistical analysis or whatever method to try and find how these strings are encoded'
2 things I'd like to mention:
I am software developer but have absolutely no experience or knowledge in reverse engineering, hex editing etc...
I do not want to spend hours reading reverse engineering tutorials and doing trial and error using many sofwares. If I don't succeed in extracting what I need in a simple manner, I'll just abandon this task.
I realize that it's probable (if the text is encrypted for instance) that this task could not be performed simply, I just want to give it a try with the best tool available.
It seems that such an automated tool does not exist, of if it did, it would only work for a very small set of input files.
I finally found a solution to my problem.
I have an EXE program that allows browsing the dictionary and displaying the definition of a word.
Using AutoHotkey, I wrote a relatively simple script that searches the definition of every word from a 400k words input list, copies it to the clipboard, then pastes it in another output text file.
I had to insert some Sleep statements between the keystrokes, window switching etc. to make the script stable.
Estimated time to "parse" the whole dictionary: 20 days :)
I have lots of scanned images of a magazine(published monthly) and i have to organize it in searchable manner.
User should be able to view magazine issue wise or can search for predefined categories/keywords.
What i have thought for now, is to create CHM as it will need less effort than creating a new custom built software.
For that i will create seperate HTMl page(Programatically) with image embedded in it along with the keywords(Stored in Excel sheet along with path of Image) for which that image should be included in result.
So i want a chm creator that can parse html meta tags and add keywords in chm keywords list.
One such software i have found is Abee CHM Maker
But i need some free alternative.
If you have any other idea to organize it with minimal efforts, then also you are welcome...
The standard (free) way to create chm files is using Microsoft's HTML help workshop:
http://msdn.microsoft.com/en-us/library/windows/desktop/ms670169(v=vs.85).aspx
Kind regards,
Bo
Free Pascal has a CHM creator package, a html DOM implementation and a basic commandline compiler for CHM projects (.hhp). The creator package is independent of MS tools or any other binary blob, and available in source. It is portable as far as FPC is portable (not as portable as gcc on paper, but enough in practice with all major architectures and OSes supported)
One could make something like that, I made something similar, but instead of meta, I folded back titles into TOC and index and cleaned up html (TeX4ht output) and fixed links before turning it into a chm.
But it will require some work, and if you are not familiar with Object Pascal/Delphi (the language), it might be a bridge too far. (the hours required would not compare favorably with the costs of the Abee thing, if that would suit your goals).
On the other hand, in a freely programmable system you can decide yourself how far you automatize things. I put in a lot of work once, and now all new output of tex4ht (with a certain fixed set of settings) formats nicely to chms.
See if this helps you (it certainly does what you need):
KEL CHM Creator: http://dumah7.wordpress.com/2009/02/17/kel-chm-creator-v-1-4-0-0/
Alternatively, I think you could add tags on each picture (right click on it-> Properties->Details->Tags) and use Windows explorer for searching them. I have never done this but it is supposed to be working (I guess).
MonoDevelop creates those for every project. Should I include them in source control?
From a MonoDevelop blog post:
There were several long time pending
bug reports, and I also wanted to
improve a bit the performance and
memory use. MonoDevelop creates a
Parser Information Database (pidb)
file for each assembly or project.
This file contains all the information
about classes implemented in an
assembly, together with documentation
pulled from Monodoc. A pidb file has
trhee sections: the first one is a
header which contains among other
things the version of the file format
(that version is checked when loading
the pidb, and the file will be
regenerated if it doesn't match the
current implementation version). The
second section is the index of the
pidb file. It contains an index of all
classes in the database. The index is
always fully loaded in memory to be
able to quickly locate classes. The
third section of the file contains all
the class information: list of
methods, fields, properties,
documentation for each of those, and
so on. Each entry in the index has a
file offset field, which can be used
to completely load all the information
of a class (the index only contains
the name).
So it sounds like it's really just an optimization. I would personally not include it in source control unless you find it makes a big difference to performance: my guess is it will only really stay valid if only one person is working on the project at a time. (If it's big and changes regularly, you could find it adds significant overhead to the repository too. I haven't checked to see what the size is actually like, but it's worth checking.)
They're just cached code completion data. As the post Jon linked explains, the main reason is to save memory, though they do also save you from waiting for MD to parse all the source files and referenced assemblies when you open a project.
The pidb files can be regenerated pretty quickly, so there's no advantage to keeping them in the VCS. Indeed, as well as the VCS repository overhead, it could also cause problem if people are using different versions of MD with different pidb formats, so I'd strongly recommend against keeping them in source control.