Is there faster and more efficient way of reading of files than using AssignFile, Reset and Read (on Windows x86 and x64)?
I need to grep (using TRegExpr) many text files for a pattern.
Two directions:
Keep using text, and to get the max out of assignfile c.s. use settextbuf to increase the buffer size to say 8kb. (bigger values are possible, but don't really speed up anymore)
Otherwise you must craft your own text access using normal binary file access.
Related
How to check pdf is exist or same 80% in mysql?
User want to upload pdf.
But problem is reup.
I think covert pdf to binary
=> I will have a string "X"(binary of that pdf) to save in mysql.
=> Select like %(splice (1/3 length(X) -> 2/3 length(X)).
maybe do it?
im using laravel
thank for reading
This cannot be done reasonably in MySQL. Since you are also using a PHP environment, it may be possible to perform via PHP, but to achieve a general solution you will need substantial effort.
PDF files are composed of (possibly compressed) streams of images and text. Several libraries can attempt to extract the text, and will work reasonably well if the PDF was generated in a straightforward way; however, they will typically fail if some text was rendered as images of its characters, or if other ofuscation has been applied. In those cases, you will need to use OCR to generate the actual text as it is seen when the PDF is displayed. Note also that tables and images are out-of-scope for these tools.
Once you have two text files, finding overlaps becomes much easier, although there are several techniques. "Same 80%" can be interpreted in several ways, but let us assume that copying a contiguous 79% of the text from a file and saving it again should not trigger alarms, while copying 81% of that same text should trigger them. Any diff tool can provide information on duplicate chunks, and may be enough for your purposes. A more sophisticated approach, which however does not provide exact percentages, is to use the normalized compression distance.
What is the correct way of storing the image files in the database.
I am using file-paths to store the images.
But the problem is here. I have to basically show 3 different sizes of one image for my website. One would be used as thumbnail, the second would be around 290px*240px and third would be full size(approx 500px*500px). As it is not considered good to scale down the images using HTML img elements, so what should be the solution for it?
Currently, what I am doing is storing 3 different size images for one thing. Is there any better way?
Frankly the correct way to store images in a database is not to store them in a database. Store them in the file system, and keep the DB restricted to the path.
At the end of the day, you're talking about storage space. That's the real issue. If you're storing multiple copies of the same file at different resolutions, it will take more space than storing just a single copy of the file.
On the other hand, if you only keep one copy of the file and scale it dynamically, you don't need the storage space, but it does take more processing power instead.
And as you already stated in the question, sending the full-size image every time is costly in terms of bandwidth.
So that's the trade-off; storage space on your server vs processor work vs bandwidth costs.
The simple fact is that the cheapest of those three things is storage space. Therefore, you should store the multiple copies of the files on your server.
This is also the solution that will give you the best performance as well, which in many ways is an even more important point than direct cost.
In terms of storing the file paths, my advice is to give the scaled versions predictable names with a standard prefix or suffix compared to the original file. This means you only need to have the single filename on the database; you can simply add the prefix for the relevant version of the image that has been requested.
Nothing wrong with storing multiple versions of the same image.
Ideally you want even more – the #2x ones for retina screens.
You can use a server side script to generate the smaller ones dynamically, but depending on traffic levels this may be a bad idea.
You are really trading storage space and speed for higher CPU and RAM usage to generate them on the fly – depending on your situation that might be a good trade off, or it might not.
Agree with rick , you can store multiple size pics as your business requirements. You should store Image in folder on the server and store its location in database. Make hierarchy in the folder and store low res images inside the folders so that you can always refer to them with only one address.
you can do like this in your web.config
<add key="WebResources" value="~/Assets/WebResources/" />
<add key="ImageRoot" value="Images\Web" />
make .233240 and .540540 folders and store those pictures with same name inside them so you can easily access them.
It looks like Hadoop MapReduce requires a key value pair structure in the text or binary text.
In reality we might have files to be split into chunks to be processed. But the keys may be
spread across the file. It may not be a clear cut that one key followed by one value. Is there any InputFileFormatter that can read such type of binary files? I don't want to use Map Reduce and Map Reduce. That will slow down the performance and defeat the purpose of using map reduce.
Any suggestions? Thanks,
According to the Hadoop : The Definitive Guide
The logical records that FileInputFormats define do not usually fit neatly into HDFS
blocks. For example, a TextInputFormat’s logical records are lines, which will cross
HDFS boundaries more often than not. This has no bearing on the functioning of your
program—lines are not missed or broken, for example—but it’s worth knowing about,
as it does mean that data-local maps (that is, maps that are running on the same host
as their input data) will perform some remote reads. The slight overhead this causes is
not normally significant.
If the file is split by HDFS between boundaries, then Hadoop framework will take care of it. But if you split the file manually, then boundaries have to be taken into consideration.
In reality we might have files to be split into chunks to be processed. But the keys may be spread across the file. It may not be a clear cut that one key followed by one value.
What's the scenario, we can look at a workaround for this?
It's no secret that application logs can go well beyond the limits of naive log viewers, and the desired viewer functionality (say, filtering the log based on a condition, or highlighting particular message types, or splitting it into sublogs based on a field value, or merging several logs based on a time axis, or bookmarking etc.) is beyond the abilities of large-file text viewers.
I wonder:
Whether decent specialized applications exist (I haven't found any)
What functionality might one expect from such an application? (I'm asking because my student is writing such an application, and the functionality above has already been implemented to a certain extent of usability)
I've been using Log Expert lately.
alt text http://www.log-expert.de/images/stories/logexpertshowcard.gif
I can take a while to load large files, but it will in fact load them. I couldn't find the file size limit (if there is any) on the site, but just to test it, I loaded a 300mb log file, so it can at least go that high.
Windows Commander has a built-in program called Lister which works very quickly for any file size. I've used it with GBs worth of log files without a problem.
http://www.ghisler.com/lister/
A slightly more powerful tool I sometimes use it Universal Viewer from http://www.uvviewsoft.com/.
Does anyone know how I can store large binary values in Riak?
For now, they don't recommend storing files larger than 50MB in size without splitting them. See: FAQ - Riak Wiki
If your files are smaller than 50MB, than proceed as you would with storing non binary data in Riak.
Another reason one might pick Riak is for flexibility in modeling your data. Riak will store any data you tell it to in a content-agnostic way — it does not enforce tables, columns, or referential integrity. This means you can store binary files right alongside more programmer-transparent formats like JSON or XML. Using Riak as a sort of “document database” (semi-structured, mostly de-normalized data) and “attachment storage” will have different needs than the key/value-style scheme — namely, the need for efficient online-queries, conflict resolution, increased internal semantics, and robust expressions of relationships.Schema Design in Riak - Introduction
#Brian Mansell's answer is on the right track - you don't really want to store large binary values (over 50 MB) as a single object, in Riak (the cluster becomes unusably slow, after a while).
You have 2 options, instead:
1) If a binary object is small enough, store it directly. If it's over a certain threshold (50 MB is a decent arbitrary value to start with, but really, run some performance tests to see what the average object size is, for your cluster, after which it starts to crawl) -- break up the file into several chunks, and store the chunks separately. (In fact, most people that I've seen go this route, use chunks of 1 MB in size).
This means, of course, that you have to keep track of the "manifest" -- which chunks got stored where, and in what order. And then, to retrieve the file, you would first have to fetch the object tracking the chunks, then fetch the individual file chunks and reassemble them back into the original file. Take a look at a project like https://github.com/podados/python-riakfs to see how they did it.
2) Alternatively, you can just use Riak CS (Riak Cloud Storage), to do all of the above, but the code is written for you. That's exactly how RiakCS works -- it breaks an incoming file into chunks, stores and tracks them individually in plain Riak, and reassembles them when it comes time to fetch it back. And provides an Amazon S3 API for file storage, for your convenience. I highly recommend this route (so as not to reinvent the wheel -- chunking and tracking files is hard enough). Yes, CS is a paid product, but check out the free Developer Trial, if you're curious.
Just like every other value. Why would it be different?
Use either the Erlang interface ( http://hg.basho.com/riak/src/461421125af9/doc/basic-client.txt ) or the "raw" HTTP interface ( http://hg.basho.com/riak/src/tip/doc/raw-http-howto.txt ). It should "just work."
Also, you'll generally find a better response on the riak-users mailing list than you will here. http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com (No offense to z8000, who seems to also have answers.)