Large text file manipulation - text-manipulation

Seeking help on manipulating large text files. I'm not a programmer and apparently have some serious trouble with regex. The text files in question are an output of some tool that logs enormous amount of information about the server it's running on. The produced output needs to be adjusted to defined requirements. Each text file between 4-10 MB of size. The structure of the file broken into a sections that look like this:
http://imageshack.us/photo/my-images/853/79515724.jpg/
What I need is to somehow remove the some sections where as leaving SOME of them intact.
I have around 200 of those files.
Thank you.

Related

Writing TIFF pixel data last and making 8 GB TIFF files

Is there any reason not to write TIFF pixel data last? Typically a simple TIFF file starts with a header that describes endianness and contains the offset to the first IFD, then the pixel data, followed at the end of the file by the IFD and then the extra data that the IFD tags point to. All TIFF files I've seen are written in this order, however the TIFF standard says nothing about mandating such an order, in fact it says that "Compressed or uncompressed image data can be stored almost anywhere in a TIFF file", and I can't imagine that any TIFF parser would mind the order as they must follow the IFD offset and then follow the StripOffsets (tag 273) tags. I think it makes more sense to put the pixel data last so that one goes through a TIFF file more sequentially without jumping from the top to the bottom back to the top for no good reason, but yet I don't see anyone else doing this, which perplexes me slightly.
Part of the reason why I'm asking is that a client of mine tries to create TIFF files slightly over 4 GB which doesn't work due to the IFD offsets overflowing, and I'm thinking that even though the TIFF standard claims TIFF files cannot exceed 2^32 bytes there might be a way to create 8 GB TIFF files that would be accepted by most TIFF parsers if we put everything that isn't pixel data first so that they all have very small offsets, and then point to two strips of pixel data, a first strip followed by a second that would start before offset 2^32 and that is itself no larger than 2^32-1, thus giving us TIFF files that can be up to 2*(2^32-1) bytes and still be in theory readable despite being limited to 32 bit offsets and sizes.
Please note that this is a question about the TIFF format, I'm not talking about what any third-party library would accept to write as I wrote my own TIFF writing code, nor am I asking about BigTIFF.
It's OK to write pixel data after the IFD structure and tag data. Many TIFF writing software and libraries do that. Most notable, libtiff based software writes image data first. As for writing image data in huge strips, possibly extending the 4GB file size, check with the software/libraries you intend to read the files with. Software compiled for 32-bit or implementation details might prevent reading such files. I found that for example modern libtiff based software, Photoshop, and tifffile can read such files, while ImageJ, BioFormats, and Paint.NET can not.

how can i open big size html file?

I am using python script for comparing the data and output of differences is saved into html file, but due to huge difference the result file turns into 158 MB size which I am unable open in any browser
How to open it or should I convert it into pdf or some other format on which I can open it?
HTML is for rendering content for consumption. A 158MB web page is to large, a human can not be expected to process this amount of information in a single viewing.
Alter your script to restrict the number of displayed differences between files to more manageable amount and include the total number of additional differences a count.
eg:
<p>[X] differences identified between [file1] and [file2], the first 5 difference are listed below:</p>
<ul>
<li>[line]:[difference]</li>
<li>[line]:[difference]</li>
<li>[line]:[difference]</li>
<li>[line]:[difference]</li>
<li>[line]:[difference]</li>
</ul>
If you need to have all the differences available, consider a different file format. A plain text file will allow you to use the various large file viewing applications (top, more etc) available.

How to check pdf is exist or same 80% in mysql?

How to check pdf is exist or same 80% in mysql?
User want to upload pdf.
But problem is reup.
I think covert pdf to binary
=> I will have a string "X"(binary of that pdf) to save in mysql.
=> Select like %(splice (1/3 length(X) -> 2/3 length(X)).
maybe do it?
im using laravel
thank for reading
This cannot be done reasonably in MySQL. Since you are also using a PHP environment, it may be possible to perform via PHP, but to achieve a general solution you will need substantial effort.
PDF files are composed of (possibly compressed) streams of images and text. Several libraries can attempt to extract the text, and will work reasonably well if the PDF was generated in a straightforward way; however, they will typically fail if some text was rendered as images of its characters, or if other ofuscation has been applied. In those cases, you will need to use OCR to generate the actual text as it is seen when the PDF is displayed. Note also that tables and images are out-of-scope for these tools.
Once you have two text files, finding overlaps becomes much easier, although there are several techniques. "Same 80%" can be interpreted in several ways, but let us assume that copying a contiguous 79% of the text from a file and saving it again should not trigger alarms, while copying 81% of that same text should trigger them. Any diff tool can provide information on duplicate chunks, and may be enough for your purposes. A more sophisticated approach, which however does not provide exact percentages, is to use the normalized compression distance.

war packaging error grails 2.3.4

I am using grails 2.3.4 and mysql is mysql:mysql-connector-java:5.1.24' and there are 163 gsp files, everytime when I run script as war or any other to create war file it shows following error
.Error
|
WAR packaging error: encoded string too long: 70621 bytes
and there is no any gsp file more than 64kb and I have already commented grails.project.fork in buildconfig.groovy but still I am getting problem please help.
I doubt that this is the answer you want to see :) I can't imagine that you have a good reason for being anywhere near the max size of a GSP. You shouldn't even know what the number is, only that it's way higher than you would ever need it to be.
You've either got a ton of code or a ton of HTML (or both) in these gigantic pages. There are plenty of obvious strategies for putting your GSPs on a diet. Use taglibs to move a lot of the code (which should not be used at all in a GSP, this isn't PHP) out of the view rendering tier and into the controller and service tiers where it belongs. You can extract static and mostly-static HTML blocks to includes/templates.
There's probably a lot of duplicated work here too - it's difficult to get this many files this large without a significant amount of copypasta. As a file gets very large it gets very hard to maintain an overall sense of what's where - our brains can only handle a certain amount of data before overloading. You also tend to start misplace small objects and partially eaten lunches in there, and that just makes things worse.
If you don't have the time for the significant refactoring this project likely needs if you've gotten this far off track, even a quick simple move to taglibs and templates without much thought about properly engineering the work would get things going. At least until you hit the limit again :)

Definition of a text file

I'm not really a professional programmer (I just do some number crunching), I'm just trying to learn more some things about computing.
I'm here to ask for -a reference- for a reading regarding the basic aspects of a 'file'. I'm having difficulty to understand the difference between text files and binary files. With my current understaning an image file is no more 'binary' than a text file. I'd like to understand what makes a file a text file. Is it a special sequence of bits?
Please, I just need a good reading reference (although some clarification would be welcome) and I'm not really trying to make a vague, generic, question.
Preferable, I'd like to be pointed to a technical reading containing definitions such as "a text file is a sequence of bits whose etc..."
Thanks,
Seneika.
BTW: what one finds on Wikipedia, for example, is not what I want.
Edit: horrible grammar mistake corrected...
A text file is a computer file that stores a typed document as a
series of alphanumeric characters, usually without visual formatting
information. The content may be a personal note or list, a journal or
newspaper article, a book, or any other text that can be rendered
accurately in typewritten form. Text files are similar to word
processing files in that the content of both is primarily textual;
they differ in that text files usually do not record information such
as character style and size, pagination, or other details that would
specify the appearance of a finished document. Some computer operating
systems make a basic distinction between a text file, which is
intended to be translated directly into human-readable text, and a
binary file, which is interpreted directly by the computer.
Source : Binary File & Text File
More details on WikiPedia
This should help you in your quest to answer your question :-)
http://www.wisegeek.com/what-is-a-binary-file.htm