How to check pdf is exist or same 80% in mysql? - mysql

How to check pdf is exist or same 80% in mysql?
User want to upload pdf.
But problem is reup.
I think covert pdf to binary
=> I will have a string "X"(binary of that pdf) to save in mysql.
=> Select like %(splice (1/3 length(X) -> 2/3 length(X)).
maybe do it?
im using laravel
thank for reading

This cannot be done reasonably in MySQL. Since you are also using a PHP environment, it may be possible to perform via PHP, but to achieve a general solution you will need substantial effort.
PDF files are composed of (possibly compressed) streams of images and text. Several libraries can attempt to extract the text, and will work reasonably well if the PDF was generated in a straightforward way; however, they will typically fail if some text was rendered as images of its characters, or if other ofuscation has been applied. In those cases, you will need to use OCR to generate the actual text as it is seen when the PDF is displayed. Note also that tables and images are out-of-scope for these tools.
Once you have two text files, finding overlaps becomes much easier, although there are several techniques. "Same 80%" can be interpreted in several ways, but let us assume that copying a contiguous 79% of the text from a file and saving it again should not trigger alarms, while copying 81% of that same text should trigger them. Any diff tool can provide information on duplicate chunks, and may be enough for your purposes. A more sophisticated approach, which however does not provide exact percentages, is to use the normalized compression distance.

Related

what data storage model is used to store articles in wikipedia

Articles in wikipedia get edited. They can grow/shrink/updated etc. What file system/database storage layout etc is used underneath to support it. In database course, I had read a bit on variable length record, but that seemed like more for small strings and not for whole document. Like in file system, files can grow/shrink etc, and I think its done by chaining blocks together. each time, we update a file, not the whole file is rewritten. Perhaps something similar would be done here.
I am looking for specific names,terminologies, may be even how the schema in mysql is defined. (I think wikipedia uses mysql).
Below are links to some writeup on wikipedia architecture, but I am not being able to answer my question from these:
http://swe.web.cs.unibo.it/twiki/pub/WikiFactory/AntonelloDiMuroThesis/Wikipedia-cheapandexplosivescalingwithLAMP.pdf
http://dom.as/uc/workbook2007.pdf
Thanks,
See:
http://www.mediawiki.org/wiki/Manual:Database_layout

Tools for viewing logs of unlimited size

It's no secret that application logs can go well beyond the limits of naive log viewers, and the desired viewer functionality (say, filtering the log based on a condition, or highlighting particular message types, or splitting it into sublogs based on a field value, or merging several logs based on a time axis, or bookmarking etc.) is beyond the abilities of large-file text viewers.
I wonder:
Whether decent specialized applications exist (I haven't found any)
What functionality might one expect from such an application? (I'm asking because my student is writing such an application, and the functionality above has already been implemented to a certain extent of usability)
I've been using Log Expert lately.
alt text http://www.log-expert.de/images/stories/logexpertshowcard.gif
I can take a while to load large files, but it will in fact load them. I couldn't find the file size limit (if there is any) on the site, but just to test it, I loaded a 300mb log file, so it can at least go that high.
Windows Commander has a built-in program called Lister which works very quickly for any file size. I've used it with GBs worth of log files without a problem.
http://www.ghisler.com/lister/
A slightly more powerful tool I sometimes use it Universal Viewer from http://www.uvviewsoft.com/.

PDF Report generation [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
EDIT : I completed this project using ABCpdf. For anyone interested, I love this product and their support is A+. Everything I listed as a 'Con' for the HTML -> PDF solution was easily doable in ABCpdf.
I've been charged with creating a data driven pdf report. After reviewing the plethora of options, I have narrowed it down to 2. I need you all to to help me decide, or offer alternatives I haven't considered. Here are the requirements:
100% Data driven
Eventually PDF (a stop in HTML is fine, so long as it is converted)
Can be run with multiple sets of data (the layout is always the same, the data is variable)
Contains normal analysis-style copy (saved in DB with html markup)
Contains tables (data for tables is generated at run-time)
Header/Page # on each page
Table of Contents
.NET (VB or C#)
Done quickly
Now, because of the fact that the report is going to be generated with multiple sets of data, I don't think a stamped pdf template will work since I won't know how long or how many pages a certain piece of the report could require.
So, I think my best options are:
Programmatic creation using an iText-like solution.
Generate in HTML and convert to PDF using a third-party application (ABCPdf is the tool I have played with so far)
Both solutions have their pro's and con's.
Programmatic solution:
Pros:
Flexible
Easy page numbering/page header/table of contents
Free
Cons:
Time consuming (to write a layer on top of iText to do what I need and keep maintainable)
Since the copy is already stored in the db with html markup, I would have to parse through the data before I place it into the pdf, ensuring I don't have to break the paragraph into chunks so I can apply bold, italic, underline, etc. to specific phrases. This seems like a huge PITA, and I hope I am wrong about that assumption.
HTML -> PDF
Pros:
Easy to generate from db (no parsing necessary)
Many tools for conversion
Uses technology I am already familiar with
Built-in "Print Preview" - not a req, but nice
Cons:
(Edited after project completion. All of my assumptions were incorrect and ABCpdf is awesome)
1. Almost impossible to generate page headers - Not True
2. Very difficult to generate page numbers Not True
3. Nearly impossible to generate table of contents Not True
4. (Cross-browser support isn't a con; Since its internal, I can dictate what browser to use)
5. Conversion tool quirks - may not convert exactly as rendered in browser Not True
6. Overall, I think it would be very hard to format the HTML exactly as I would want it to appear/convert to PDF. Not True
That's it - I need the communitys help in deciding which way I should go. I might be wrong about some of my Pro/Con assumptions. If I am, please tell me. All thoughts and suggestions are welcome and appreciated.
Thanks
Decided on using an approach similar to the one used at
http://alistapart.com/articles/boom
Using ABCPdf instead of Prince for the eventual HTML -> PDF generation.
Anyone who is interested in the same thing, feel free to message me about this approach.
I think that if you have a full version of Adobe Acrobat Pro, it comes with Adobe Live Cycle. You should be able to produce reports generated from a database from it. It will give you everything you need in formatting since you will create the report from scratch.
You can create a database connection to an OLE database that will feed data to your form fields. You select the tables to be used, any stored procedures that will run, any queries, and then the data will appear on one of the pallettes in the designer.
You can also use Web Services (WDSL) to receive and process commands and return the results to the form.
Either way, you would bind fields to your data source and then the data would be displayed in your form.
If you're willing to do a little .NET work there's this:
http://www.dotnetvj.com/2009/05/populating-pdf-from-aspnet-using.html
Depending on which platform you are using and targeting, you might want to consider a reporting solution. These are not perfect but the one thing they do give you is the ability to write a report once and then render it in HTML, PDF, or even Excel.
Usually they also provide an editor that helps you design the report and make it look just right. They provide things like paging, headers, footers, graphs, etc. They also provide an API that you can use to programatically create and run the reports.
I've used Reporting Services in a MS environment and Jasper Reports in a Java environment with good results in both. I'm sure there are other options but these are the ones I've been able to use successfully.
For the HTML→PDF step, I really love Prince. It looks like you can call it from VB.
My recommendation is to use SQL Reporting Services.
Can design every page & table of your report
Include Header and Footer
Include Page Numbers
Table of Contents
Can span through multiple pages
Supports Images & Charts
Can be rendered to PDF without a need for any thrid party PDF Converters

Tools to help reverse engineer binary file formats

What tools are available to aid in decoding unknown binary data formats?
I know Hex Workshop and 010 Editor both support structures. These are okay to a limited extent for a known fixed format but get difficult to use with anything more complicated, especially for unknown formats. I guess I'm looking at a module for a scripting language or a scriptable GUI tool.
For example, I'd like to be able to find a structure within a block of data from limited known information, perhaps a magic number. Once I've found a structure, then follow known length and offset words to find other structures. Then repeat this recursively and iteratively where it makes sense.
In my dreams, perhaps even automatically identify possible offsets and lengths based on what I've already told the system!
Here are some tips that come to mind:
From my experience, interactive scripting languages (I use Python) can be a great help. You can write a simple framework to deal with binary streams and some simple algorithms. Then you can write scripts that will take your binary and check various things. For example:
Do some statistical analysis on various parts. Random data, for example, will tell you that this part is probably compressed/encrypted. Zeros may mean padding between parts. Scattered zeros may mean integer values or Unicode strings and so on. Try to spot various offsets. Try to convert parts of the binary into 2 or 4 byte integers or into floats, print them and see if they make sence. Write some functions that will search for repeating or very similar parts in the data, this way you can easily spot headers.
Try to find as many strings as possible, try different encodings (c strings, pascal strings, utf8/16, etc.). There are some good tools for that (I think that Hex Workshop has such a tool). Strings can tell you a lot.
Good luck!
For Mac OS X, there's a great tool that's even better than my iBored: Synalyze It!
(http://www.synalysis.net/)
Compared to iBored, it is better suited for non-blocked files, while also giving full control over structures, including scriptability (with Lua). And it visualizes structures better, too.
Tupni; to my knowledge not directly available out of Microsoft Research, but there is a paper about this tool which can be of interest to someone wanting to write a similar program (perhaps open source):
Tupni: Automatic Reverse Engineering of Input Formats (# ACM digital library)
Abstract
Recent work has established the importance of automatic reverse
engineering of protocol or file format specifications. However, the
formats reverse engineered by previous tools have missed important
information that is critical for security applications. In this
paper, we present Tupni, a tool that can reverse engineer an input
format with a rich set of information, including record sequences,
record types, and input constraints. Tupni can generalize the format
specification over multiple inputs. We have implemented a
prototype of Tupni and evaluated it on 10 different formats: five
file formats (WMF, BMP, JPG, PNG and TIF) and five network
protocols (DNS, RPC, TFTP, HTTP and FTP). Tupni identified all
record sequences in the test inputs. We also show that, by aggregating
over multiple WMF files, Tupni can derive a more complete
format specification for WMF. Furthermore, we demonstrate the
utility of Tupni by using the rich information it provides for zeroday
vulnerability signature generation, which was not possible with
previous reverse engineering tools.
My own tool "iBored", which I released just recently, can do parts of this. I wrote the tool to visualize and debug file system formats (UDF, HFS, ISO9660, FAT etc.), and implemented search, copy and later even structure and templates support. The structure support is pretty straight-forward, and the templates are a way to identify structures dynamically.
The entire thing is programmable in a Visual BASIC dialect, allowing you to test values, read specific blocks, and all.
The tool is free, works on all platforms (Win, Mac, Linux), but as it's personal tool which I just released to the public to share it, it's not much documented.
However, if you want to give it a try, and like to give feedback, I might add more useful features.
I'd even open source it, but as it's written in REALbasic, I doubt many people will join such a project.
Link: iBored home page
I still occasionally use an old hex editor called A.X.E., Advanced Hex Editor. It seems to have largely disappeared from the Internet now, though Google should still be able to find it for you. The last version I know of was version 3.4, but I've really only used the free-for-personal-use version 2.1.
Its most interesting feature, and the one I've had the most use for deciphering various game and graphics formats, is its graphical view mode. That basically just shows you the file with each byte turned into a color-coded pixel. And as simple as that sounds, it has made my reverse-engineering attempts a lot easier at times.
I suppose doing it by eye is quite the opposite of doing automatic analysis, though, and the graphical mode won't be much use for finding and following offsets...
The later version has some features that sound like they could fit your needs (scripts, regularity finder, grammar generator), but I have no idea how good they are.
There is Hachoir which is a Python library for parsing any binary format into fields, and then browse the fields. It has lots of parsers for common formats, but you can also write own parsers for your files (eg. when working with code that reads or writes binary files, I usually write a Hachoir parser first to have a debugging aid). Looks like the project is pretty much inactive by now, though.
Kaitai is an open-source language for describing binary structures in data streams. It comes with a translator that can output parsing code for many programming languages, for inclusion in your own program code.
My project icebuddha.com supports this using python to describe the format in the browser.
A cut'n'paste of my answer to a similar question:
One tool is WinOLS, which is designed for interpreting and editing vehicle engine managment computer binary images (mostly the numeric data in their lookup tables). It has support for various endian formats (though not PDP, I think) and viewing data at various widths and offsets, defining array areas (maps) and visualising them in 2D or 3D with all kinds of scaling and offset options. It also has a heuristic/statistical automatic map finder, which might work for you.
It's a commercial tool, but the free demo will let you do everything but save changes to the binary and use engine management features you don't need.

MS Access Reporting - can it be pretty?

I am working on a project converting a "spreadsheet application" to a database solution. A macro was written that takes screen shots of each page and pastes them into a PowerPoint presentation. Because of the nice formatting options in Excel, the presentation looks very pretty.
The problem I'm having is that I haven't ever seen an Access report that would be pretty enough to display to upper management. I think the output still has to be a PowerPoint presentation. It needs to look as close as possible to the original output.
I am currently trying to write some code to use a .pot (presentation template) and fill in the data programmatically. Putting the data into a PowerPoint table has been tricky because the tables are not easy to manipulate. For example, if a particular description is too long, I need to break into the next cell down (word-wrap isn't allowed because I can only have n lines per page).
Is there a way to make an Access report pretty, am I headed down the right path, or should I just try to programmatically fill in the Excel spreadsheet and use the code that already exists there to produce the presentation? (I'd still need to figure out how to know when to break a line when using a non-monospaced font, as the users are currently doing that manually when they enter the data in the spreadsheet)
Jason Z:
If I set it to wrap, and I already have n lines, it would make n+1 or 2 lines on the slide, which is unacceptable.
Dennis:
That article looks very good, I should be able to glean something from it. Thanks!
Access has the capability to create downright beautiful reports. The problem is that it can't make a spreadsheet look better than Excel. You have to know when to use each tool.
Use Excel when you have spreadsheet-like formatting, need a lot of boxes and lines, or want to draw charts.
Use Access when you will output a report as a PDF. It's very useful for one-record-per-page detail reports, formatting where you need to position things very precisely, and where you need to embed subreports with related or unrelated data.
Think about the reports that would be nasty in Excel because you'd have to merge cells all over the place and do funny things with the placement and the layout would never work. That's where Access shines.
Joel, (your co-host here) did a thing about using access reports for shipping labels a few years back... maybe this could be a inspriation for you?
http://www.joelonsoftware.com/articles/HowToShipAnything.html
I have implemented Access reports which were 'pretty' enough. The downside is that it takes a lot of time and effort, and trial and error to produce the desired output.
You can definitely get there, but it requires the patience of a saint.
I guess it depends on what you mean by pretty. For example, I do not find it particularly difficult to produce say, reasonable graphs and tables with alternate line shading in Access. It is also possible to use MS Word and fill in bookmarks, or mail merge. If the present system uses VBA to create the PowerPoint presentation, perhaps much of it could be transferred to Access? Microsoft have an article on Access to Powerpoint: http://msdn.microsoft.com/en-us/library/aa159920(office.11).aspx
Finally, it is not impossible to build HTML output from Access.
We create multi-colored, conditionally formated, reports that are printed for the partner meeting each month of a publically traded corporation. They're real pretty.
I would suggest that the problem you're having is because the requirement to replicate the old method identically is an incredibly bad idea.
You're not using Excel any more.
You're using a different tool with different capabilities.
Thus, you will use different methods to get results.
Re-evaluate the original requirements to see if they still make sense (e.g., exactly why is PowerPoint involved at all? Can PowerPoint import from the Access report snapshot viewer? Can PowerPoint import from a PDF produced from an Access report?), or if they are too connected to the old tools, and then determine what is important and what isn't, and only then should you start designing your solution.
I personally would not try to re-invent the wheel here. If you already have an Excel sheet that has the formatting you want, just export the data from Access into Excel for the report. Now, if you didn't have the original Excel sheet to begin with, that would be a completely different story.
As for breaking lines with non-monospaced fonts, have you tried setting the cell format to wrap?
It sounds like the path of least resistance is to fill in the Excel spreadsheet. We have a contractor who does our Access stuff, and for the more complicated reports he uses Excel. I guess complicated == hard to make look good.
Rather than filling in the excel spreadsheet programmatically, you may want to use the external data features of Excel and Access. Generally I put a query on each tab, which of course may be hidden. An "update all" causes all the queries to be updated. Then summary tabs show the pretty results from all the individual queries.
For one particularly complex system, a bit of excel vba programmatically changed a query and then walked through the tabs updating each one.
Finally, rather than doing screen shots, Excel has a "copy cells as a image" copy that populates the copy buffer with a resizeable image. This could give you better looking results than a pure screenshot since a screenshot can have various deficiencies depending on pixel density.
Just an update:
After a few hours of work, I was able to get a nice report out of Access (almost an exact copy of the excel version). It wasn't as difficult as I thought, I just had to figure the correct mixture of out subreports and pagebreaks.
Working with the wordwrap features of Excel/Powerpoint were a dead end because there could only be a set number of lines per page, period; plus I was too lazy to nail down all the pagination with VBA code issues myself. Like Shelley says, Access shines at report generation.
The output ended up being a PDF (Using Adobe Acrobat Professional). The problem I have left is getting select pages of said PDF into Powerpoint without Powerpoint antialiasing the results for me and making the resulting slide's text fuzzy. I found a couple of articles on converting .snp output to .wmf, which sounds like the way to go on that front.