How to embed table within text and produce pdf output using Perl

How to embed table within text and produce pdf output using Perl - html

I have a requirement to produce letters to send to customers which will contain a report within the letter text. The idea is that the user can create letter paragraphs which can be saved in a database for later use, can be sequenced and can appear either before or after a report. The report will be in table form.
I've looked using PDF::Table and PDF::API2, (both of which are good at what they do), however, both place 'items' on the page in fixed positions and not create a free flowing document.
Unless I've missed something, there is no way to add a table immediately after a paragraph of text or vice versa as page positions are required.
I have thought about using HTML::Template to create the basic letter, then HTML::HTMLDoc to convert to PDF, but would need the ability to insert a page break on change of customer.
What is my best option to achieve the above result please?
Many Thanks

There are only two ways that I've had any success with.
The first is the Apache XML-FOP project. This is a huge, sprawling Java library and specification for turning XML documents into nicely formatted PDFs. I was never good enough with XML stylesheets and transformations to get to grips with this.
The second is to generate openoffice/libreoffice documents and then use a copy of libreoffice in headless mode to convert them to PDFs. This is what I generally end up doing. You may want a minimal X11 installation for fonts etc with Xvfb as a fake display.
For editing the documents I've had success with the OpenOffice-OODoc distribution. HTH.

Related

How to create language-dictionary database from text file?

I have a large text file, which is an Italian-English dictionary. A typical line is:
Mazzapícchio, a long pole that fishers vse to bob vp and down for Eeles, and also to make fish to stirre. Also a kind of meate or custard in some parts of Italie made with milke and egges.
(Yes, it's a 17th-century dictionary.)
I'm looking for the best/easiest way to turn this into a searchable database.
The search would need to ignore the diacritics; with everything up to the first comma as the 'entry'. There are some cross-references, e.g.: Mefíte, as Mephíte.
My first thought is simply to turn it into HTML, with anchor tags for the word/phrase up to the first comma. That should be easy enough with a bit of Grep. I could also add links to the crossrefs in the same way (using BBEdit to confirm each change). It would then be easy to query just using a browser's search field.
However, ideally, I'd like something that returned only (all) the matching results. XML/HTML Tagging is the easy bit: the problem is the front-end to access/query it.
I'm on MacOS. (I'm also investigating Apple's Dictionary format...)
Any ideas on how to proceed would be welcome. Thanks.

This is a huge question. So many choices at so many areas.
A small start:
A searchable db. Look at https://solr.apache.org/
Php to handle interaction front-end with solr and to serve your html search form and results.

XML as complement to HTML

I'm having trouble wrapping my head around using XML as complement to HTML. I know what they are used for but I don't quite understand how to use them together.
I know that you can use JavaScript to convert an XML file to HTML, but I don't get how that's going to do the trick. How would I be able to style this HTML-file?
I have a template form, which I want to be accessible on a server and for which I want to enable edits. Once edited I want to save the edits on a separate file, so that the template is still available.(Just so you guys have a little bit of background regarding what I need this for).
After a lot of research I came to the conclusion that I would need to use XML, as I will have to store and transport data.
Could anyone explain in more detail how exactly XML can be used as a complement to HTML?
If you need more details or information please let me know. I did do a lot of research and I read the other posts regarding how to convert XML to HTML with JavaScript, but that doesn't answer my question about how EXACTLY they complement each other.
I guess my problem here is that I have yet to manage to wrap my head around the concept.

XML is related to HTML, as it uses the same magic characters for its markup and the same logic where to put the data.
The characters <> are used to separate the markups from the content.
The character & together with an entity code like < is used to encode characters, which would lead to troubles otherwise
elements can contain attributes like <someElement someAttribute="attr value">
elements can contain text or sub elements
The big difference is, that XML is absolutely free how you name your elements and attributes, while HTML relys on dedicated names (like <body>), whereas XML is absolutely strict in structure while HTML allows a lot (like unclosed tags).
As a thing in the middle there is XHTML, which is as strict as XML but sticks to the rules of HTML.
It is almost impossible to read HTML as XML, but you can easily create XML which is taken by any browser as a valid web page.
Your issue cries for XSLT. This is a method to transform a given XML into a new format. This allows for example, to export your data as XML and create a nice web page from it. Different XSLT will present the same data in different ways.
There are several online tools to test this feature. you might have a look here.
Your statement After a lot of research I came to the conclusion that I would need to use XML, as I will have to store and transport data is not all clear... How you send data (to a web application), and the way you send the (manipulated) data back, is not bound to XML. This is very often done with JSON, using Java Script to read, edit and send it back.
XML -> XSLT - HTML is often seen to create (rather static) reports for a web viewer

Detecting the changing areas in a web page

I'm trying to write a crawler that gets raw html data and finds Title, price, update date, photo etc... fields and writes it to database. This is an classic and old way to crawl data.
I think that I can do this job wit an other way.
If I crawl all pages (may be more than 1000) in the web site, and compare them all I can find the specific areas.
I mean html tags will be always the same. Only specific areas will change like title, image etc...
So, what is the best way to determine changed areas?

compare them all I can find the spesific areas
what is the best way to determine changed areas?
In your question you set the scrapeing/crawling approach of comparing pages' parts and getting the data of specific areas. This smells with regex approach. Do not use it as the very non-efficient approach. Rather use xpath, operating on XML structures.
So, be simple:
Get html
Make it DOM
Make DOM a valid XML
Apply xPath queries to XML
Believe me, xml libraries are well able to handle huge structures (including idle html tags) and traverse over them. A classical example of using xpath is in this post of mine.
To determine data node paths you just use web inspector tools (F12 - in Chrome and IE and Ctrl+Shift+I in FF) to see the html tags containing useful info.

Cleaning up microsoft produced html for hash including in a 'clean' webpage

I have an intranet page that uses #include to include other files such as rotas or phone number tables. These included files are maintained in microsoft excel.
Not all of them are maintained by me (the guy in charge of the intranet itself) so there isn't really the option to refuse to accept excel produced html files.
The problem I have is that these files are crammed full of crap that is almost certainly not needed by the browser to display what is esentially a simple table with some colour formatting in places (and sometimes text will be bolded or italic in particular cells)
What, in your opinion would be a better way to go about this? Is there some code that can clean all the crap out of a file saved by excel as html? Is there a neater more industry-known way to display content inline generated by third parties?
Any suggestions welcome.
edit: Solutions that use ASP, PHP, Javascript also welcome.

Is there some repeating structure to your files? Exporting to CSV (comma-separated values) and rebuilding the tables from that source could be easier and faster than trying to remove dozens of unwanted elements and attributes Excel think it has to add.
If your bolded and italic particular cells are really particular (a whole column of data is part italic part normal), than CSV won't help though.

Why do I need Markdown?

Why do I need a Markdown with a front edit editor like WMD? What does the markdown do to the content that’s sent from the WMD editor?
How does Markdown store the content in the backend? Is it the same way like *bold* or in some other format? Why can’t I just do an html encode?
Sorry if I sounded very naïve.

It's probably helpful to take a step back and ask some of the larger questions. The issue Markdown is trying to solve is that of rich editing in the browser. Consider this: At some point, for any piece of software to enable rich text it has to describe the richness in a some manner, however that may be.
We could call that description of richness (by description of richness I mean like "this bit of text is bold" or "this bit of text is a hyperlink), we could call that description of richness "markup" -- it marks up the text with meta "richness".
Implementations of rich text can take on two approaches, either a.) hide the markup from the user or b.) let them have access to the markup.
For those who choose to hide it, the end result is very often WYSIWYG. The user is oblivious to what is happening behind the scenes. The editor takes care of the details. Think MS Word as an example. No one manipulates the Word markup format as a regular end user.
For implementations which choose to expose the markup, a markup language is then in order to allow users to interacat with it. Such markup languages would be things like HTML doing <tag> or BB code for example, doing things like [tag].
Markdown is one such of these languages.
As opposed to the former types I mentioned, Markdown has tried to design itself so that the markup renders common ASCII people already use. For example, it's common for people to asterisk their text to set it off, *important*, and this notation in Markdown is an indicator of italic.
In regards to storage, as Stephan pointed out, the system will most likely store the raw markdown, because the user will most likely need to have the possibility of editing, and the original markdown can be recalled for that purpose.
In most of the systems I've built, I store the markdown, and then normalize it to a 2nd field which caches the HTML rendering of the markdown. This way I don't have to do markdown->HTML rendering for every markdown field. It takes a little more space, but I'd rather the user have a faster response than use less DB storage space.
Care should also be taken when accepting Markdown from the browser, as it can easily contain <script> tags which need to be filtered out. Most markdown implementations will also recognize HTML intermingled with Markdown formatting, as so to be safe, you need to make sure your inputs and caches are sanitized properly.

The reason for using an alternate encoding system other than HTML is for security
Markdown and other such wiki style encoding systems do not usually support scripting languages
HTML supports scripting languages in many ways (
The two main security issues are:
Malware criminals use scripts in user generated content to attempt malware actions on the content readers computer by scripting to access known security holes
Free loaders using scripts to subvert the rest of the site by changing the content frame or styles i.e. ads, menu's, logos etc. This can also be criminal behaviour if not just annoying
By using an intermediate language such as Markdown you have total control on the rendered output
Filtering HTML is possible, but is also complex and risky
The other significant reason for an alternate encoding system is enforcement of style. Normal HTML has too many options. By limiting the available options, users can only use certain styles. The usually makes for cleaner looking and more readable content (compare SO to Ebay)

The main reason for using Markdown is the readability of a marked text. For instance, you can send it in a plain-text email and the reader will still understand the emphiasis, bullets, the text will be divided in paragraphs et cetera.
When you ask about storing data, it depends. If you enable Markdown in the WordPress blog engine, it stores data as the user has input it - in Markdown. In Stack Overflow, however, it seems like the data is stored as HTML. At least, the "Stack Overflow data dumps" contain HTML, not Markdown (I've seen people complaining) that they have to convert it back).
If you use the WMD editor, you can show the user how the outputs will look like after being converted to HTML. Even though Markdown syntax is really simple, it is not hard to make mistakes. Hence, it is best to show users the output.
Another reason for using Markdown instead of a WYSIWIG control - a WYSIWIG control allows the user to use HTML in data you are displaying on your web page. So, you have to be the one who decides when there is simply incorrect HTML and when it is an evil XSS/CSRF/whatever injection. In Markdown, you simply convert *something* to <b>something</b>, remove any unknow HTML elements and you're done.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008