Tool to remove leading/trailing spaces in HTML files? - html

I have searched but could not find anything similar to what I need. I am looking for a tool that is capable of removing leading/trailing spaces in my HTML files which also have embedded JavaScript. Basically in the end, I plan to use this tool within my Nant scripts to perform this task on the fly with every deployment.
Is there already a tool that can do this, or maybe the best scripting language?
Basically, I will like what MS Word does for text using "justify (Ctrl+J)", to be done for my HTML files.

Here is the solution I found for this.
Using the html compressor command line tool, I was able to only remove the leading spaces of the html file where as fully minifying them didnt work.
Soultion:
java -jar htmlcompressor.jar --preserve-comments --preserve-multi-spaces --preserve-line-breaks --output D:\html\foo-leading_spaces.htm D:\html\foo.htm
Using this tool to generate my desired results, I am able to apply this to my build scripts to perform this process on the fly.
Thanks everyone for their input and hope this helps others in the similar situation.

Related

DiveintoHTML5.info is based on?

I was wondering what programming/markup languages does Diveintohtml5.info use. I am planning to create an online book(on some math stuff) similar to that of Mark Pilgrims' but need to know what exactly he used to create them.
Did he use a CMS like wordpress? Or it's just plain old HTML and CSS?
I am a bit new to the world of web development. Be kind.
Thanks in advance
Looking at the book’s source code on GitHub, it seems to be mostly static HTML, CSS, and JavaScript. However, it uses Python, Java, and shell code too, as you can see in the Makefile. (Makefiles are run with make.)
The Makefile contains a lot of shell code doing things like substitution, file copying, and concatenation. It also calls the Python and Java code, which is all in the util folder. The Python and Java programs compress the HTML and CSS, build the table of contents from the headings in each file, and do a few other things.

How can I create a well-formatted PDF?

I'm working on automating our company invoicing system. Currently all data is stored in our local MySQL database and someone manually updates an excel spreadsheet and then merges this data into a MS Word template. The goal is to automate this process so that the invoice can be generated from our intranet website as a PDF.
My original plan was to create a template in HTML/CSS and use wkhtmltopdf to generate the PDF but I ran into problems with getting a repeatable header and footer on each page. thead and tfoot aren't supported by Webkit and the fix suggested in this other question does not seem to work either.
So I then stumbled on using XML and XSL-FO, the latter I know nothing about. Is this the best path to take? Are there any libraries or utilities out there that will make converting my HTML+CSS into XML+XSL-FO easier? Are there any other alternatives I'm overlooking?
EDIT
Currently the server is CentOS Linux with a MySQL database. All other code is currently in PHP currently but that may change as the whole system is being revamped. Linux and MySQL will almost certainly remain, though.
For your requirement, XSL-FO might just do the trick. It is much cleaner to produce the pdf's directly from the data, then going the cumbersome html path, unless you need to display the html as well, then you might consider converting from html to pdf, but it will always be messy.
You can get xml results from mysql quite easily (mysql --xml) and then you write one (or several) xsl-fo stylesheet for the data. then, you cannot only produce pdfs, but also postscript files or rtf's with some processors.
XSL-FO has its limitations tho, but for your situation, it should suffice.
I admit, the learning curve can be steep, and maintaining xslt-stylesheets can get very tiring, but as you start knowing more about it, you end up writing less code.
another possibility is to do the whole thing in e.g. java or c# - send select statements and loop the results and iteratively build the pdf using a library like iText.
You could try JODReports or Docmosis as less-code intensive options. You supply Word or OpenOffice Writer documents to act as templates and use these engines to manipulate/populate the templates then spit out the documents in the format(s) you require. This may mean your existing Word-templates can be used directly which should save you some effort/time.
iText is another library that will let you build and pump out PDFs from code. It's pretty good.
If you cloud use ASP.NET for web you can use free ReportViewer library and designer for automated of publishing PDF-s.
Here is some references:
http://gotreportviewer.com
http://weblogs.asp.net/srkirkland/archive/2007/10/29/exporting-a-sql-server-reporting-services-2005-report-directly-to-pdf-or-excel.aspx
If you're OK using .NET and C#, you could use DotPdf from Atalasoft (obligatory disclaimer: I work for Atalasoft and wrote most of DotPdf). The Generating namespace is geared for exactly what you're trying to do: automate report generation. From the very basics, you could just create docs directly with the toolkit or you can create template documents that have unpopulated text fields that you can reload and fill later (see here and here for examples).

How difficult would it be to add a message on 1000+ html files?

I have over 1000 html files that I need to edit in the exact same way. I need to;
Add a simple javascript code at the top of each file.
Put some kind of message at the top (it can be anything, as long as it displays the message I want it to).
I was wondering, do I have to edit each file manually to do this? Is there not .htaccess hacks or anything like that?
Any suggestions/help would be appreciated.
I you are using linux, or have installed Cygwin on windows, then sed may be the quickest way to edit the files.
Combined with find, it can be used to very quickly add (or indeed edit) many files.
For example, the following command will replace all instances of the word 'old' with 'new' in all .html files:
find . -name "*.html" -exec sed -i "s/old/new/g" '{}' \;
There are many other examples online.
You can use .htaccess to autoprepend some code, but to be honest, a global find/replace would be a better idea in many ways.
I don't know what OS you use, but as a Mac Developer, http://www.hexmonkeysoftware.com/ is a neat little tool that does find and replace over loads of files.
Otherwise, a quick python script would be easy to write to do this.
If there is any common structure to the files, and their content is valuable and going to be used further in some way, then I would consider going the opposite route and extracting all that information, storing it in a database (or something) and presenting it like normal. This would provide more flexibility in presentation, and could even make the data useful/usable in other ways.

Which technology should I use to transform my latex documents into html documents

I want to write a little program that transforms my TeX files into HTML. I want to parse the documents and turn the macros (the build-in and of course my own) into HTML pieces. Here are my requirements:
predefined rules (e.g. begin{itemize} \item text \end{itemize} => <br> <p>text </p> <br/>)
defining own CSS style
ability to convert formulars (extract the formulars, load them in an imagecreator and then save the jpg/png)
easy to maintain and concise
I know there are several technologies out there, but I don't exactly know which is the best for me. Here are the technologies which flow into my mind
Ruby (I/O is easy, formular loading via webrat),
XML XSLT (I don't think that I need just overhead)
perl (there are many libs out there but I'm not quite familiar with it)
bash (I worked with sed and was surprised how easy it was to work with regular expressions)
latex2html ... (these converters won't work for me and they don't give me freedom in parsing)
Any suggestions, hints and comments are welcome.
Thanks for your time, folks.
have a look at pandoc here. it can also be installed on linux or os x. Though it won't do your custom macros. The only thing I've seen that can do a decent job with custom macros is tex4ht, but to really work well you need to be producing .DVI files. If you have a ton of custom macros, writing your own converter is going to take an ass load of time. Even if you only have a few custom macros, it's still going to be a pain. good luck!
Six: TeX
Seven: Haskell
(I gave up trying to persuade SO to start numbering my list from 6).

How can I extract HTML content efficiently with Perl?

I am writing a crawler in Perl, which has to extract contents of web pages that reside on the same server. I am currently using the HTML::Extract module to do the job, but I found the module a bit slow, so I looked into its source code and found out it does not use any connection cache for LWP::UserAgent.
My last resort is to grab HTML::Extract's source code and modify it to use a cache, but I really want to avoid that if I can. Does anyone know any other module that can perform the same job better? I basically just need to grab all the text in the <body> element with the HTML tags removed.
I use pQuery for my web scraping. But I've also heard good things about Web::Scraper.
Both of these along with other modules have appeared in answers on SO for similar questions to yours:
how can i screen scrape with perl
how can i extract xml of a website and save in a file using perls lwp
how do i extract an html title with perl
can you provide an example of parsing html with your favorite parser
how do I extract content from html file using perl
HTML::Extract's features look very basic and uninteresting. If the modules that draegfun mentioned don't interest you, you could do everything that HTML::Extract does using LWP::UserAgent and HTML::TreeBuilder yourself, without requiring very much code at all, and then you would be free to work in caching on your own terms.
I've been using Web::Scraper for my scraping needs. It's very nice indeed for extracting data, and because you can call ->scrape($html, $originating_uri) then it's very easy to cache the result you need as well.
Do you need to do this in real-time? How does the inefficiency affect you? Are you doing the task serially so that you have to extract one page before you move onto the next one? Why do you want to avoid a cache?
Can your crawler download the pages and pass them off to something else? Perhaps your crawler can even run in parallel, or in some distributed manner.