Trouble with printing <bbb>-constructs in perl

Trouble with printing <bbb>-constructs in perl - html

I have a small CGI perl script that I would like to print a file containing plain text to be output as html. So far, so god, however, the file contains some text enclosed with < and >, ie. <bbb>. The problem is that the <bbb> are removed. The same happens when I do a simple print statement like this:
print "aaa <bbb> ccc";
displays
aaa ccc
I have searched around, but not been able to find the solution. Amongst others, I have found this Printing string in Perl, but can't really see how the answer applies?

Are you displaying the output in a web browser? Then try
print "aaa <bbb> ccc";
and see the HTML::Entities module if you'll need to make conversions like this on other lines of text.

Related

Can Excel functions recognize bold text?

For convenience sake in something work related, I need to convert text style into html format. If I have this sentence for example; "the sky is Blue" in a MS Word .doc document, I want to be able to copy it to excel and have the bold potion be written with html tags.
Question is, can Excel functions detect text styles? and if so which function would be correct? I was thinking of Substitute but not so sure anymore.
Any help would be appreciated!

I think this is something that will be better done in the Word before you copy it to Excel. I found this article about it (https://word.tips.net/T001904_Adding_Tags_to_Text.html) - basically just use Find and Replace where you set up the format of what are you looking for (like italic) and that you want to replace it with tags like this:
<i>^&</i>
The part ^& tells it to include the string it found, so you do not lose the content and it adds the tags before and after the string in given format.

Linebreaks in middle of URL in HTML

I have this strange issue, where I get random linebreaks in my HTML when I copy & paste links from mails I get.
The problem is, linebreaks look exactly like any other whitespace and on long lines I have problems seeing if there is any linebreaks.
Normally this wouldn't be a problem, but we are also using emailing system that doesn't like breaklines in middle of an element.
Is there a way to see these without manually scanning all the lines, which is impossible due to amount of mails we are sending.
Regex maybe?
I'm using Notepad++ as an editor.

In Notepad++, you can use "Extended" mode in the FIND Option. Use "\r\n" to scan all the new lines in the file. Use "\r" to find all carriage returns in the file.

convert pdf into small chunks of data(many chunks per page)?

I have a pdf file and I need to get get small pieces of data from it.
It is structured like this :
Page1:
Question 1
......................................
......................................
Question 2
......................................
......................................
Page End
I want to get Question 1 and Question 2 as separate html files, which contain text and image.
I've tried
pdftohtml -c pdffile.pdf output.html
And I got files with png images, but how to do I cut the Image into smaller chunks to fit the size of each Question (I want to separate each question into individual files)?
P.S. I have alot of pdf files, so a command-line tool would be nice.

I'll try to give you an approach on how I would go about it. You mention, that every page in your PDF document might have multiple questions and you basically want have one HTML file for every question.
It's great if pdftohtml works for you, but I also found another decent command line utility that you might want to try out.
Ok, so assuming you have an HTML file converted from the PDF you initially had, you might want to use csplit or awk to split your file into multiple files based on the delimiter 'Question' in your case. (Side note- csplit and awk are linux specific utilites, but I'm sure there are alternatives if you are on Windows or a MAC. I haven't specifically tried the following code)
From a relevant SO Post :
csplit input.txt'/^Question$/' '{*}'
awk '/Question/{filename=NR".txt"}; {print >filename}' input.txt
So, assuming this works, you will have a couple of broken html files. Broken because they'll be unsanitized due to dangling < or > or some other stray HTML elements after the splitting.
So you could start by saving the initial .html as .txt, removing the html, head and body elements specifically and going through the general structure of how the program converts the pdf into html. I'm sure you'll see a pattern around how the string 'Quetion' is wrapped in an element and is something you can take care of. That is why I mention .txt files in the code snippets.
You will basically have a bunch of text files with just the content html and not the usual starting tags for an html file because we removed that initially. Then it's only a matter of reading each file, just taking care of the element that surrounds the string 'Question' and adding the html, head and body elements around the content and saving them as .html files. You could do this in any programming language of your choice that supports file reading and writing (would be a fun exercise)
I hope this gets you started in the right direction.

How to display plain text in webpage?

I have inserted my code in mysql database using text area.
What I have save appears is like this in mysql
This is Line 1
test
This lis Line 3
Now, my problem is to display the saved "file" to my browser which I am expecting to appear like this.
This is Line 1
test
This lis Line 3
Has anyone have some situation like this?

Use htmlentities on your output display. You can save html or any code as is in mysql with no special attention. You will need to escape it though so user based input isn't malicious.
http://www.php.net/manual/en/function.htmlentities.php
htmlentities("URL", ENT_QUOTES, 'UTF-8');
If you run this in php you will display the whole html tag. Likewise, you can spew out results from a mysql query, wrapping the relevant content in htmlentities to achieve what you're looking to do.

How do I match text in HTML that's not inside tags?

Given a string like this:
This is the foo link
... and a search string like "foo", I would like to highlight all occurrences of "foo" in the text of the HTML -- but not inside a tag. In other words, I want to get this:
This is the <b>foo</b> link
However, a simple search-and-replace won't work, because it will match part of the URL in the <a> tag's href.
So, to express the above in the form of a question: How do I restrict a regex so that it only matches text outside of HTML tags?
Note: I promise that the HTML in question will never be anything pathological like:
<img title="Haha! Here are some angle brackets to screw you up: ><" />
Edit: Yes, of course I'm aware that there are complex libraries in CPAN that can parse even the most heinous HTML, and thus alleviate the need for such a regex. On many occasions, that's what I would use. However, this is not one of those occasions, since keeping this script short and simple, without external dependencies, is important. I just want a one-line regex.
Edit 2: Again, I know that Template::Refine::Fragment can parse all my HTML for me. If I were writing an application I would certainly use a solution like that. But this isn't an application. It's barely more than a shell script. It's a piece of disposable code. Being a single, self-contained file that can be passed around is of great value in this case. "Hey, run this program" is a much simpler instruction than, "Hey, install a Perl module and then run this-- wait, what, you've never used CPAN before? Okay, run perl -MCPAN -e shell (preferably as root) and then it's going to ask you a bunch of questions, but you don't really need to answer them. No, don't be afraid, this isn't going to break anything. Look, you don't need to answer every question carefully -- just hit enter over and over. No, I promise, it's not going to break anything."
Now multiply the above across a great deal of users who are wondering why the simple script they've been using isn't so simple anymore, when all that's changed is to make the search term boldface.
So while Template::Refine::Fragment may be the answer to someone else's HTML parsing question, it's not the answer to this question. I just want a regular expression that works on the very limited subset of HTML that the script will actually be asked to parse.

If you can absolutely guarantee that there are no angle brackets in the HTML other than those used to open and close tags, this should work:
s%(>|\G)([^<]*?)($key)%$1$2<b>$3</b>%g

In general, you want to parse the HTML into a DOM, and then traverse the text nodes. I would use Template::Refine for this:
#!/usr/bin/env perl
use strict;
use warnings;
use feature ':5.10';
use Template::Refine::Fragment;
my $frag = Template::Refine::Fragment->new_from_string('<p>Hello, world. This is a test of foo finding. Here is another foo.');
say $frag->process(
simple_replace {
my $n = shift;
my $text = $n->textContent;
$text =~ s/foo/<foo>/g;
return XML::LibXML::Text->new($text);
} '//text()',
)->render;
This outputs:
<p>Hello, world. This is a test of <foo> finding. Here is another <foo>.</p>
Anyway, don't parse structured data with regular expressions. HTML is not "regular", it's "context-free".
Edit: finally, if you are generating the HTML inside your program, and you have to do transformations like this on strings, "UR DOIN IT WRONG". You should build a DOM, and only serialize it when everything has been transformed. (You can still use TR, however, via the new_from_dom constructor.)

The following regex will match all text between tags or outside of tags:
<.*?>(.*?)<.*?>|>(.*?)<
Then you can operate on that as desired.

Try this one
(?=>)?(\w[^>]+?)(?=<)
it matches all words between tags

To strip off the variable size contents from even nested tags you can use this regex that is in fact a mini-regular grammar for that. (note: PCRE machine)
(?<=>)((?:\w+)(?:\s*))(?1)*

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008