PERL/CGI- gets more than text from input textarea - html

I'm not at all familiar with perl, but have some understanding of html. I'm currently trying to configure code from an online program that processes text inputted from the user to calculate and output a few important numbers in order to do the same for a large number of files containing text in a local directory. The problem lies in my lack of understanding for how or why the code from the site is splitting the inputted text by looking for & and =, as the inputted text never contains these characters, and neither my files. Here's some of the code from the online program:
if ($ENV{'REQUEST_METHOD'} ne "POST") {
&error('Error','Use Standard Input by METHOD=POST');
}
read(STDIN, $buffer, $ENV{'CONTENT_LENGTH'});
if ($buffer eq '') {
&error('Error','Can not execute directly','Check the usage');
}
$ref = $ENV{'HTTP_REFERER'};
#pairs = split(/&/,$buffer);
foreach $pair (#pairs) {
($name,$value) = split(/=/,$pair);
if ($name eq "ATOMS") { $atoms = $value; }
It then uses these "pairs" to appropriately calculate the required numbers. The input from the user is simply a textarea named "ATOMS", and the form action is the cgi script:
<form method=POST action="/path/to/the/cgi/file.cgi">
<textarea name="ATOMS" rows=20 cols=80></textarea>
</form>
I've left out the less important details of both the html and perl codes. So far all I've been able to do is get all the content from all files in a given directory in a text format, but when I input this into the script that uses the text from textarea to calculate the values (in place of the variable $buffer), it doesn't work, which I suspect is due to the split codes, which cannot find the & and = symbols. How does the code get these symbols from the online script, and how can I implement that to use for my local files? Let me know if any additional information is needed, and thanks in advance!

The encoding scheme forms use (by default) to POST data over HTTP consists of key=value pairs (hence the =) which are separated by & characters.
The latter doesn't much matter for your form since it has only one control in it.
This is described pretty succinctly in the HTML 4 specification and in more detail in the HTML 5 specification.
If you aren't dealing with data from a form, you should remove all the form decoding code.

Not sure where you got that code from, but it's prehistoric (from the last millennium).
You should be able to replace it all with.
use CGI ':cgi';
my $atoms = param('ATOMS');

Related

Regex for different pair of html tags

I need regex matching every pair of <p>...<br> and <p CLASS='extmsg' >...<br> to distinguish parts of chat conversation, which I receive as string in following format:
<p CLASS='extmsg'>16:30:24 ~ customer#home.com: hello<br>
<p>16:30:14 ~ consultant#company.com: hello to you<br>
<p CLASS='extmsg'>16:30:03 ~ sam.i.am#greeneggs.ham: how are you<br>
<p>03/06/2018 16:29:55 ~ bok.kier#ccc.pl: im fine<br>
I need it for parsing method.
Don't parse HTML with regex, use a proper XML/HTML parser.
theory :
According to the compiling theory, HTML can't be parsed using regex based on finite state machine. Due to hierarchical construction of HTML you need to use a pushdown automaton and manipulate LALR grammar using tool like YACC.
realLife©®™ everyday tool in a shell :
You can use one of the following :
xmllint
xmlstarlet
saxon-lint (my own project)
Check: Using regular expressions with HTML tags
Example :
xmllint --html --xpath '//p[#CLASS="extmsg"]/text()' file
Regexes are not suitable for this, as per Giles Quenot's answer. Using a proper parser is a much better way to do this. If you do receive messages in the format shown:
One message per line
Every message starts with "<p"
Every message ends with "<br>"
an easier idea might be string-matching the start of the line in stead. I don't know what language you're using, but an example in javascript might be:
var inputString = "" // From wherever you get your data
var lines = inputString.split("\n")
for (i = 0; i < lines.length; i++) {
var line = lines[i]
if (line.indexOf("<p CLASS='extmsg'>") == 0) {
console.log("Customer just said: " + line)
} else {
console.log("Representative just said: " + line)
}
}
You can trim the <p> and <br> tags out too, as you already know how long they are.
NOTE This will break if the format of the data changes (e.g. a designer gets into the CSS file and starts using BEM notation, changing extmsg to message--external, and adding message--internal to the rep's messages). As it would if you used a regex or a parser. The best way to deal with this would be to get whoever is supplying the data to make you a proper API for this info.

What the difference between 2 name of input?

What the difference between 2 name of input ?
OK, Normal i usually use this format name="xxxxxx"
<input type="text" name="xxxxxx"/>
But, today i see name format that i not understand name="xxxxx[]"
<input type="text" name="xxxxxx[]"/>
what is [] in name="xxxxx[]"
With this format xxxxx[] the variable $_POST['xxxxx'] is an Array when form is posted.
For example, is possible to iterate by the $_POST['xxxxx']:
<?php
$data = filter_input(INPUT_POST, 'xxxxx');
if(is_array($data)) {
foreach($data as $value) {
echo $value;
}
}
?>
In HTML5, the name attribute is just a string (without any special syntax). The only thing that can have a special meaning are the _charset_ and isindex strings. Thus square brackets themselves are nothing special.
However, the authors of programming languages or libraries that interact with HTML forms some times decide to define special syntaxes. That's the case of the PHP server-side language, where paired brackets in form element names are used by the language to automatically define variables of array type. See How do I create arrays in a HTML ? for further details.
(It's possible that other langs make use of similar conventions but I don't really know.)
Nothing, it just another name, may be something auto generated or someone use for a purpose in mind.
This is mainly done because of server side frameworks.
With PHP, for instance, if you had
<input type="text" name="address[firstline]">
and
<input type="text" name="address[secondline]">
and submitted the form, in your PHP code on the server you'd retrieve a single address object from the request and it would have the keys firstline and secondline on it.
you can still query using jQuery:
$('input[address\\[\\]=firstline]')
The reason for needing two backslashes is because a single backslash is interpreted as a JavaScript string escape character, so you need two to specify a literal backslash, which provides the escape character to the selector...

$_GET textarea losing HTML characters

This is probably a really simple one but I can't find the answer anywhere!
I have a self submitting form with a textarea field like this
<textarea name="desc" wrap="1" cols="64" rows="5"></textarea>
When I type HTML characters in to the textarea field and hit the submit button, the HTML characters are being stripped and I can't see what is doing it!
Do $_GET variables have their HTML stripped automatically?
For example, If I type '[strong]Just[/strong] a test' in to the textarea, and echo the contents of 'desc' like this
echo(print_r($_GET));
I see $_GET['desc'] contains 'Just a test' rather than '[strong]Just[/strong] a test'.
Is this normal? If so, is there a way to keep the HTML so I can store it in a database?
I am using angle '<>' brackets rather than square '[]' in my code, but this forum converts them if I use them here!
Use CDATA
A CDATA section starts with "<![CDATA[" and ends with "]]>"
Source : http://www.w3schools.com/xml/xml_cdata.asp
Where are you printing the data too? The web will parse the html and if you're not looking at the page source you're only going to see the non-html parts.
However, you should be using print html_entities($_GET['desc']) to print out the contents with the html content properly encoded so it's printed instead of parsed.

How to stop an html TEXTAREA from decoding html entities

I have a strange problem:
In the database, I have a literal ampersand lt semicolon:
<div
whenever its printed into a html textarea tag, the source code of the page shows the > as >.
How do I stop this decoding?
You can't stop entities being decoded in a textarea since the content of a textarea is not (unlike a script or style element) intrinsic CDATA, even though error recovery may sometimes give the impression that it is.
The definition of the textarea element is:
<!ELEMENT TEXTAREA - - (#PCDATA) -- multi-line text field -->
i.e. it contains PCDATA which is described as:
Document text (indicated by the SGML construct "#PCDATA"). Text may contain character references. Recall that these begin with & and end with a semicolon (e.g., Hergé's adventures of Tintin contains the character entity reference for the e acute character).
This means that when you type (the invalid HTML of) "start of tag" (<) the browser corrects it to "less than sign" (<) but when you type "start of entity" (&), which is allowed, no error correction takes place.
You need to write what you mean. If you want to include some HTML as data then you must convert any character with special meaning to its respective character reference.
If the data is:
<div
Then the HTML must be:
<textarea>&lt;div</textarea>
You can use the standard functions for converting this (e.g. PHP's htmlspecialchars or Perl's HTML::Entities module).
NB 1: If you were using XHTML[2] (and really using it, it doesn't count if you serve it as text/html) then you could use an explicit CDATA block:
<textarea><![CDATA[<div]]></textarea>
NB 2: Or if browsers implemented HTML 4 correctly
Ok , but the question is . why it decodes them anyway ? assuming i've added & , save the textarea , ti will be saved < , but displayed as < , saving it again will convert it back to < (but it will remain < in the database) , saving again will save it a < in the database , why the textarea decodes it ?
The server sends (to the browser) data encoded as HTML.
The browser sends (to the server) data encoded as application/x-www-form-urlencoded (or multipart/form-data).
Since the browser is not sending the data as HTML, the characters are not represented as HTML entities.
If you take the data received from the client and then put it into an HTML document, then you must encode it as HTML first.
In PHP, this can be done using htmlentities(). Example below.
<?php
$content = "This string contains the TM symbol: ™";
print "<textarea>". htmlentities($content) ."</textarea>";
?>
Without htmlentities(), the textarea would interpret and display the TM symbol (™) instead of "™".
http://php.net/manual/en/function.htmlentities.php
You have to be sure that this is rendered to the browser:
<textarea name="somename">&lt;div</textarea>
Essentially, this means that the & in < has to be html encoded to &. How to do it will depend on the technologies you're using.
UPDATE: Think about it like this. If you want to display <div> inside a textarea, you'll have to encode <> because otherwise, <div> would be a normal HTML element to the browser:
<textarea name="somename"><div></textarea>
Having said this, if you want to display <div> inside a textarea, you'll have to encode & again, because the browser decodes HTML entities when rendering HTML. It has nothing to do with your database.
You can serve your DB-content from a separate page and then place it in the textarea using a Javascript (jQuery) Ajax-call:
request = $.ajax
({
type: "GET",
url: "url-with-the-troubled-content.php",
success: function(data)
{
document.getElementById('id-of-text-area').value = data;
}
});
Explained at
http://www.endtask.net/how-to-prevent-a-textarea-element-from-decoding-html-entities/
I had the same problem and I just made two replacements on the text to show from the database before letting it into the text area:
myString = Replace(myString, "&", "&")
myString = Replace(myString, "<", "<")
Replace n:o 1 to trick the textarea to show the codes.
replace n:o 2: Without this replacement you can not show the word "" inside the textarea (it would end the textarea tag).
(Asp / vbscript code above, translate to a replace method of your language choice)
I found an alternative solution for reading and working with in-browser, simply read the element's text() using jQuery, it returns the characters as display characters and allows me to write from a textarea to a div's innerHTML using the property via html()...
With only JS and HTML...
...to answer the actual question, with a bare-minimal example:
<textarea id=myta></textarea>
<script id=mytext type=text/plain>
™
</script>
<script> myta.value = mytext.innerText; </script>
Explanation:
Script tags do not render html nor entities. By storing text in a script tag, it will remain unadultered-- problem is it will try to execute as JavaScript. So we use an empty textarea and store the text in a script tag (here, the first one).
To prevent that, we change the mime-type to text/plain instead of it's default, which is text/javascript. This will prevent it from running.
Then to populate the textarea, we copy the script tag's content to it (here done in the second script tag).
The only caveats I have found with this are you have to use JavaScript and you cannot include script tags directly in it.

Regex: Extracting readable (non-code) text and URLs from HTML documents

I am creating an application that will take a URL as input, retrieve the page's html content off the web and extract everything that isn't contained in a tag. In other words, the textual content of the page, as seen by the visitor to that page. That includes 'masking' out everything encapsuled in <script></script>, <style></style> and <!-- -->, since these portions contain text that is not enveloped within a tag (but is best left alone).
I have constructed this regex:
(?:<(?P<tag>script|style)[\s\S]*?</(?P=tag)>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>)
It correctly selects all the content that i want to ignore, and only leaves the page's text contents. However, that means that what I want to extract won't show up in the match collection (I am using VB.Net in Visual Studio 2010).
Is there a way to "invert" the matching of a whole document like this, so that I'd get matches on all the text strings that are left out by the matching in the above regex?
So far, what I did was to add another alternative at the end, that selects "any sequence that doesn't contain < or >", which then means the leftover text. I named that last bit in a capture group, and when I iterate over the matches, I check for the presence of text in the "text" group. This works, but I was wondering if it was possible to do it all through regex and just end up with matches on the plain text.
This is supposed to work generically, without knowing any specific tags in the html. It's supposed to extract all text. Additionally, I need to preserve the original html so the page retains all its links and scripts - i only need to be able to extract the text so that I can perform searches and replacements within it, without fear of "renaming" any tags, attributes or script variables etc (so I can't just do a "replace with nothing" on all the matches I get, because even though I am then left with what I need, it's a hassle to reinsert that back into the correct places of the fully functional document).
I want to know if this is at all possible using regex (and I know about HTML Agility Pack and XPath, but don't feel like).
Any suggestions?
Update:
Here is the (regex-based) solution I ended up with: http://www.martinwardener.com/regex/, implemented in a demo web application that will show both the active regex strings along with a test engine which lets you run the parsing on any online html page, giving you parse times and extracted results (for link, url and text portions individually - as well as views where all the regex matches are highlighted in place in the complete HTML document).
what I did was to add another alternative at the end, that selects "any sequence that doesn't contain < or >", which then means the leftover text. I named that last bit in a capture group, and when I iterate over the matches, I check for the presence of text in the "text" group.
That's what one would normally do. Or even simpler, replace every match of the markup pattern with and empty string and what you've got left is the stuff you're looking for.
It kind of works, but there seems to be a string here and there that gets picked up that shouldn't be.
Well yeah, that's because your expression—and regex in general—is inadequate to parse even valid HTML, let alone the horrors that are out there on the real web. First tip to look at, if you really want to chase this futile approach: attribute values (as well as text content in general) may contain an unescaped > character.
I would like to once again suggest the benefits of HTML Agility Pack.
ETA: since you seem to want it, here's some examples of markup that looks like it'll trip up your expression.
<a href=link></a> - unquoted
<a href= link></a> - unquoted, space at front matched but then required at back
- very common URL char missing in group
- more URL chars missing in group
<a href=lïnk></a> - IRI
<a href
="link"> - newline (or tab)
<div style="background-image: url(link);"> - unquoted
<div style="background-image: url( 'link' );"> - spaced
<div style="background-image: url('link');"> - html escape
<div style="background-image: ur\l('link');"> - css escape
<div style="background-image: url('link\')link');"> - css escape
<div style="background-image: url(\
'link')"> - CSS folding
<div style="background-image: url
('link')"> - newline (or tab)
and that's just completely valid markup that won't match the right link, not any of the possible invalid markup, markup that shouldn't but does match a link, or any of the many problems with your other technique of splitting markup from text. This is the tip of the iceberg.
Regex is not reliable for retrieving textual contents of HTML documents. Regex cannot handle nested tags. Supposing a document doesn't contain any nested tag, regex still requires every tags are properly closed.
If you are using PHP, for simplicity, I strongly recommend you to use DOM (Document Object Model) to parse/extract HTML documents. DOM library usually exists in every programming language.
If you're looking to extract parts of a string not matched by a regex, you could simply replace the parts that are matched with an empty string for the same effect.
Note that the only reason this might work is because the tags you're interested in removing, <script> and <style> tags, cannot be nested.
However, it's not uncommon for one <script> tag to contain code to programmatically append another <script> tag, in which case your regex will fail. It will also fail in the case where any tag isn't properly closed.
You cannot parse HTML with regular expressions.
Parsing HTML with regular expressions leads to sadness.
I know you're just doing it for fun, but there are so many packages out there than actually do the parsing the right way, AND do it reliably, AND have been tested.
Don't go reinventing the wheel, and doing it a way that is all but guaranteed to frustrate you down the road.
OK, so here's how I'm doing it:
Using my original regex (with the added search pattern for the plain text, which happens to be any text that's left over after the tag searches are done):
(?:(?:<(?P<tag>script|style)[\s\S]*?</(?P=tag)>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>))|(?P<text>[^<>]*)
Then in VB.Net:
Dim regexText As New Regex("(?:(?:<(?<tag>script|style)[\s\S]*?</\k<tag>>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>))|(?<text>[^<>]*)", RegexOptions.IgnoreCase)
Dim source As String = File.ReadAllText("html.txt")
Dim evaluator As New MatchEvaluator(AddressOf MatchEvalFunction)
Dim newHtml As String = regexText.Replace(source, evaluator)
The actual replacing of text happens here:
Private Function MatchEvalFunction(ByVal match As Match) As String
Dim plainText As String = match.Groups("text").Value
If plainText IsNot Nothing AndAlso plainText <> "" Then
MatchEvalFunction = match.Value.Replace(plainText, plainText.Replace("Original word", "Replacement word"))
Else
MatchEvalFunction = match.Value
End If
End Function
Voila. newHtml now contains an exact copy of the original, except every occurrence of "Original word" in the page (as it's presented in a browser) is switched with "Replacement word", and all html and script code is preserved untouched. Of course, one could / would put in a more elaborate replacement routine, but this shows the basic principle. This is 12 lines of code, including function declaration and loading of html code etc. I'd be very interested in seeing a parallel solution, done in DOM etc for comparison (yes, I know this approach can be thrown off balance by certain occurrences of some nested tags quirks - in SCRIPT rewriting - but the damage from that will still be very limited, if any (see some of the comments above), and in general this will do the job pretty darn well).
For Your Information,
Instead of Regex, With JQuery , Its possible to extract text alone from a html markup. For that you can use the following pattern.
$("<div/>").html("#elementId").text()
You can refer this JSFIDDLE