I am posting HTML data from an input text field called "textbox" to a backend application. The backend application (a django view) receives the data bu it is garbled with random equal to "=" characters in between, even though the html content in "textbox" before posting, was perfectly fine.
I suspect this is a problem with the encoding of POST data, but I am not able to figure out a solution to avoid this.
The textbox data can have any html data, and special characters like <, >, {, }.
To summarize the problem:
The text data like:
<p>This is a <b>sample</b> text<p>
<p> This is the second line </p>
becomes something like when I check the request.POST["textbox"] value in the Django view.
<p>Th=is is a <b>sample</b> tex=t<p>
<p> This i=s the se=cond line </p>
Is anyone facing a similar problem because I did not find any related questions on stackoverflow? AFAIK, I think this problem might not have specificity to Django, but still adding the information, in case its useful.
Related
I have a bunch of records in a QuickBase table that contain a rich text field. In other words, they each contain some paragraphs of text intermingled with HTML tags like <p>, <strong>, etc.
I need to migrate the records to a new table where the corresponding field is a plain text field. For this, I would like to strip out all HTML tags and leave only the text in the field values.
For example, from the below input, I would expect to extract just a small example link to a webpage:
<p>just a small <a href="#">
example</a> link</p><p>to a webpage</p>
As I am trying to get this done quickly and without coding or using an external tool, I am constrained to using Quickbase Pipelines' Text channel tool. The way it works is that I define a regex pattern and it outputs only the bits that match the pattern.
So far I've been able to come up with this regular expression (Python-flavored as QB's backend is written in Python) that correctly does the exact opposite of what I need. I.e. it matches only the HTML tags:
/(<[^>]*>)/
In a sense, I need the negative image of this expression but have not be able to build it myself.
Your help in "negating" the above expression is most appreciated.
Assuming there are no < or > elsewhere or entity-encoded, an idea using a lookbehind.
(?:(?<=>)|^)[^<]+
See this demo at regex101
(?:(?<=>)|^) is an alternation between either ^ start of the string or looking behind for any >. From there [^<]+ matches one or more characters that are not < (negated character class).
My problem occurs when I try to use some data/strings in a p-element.
I start of with data like this:
data: function() {
return {
reportText: {
text1: "This is some subject text",
text2: "This is the conclusion",
}
}
}
I use this data as follows in my (vue-)html:
<p> {{ reportText.text1 }} </p>
<p> {{ reportText.text2 }} </p>
In my browser, when I inspect my elements I get to see the following results:
<p>This is some subject text</p>
<p>This is the conclusion</p>
As you can see, there is suddenly a difference, one p element uses and the other , even though I started of with both strings only using . I know and technically represent the same thingm, but the problem with the string is that it gets treated as a string with 1 large word instead of multiple separate words. This screws up my layout and I can't solve this by using certain css properties (word-wrap etc.)
Other things I have tried:
Tried sanitizing the strings by using .replace( , ), but that doesn't do anything. I assume this is because it basically is the same, so there is nothing to really replace. Same reason why I have to use blockcode on stackoverflow to make the destinction between and .
Logged the data from vue to see if there is any noticeable difference, but I can't see any. If I log the data/reportText I again only see string with 's
So I have the following questions:
Why does this happen? I can't seem to find any logical explanation why it sometimes uses 's and sometimes uses 's, it seems random, but I am sure I am missing something.
Any other things I could try to follow the path my string takes, so I can see where the transformation from to happens?
Per the comments, the solution devised ended up being a simple unicode character replacement targeting the \u00A0 unicode code point (i.e. replacing unicode non-breaking spaces with ordinary spaces):
str.replace(/[\\u00A0]/g, ' ')
Explanation:
JavaScript typically allows the use of unicode characters in two ways: you can input the rendered character directly, or you can use a unicode code point (i.e. in the case of JavaScript, a hexadecimal code prefixed with \u like \u00A0). It has no concept of an HTML entity (i.e. a character sequence between a & and ; like ).
The inspector tool for some browsers, however, utilizes the HTML concept of the HTML entity and will often display unicode characters using their corresponding HTML entities where applicable. If you check the same source code in Chrome's inspector vs. Firefox's inspector (as of writing this answer, anyway), you will see that Chrome uses HTML entities while Firefox uses the rendered character result. While it's a handy feature to be able to see non-printable unicode characters in the inspector, Chrome's use of HTML entities is only a convenience feature, not a reflection of the actual contents of your source code.
With that in mind, we can infer that your source code contains unicode characters in their fully rendered form. Regardless of the form of your unicode character, the fix is identical: you need to target these unicode space characters explicitly and replace them with ordinary spaces.
Consider the following setup of HTML Purifier:
require_once 'library/HTMLPurifier.auto.php';
$config = HTMLPurifier_Config::createDefault();
$config->set('Core.EscapeInvalidTags', true);
$purifier = new HTMLPurifier($config);
$clean_html = $purifier->purify($dirty_html);
If you run the following case:
$dirty_html = "<p>lorem <script>ipsum</script></p>";
//output
<p>lorem <script>ipsum</script></p>
As expected, instead of removing the invalid tags, it just escaped them all.
However, consider these other test cases:
case 1
$dirty_html = "<p>lorem <b>ipsum</p>";
//output
<p>lorem <b>ipsum</b></p>
//desired output
<p>lorem <b>ipsum</p>
case 2
$dirty_html = "<p>lorem ipsum</b></p>";
//output
<p>lorem ipsum</p>
//desired output
<p>lorem ipsum</b></p>
case 3
$dirty_html = "<p>lorem ipsum<script></script></p>";
//output
<p>lorem ipsum<script /></p>
//desired output
<p>lorem ipsum<script></script></p>
Instead of just escaping the invalid tags, first it repairs them and then escapes them. This way things can get very strange, for example:
case 4
$dirty_html = "<p><a href='...'><div>Text</div></a></p>";
//output
<p></p><div>Text</div></p>
Question
Therefore, is it possible to disable the syntax repair and just escape the invalid tags?
The reason you're seeing a syntax repair is because of the fundamental way that HTML Purifier approaches the topic of HTML sanitation: It first parses the HTML to understand it, then decides which of the elements to keep in the parsed representation, then renders the HTML.
You might be familiar with one of stackoverflow's most famous answers, which is an amused and exasperated observation that true regular expressions can't parse HTML - you need additional logic, since HTML is a context-free language, not a regular language. (Modern 'regular' expressions are not formal regular expressions, but that's another matter.) In other words, if you actually want to know what's going on in your HTML - so that you correctly apply your white- or blacklisting - you need to parse it, which means the text ends up in a totally different representation.
An example of how parsing causes changes between input and output is that HTML Purifier strips extraneous whitespace from between attributes, which may not bother you in your case, but still stems from that the parsed representation of HTML is quite different from the text representation. It's not trying to preserve the form of your input - it's trying to preserve the function.
This gets tricky when there is no clear function and it has to start guessing. To pick an example, imagine while going through the HTML input, you come across what looks like an opening <td> tag in the middle of nowhere - you can consider it valid if there was an unclosed <td> tag a while back as long as you add a closing tag, but if you had escaped the first tag as <td>, you would need to discard the text data that would have been in the <td> since - depending on browser rendering - it may put data into parts of the page visually outside the fragment, i.e. places that are not clearly user-submitted.
In brief: You can't easily disable all syntax repair and/or tidying without having to rummage through the parsing guts of HTML Purifier and ensuring no information you find valuable is lost.
That said, you can try switching the underlying parsing engine with Core.LexerImpl and see if it gets you better results! :) DOMLex definitely adds missing ending nodes right from the get-go, but from a cursory glance, DirectLex may not. There is a large chunk of autoclosing logic in HTMLPurifier's MakeWellFormed strategy class which might also pose a problem for you.
Depending on why you want to preserve this data, though (to allow analysis?), saving the original input separately (while leaving HTML Purifier itself be) may provide you with a better solution.
I'm working on a small app that requires me to parse an html site on the web.
My problem is as follows :
The parsing routine is working fine for some infos BUT I'm searching for hours for a way to get some infos that refuse to appear.
Here is the partial code structure I'm willing to parse :
<body>
`<header>
<nav>
<div.....>
<aside......>
<main>
<div .....>
<a ......>
<a ......>
</div>
.
.
.
<div id="general">
<h2> ........</h2>
<p>
<span class="label">text</span>
"text 2 to be parsed"
<br>
<span class="label">other text</span>
"text 3 to be parsed"
<br>
just an exemple of structure, to be precise the url is http://www.ourairports.com/airports/EBBR/pilot-info.html
OK it seems that the html code is not appearing on the preview so in the source code of the page above, when you see [div id="general"], below you have a [p] followed by [span class="label"]some text[/span] and just below that you have text between brackets. This happens on several lines and I need to catch those infos .
I've tried with : //body/div/main/div[#id='general']/p as XpathQueryString but result is 1 node and empty
also with div[#id='general'] but result is no node found,
with div[#id='general']/p/span result is no node found,
with //div/p/span[#class='label'] results are the titles between the flags and >/span> but I'm looking to retrieve the text between quotes just behind and I cannot figure out how to succeed. I think I've tried all combinations (a lot others than explained above) but no chance. Is there a special path to get to this text ?
Thanks for your advices.
By the way, this is my very first post on stackoverflow.com and My first language is french, so I do apologize in advance for any rule not followed or my bad english.
Enjoy your day, evening, ... night on the keyboard.
Alain
Your first expression //body/div/main/div[#id='general']/p is expected to return a single node, the <p>. And it works exactly that way on the referred website as you observed. The expression reaches down to that node but not deeper where the text nests. However you must get the text too, just encapsulated in html, with fancy tags around it. A good XPath selector API used properly should return the html node that was matched, including the <p> tag itself.
If all you see in the end is just the text nodes try the following:
Think of the text among the <span>s as html nodes, text() nodes.
//div[#id='general']/p/text()
This will match the "text to be parsed".
A node() will match any html node (even text among tags) and a * any non-text() node.
For any number of steps, use the double slash:
//div[#id='general']/p//text()
Now you match every text node under the <p> tag, regardless of the nesting level. And since text nodes are by definition leaf nodes (cannot contain other nodes), this guarantees that you will not match members of the same path down the tree more than once.
Some comments on you expressions:
//body is superficial, there is only one body and html defines exactly where.
Nodes quantified by #id should not need be proceeded by selectors for their parents, start with //div[#id='something unique'] .
Learn more about XPath. An API that properly returns selected "nodes" and not just concatenated text can play an important role in the understanding of how the expressions work in practice.
I am creating an application that will take a URL as input, retrieve the page's html content off the web and extract everything that isn't contained in a tag. In other words, the textual content of the page, as seen by the visitor to that page. That includes 'masking' out everything encapsuled in <script></script>, <style></style> and <!-- -->, since these portions contain text that is not enveloped within a tag (but is best left alone).
I have constructed this regex:
(?:<(?P<tag>script|style)[\s\S]*?</(?P=tag)>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>)
It correctly selects all the content that i want to ignore, and only leaves the page's text contents. However, that means that what I want to extract won't show up in the match collection (I am using VB.Net in Visual Studio 2010).
Is there a way to "invert" the matching of a whole document like this, so that I'd get matches on all the text strings that are left out by the matching in the above regex?
So far, what I did was to add another alternative at the end, that selects "any sequence that doesn't contain < or >", which then means the leftover text. I named that last bit in a capture group, and when I iterate over the matches, I check for the presence of text in the "text" group. This works, but I was wondering if it was possible to do it all through regex and just end up with matches on the plain text.
This is supposed to work generically, without knowing any specific tags in the html. It's supposed to extract all text. Additionally, I need to preserve the original html so the page retains all its links and scripts - i only need to be able to extract the text so that I can perform searches and replacements within it, without fear of "renaming" any tags, attributes or script variables etc (so I can't just do a "replace with nothing" on all the matches I get, because even though I am then left with what I need, it's a hassle to reinsert that back into the correct places of the fully functional document).
I want to know if this is at all possible using regex (and I know about HTML Agility Pack and XPath, but don't feel like).
Any suggestions?
Update:
Here is the (regex-based) solution I ended up with: http://www.martinwardener.com/regex/, implemented in a demo web application that will show both the active regex strings along with a test engine which lets you run the parsing on any online html page, giving you parse times and extracted results (for link, url and text portions individually - as well as views where all the regex matches are highlighted in place in the complete HTML document).
what I did was to add another alternative at the end, that selects "any sequence that doesn't contain < or >", which then means the leftover text. I named that last bit in a capture group, and when I iterate over the matches, I check for the presence of text in the "text" group.
That's what one would normally do. Or even simpler, replace every match of the markup pattern with and empty string and what you've got left is the stuff you're looking for.
It kind of works, but there seems to be a string here and there that gets picked up that shouldn't be.
Well yeah, that's because your expression—and regex in general—is inadequate to parse even valid HTML, let alone the horrors that are out there on the real web. First tip to look at, if you really want to chase this futile approach: attribute values (as well as text content in general) may contain an unescaped > character.
I would like to once again suggest the benefits of HTML Agility Pack.
ETA: since you seem to want it, here's some examples of markup that looks like it'll trip up your expression.
<a href=link></a> - unquoted
<a href= link></a> - unquoted, space at front matched but then required at back
- very common URL char missing in group
- more URL chars missing in group
<a href=lïnk></a> - IRI
<a href
="link"> - newline (or tab)
<div style="background-image: url(link);"> - unquoted
<div style="background-image: url( 'link' );"> - spaced
<div style="background-image: url('link');"> - html escape
<div style="background-image: ur\l('link');"> - css escape
<div style="background-image: url('link\')link');"> - css escape
<div style="background-image: url(\
'link')"> - CSS folding
<div style="background-image: url
('link')"> - newline (or tab)
and that's just completely valid markup that won't match the right link, not any of the possible invalid markup, markup that shouldn't but does match a link, or any of the many problems with your other technique of splitting markup from text. This is the tip of the iceberg.
Regex is not reliable for retrieving textual contents of HTML documents. Regex cannot handle nested tags. Supposing a document doesn't contain any nested tag, regex still requires every tags are properly closed.
If you are using PHP, for simplicity, I strongly recommend you to use DOM (Document Object Model) to parse/extract HTML documents. DOM library usually exists in every programming language.
If you're looking to extract parts of a string not matched by a regex, you could simply replace the parts that are matched with an empty string for the same effect.
Note that the only reason this might work is because the tags you're interested in removing, <script> and <style> tags, cannot be nested.
However, it's not uncommon for one <script> tag to contain code to programmatically append another <script> tag, in which case your regex will fail. It will also fail in the case where any tag isn't properly closed.
You cannot parse HTML with regular expressions.
Parsing HTML with regular expressions leads to sadness.
I know you're just doing it for fun, but there are so many packages out there than actually do the parsing the right way, AND do it reliably, AND have been tested.
Don't go reinventing the wheel, and doing it a way that is all but guaranteed to frustrate you down the road.
OK, so here's how I'm doing it:
Using my original regex (with the added search pattern for the plain text, which happens to be any text that's left over after the tag searches are done):
(?:(?:<(?P<tag>script|style)[\s\S]*?</(?P=tag)>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>))|(?P<text>[^<>]*)
Then in VB.Net:
Dim regexText As New Regex("(?:(?:<(?<tag>script|style)[\s\S]*?</\k<tag>>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>))|(?<text>[^<>]*)", RegexOptions.IgnoreCase)
Dim source As String = File.ReadAllText("html.txt")
Dim evaluator As New MatchEvaluator(AddressOf MatchEvalFunction)
Dim newHtml As String = regexText.Replace(source, evaluator)
The actual replacing of text happens here:
Private Function MatchEvalFunction(ByVal match As Match) As String
Dim plainText As String = match.Groups("text").Value
If plainText IsNot Nothing AndAlso plainText <> "" Then
MatchEvalFunction = match.Value.Replace(plainText, plainText.Replace("Original word", "Replacement word"))
Else
MatchEvalFunction = match.Value
End If
End Function
Voila. newHtml now contains an exact copy of the original, except every occurrence of "Original word" in the page (as it's presented in a browser) is switched with "Replacement word", and all html and script code is preserved untouched. Of course, one could / would put in a more elaborate replacement routine, but this shows the basic principle. This is 12 lines of code, including function declaration and loading of html code etc. I'd be very interested in seeing a parallel solution, done in DOM etc for comparison (yes, I know this approach can be thrown off balance by certain occurrences of some nested tags quirks - in SCRIPT rewriting - but the damage from that will still be very limited, if any (see some of the comments above), and in general this will do the job pretty darn well).
For Your Information,
Instead of Regex, With JQuery , Its possible to extract text alone from a html markup. For that you can use the following pattern.
$("<div/>").html("#elementId").text()
You can refer this JSFIDDLE