Yahoo Pipes and Regex with an html formatting issue - html

I am struggling to see how to use the regex to add a non-printable carriage return character into an html string.
Its a WordPress thing in that to auto-embed a video I need to put the URL on its own line in the html.
First I use a regex:
In item.vid_src replace ($) with \\r$1
s is checked.
After which I am using a loop with a string builder in it - I am prefixing vid_src to the start of description thus:
item.vid_src
<br><br>
item.description
assign results to item.description
Before I include the Regex module in the pipe I get this:
http://www.youtube.com/watch?v=THA_5cqAfCQ<br><br><p><h1 class="MsoNormal">Cheetahs on
the edge</h1>
But I need this:
http://www.youtube.com/watch?v=THA_5cqAfCQ
<br><br><p><h1 class="MsoNormal">Cheetahs on the edge</h1>
Adding the regex module I get this:
http://www.youtube.com/watch?v=THA_5cqAfCQ\r<br><br><p><h1 class="MsoNormal">
Cheetahs on the edge</h1>
Clearly its inserting exactly what I have asked for, but It is not what I was expecting, I need to get the html formatted with the newline. Does anybody have an insight as to how to tackle the problem?

Related

How to use regex (regular expressions) in Notepad++ to remove all HTML and JSON code that does not contain a specific string?

Using regular expressions (in Notepad++), I want to find all JSON sections that contain the string foo. Note that the JSON just happens to be embedded within a limited set of HTML source code which is loaded into Notepad++.
I've written the following regex to accomplish this task:
({[^}]*foo[^}]*})
This works as expected in all the input that is possible.
I want to improve my workflow, so instead of just finding all such JSON sections, I want to write a regex to remove all the HTML & JSON that does not match this expression. The result will be only JSON sections that contain foo.
I tried using the Notepad++ regex Replace functionality with this find expression:
(?:({[^}]*?foo[^}]*?})|.)+
and this replace expression:
$1\n\n$2\n\n$3\n\n$4\n\n$5\n\n$6\n\n$7\n\n$8\n\n$9\n\n
This successfully works for the last occurrence of foo within the JSON, but does not find the rest of the occurrences.
How can I improve my code to find all the occurrences?
Here is a simplified minimal example of input and desired output. I hope I haven't simplified it too much for it to be useful:
Simplified input:
<!DOCTYPE html>
<html>
<div dat="{example foo1}"> </div>
<div dat="{example bar}"> </div>
<div dat="{example foo2}"> </div>
</html>
Desired output:
{example foo1}
{example foo2}
You can use
{[^}]*foo[^}]*}|((?s:.))
Replace with (?1:$0\n). Details:
{[^}]*foo[^}]*} - {, zero or more chars other than }, foo, zero or more chars other than } and then a }
| - or
((?s:.)) - Capturing group 1: any one char ((?s:...) is an inline modifier group where . matches all chars including line break chars, same as if you enabled . matches newline option).
The (?1:$0\n) replacement pattern replaces with an empty string if Group 1 was matched, else the replacement is the match text + a newline.
See the demo and search and replace dialog settings:
Updates
The comment section was full tried to suggest a code here,
Let me know if this is a bit close to your intended result,
Find: ({.+?[\n]*foo[ \d]*})|.*?
Replace all: $1
Also added Toto's example

regex interprete markdown but ignore HTML

In a string like
Hallo, this is <code>`code`</code> and this `is code again`.
To analyse it, parse it with regex?
In this example the user just typed the far right ` at the very last. The first "code" has obviously already been surrounded by HTML.
I need a regex to get the next code indicated part.
There always be one series, that is valid markdown AND not already surrounded by the corresponding HTML tags.
How to get this specific series (regardless if it's *, **, ___, ` or whatever)?
So what you want is a regex that only matches the markdown that isn't surrounded by HTML tags right ?
You can use something like this :
/(?:[^<>]|^)(`[^<>].*?`)/
This will only match the text placed inside `` that aren't directly placed next to a < or > character. This way, no matter what the HTML tag is inside the <...>, the `code` won't match.
See this Regex101.com
If you want to match every emphasized string that is not tagged with "code" you can use
(?<!<code>)`[\w ]+`
You can test it on regex101.com

Using Regex to find "<img .../>" and "<script ...> </script>" in HTML string

I am trying to use Regular Expressions for the first time to search for images and scripts in webpages in Scala. The expressions I've come up with are
Images:
/(<img\S+\s+\/>)+/
Scripts:
/(<script\s+\S+><\/script>)+/
I don't really know anything about HTML code or using Regex so I'm not sure what I need in order to specify that it should match <img .../> where the ... could be any amount of characters or whitespace. This is just a small part of a programming assignment I'm writing in Scala and we have to use Regex.
A regex like <img[^>]*> would match <img..........>.
A regex like <script.*?</script> would match a single <script...>...</script> instance. The ? is necessary to prevent it from matching everything from the first <script...> tag to the last </script> tag.
(Feel free to add back in the capturing ( )'s, the \ escapes, and surround with the regex delimiting / / tokens. I removed them to focus on the regular expressions themselves, without the leaning toothpick syndrome and other noise.)
While these are better than the ones you proposed, they will still break in many circumstances. RegEx is not designed to parse HTML.
<script>
<!-- This "</script>" doesn't end the script, but fools the RegEx -->
</script>

How can i convert/replace every newline to '<br/>'?

set tabstop=4
set shiftwidth=4
set nu
set ai
syntax on
filetype plugin indent on
I tried this, content.gsub("\r\n","<br/>") but when I click the view/show button to see the contents of these line, I get the output/result=>
set tabstop=4<br/> set shiftwidth=4<br/> set nu<br/> set ai<br/> syntax on<br/> filetype plugin indent on
But I tried to get those lines as a seperate lines. But all become as a single line. Why?
How can I make all those lines with a html break (<br/>) ?
I tried this, that didn't work.
#addpost = Post.new params[:data]
#temptest = #addpost.content.html_safe
#addpost.content = #temptest
#logger.debug(#addpost)
#addpost.save
Also tried without saving into database. Tried only in view layer,<%= t.content.html_safe %> That didn't work too.
Got this from page source
vimrc file <br/>
2011-12-06<br/><br/>
set tabstop=4<br/><br/>set shiftwidth=4<br/><br/>set nu<br/><br/>set ai<br/><br/>syntax on<br/><br/>filetype plugin indent on<br/>
Edit
Delete
<br/><br/>
An alternative to convert every new lines to html tags <br> would be to use css to display the content as it was given :
.wrapped-text {
white-space: pre-wrap;
}
This will wrap the content on a new line, without altering its current form.
You need to use html_safe if you want to render embedded HTML:
<%= #the_string.html_safe %>
If it might be nil, raw(#the_string) won't throw an exception. I'm a bit ambivalent about raw; I almost never try to display a string that might be nil.
With Ruby On Rails 4.0.1 comes the simple_format from TextHelper. It will handle more tags than the OP requested, but will filter malicious tags from the content (sanitize).
simple_format(t.content)
Reference : http://api.rubyonrails.org/classes/ActionView/Helpers/TextHelper.html
http://www.ruby-doc.org/core-1.9.3/String.html
as it says there gsub expects regex and replacement
since "\n\r" is a string you can see in the docs:
if given as a String, any regular expression metacharacters it contains will be interpreted literally, e.g. '\d' will match a backlash followed by ‘d’, instead of a digit.
so you are trying to match "\n\r", you probably want a character class containing \n or \r -[\n\r]
a = <<-EOL
set tabstop=4
set shiftwidth=4
set nu
set ai
syntax on
filetype plugin indent on
EOL
print a.gsub(/[\n\r]/,"<br/>\n");
I'm not sure I exactly follow the question - are you seeing the output as e.g. preformatted text, or does the source HTML have those tags? If the source HTML has those tags, they should appear on new lines, even if they aren't on line breaks in the source, right?
Anyway, I'm guessing you're dealing with automatic string escaping. Check out this other Stack Overflow question
Also, this: Katz talking about this feature

escaping html inside comment tags

escaping html is fine - it will remove <'s and >'s etc.
ive run into a problem where i am outputting a filename inside a comment tag eg. <!-- ${filename} -->
of course things can be bad if you dont escape, so it becomes:
<!-- <c:out value="${filename}"/> -->
the problem is that if the file has "--" in the name, all the html gets screwed, since youre not allowed to have <!-- -- -->.
the standard html escape doesnt escape these dashes, and i was wondering if anyone is familiar with a simple / standard way to escape them.
Definition of a HTML comment:
A comment declaration starts with <!, followed by zero or more comments, followed by >. A comment starts and ends with "--", and does not contain any occurrence of "--".
Of course the parsing of a comment is up to the browser.
Nothing strikes me as an obvious solution here, so I'd suggest you str_replace those double dashes out.
There is no good way to solve this. You can't just escape them because comments are read in plaintext. You will have to do something like put a space between the hyphens, or use some sort of code for hyphens (like [HYPHEN]).
Since it is obvoius that you cannnot directly display the '--'s you can either encode them or use the fn:escapeXml or fn:replace tags for appropriate replacements.
JSTL documentation
There's no universal working way to escape those characters in html unless the - characters are in multiples of four so if you do -- it wont work in firefox but ---- will work. So it all depends on the browser. For Example, looking at Internet Explorer 8, it is not a problem, those characters are escaped properly. The same goes for Googles Chrome... However Firefox even the latest browser (3.0.4), it doesn't handle escaping of these characters well.
You shouldn't be trying to HTML-escape, the contents of comments are not escapable and it's fine to have a bare ‘>’ or ‘&’ inside.
‘--’ is its own, unrelated problem and is not really fixable. If you don't need to recover the exact string, just do a replacement to get rid of them (eg. replace with ‘__’).
If you do need to get a string through completely unmolested to a JavaScript that will be reading the contents of the comment, use a string literal:
<!-- 'my-string' -->
which the script can then read using eval(commentnode.data). (Yes, a valid use for eval() at last!)
Then your escaping problem becomes how to put things in JS string literals, which is fairly easily solvable by escaping the ‘'’ and ‘-’ characters:
<!-- 'Bob\x27s\x2D\x2Dstring' -->
(You should probably also escape ‘<’, ‘&’ and ‘"’, in case you ever want to use the same escaping scheme to put a JS string literal inside a <​script> block or inline handler.)