Using regular expressions to remove relative path slashes - html

I am trying to remove all the relative image path slashes from a chunk of HTML that contains several other elements.
For example
<img src="../../../../images/upload/1/test.jpg />
would need to become
<img src="http://s3.amazonaws.com/website/images/upload/1/test.jpg" />
I was thinking of writing this as a rails helper, and just passing the entire block into the method, and make using Nokogiri or Hpricot to parse the HTML instead, but I don't really know.
Any help would be great
Cheers
Adam

No need to reinvent the wheel, when the builtin 'uri' lib can do that for you:
require 'uri'
main_path = "http://s3.amazonaws.com/website/a/b/c"
relative_path = "../../../../images/upload/1/test.jpg"
URI.join(main_path, relative_path).to_s
# ==> "http://s3.amazonaws.com/images/upload/1/test.jpg"

One way to construct an absolute path given the absolute URL of the page and a relative path found on that page:
pageurl = 'http://s3.amazonaws.com/website/foo/bar/baz/quux/index.html'
relative = '../../../../images/upload/1/test.jpg'
absolute = pageurl.sub(/\/[^\/]*$/, '')
relative.split('/').each do |d|
if d == '..'
absolute.sub!(/\/[^\/]*$/, '')
else
absolute << "/#{d}"
end
end
p absolute
Alternatively, you could cheat a bit:
'http:/'+File.expand_path(File.dirname(pageurl.sub(/^http:/, ''))+'/'+relative)

This chunk might help:
html = '<img src="../../../../images/upload/1/test.jpg />'
absolute_uri = "http://s3.amazonaws.com/website/images"
html.gsub(/(\.\.\/)+images/, absolute_uri)

Related

How can I display a TemporaryUploadedFile from Django in HTML as an image?

In Django, I have programmed a form in which you can upload one image. After uploading the image, the image is passed to another method with the type TemporaryUploadedFile, after executing the method it is given to the HTML page.
What I would like to do is display that TemporaryUploadedFile as an image in HTML. It sounds quite simple to me but I could not find the answer on StackOverflow or on Google to the question: How to display a TemporaryUploadedFile in HTML without having to save it first, hence my question.
All help is appreciated.
Edit 1:
To give some more information about the code and the variables while debugging.
input_image = next(iter(request.FILES.values()))
output_b64 = (input_image.content_type, str(base64.b64encode(input_image.read()), 'utf8'))
Well, you can encode the image to base64 and use a data url as the value for src.
A base64 data url looks like this:
<img src="">
\_______/ \__________________/
| |
File type base64 encoded data
Read the Mozilla docs for more on data urls.
Here's some relevant code:
import base64
def my_view(request):
# assuming `image` is a <TemporaryUploadedFile object>
image_b64 = base64.b64encode(image.read())
image_b64 = image_b64.decode('utf8') # convert bytes to string
image_type = image.content_type # png or jpeg or something else
return render('template', {'image_b64': image_b64, 'image_type': image_type})
Then in your template:
<img src="data:{{ image_type }};base64,{{ image_b64 }}">
I want to thank xyres for pushing me in the right direction. As you can see, I used some parts of his solution in the code below:
# As input I take one image from the form.
temp_uploaded_file = next(iter(request.FILES.values()))
# The TemporaryUploadedFile is converted to a Pillow Image
input_image = pil_image.open(temp_uploaded_file)
# The input image does not have a name so I set it afterwards. (This step, of course, is not mandatory)
input_image.filename = temp_uploaded_file.name
# The image is saved to an InMemoryFile
output = BytesIO()
input_image.save(output, format=img.format)
# Then the InMemoryFile is encoded
img_data = str(base64.b64encode(output.getvalue()), 'utf8')
output_b64 = ('image/' + img.format, img_data)
# Pass it to the template
return render(request, 'visualsearch/similarity_output.html', {
"output_image": output_b64
})
In the template:
<img id="output_image" src="data:{{ image.0 }};base64,{{ image.1 }}">
The current solution works but I don't think it is perfect because I expect that it can be done with less code and faster, so if you know how this can be done better you are welcome to post your answer here.

Creating HTML links from images in :colons: with Ruby

I have a simple HTML document:
<div should-not-be-replaced=":smile:">
Hello :smile:!
</div>
How would I replace the :smile: text with <img src="smile.png">, but keeping the first :smile: unchanged, to get this:
<div should-not-be-replaced=":smile:">
Hello <img src="smile.png">!
</div>
I tried this, but Nokogiri escapes my HTML as plain text:
doc = Nokogiri::HTML::DocumentFragment.parse(html)
doc.traverse do |x|
next unless x.text?
x.content = x.text.gsub(':smile:', '<img src="smile.png">')
end
My solution is very similar to Ku's, although I've tried to handle situations where the replaced text could be in the source text multiple times by completely replacing the content text with an HTML Doc Fragment
doc = Nokogiri::HTML::DocumentFragment.parse(DATA.read)
doc.traverse do |x|
next unless x.text?
if x.text.match(%r{:(\w+):})
replace_text = x.text.gsub(%r{:(\w+):}, "<img src='#{$1}.png'>")
x.content = ""
x.add_next_sibling replace_text
end
end
I think this might be what you want, and it also deals with strings between two colons like :something: and produces "something.png" as well.
doc = Nokogiri::HTML::DocumentFragment.parse(html)
doc.traverse do |x|
if x.text? && x.content =~ /:\w+:/
x.content = x.content.sub(/:(\w+):/, '')
a = Nokogiri::HTML::DocumentFragment.parse('<a src="'+$1+'.png">')
x.add_next_sibling(a)
end
end
You are making it much too hard, and using traverse which is slow because it forces Nokogiri to walk through every node in the document; In a large page that is costly.
Instead take advantage of selectors to find the specific node(s) you want:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<div parm=":smile:">
Hello :smile:!
</div>
EOT
div = doc.at('div[parm=":smile:"]')
div.inner_html = div.text.sub(/:smile:/, '<img src="smile.png">')
puts doc.to_html
Running that results in:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<div parm=":smile:">
Hello <img src="smile.png">!
</div>
</body></html>
I'm using at, which finds the first occurrence. If you need to process more than one then use search. search returns a NodeSet, which is like an array so you'll want to iterate over it. That are innumerable examples of doing so on Stack Overflow and elsewhere.
Do you mean it returns &lt or &gt?
I recommend to wrap CGI#unescape_html method
try,
require 'cgi'
CGI::unescape_html(doc.to_s)

Text should change color after # or # [duplicate]

I'm assuming it's not possible, but just in case it is or someone has a nice trick up their sleeves, is there a way to target certain characters using CSS?
For example make all letters z in paragraphs red, or in my particular case set vertical-align:sup on all 7 in elements marked with the class chord.
Hi I know you said in CSS but as everybody told you, you can't, this is a javascript solution, just my 2 cents.
best...
JSFiddle
css
span.highlight{
background:#F60;
padding:5px;
display:inline-block;
color:#FFF;
}
p{
font-family:Verdana;
}
html
<p>
Let's go Zapata let's do it for the revolution, Zapatistas!!!
</p>
javascript
jQuery.fn.highlight = function (str, className) {
var regex = new RegExp(str, "gi");
return this.each(function () {
this.innerHTML = this.innerHTML.replace(regex, function(matched) {return "<span class=\"" + className + "\">" + matched + "</span>";});
});
};
$("p").highlight("Z","highlight");
Result
That's not possible in CSS. Your only option would be first-letter and that's not going to cut it. Either use JavaScript or (as you stated in your comments), your server-side language.
The only way to do it in CSS is to wrap them in a span and give the span a class. Then target all spans with that class.
As far as I understand it only works with regular characters/letters. For example: what if we want to highlight all asterisk (\u002A) symbols on page. Tried
$("p").highlight("u\(u002A)","highlight");in js and inserted * in html but it did not worked.
In reply to #ncubica but too long for a comment, here's a version that doesn't use regular expressions and doesn't alter any DOM nodes other than Text nodes. Pardon my CoffeeScript.
# DOM Element.nodeType:
NodeType =
ELEMENT: 1
ATTRIBUTE: 2
TEXT: 3
COMMENT: 8
# Tags all instances of text `target` with <span class=$class>$target</span>
#
jQuery.fn.tag = (target, css_class)->
#contents().each (index)->
jthis = $ #
switch #.nodeType
when NodeType.ELEMENT
jthis.tag target, css_class
when NodeType.TEXT
text = jthis.text()
altered = text.replaceAll target, "<span class=\"#{css_class}\">$&</span>"
if altered isnt text
jthis.replaceWith altered
($ document).ready ->
($ 'div#page').tag '⚀', 'die'

Jsoup get div contents

Hi,
I can't get the "src" content of this div class :
<div class="myclass"><img border=0 src="./images/myimage.jpg"></div>
I use
Els1 = doc1.getElementsByClass("myclass");
el=Els1.get(i)
but el.attr("src") or any other attributes returns emmpty
Conversely,
el.html() is ok :
<img border="0" src="./images/myimage.jpg" />
Tried also
doc1 = Jsoup.parseBodyFragment(el.outerHtml());
print (doc1.getElementsByAttribute("src").text());
with no success.
How can I get this src value ?
Thanks for any help,
Olivier
From the Jsoup Doc it should look somehow like this:
Element image = document.select("img").first();
String url = image.absUrl("src");
You also could use String url = image.attr("abs:src"); instead of absUrl.
I can't test your case on my system right now, so i hope you ll handle it somehow with the Jsoup Docs (URL part)
Jsoup Docs Working with URLs
Here is what you should be doing, if you are using class attribute.
Elements elements = doc.getElementsByClass("myclass");
String imageUrl = elements.attr("src");
And this one, if you are using id,
Element element = doc.getElementById("myid");
String imageUrl = element.attr("src");
This should just work fine.

Download HTML Text with Ruby

I am trying to create a histogram of the letters (a,b,c,etc..) on a specified web page. I plan to make the histogram itself using a hash. However, I am having a bit of a problem actually getting the HTML.
My current code:
#!/usr/local/bin/ruby
require 'net/http'
require 'open-uri'
# This will be the hash used to store the
# histogram.
histogram = Hash.new(0)
def open(url)
Net::HTTP.get(URI.parse(url))
end
page_content = open('_insert_webpage_here')
page_content.each do |i|
puts i
end
This does a good job of getting the HTML. However, it gets it all. For www.stackoverflow.com it gives me:
<body><h1>Object Moved</h1>This document may be found here</body>
Pretending that it was the right page, I don't want the html tags. I'm just trying to get Object Moved and This document may be found here.
Is there any reasonably easy way to do this?
When you require 'open-uri', you don't need to redefine open with Net::HTTP.
require 'open-uri'
page_content = open('http://www.stackoverflow.com').read
histogram = {}
page_content.each_char do |c|
histogram[c] ||= 0
histogram[c] += 1
end
Note: this does not strip out <tags> within the HTML document, so <html><body>x!</body></html> will have { '<' => 4, 'h' => 2, 't' => 2, ... } instead of { 'x' => 1, '!' => 1 }. To remove the tags, you can use something like Nokogiri (which you said was not available), or some sort of regular expression (such as the one in Dru's answer).
See the section "Following Redirection" on the Net::HTTP Documentation here
Stripping html tags without Nokogiri
puts page_content.gsub(/<\/?[^>]*>/, "")
http://codesnippets.joyent.com/posts/show/615