HTML insert in Ruby - html

I am trying to create html statement in ruby.
Here is my sample code.
tmp1 = "<div><font face='Arial'><span style='font-size:9pt'>♦Issue : #{#issue[:"Defect Type"]} (#{#issue[:"Checker"]}) found in #{#issue[:"Function"]}</span></font></div>"
doc = Nokogiri::HTML.fragment(tmp1)
doc.to_html
header = Nokogiri::XML.fragment('<html><body>')
header.at('body').children = doc
details = header.to_html
doc = Nokogiri::XML::DocumentFragment.parse(details)
body = doc.at('body')
url = "http://collab.temp.com/main/display/CQ/Checker+Guides"
tmp2 = "<div><font face='Arial'><span style='font-size:9pt'>♦Review Guide : <a href=#{url}>#{url}</a></span></font></div>"
body.add_child(tmp2)
details = doc.to_html(:encoding => 'EUC-KR')
When I display 'details' in browser, I can an exact hyper link like below:
♦Review Guide : http://collab.temp.com/main/display/CQ/Checker+Guides
But, if I click the link, 'The Webpage cannot be found' error occurs.
If I copy the link and then paste it in browser, I can access to it successfully.
So I think that I may create wrong html statement in ruby.
Could you help me for this problem?

Surround the url attribute of a anchor tag (within tmp2 = line) with quotes:
<a href=#{url}>#{url}</a> ⇐ incorrect
<a href='#{url}'>#{url}</a> ⇐ correct

Related

I'm having trouble with sending a form using POST to retrieve data in R

I'm having trouble collecting doctors from https://www.uchealth.org/providers/. I've found out it's a POST method but with httr I can't seem to create the form. Here's what I have
url = url = 'https://www.uchealth.org/providers/'
formA = list(title = 'Search', onClick = 'swapForms();', doctor-area-of-care-top = 'Cancer Care')
formB = list(Search = 'swapForms();', doctor-area-of-care = 'Cancer Care')
get = POST(url, body = formB, encode = 'form')
I'm fairly certain formB is the correct one. However, I can't test it since I yield an error when trying to make the list. I believe it is because you can't use "-" characters when naming although I could be wrong on that. Could somebody help please?
I am unable to comment properly but try this to create an list. Code below worked for me.
library(httr)
url = 'https://www.uchealth.org/providers/'
formB = list(Search = 'swapForms();', `doctor-area-of-care` = 'Cancer Care')
get = POST(url, body = formB, encode = 'form')
When you are creating names with spaces or some other special character you have to put it into the operator above.

.text is scrambled with numbers and special keys in BeautifuSoup

Hello I am currently using Python 3, BeautifulSoup 4 and, requests to scrape some information from supremenewyork.com UK. I have implemented a proxy script (that I know works) into the script. The only problem is that this website does not like programs to scrape this information automatically and so they have decided to scramble this script which I think makes it unusable as text.
My question: is there a way to get the text without using the .text thing and/or is there a way to get the script to read the text? and when it sees a special character like # to skip over it or to read the text when it sees & skip until it sees ;?
because basically how this website scrambles the text is by doing this. Here is an example, the text shown when you inspect element is:
supremetshirt
Which is supposed to say "supreme t-shirt" and so on (you get the idea, they don't use letters to scramble only numbers and special keys)
this  is kind of highlighted in a box automatically when you inspect the element using a VPN on the UK supreme website, and is different than the text (which isn't highlighted at all). And whenever I run my script without the proxy code onto my local supremenewyork.com, It works fine (but only because of the code, not being scrambled on my local website and I want to pull this info from the UK website) any ideas? here is my code:
import requests
from bs4 import BeautifulSoup
categorys = ['jackets', 'shirts', 'tops_sweaters', 'sweatshirts', 'pants', 'shorts', 't-shirts', 'hats', 'bags', 'accessories', 'shoes', 'skate']
catNumb = 0
#use new proxy every so often for testing (will add something that pulls proxys and usses them for you.
UK_Proxy1 = '51.143.153.167:80'
proxies = {
'http': 'http://' + UK_Proxy1 + '',
'https': 'https://' + UK_Proxy1 + '',
}
for cat in categorys:
catStr = str(categorys[catNumb])
cUrl = 'http://www.supremenewyork.com/shop/all/' + catStr
proxy_script = requests.get(cUrl, proxies=proxies).text
bSoup = BeautifulSoup(proxy_script, 'lxml')
print('\n*******************"'+ catStr.upper() + '"*******************\n')
catNumb += 1
for item in bSoup.find_all('div', class_='inner-article'):
url = item.a['href']
alt = item.find('img')['alt']
req = requests.get('http://www.supremenewyork.com' + url)
item_soup = BeautifulSoup(req.text, 'lxml')
name = item_soup.find('h1', itemprop='name').text
#name = item_soup.find('h1', itemprop='name')
style = item_soup.find('p', itemprop='model').text
#style = item_soup.find('p', itemprop='model')
print (alt +(' --- ')+ name +(' --- ')+ style)
#print(alt)
#print(str(name))
#print (str(style))
When I run this script I get this error:
name = item_soup.find('h1', itemprop='name').text
AttributeError: 'NoneType' object has no attribute 'text'
And so what I did was I un-hash-tagged the stuff that is hash-tagged above, and hash-tagged the other stuff that is similar but different, and I get some kind of str error and so I tried the print(str(name)). I am able to print the alt fine (with every script, the alt is not scrambled), but when it comes to printing the name and style all it prints is a None under every alt code is printed.
I have been working on fixing this for days and have come up with no solutions. can anyone help me solve this?
I have solved my own answer using this solution:
thetable = soup5.find('div', class_='turbolink_scroller')
items = thetable.find_all('div', class_='inner-article')
for item in items:
alt = item.find('img')['alt']
name = item.h1.a.text
color = item.p.a.text
print(alt,' --- ', name, ' --- ',color)

How I can convert a html string to pdf?

I am using ASPPDF active x component. I have already create a pdf document from ASP. But the content of this pdf is only simple text. How I can convert asp variable, that contains html tags, to pdf document?
here is my code
Set Pdf = Server.CreateObject("Persits.Pdf")
Set Doc = Pdf.CreateDocument
Doc.Title = "Mi primer documento PDF"
Doc.Creator = "arsys.es"
Set Page = Doc.Pages.Add
Set Font = Doc.Fonts("Helvetica")
Params = "x=0; y=650; width=612; alignment=center; size=10"
Page.Canvas.DrawText "pdf text", Params, Font 'can I put asp variable instead of "pdf text"?
Filename = Doc.Save( Server.MapPath("salida\archivo.pdf"), False )
Response.Write "Enhorabuena! Descarga tu primer archivo PDF"
I have the same thing running on my server. The guide that comes with it requires you to use Doc.ImportFromUrl but that will need an full reference to the ASP page. So....
1) Create your ASP page and layout your page with SQL and bindings if required. I use this with a QueryString parameter as I only want to reference one value to create the PDF.....my example is based around creating an invoice.
2) In a "different folder" to your ASP page run the ASPPDF script in a page
3) Ensure you have a full reference to the path e.g. http://www.domain.com/file.asp
This should help:
<%
Set Pdf = Server.CreateObject("Persits.Pdf")
Set Doc = Pdf.CreateDocument
' A4 paper size - if US required remove/comment the line below.
Set Page = Doc.Pages.Add(595.3, 841.9)
Doc.ImportFromUrl "http://www.yourdomain.com/yourfile.asp?valueID=" & Request.QueryString("valueID")
Filename = Doc.Save( Server.MapPath("quotation.pdf"), False )
'Response.Write "Success! Download your PDF file here
%>
That works for me anyway!

page number variable: html,django

i want to do paging. but i only want to know the current page number, so i will call the webservice function and send this parameter and recieve the curresponding data. so i only want to know how can i be aware of current page number? i'm writing my project in django and i create the page with xsl. if o know the page number i think i can write this in urls.py:
url(r'^ask/(\d+)/$',
'ask',
name='ask'),
and call the function in views.py like:
ask(request, pageNo)
but i don't know where to put pageNo var in html page. (so fore example with pageN0=2, i can do pageNo+1 or pageNo-1 to make the url like 127.0.0.01/ask/3/ or 127.0.0.01/ask/2/). to make my question more cleare i want to know how can i do this while we don't have any variables in html?
sorry for my crazy question, i'm new in creating website and also in django. :">
i'm creating my html page with xslt. so i send the total html page. (to show.html which contains only {{str}} )
def ask(request:
service = GetConfigLocator().getGetConfigHttpSoap11Endpoint()
myRequest = GetConfigMethodRequest()
myXml = service.GetConfigMethod(myRequest)
myXmlstr = myXml._return
styledoc = libxml2.parseFile("ask.xsl")
style = libxslt.parseStylesheetDoc(styledoc)
doc = libxml2.parseDoc(myXmlstr)
result = style.applyStylesheet(doc, None)
out = style.saveResultToString( result )
ok = mark_safe(out)
style.freeStylesheet()
doc.freeDoc()
result.freeDoc()
return render_to_response("show.html", {
'str': ok,
}, context_instance=RequestContext(request))
i'm not working with db and i just receive xml file to parse it. so i don't have contact_list = Contacts.objects.all(). can i still use this way? should i put the first parameter inpaginator = Paginator(contact_list, 25) blank?
if you user standart django paginator, thay send you to url http://example.com/?page=N, where N - number you page
So,
# urls.py
url('^ask/$', 'ask', name='viewName'),
You can get page number in views:
# views.py
def ask(request):
page = request.GET.get('page', 1)

Using MATLAB to parse HTML for URL in anchors, help fast

I'm on a strict time limit and I really need a regex to parse this type of anchor (they're all in this format)
20120620_0512_c2_102..>
for the URL
20120620_0512_c2_1024.jpg
I know its not a full URL, it's relative, please help
Here's my code so far
year = datestr(now,'yyyy');
timestamp = datestr(now,'yyyymmdd');
html = urlread(['http://sohowww.nascom.nasa.gov//data/REPROCESSING/Completed/' year '/c2/' timestamp '/']);
links = regexprep(html, '<a href=.*?>', '');
Try the following:
url = 'http://sohowww.nascom.nasa.gov/data/REPROCESSING/Completed/2012/c2/20120620/';
html = urlread(url);
t = regexp(html, '<a href="([^"]*\.jpg)">', 'tokens');
t = [t{:}]'
The resulting cell array (truncated):
t =
'20120620_0512_c2_1024.jpg'
'20120620_0512_c2_512.jpg'
...
'20120620_2200_c2_1024.jpg'
'20120620_2200_c2_512.jpg'
I think this is what you are looking for:
htmlLink = '20120620_0512_c2_102..>';
link = regexprep(htmlLink, '(.*)', '$2');
link =
20120620_0512_c2_1024.jpg
regexprep works also for cell arrays of strings, so this works too:
htmlLinksCellArray = { '20120620_0512_c2_102..>', '20120620_0512_c2_102..>', '20120620_0512_c2_102..>' };
linksCellArray = regexprep(htmlLinksCellArray, '(.*)', '$2')
linksCellArray =
'20120620_0512_c2_1024.jpg' '20120620_0512_c2_1025.jpg' '20120620_0512_c2_1026.jpg'