Need help in scraping the img src from a website; below is my code - html

Below is my web scraping code for a website; it clicks a form which redirects to a page. From that page I need to extract [img] src url and export it into csv in a text form. I used the code below to extract a content from a td tag. When I run the same code it doesn't work because the td tag has no content but only a img tag. Any help will be appreciated. I am new to web-scraping. Thanks in Advance.
browser.find_element_by_css_selector(".textinput[value='APPLY']").click()
#select_finder = "//tr[contains(text(), 'NB')]//a"
select_finder = "//td[text()='NB')]/../td[2]/a"
browser.find_element_by_css_selector(".content a").click()
assert "Application Details" in browser.title
file_data = []
try:
assert "Application Details" in browser.title
enlargement = browser.find_element_by_xpath("/html/body/center/table[15]/tbody/tr[3]/td[2]/b").text
enlargement_answer1 = browser.find_element_by_xpath("/html/body/center/table[15]/tbody/tr[4]/td[2]").text
enlargement_answer2 = browser.find_element_by_xpath("/html/body/center/table[15]/tbody/tr[4]/td[3]").text
enlargement_text = enlargement + enlargement_answer1 + enlargement_answer2
considerations = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[4]/td[2]/b").text
considerations_answer = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[4]/td[3]").text
considerations_text = considerations + considerations_answer
alteration = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[4]/td[6]/b").text
alteration_answer = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[4]/td[7]").text
alteration_text = alteration + alteration_answer
units = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[5]/td[3]/b").text
units_answer = browser.find_element_by_xpath("/html/body/center/table[15]/tbody/tr[5]/td[4]").text
units_text = units + units_answer
occupancy = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[6]/td[3]/b").text
occupancy_answer = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[6]/td[4]").text
occupancy_text = occupancy + occupancy_answer
coo = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[7]/td[3]/b").text
coo_answer = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[7]/td[4]").text
coo_text = coo + coo_answer
floors = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[8]/td[3]/b").text
floors_answer = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[8]/td[4]").text
floors_text = floors + floors_answer
except (NoSuchElementException, AssertionError) as e:
floors_text.append("No Zoning Characteristics Present")
coo_text.append("n/a")
occupancy_text.append("n/a")
units_text.append("n/a")
alteration_text.append("n/a")
considerations_text.append("n/a")
enlargement_text.append("n/a")
with open('DOB.csv', 'a') as f:
wr = csv.writer(f, dialect='excel')
wr.writerow((block_number, lot_number, houseno, street, condo_text,
vacant_text, city_owned_text, file_data, floors_text, coo_text, occupancy_text, units_text, alteration_text,
considerations_text, enlargement_text ))
browser.close()

As you stated you are new to web scraping I encourage you to read up a bit: http://selenium-python.readthedocs.io/locating-elements.html
You are using XPath exclusively and in ways that are not recommended.
From the docs: "You can use XPath to either locate the element in absolute terms (not advised), or relative to an element that does have an id or name attribute."
Try using other locators to get your image.
for example: driver.find_element_by_css_selector("img[src='images/box_check.gif']")

Related

Extracting contents from a PDF to display on web pages

I'm trying to display the contents of the pdf by converting PDF into HTML using Adobe Acrobat 2021, extracting the paragraph structure, and post-processing. I saw a website whose only source is judgments as PDFs from the Supreme Court Website and displays them flawlessly. Does anybody have any idea how it's done?
My current flow is to convert the PDF into HTML to preserve the page layout and extract the text using Beautifulsoup.
Issues I'm currently facing:
Bulletin numbers are somehow dynamically calculated in the PDF and are tagged as
::before
on the browser. bs4 won't recognize it
Miss some paragraphs in between as some paragraphs are detected incorrectly
Table is detected as a table but some imperfections
PDF example : drive link
HTML from Adobe Acrobat : HTML file of the above PDF
This is my goal : Advocatekhoj
This is how accurate I'm expecting it to be.
Could someone please shed light on this? how-to(s) or any suggestions.
Note: I tried various PDF to HTML tools and the Adobe Acrobat was the best in detecting paragraph layout and preserving structure.
from bs4 import BeautifulSoup
from pprint import pprint
from os import listdir
from os.path import isfile, join
mypath = "sup_del_htmls/"
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]
counter = 0
for f in onlyfiles:
print(counter)
with open("output_txt/"+f+".txt", 'w',encoding='utf-8') as txtfile:
with open(mypath+f, encoding='utf-8') as fp:
soup = BeautifulSoup(fp, "html.parser")
para_counter = 1
for li in soup.select("li"):
if li.find_parent("li"):
continue
full_para = ""
for para in li.select("p"):
for match in para.findAll('span'):
match.unwrap()
para_txt = para.get_text().replace("ΒΆ", "")
para_txt = para_txt.strip()
if para_txt.endswith(".") or para_txt.endswith(":") or para_txt.endswith(";") or para_txt.endswith(",") or para_txt.endswith('"') or para_txt.endswith("'"):
full_para += para_txt + "\n"
else:
full_para += para_txt + " "
txtfile.write(full_para)
txtfile.write("\n" + "--sep--" + "\n")
if li.find("table"):
tables = li.find_all("table")
for table in tables:
txtfile.write("--table--"+ "\n")
txtfile.write(str(table) + "\n")
txtfile.write("--sep--" + "\n")
reversed_end = []
for p in reversed(soup.select("p")):
if p.find_parent('li') or p.find_parent('ol'):
break
reversed_end.append(" ".join(p.text.split()))
if reversed_end!=[]:
for final_end in reversed(reversed_end):
txtfile.write(final_end + "\n")
txtfile.write("--sep--" + "\n")
The Result : output.txt
For the numbering with :before in css, you can try to extract the selector/s for the numbered items with a function like this
def getLctrSelectors(stsh):
stsh = stsh.get_text() if stsh else ''
ll_ids = list(set([
l.replace('>li', '> li').split('> li')[0].strip()
for l in stsh.splitlines() if l.strip()[:1] == '#'
and '> li' in l.replace('>li', '> li') and
'counter-increment' in l.split('{')[-1].split(':')[0]
]))
for i, l in enumerate(ll_ids):
sel = f'{l} > li > *:first-child'
ll_ids[i] = (sel, 1)
crl = [
ll for ll in stsh.splitlines() if ll.strip().startswith(l)
and 'counter-reset' in ll.split('{')[-1].split(':')[-2:][0]
][:1]
if not crl: continue
crl = crl[0].split('{')[-1].split('counter-reset')[-1].split(':')[-1]
crl = [w for w in crl.split(';')[0].split() if w.isdigit()]
ll_ids[i] = (sel, int(crl[-1]) if crl else 1)
return ll_ids
(It should take a style tag as input and return a list of selectors and starting counts - like [('#l1 > li > *:first-child', 3)] for your sample html.)
You can use it in your code to insert the numbers into the text in the bs4 tree:
soup = BeautifulSoup(fp, "html.parser")
for sel, ctStart in getLctrSelectors(soup.select_one('style')):
for i, lif in enumerate(soup.select(sel)):
lif.insert(0, f'{i + ctStart}. ')
para_counter = 1
### REST OF CODE ###
I'm not sure I can help you with paragraphs and tables issues... Are you sure the site uses the same pdfs as you have? (Or that they use pdfs at all rather than something closer to the original/raw data?) Your pdf itself looked rather different from its corresponding page on the site.

Adding words in bold to table using python-docx

I'm new here (and to python) so any feedback on my post is welcome.
I have some code which asks for an input and then adds it to an entry in various tables.
e.g
import docx
doc = docx.Document('ABC.docx')
length = len(doc.tables)
name = input("What is your name?")
x = range(0,length)
for r in x:
doc.tables[r].cell(0, 1).text = name + ": " + doc.tables[r].cell(0, 1).text
doc.save("ABC_.docx")
and this will take text like "I love you" and change it to "Bob: I love you", which is great. However, I'd like the Bob to appear in bold. How do I do that?
Not sure this is the perfect way to do this, but it works. Basically you store the current cell text in a variable then clear the cell. After that, get the first paragraph of the cell and add formatted runs of text to it, as follows:
import docx
name = input("What is your name?")
doc = docx.Document('ABC.docx')
length = len(doc.tables)
x = range(0,length)
for r in x:
celltext = doc.tables[r].cell(0, 1).text
doc.tables[r].cell(0, 1).text = ""
paragraph = doc.tables[r].cell(0, 1).paragraphs[0]
paragraph.add_run(name).bold = True
paragraph.add_run(": " + celltext)
doc.save("ABC_.docx")
Input:
What is your name?NewCoder
Result:

I'm having trouble with sending a form using POST to retrieve data in R

I'm having trouble collecting doctors from https://www.uchealth.org/providers/. I've found out it's a POST method but with httr I can't seem to create the form. Here's what I have
url = url = 'https://www.uchealth.org/providers/'
formA = list(title = 'Search', onClick = 'swapForms();', doctor-area-of-care-top = 'Cancer Care')
formB = list(Search = 'swapForms();', doctor-area-of-care = 'Cancer Care')
get = POST(url, body = formB, encode = 'form')
I'm fairly certain formB is the correct one. However, I can't test it since I yield an error when trying to make the list. I believe it is because you can't use "-" characters when naming although I could be wrong on that. Could somebody help please?
I am unable to comment properly but try this to create an list. Code below worked for me.
library(httr)
url = 'https://www.uchealth.org/providers/'
formB = list(Search = 'swapForms();', `doctor-area-of-care` = 'Cancer Care')
get = POST(url, body = formB, encode = 'form')
When you are creating names with spaces or some other special character you have to put it into the operator above.

Image uploads with Pyramid and SQLAlchemy

How one should do image file uploads with Pyramid, SQLAlchemy and deform? Preferably so that one can easily get image thumbnail tags in the templates. What configuration is needed (store images on the file system backend, so on).
This question is by no means specifically asking one thing. Here however is a view which defines a form upload with deform, tests the input for a valid image file, saves a record to a database, and then even uploads it to amazon S3. This example is shown under the links to the various documentation I have referenced.
To upload a file with deform see the documentation.
If you want to learn how to save image files to disk, see this article see the official documentation
Then if you want to learn how to save new items with SQLAlchemy see the SQLAlchemy tutorial.
If you want to ask a better question where a more precise detailed answer can be given for each section, then please do so.
#view_config(route_name='add_portfolio_item',
renderer='templates/user_settings/deform.jinja2',
permission='view')
def add_portfolio_item(request):
user = request.user
# define store for uploaded files
class Store(dict):
def preview_url(self, name):
return ""
store = Store()
# create a form schema
class PortfolioSchema(colander.MappingSchema):
description = colander.SchemaNode(colander.String(),
validator = Length(max=300),
widget = text_area,
title = "Description, tell us a few short words desribing your picture")
upload = colander.SchemaNode(
deform.FileData(),
widget=widget.FileUploadWidget(store))
schema = PortfolioSchema()
myform = Form(schema, buttons=('submit',), action=request.url)
# if form has been submitted
if 'submit' in request.POST:
controls = request.POST.items()
try:
appstruct = myform.validate(controls)
except ValidationFailure, e:
return {'form':e.render(), 'values': False}
# Data is valid as far as colander goes
f = appstruct['upload']
upload_filename = f['filename']
extension = os.path.splitext(upload_filename)[1]
image_file = f['fp']
# Now we test for a valid image upload
image_test = imghdr.what(image_file)
if image_test == None:
error_message = "I'm sorry, the image file seems to be invalid is invalid"
return {'form':myform.render(), 'values': False, 'error_message':error_message,
'user':user}
# generate date and random timestamp
pub_date = datetime.datetime.now()
random_n = str(time.time())
filename = random_n + '-' + user.user_name + extension
upload_dir = tmp_dir
output_file = open(os.path.join(upload_dir, filename), 'wb')
image_file.seek(0)
while 1:
data = image_file.read(2<<16)
if not data:
break
output_file.write(data)
output_file.close()
# we need to create a thumbnail for the users profile pic
basewidth = 530
max_height = 200
# open the image we just saved
root_location = open(os.path.join(upload_dir, filename), 'r')
image = pilImage.open(root_location)
if image.size[0] > basewidth:
# if image width greater than 670px
# we need to recduce its size
wpercent = (basewidth/float(image.size[0]))
hsize = int((float(image.size[1])*float(wpercent)))
portfolio_pic = image.resize((basewidth,hsize), pilImage.ANTIALIAS)
else:
# else the image can stay the same size as it is
# assign portfolio_pic var to the image
portfolio_pic = image
portfolio_pics_dir = os.path.join(upload_dir, 'work')
quality_val = 90
output_file = open(os.path.join(portfolio_pics_dir, filename), 'wb')
portfolio_pic.save(output_file, quality=quality_val)
profile_full_loc = portfolio_pics_dir + '/' + filename
# S3 stuff
new_key = user.user_name + '/portfolio_pics/' + filename
key = bucket.new_key(new_key)
key.set_contents_from_filename(profile_full_loc)
key.set_acl('public-read')
public_url = key.generate_url(0, query_auth=False, force_http=True)
output_dir = os.path.join(upload_dir)
output_file = output_dir + '/' + filename
os.remove(output_file)
os.remove(profile_full_loc)
new_image = Image(s3_key=new_key, public_url=public_url,
pub_date=pub_date, bucket=bucket_name, uid=user.id,
description=appstruct['description'])
DBSession.add(new_image)
# add the new entry to the association table.
user.portfolio.append(new_image)
return HTTPFound(location = route_url('list_portfolio', request))
return dict(user=user, form=myform.render())

Using MATLAB to parse HTML for URL in anchors, help fast

I'm on a strict time limit and I really need a regex to parse this type of anchor (they're all in this format)
20120620_0512_c2_102..>
for the URL
20120620_0512_c2_1024.jpg
I know its not a full URL, it's relative, please help
Here's my code so far
year = datestr(now,'yyyy');
timestamp = datestr(now,'yyyymmdd');
html = urlread(['http://sohowww.nascom.nasa.gov//data/REPROCESSING/Completed/' year '/c2/' timestamp '/']);
links = regexprep(html, '<a href=.*?>', '');
Try the following:
url = 'http://sohowww.nascom.nasa.gov/data/REPROCESSING/Completed/2012/c2/20120620/';
html = urlread(url);
t = regexp(html, '<a href="([^"]*\.jpg)">', 'tokens');
t = [t{:}]'
The resulting cell array (truncated):
t =
'20120620_0512_c2_1024.jpg'
'20120620_0512_c2_512.jpg'
...
'20120620_2200_c2_1024.jpg'
'20120620_2200_c2_512.jpg'
I think this is what you are looking for:
htmlLink = '20120620_0512_c2_102..>';
link = regexprep(htmlLink, '(.*)', '$2');
link =
20120620_0512_c2_1024.jpg
regexprep works also for cell arrays of strings, so this works too:
htmlLinksCellArray = { '20120620_0512_c2_102..>', '20120620_0512_c2_102..>', '20120620_0512_c2_102..>' };
linksCellArray = regexprep(htmlLinksCellArray, '(.*)', '$2')
linksCellArray =
'20120620_0512_c2_1024.jpg' '20120620_0512_c2_1025.jpg' '20120620_0512_c2_1026.jpg'