currently I'm trying to automatically scrape/download yahoo finance historical data. I plan to download the data using the download link provided in the website.
My code is to list all the available link and work it from there, the problem is that the exact link doesn't appear in the result. Here is my code(partial):
def scrape_page(url, header):
page = requests.get(url, headers=header)
if page.status_code == 200:
soup = bs.BeautifulSoup(page.content, 'html.parser')
return soup
return null
if __name__ == '__main__':
symbol = 'GOOGL'
dt_start = datetime.today() - timedelta(days=(365*5+1))
dt_end = datetime.today()
start = format_date(dt_start)
end = format_date(dt_end)
sub = subdomain(symbol, start, end)
header = header_function(sub)
base_url = 'https://finance.yahoo.com'
url = base_url + sub
soup = scrape_page(url, header)
result = soup.find_all('a')
for a in result:
print('URL :',a['href'])
UPDATE 10/9/2020 :
I managed to find the span which is the parent for the link with this code
spans = soup.find_all('span',{"class":"Fl(end) Pos(r) T(-6px)"})
However, when I print it out, it does not show the link, here is the output:
>>> spans
[<span class="Fl(end) Pos(r) T(-6px)" data-reactid="31"></span>]
To download the historical data in CSV format from Yahoo Finance, you can use this example:
import requests
from datetime import datetime
csv_link = 'https://query1.finance.yahoo.com/v7/finance/download/{quote}?period1={from_}&period2={to_}&interval=1d&events=history'
quote = 'GOOGL'
from_ = str(datetime.timestamp(datetime(2019,9,27,0,0))).split('.')[0]
to_ = str(datetime.timestamp(datetime(2020,9,27,23,59))).split('.')[0]
print(requests.get(csv_link.format(quote=quote, from_=from_, to_=to_)).text)
Prints:
Date,Open,High,Low,Close,Adj Close,Volume
2019-09-27,1242.829956,1244.989990,1215.199951,1225.949951,1225.949951,1706100
2019-09-30,1220.599976,1227.410034,1213.420044,1221.140015,1221.140015,1223500
2019-10-01,1222.489990,1232.859985,1205.550049,1206.000000,1206.000000,1225200
2019-10-02,1196.500000,1198.760010,1172.630005,1177.920044,1177.920044,1651500
2019-10-03,1183.339966,1191.000000,1163.140015,1189.430054,1189.430054,1418400
2019-10-04,1194.290039,1212.459961,1190.969971,1210.959961,1210.959961,1214100
2019-10-07,1207.000000,1218.910034,1204.359985,1208.250000,1208.250000,852000
2019-10-08,1198.770020,1206.869995,1189.479980,1190.130005,1190.130005,1004300
2019-10-09,1201.329956,1208.459961,1198.119995,1202.400024,1202.400024,797400
2019-10-10,1198.599976,1215.619995,1197.859985,1209.469971,1209.469971,642100
2019-10-11,1224.030029,1228.750000,1213.640015,1215.709961,1215.709961,1116500
2019-10-14,1213.890015,1225.880005,1211.880005,1217.770020,1217.770020,664800
2019-10-15,1221.500000,1247.130005,1220.920044,1242.239990,1242.239990,1379200
2019-10-16,1241.810059,1254.189941,1238.530029,1243.000000,1243.000000,1149300
2019-10-17,1251.400024,1263.750000,1249.869995,1252.800049,1252.800049,1047900
2019-10-18,1254.689941,1258.109985,1240.140015,1244.410034,1244.410034,1581200
2019-10-21,1248.699951,1253.510010,1239.989990,1244.280029,1244.280029,904700
2019-10-22,1244.479980,1248.729980,1239.849976,1241.199951,1241.199951,1143100
2019-10-23,1240.209961,1258.040039,1240.209961,1257.630005,1257.630005,1064100
2019-10-24,1259.109985,1262.900024,1252.349976,1259.109985,1259.109985,1011200
...and so on.
I figured it out. That link is generated by javascript and requests.get() method won't work on dynamic content. I switched to selenium to download that link.
I'm having trouble collecting doctors from https://www.uchealth.org/providers/. I've found out it's a POST method but with httr I can't seem to create the form. Here's what I have
url = url = 'https://www.uchealth.org/providers/'
formA = list(title = 'Search', onClick = 'swapForms();', doctor-area-of-care-top = 'Cancer Care')
formB = list(Search = 'swapForms();', doctor-area-of-care = 'Cancer Care')
get = POST(url, body = formB, encode = 'form')
I'm fairly certain formB is the correct one. However, I can't test it since I yield an error when trying to make the list. I believe it is because you can't use "-" characters when naming although I could be wrong on that. Could somebody help please?
I am unable to comment properly but try this to create an list. Code below worked for me.
library(httr)
url = 'https://www.uchealth.org/providers/'
formB = list(Search = 'swapForms();', `doctor-area-of-care` = 'Cancer Care')
get = POST(url, body = formB, encode = 'form')
When you are creating names with spaces or some other special character you have to put it into the operator above.
I'm looking for a effective way to add the bounding boxes to the live video displayed on a webpage.
The project is to detect the cars in the video stream (1080p 30fps) captured by a camera using built-in rtsp function and display the results at the webpage.
My current solution:
The frame image is extracted using OpenCV. I run the detection algorithms frame by frame in real time. The frame image is first compressed into JPEG and then encoded into a Base64 string. Then, the Base64 image string along with the bounding boxes is encapsulated into JSON. I transmit the JSON string to the frontend using WebSocket.
At the frontend, I use canvas to draw the frame image each time a message is received.
Now the problems are:
The compression progress JPEG(OpenCV imencode) + Base64 (base64.b64encode) takes too much time, the throughput is about 20FPS (without detection).
The images displaying on the browser is slow and not smooth, the framerate can not even achieve 20FPS (without detection). And the webpage takes much memory and may become not responding.
I wonder if there is a solution to my problems (achieve 30FPS) or there are any other ways to achieve my purpose (synchronize the backend detection results and the frontend live video).
P.S.
The bounding boxes are not rendered to the image at the backend because there may be some operations need to be done to the bounding boxes at the frontend, e.g., determing whether to show the bounding boxes or not.
EDIT: The codes in python (without detection) at the backend are like:
Main.py
from multiprocessing import Process, Queue
import base64
import json
import cv2
import WSServer
def capture(path, outq):
cap = cv2.VideoCapture(path)
while True:
ret, image = cap.read()
outq.put(image)
def conv2jpg(inq, outq):
while True:
image = inq.get()
imgEnc = cv2.imencode('.jpg', image, [cv2.IMWRITE_JPEG_QUALITY, 75])[1].tostring()
outq.put(imgEnc)
def conv2b64(inq, outq):
while True:
imgEnc = inq.get()
strEnc = str(base64.b64encode(imgEnc), 'utf-8')
outq.put(strEnc)
if __name__ == '__main__':
# After arg parsing
ws = WSServer.WSServer(args.ip, args.port)
capQ = Queue(2)
jpgQ = Queue(2)
b64Q = Queue(2)
capP = Process(target=capture, args=(args.input, capQ, ))
# args.input is the rtsp address of the camera
jpgP = Process(target=conv2jpg, args=(capQ, jpgQ, ))
b64P = Process(target=conv2b64, args=(jpgQ, b64Q, ))
b64P.start()
jpgP.start()
capP.start()
while True:
strEnc = b64Q.get()
msg = {'image': strEnc}
msgJson = json.dumps(msg)
# Send the json string over Websocket
USERS = ws.getUSERS()
for user in USERS:
user.put(msgJson)
WSServer.py
import threading
from queue import Queue
import asyncio
import websockets
class WSServer:
def __init__(self, ip="localhost", port="11000", interval=1.0/60):
self.ip = ip
self.port = port
self.interval = interval
self.USERS = set()
self.USERS_Lock = threading.Lock()
self.P = threading.Thread(target=self.proc)
self.P.setDaemon(True)
self.P.start()
async def get(self, q):
if q.empty():
return None
return q.get()
async def register(self, q):
with self.USERS_Lock:
self.USERS.add(q)
async def unregister(self, q):
with self.USERS_Lock:
self.USERS.remove(q)
async def serve(self, websocket, path):
q = Queue()
await self.register(q)
try:
while True:
while True:
msg = await self.get(q)
if type(msg) == type(None):
break
await websocket.send(msg)
await asyncio.sleep(self.interval)
finally:
await self.unregister(q)
def getUSERS(self):
with self.USERS_Lock:
return self.USERS.copy()
def proc(self):
print('Listening...')
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
loop.run_until_complete(websockets.serve(self.serve, self.ip, self.port))
loop.run_forever()
EDIT2: The simplified codes at the frontend:
<!DOCTYPE html>
<html>
<head>
<title>demo</title>
</head>
<body>
<img id="myImg" style="display: None;"></img>
<canvas id="myCanvas">
<script>
var img = document.getElementById('myImg')
var cvs = document.getElementById('myCanvas')
cvs.width = 1920
cvs.height = 1080
var ctx = cvs.getContext('2d')
var ws = new WebSocket("ws://127.0.0.1:11000/");
ws.onmessage = function (event) {
var msg = JSON.parse(event.data)
img.src = 'data:image/jpg;base64,' + msg.image
};
function render () {
ctx.drawImage(img, 0, 0)
window.requestAnimationFrame(render)
}
window.requestAnimationFrame(render)
</script>
</body>
</html>
Below is my web scraping code for a website; it clicks a form which redirects to a page. From that page I need to extract [img] src url and export it into csv in a text form. I used the code below to extract a content from a td tag. When I run the same code it doesn't work because the td tag has no content but only a img tag. Any help will be appreciated. I am new to web-scraping. Thanks in Advance.
browser.find_element_by_css_selector(".textinput[value='APPLY']").click()
#select_finder = "//tr[contains(text(), 'NB')]//a"
select_finder = "//td[text()='NB')]/../td[2]/a"
browser.find_element_by_css_selector(".content a").click()
assert "Application Details" in browser.title
file_data = []
try:
assert "Application Details" in browser.title
enlargement = browser.find_element_by_xpath("/html/body/center/table[15]/tbody/tr[3]/td[2]/b").text
enlargement_answer1 = browser.find_element_by_xpath("/html/body/center/table[15]/tbody/tr[4]/td[2]").text
enlargement_answer2 = browser.find_element_by_xpath("/html/body/center/table[15]/tbody/tr[4]/td[3]").text
enlargement_text = enlargement + enlargement_answer1 + enlargement_answer2
considerations = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[4]/td[2]/b").text
considerations_answer = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[4]/td[3]").text
considerations_text = considerations + considerations_answer
alteration = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[4]/td[6]/b").text
alteration_answer = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[4]/td[7]").text
alteration_text = alteration + alteration_answer
units = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[5]/td[3]/b").text
units_answer = browser.find_element_by_xpath("/html/body/center/table[15]/tbody/tr[5]/td[4]").text
units_text = units + units_answer
occupancy = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[6]/td[3]/b").text
occupancy_answer = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[6]/td[4]").text
occupancy_text = occupancy + occupancy_answer
coo = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[7]/td[3]/b").text
coo_answer = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[7]/td[4]").text
coo_text = coo + coo_answer
floors = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[8]/td[3]/b").text
floors_answer = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[8]/td[4]").text
floors_text = floors + floors_answer
except (NoSuchElementException, AssertionError) as e:
floors_text.append("No Zoning Characteristics Present")
coo_text.append("n/a")
occupancy_text.append("n/a")
units_text.append("n/a")
alteration_text.append("n/a")
considerations_text.append("n/a")
enlargement_text.append("n/a")
with open('DOB.csv', 'a') as f:
wr = csv.writer(f, dialect='excel')
wr.writerow((block_number, lot_number, houseno, street, condo_text,
vacant_text, city_owned_text, file_data, floors_text, coo_text, occupancy_text, units_text, alteration_text,
considerations_text, enlargement_text ))
browser.close()
As you stated you are new to web scraping I encourage you to read up a bit: http://selenium-python.readthedocs.io/locating-elements.html
You are using XPath exclusively and in ways that are not recommended.
From the docs: "You can use XPath to either locate the element in absolute terms (not advised), or relative to an element that does have an id or name attribute."
Try using other locators to get your image.
for example: driver.find_element_by_css_selector("img[src='images/box_check.gif']")
How one should do image file uploads with Pyramid, SQLAlchemy and deform? Preferably so that one can easily get image thumbnail tags in the templates. What configuration is needed (store images on the file system backend, so on).
This question is by no means specifically asking one thing. Here however is a view which defines a form upload with deform, tests the input for a valid image file, saves a record to a database, and then even uploads it to amazon S3. This example is shown under the links to the various documentation I have referenced.
To upload a file with deform see the documentation.
If you want to learn how to save image files to disk, see this article see the official documentation
Then if you want to learn how to save new items with SQLAlchemy see the SQLAlchemy tutorial.
If you want to ask a better question where a more precise detailed answer can be given for each section, then please do so.
#view_config(route_name='add_portfolio_item',
renderer='templates/user_settings/deform.jinja2',
permission='view')
def add_portfolio_item(request):
user = request.user
# define store for uploaded files
class Store(dict):
def preview_url(self, name):
return ""
store = Store()
# create a form schema
class PortfolioSchema(colander.MappingSchema):
description = colander.SchemaNode(colander.String(),
validator = Length(max=300),
widget = text_area,
title = "Description, tell us a few short words desribing your picture")
upload = colander.SchemaNode(
deform.FileData(),
widget=widget.FileUploadWidget(store))
schema = PortfolioSchema()
myform = Form(schema, buttons=('submit',), action=request.url)
# if form has been submitted
if 'submit' in request.POST:
controls = request.POST.items()
try:
appstruct = myform.validate(controls)
except ValidationFailure, e:
return {'form':e.render(), 'values': False}
# Data is valid as far as colander goes
f = appstruct['upload']
upload_filename = f['filename']
extension = os.path.splitext(upload_filename)[1]
image_file = f['fp']
# Now we test for a valid image upload
image_test = imghdr.what(image_file)
if image_test == None:
error_message = "I'm sorry, the image file seems to be invalid is invalid"
return {'form':myform.render(), 'values': False, 'error_message':error_message,
'user':user}
# generate date and random timestamp
pub_date = datetime.datetime.now()
random_n = str(time.time())
filename = random_n + '-' + user.user_name + extension
upload_dir = tmp_dir
output_file = open(os.path.join(upload_dir, filename), 'wb')
image_file.seek(0)
while 1:
data = image_file.read(2<<16)
if not data:
break
output_file.write(data)
output_file.close()
# we need to create a thumbnail for the users profile pic
basewidth = 530
max_height = 200
# open the image we just saved
root_location = open(os.path.join(upload_dir, filename), 'r')
image = pilImage.open(root_location)
if image.size[0] > basewidth:
# if image width greater than 670px
# we need to recduce its size
wpercent = (basewidth/float(image.size[0]))
hsize = int((float(image.size[1])*float(wpercent)))
portfolio_pic = image.resize((basewidth,hsize), pilImage.ANTIALIAS)
else:
# else the image can stay the same size as it is
# assign portfolio_pic var to the image
portfolio_pic = image
portfolio_pics_dir = os.path.join(upload_dir, 'work')
quality_val = 90
output_file = open(os.path.join(portfolio_pics_dir, filename), 'wb')
portfolio_pic.save(output_file, quality=quality_val)
profile_full_loc = portfolio_pics_dir + '/' + filename
# S3 stuff
new_key = user.user_name + '/portfolio_pics/' + filename
key = bucket.new_key(new_key)
key.set_contents_from_filename(profile_full_loc)
key.set_acl('public-read')
public_url = key.generate_url(0, query_auth=False, force_http=True)
output_dir = os.path.join(upload_dir)
output_file = output_dir + '/' + filename
os.remove(output_file)
os.remove(profile_full_loc)
new_image = Image(s3_key=new_key, public_url=public_url,
pub_date=pub_date, bucket=bucket_name, uid=user.id,
description=appstruct['description'])
DBSession.add(new_image)
# add the new entry to the association table.
user.portfolio.append(new_image)
return HTTPFound(location = route_url('list_portfolio', request))
return dict(user=user, form=myform.render())