Layout Parser: lp.draw_box not working in for loop - layout-parser

I am trying to visualize pages after passing through layout parser using lp.draw_box. It works fine for a single image, but if I run through a for loop and try for each image at once from a set of images, it doesn't print anything. Could someone please help me here? THanks
Code:
from pdf2image import convert_from_path, convert_from_bytes
import cv2, time
start = time.time()
images = convert_from_path(f"/content/neft-system-faq.pdf")
page_layout_map = {}
for idx, image in enumerate(images):
image.save('out.png', 'PNG')
image = cv2.imread("out.png")
image = image[..., ::-1]
layout = model.detect(image)
lp.draw_box(image, layout,None,show_element_id=True)
text_blocks = lp.Layout([b for b in layout])
print(text_blocks)
block_test = get_coordinates(text_blocks)
result_arr = layout_map_processor(block_test)
page_layout_map[idx] = result_arr
print(time.time()-start)

Related

How to scrape only texts from specific HTML elements?

I have a problem with selecting the appropriate items from the list.
For example - I want to omit "1." then the first "5" (as in the example)
Additionally, I would like to write a condition that the letter "W" should be changed to "WIN".
import re
from selenium import webdriver
from bs4 import BeautifulSoup as BS2
from time import sleep
driver = webdriver.Chrome()
driver.get("https://www.flashscore.pl/druzyna/ajax/8UOvIwnb/tabela/")
sleep(10)
page = driver.page_source
soup = BS2(page,'html.parser')
content = soup.find('div',{'class':'ui-table__body'})
content_list = content.find_all('span',{"table__cell table__cell--value"})
res = []
for i in content:
line = i.text.split()[0]
if re.search('Ajax', line):
res.append(line)
print(res)
results
['1.Ajax550016:315?WWWWW']
I need
Ajax;5;5;0;16;3;W;W;W;W;W
I would recommend to select your elements more specific:
for e in soup.select('.ui-table__row'):
Iterate the ResultSet and decompose() unwanted tag:
e.select_one('.wld--tbd').decompose()
Extract texts with stripped_strings and join() them to your expected string:
data.append(';'.join(e.stripped_strings))
Example
Also making some replacements, based on dict just to demonstrate how this would work, not knowing R or P.
...
soup = BS2(page,'html.parser')
data = []
for e in soup.select('.ui-table__row'):
e.select_one('.wld--tbd').decompose()
e.select_one('.tableCellRank').decompose()
e.select_one('.table__cell--points').decompose()
e.select_one('.table__cell--score').string = ';'.join(e.select_one('.table__cell--score').text.split(':'))
pattern = {'W':'WIN','R':'RRR','P':'PPP'}
data.append(';'.join([pattern.get(i,i) for i in e.stripped_strings]))
data
To only get result for Ajax:
data = []
for e in soup.select('.ui-table__row:-soup-contains("Ajax")'):
e.select_one('.wld--tbd').decompose()
e.select_one('.tableCellRank').decompose()
e.select_one('.table__cell--points').decompose()
e.select_one('.table__cell--score').string = ';'.join(e.select_one('.table__cell--score').text.split(':'))
pattern = {'W':'WIN','R':'RRR','P':'PPP'}
data.append(';'.join([pattern.get(i,i) for i in e.stripped_strings]))
data
Output
Based on actually data it may differ from questions example.
['Ajax;6;6;0;0;21;3;WIN;WIN;WIN;WIN;WIN']
you had the right start by using bs4 to find the table div, but then you gave up and just tried to use re to extract from the text. as you can see that's not going to work. Here is a simple way to hack and get what you want. I keep grabinn divs from the table div you find, and the grab the text of the next eight divs after finding Ajax. then I do some dirty string manipulation thing because the WWWWW is all in the same toplevel div.
import re
from selenium import webdriver
from bs4 import BeautifulSoup as BS2
from time import sleep
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
#driver = webdriver.Chrome()
driver.get("https://www.flashscore.pl/druzyna/ajax/8UOvIwnb/tabela/")
driver.implicitly_wait(10)
page = driver.page_source
soup = BS2(page,'html.parser')
content = soup.find('div',{'class':'ui-table__body'})
content_list = content.find_all('span',{"table__cell table__cell--value"})
res = []
found = 0
for i in content.find('div'):
line = i.text.split()[0]
if re.search('Ajax', line):
found = 8
if found:
found -= 1
res.append(line)
# change field 5 into separate values and skip field 6
res = res[:4] +res[5].split(':') + res[7:]
# break the last field into separate values and drop the first '?'
res = res[:-1] + [ i for i in res[-1]][1:]
print(";".join(res))
returns
Ajax;5;5;0;16;3;W;W;W;W;W
This works, but it is very brittle, and will break as soon as the website changes their content. you should put in a lot of error checking. I also replaced the sleep with a wait call, and added chromedrivermamager, which allows me to use selenium with chrome.

In Gtk, how to make a window smaller when creating

I am trying to display both an image and a box with an Entry widget. I can do that, but the window is so large that the widget at the bottom is mostly out of view. I have tried several calls to set the window's size or unmaximize it, but they seem to have no effect. I determined that the problem only occurs when the image is large, but still wonder how to display a large image in a resizable window or, for that matter, to make any changes to the window's geometry from code. All the function call I tried seem to have no effect.
Here is my code:
import gi
gi.require_version("Gtk", "3.0")
from gi.repository import Gtk
from gi.repository import GdkPixbuf
from urllib.request import urlopen
class Display(object):
def __init__(self):
self.window = Gtk.Window()
self.window.connect('destroy', self.destroy)
self.window.set_border_width(10)
# a box underneath would be added every time you do
# vbox.pack_start(new_widget)
vbox = Gtk.VBox()
self.image = Gtk.Image()
response = urlopen('http://1.bp.blogspot.com/-e-rzcjuCpk8/T3H-mSry7PI/AAAAAAAAOrc/Z3XrqSQNrSA/s1600/rubberDuck.jpg').read()
pbuf = GdkPixbuf.PixbufLoader()
pbuf.write(response)
pbuf.close()
self.image.set_from_pixbuf(pbuf.get_pixbuf())
self.window.add(vbox)
vbox.pack_start(self.image, False, False, 0)
self.entry = Gtk.Entry()
vbox.pack_start(self.entry, True,True, 0)
self.image.show()
self.window.show_all()
def main(self):
Gtk.main()
def destroy(self, widget, data=None):
Gtk.main_quit()
a=Display()
a.main()
Most of the posted information seems to pertain to Gtk2 rather than Gtk3, but there is a solution: to use a pix buf loader and set the size:
from gi.repository import Gtk, Gdk, GdkPixbuf
#more stuff
path = config['DEFAULT']['datasets']+'working.png'
with open(path,'rb') as f:
pixels = f.read()
loader = GdkPixbuf.PixbufLoader()
loader.set_size(400,400)
loader.write(pixels)
pb = GdkPixbuf.Pixbuf.new_from_file(path)
self.main_image.set_from_pixbuf(loader.get_pixbuf())
loader.close()

How can I download link from YahooFinance in BeautifulSoup?

currently I'm trying to automatically scrape/download yahoo finance historical data. I plan to download the data using the download link provided in the website.
My code is to list all the available link and work it from there, the problem is that the exact link doesn't appear in the result. Here is my code(partial):
def scrape_page(url, header):
page = requests.get(url, headers=header)
if page.status_code == 200:
soup = bs.BeautifulSoup(page.content, 'html.parser')
return soup
return null
if __name__ == '__main__':
symbol = 'GOOGL'
dt_start = datetime.today() - timedelta(days=(365*5+1))
dt_end = datetime.today()
start = format_date(dt_start)
end = format_date(dt_end)
sub = subdomain(symbol, start, end)
header = header_function(sub)
base_url = 'https://finance.yahoo.com'
url = base_url + sub
soup = scrape_page(url, header)
result = soup.find_all('a')
for a in result:
print('URL :',a['href'])
UPDATE 10/9/2020 :
I managed to find the span which is the parent for the link with this code
spans = soup.find_all('span',{"class":"Fl(end) Pos(r) T(-6px)"})
However, when I print it out, it does not show the link, here is the output:
>>> spans
[<span class="Fl(end) Pos(r) T(-6px)" data-reactid="31"></span>]
To download the historical data in CSV format from Yahoo Finance, you can use this example:
import requests
from datetime import datetime
csv_link = 'https://query1.finance.yahoo.com/v7/finance/download/{quote}?period1={from_}&period2={to_}&interval=1d&events=history'
quote = 'GOOGL'
from_ = str(datetime.timestamp(datetime(2019,9,27,0,0))).split('.')[0]
to_ = str(datetime.timestamp(datetime(2020,9,27,23,59))).split('.')[0]
print(requests.get(csv_link.format(quote=quote, from_=from_, to_=to_)).text)
Prints:
Date,Open,High,Low,Close,Adj Close,Volume
2019-09-27,1242.829956,1244.989990,1215.199951,1225.949951,1225.949951,1706100
2019-09-30,1220.599976,1227.410034,1213.420044,1221.140015,1221.140015,1223500
2019-10-01,1222.489990,1232.859985,1205.550049,1206.000000,1206.000000,1225200
2019-10-02,1196.500000,1198.760010,1172.630005,1177.920044,1177.920044,1651500
2019-10-03,1183.339966,1191.000000,1163.140015,1189.430054,1189.430054,1418400
2019-10-04,1194.290039,1212.459961,1190.969971,1210.959961,1210.959961,1214100
2019-10-07,1207.000000,1218.910034,1204.359985,1208.250000,1208.250000,852000
2019-10-08,1198.770020,1206.869995,1189.479980,1190.130005,1190.130005,1004300
2019-10-09,1201.329956,1208.459961,1198.119995,1202.400024,1202.400024,797400
2019-10-10,1198.599976,1215.619995,1197.859985,1209.469971,1209.469971,642100
2019-10-11,1224.030029,1228.750000,1213.640015,1215.709961,1215.709961,1116500
2019-10-14,1213.890015,1225.880005,1211.880005,1217.770020,1217.770020,664800
2019-10-15,1221.500000,1247.130005,1220.920044,1242.239990,1242.239990,1379200
2019-10-16,1241.810059,1254.189941,1238.530029,1243.000000,1243.000000,1149300
2019-10-17,1251.400024,1263.750000,1249.869995,1252.800049,1252.800049,1047900
2019-10-18,1254.689941,1258.109985,1240.140015,1244.410034,1244.410034,1581200
2019-10-21,1248.699951,1253.510010,1239.989990,1244.280029,1244.280029,904700
2019-10-22,1244.479980,1248.729980,1239.849976,1241.199951,1241.199951,1143100
2019-10-23,1240.209961,1258.040039,1240.209961,1257.630005,1257.630005,1064100
2019-10-24,1259.109985,1262.900024,1252.349976,1259.109985,1259.109985,1011200
...and so on.
I figured it out. That link is generated by javascript and requests.get() method won't work on dynamic content. I switched to selenium to download that link.

Python Full Web Parsing

As of right now I'm attempting to make a simple music player app that streams music or video directly from a Youtube URL, and in order to do that I need the full download of the search page that's used to search for videos to stream. But I'm having some problems with the urlopen module in python 3, which is what I'm using to make the command application. It won't load the ytd-app tag on Youtube, which is what a good deal of the video and playlist references are put on when you first load the search. Anyone know what's going on, or know some type of workaround for it? Thanks!
My code so far:
BASICURL = "https://www.youtube.com/results?"
query = query.split()
ret = ""
stufffound = {}
for x in query:
ret = ret + x + "+"
ret = (ret[:len(ret)-1])
# URL BUILDER
if filtercriteria:
URL = BASICURL + "sp={0}".format(filtercriteria) + "&search_query={0}".format(ret)
else:
URL = BASICURL + "search_query={0}".format(ret)
query = urlopen(str(URL))
passdict = {}
def findvideosonpage(query,dictToAddTo):
for x in (BS(urlopen(query)).read()).findAll(attrs={'class':'yt-simple-endpoint style-scope ytd-video-renderer'})
dictToAddTo[query.index(x)] = x[href]
print(x)
return list([x for _,x in sorted(zip(dictToAddTo.values(), dictToAddTo.keys()))])
# Dictionary is meant to be converted into a list later to order the results

How to get the size (in pixels) of a jpeg image I get with UrlFetchApp.fetch(photoLink)?

In a script that sends email in HTML format I add an image that is stored in a public shared folder.
I get the blob using UrlFetchApp.fetch(photoLink) but the image has not necessarily the right size so in the html code I use width and height attributes (for now with fixed values, see code below) but I'd like it to be automatically resized with the right ratio.
To achieve that I need to know how to get the original size of the image (height and width) but I just don't know how to get it without inserting the image in an intermediate document (which would work but I find this approach a bit weird and unnecessarily complicated... moreover I don't feel like having a useless doc appearing each time I change the image file).
Here is the relevant part of the code that creates the email message :
function sendMail(test,rowData,genTitle,day,title,stHour,endHour){
var photoLink = sh.getRange('H1').getValue();
var image = UrlFetchApp.fetch(photoLink);
//************* find the pixel size of the image to get its ratio
var msgTemplate = '<IMG SRC="'+photoLink+'" BORDER=0 ALT="logo" HEIGHT=200 WIDTH=300><BR><BR>'+
'Résumé de vos réservations au nom de <NOM><BR><BR><BR><table style="background-color:lightblue;border-collapse:collapse;" border = 1 cellpadding = 5><th></th><th><TABLEHEADER></th><EVENTS></table><BR><CONCLUSION><BR>Cordialement,<BR><BR>';
var mailTitle = 'Confirmation de réservation - '+getTextFromHtml(genTitle);
var descr = '';
for(var d = 0;d<day.length;++d){
Logger.log(Number(rowData[(d+5)]));
var content = '<tr bgcolor="#ffffbb" width="100%"><td><NUMBER> </td><td > <DESCRIPTION></td></tr>'
if(Number(rowData[(d+5)])>1){var pl = ' places'}else{var pl = ' place'};
content = content.replace('<NUMBER>',rowData[(d+5)]+pl);
content = content.replace('<DESCRIPTION>',title[d]+' de '+stHour[d]+' heures à '+endHour[d]+' heures');
if(Number(rowData[(d+5)])>0){
descr += content;
}
}
msgTemplate = msgTemplate.replace('<NOM>',rowData[1]).replace('<EVENTS>',descr).replace('<TABLEHEADER>',genTitle);
var textVersion = getTextFromHtml(msgTemplate.replace(/<br>/gi,'\n').replace(/<td>/gi,'\n'));
// Logger.log(textVersion)
if(test){
MailApp.sendEmail(Session.getEffectiveUser().getEmail(),mailTitle, textVersion,{'htmlBody':msgTemplate,"replyTo" : retour});
}
else
{
MailApp.sendEmail(rowData[2],mailTitle, textVersion,{'htmlBody':msgTemplate,"replyTo" : retour});
}
}
There is no easy way within Apps Script to figure out what an image size would be. There are some other projects that might be able to analyze the bitmap data and give you dimensions.
The last time I had to solve this problem. I just wrote a simple App Engine app to do the image math for me -
import webapp2
from google.appengine.api import urlfetch
from google.appengine.api import images
from django.utils import simplejson
class MainHandler(webapp2.RequestHandler):
def get(self):
url = self.request.get('url')
imgResp = urlfetch.fetch(url) #eg. querystring - url=http://xyz.com/img.jpg
if imgResp.status_code == 200:
img = images.Image(imgResp.content);
jsonResp = {"url":url, "h":img.height, "w":img.width, "format":img.format}
self.response.headers['Content-Type'] = 'application/json'
self.response.out.write(simplejson.dumps(jsonResp))
app = webapp2.WSGIApplication([('/imageinfo', MainHandler)], debug=True)
And then I call it from Apps Script like this -
function checkImageSizes() {
var imageUrls = ['http://developers.google.com/apps-script/images/carousel0.png','http://www.w3.org/MarkUp/Test/xhtml-print/20050519/tests/jpeg420exif.jpg'];
for(var i in imageUrls){
var resp = JSON.parse(UrlFetchApp.fetch('http://arunimageinfo.appspot.com/imageinfo?url='+imageUrls[i]).getContentText());
Logger.log('Image at %s is %s x %s',resp.url,resp.w,resp.h);
}
}
You are welcome to use my App Engine instance if your volume is a couple of times a week :)
I doubt you can do this in apps script. Certainly not natively but you might be able to find or adapt a jpg library that looks at the binary blob header and extracts the image size.