I want to implement an exception handling in my code. I'm scraping data from Transfermarkt. If you look at the attached picture, you will see that in season 10/11 there's a missing entry. BS doesn't find any matchings there and just skips it. I've implemented some code that checks the full length at the end and just appends the list with a 'MISSING'. Unfortunately, I can just append this at the end of any page. Thus, if there is a missing entry in the middle of the table, I have to move it manually. The problem is that my year/season don't fit after such missing entries.
Can this be done with selenium?
Relevant part of my code:
import requests
from bs4 import BeautifulSoup
url = 'https://www.transfermarkt.de/pep-guardiola/erfolge/trainer/5672'
headers = {'Host': 'www.transfermarkt.de',
'Referer': 'https://www.transfermarkt.de/manuel-neuer/erfolge/spieler/17259',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
pageTree = requests.get(url, headers=headers)
soup = BeautifulSoup(pageTree.content, 'lxml')
for title in soup.select('.box td.hauptlink > a'):
list5.append(str('xhttps://www.transfermarkt.de') + str(title['href']))
for titlelink in title.find_all('img'):
list4.append(str(titlelink['alt']))
missingentries = len(list3) - len(list4)
for x in range(0, missingentries):
list4.append(str('MISSING'))
missinglinks = len(list4) - len(list5)
for x in range(0, missinglinks):
list5.append(str('MISSING'))
My output:
Output I want:
Any help is appreciated!
The problem seems to be that you are isolating each of these related elements while parsing them. Once they have been stored into individual lists then you canot find the missing element index.
What you can do is to first get these elements together and then insert 'MISSING' into the list at that point itself. In that case you can use exception to catch the missing element. I would have preferred to store these in a list of list instead of a separate list for each element.
import requests
from bs4 import BeautifulSoup
url = 'https://www.transfermarkt.de/pep-guardiola/erfolge/trainer/5672'
headers = {'Host': 'www.transfermarkt.de',
'Referer': 'https://www.transfermarkt.de/manuel-neuer/erfolge/spieler/17259',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
pageTree = requests.get(url, headers=headers)
soup = BeautifulSoup(pageTree.content, 'html5lib')
list3=[]
list4=[]
list5=[]
for td in soup.find_all('td',class_='hauptlink'):
date=td.find_previous_sibling("td")
list3.append(date.text)
try:
list4.append(str(td.a.find('img')['alt']))
list5.append(str('xhttps://www.transfermarkt.de') + str(td.a['href']))
except AttributeError:
list5.append('MISSING')
list4.append('MISSING')
#just for viewing output
for item in zip(list3,list4,list5):
print(item)
Output
('10/11', 'LaLiga', 'xhttps://www.transfermarkt.de/primera-division/startseite/wettbewerb/ES1/saison_id/2010')
('08/09', 'LaLiga', 'xhttps://www.transfermarkt.de/primera-division/startseite/wettbewerb/ES1/saison_id/2008')
('10/11', 'UEFA Champions League', 'xhttps://www.transfermarkt.de/uefa-champions-league/startseite/pokalwettbewerb/CL/saison_id/2010')
('08/09', 'UEFA Champions League', 'xhttps://www.transfermarkt.de/uefa-champions-league/startseite/pokalwettbewerb/CL/saison_id/2008')
('17/18', 'Premier League', 'xhttps://www.transfermarkt.de/premier-league/startseite/wettbewerb/GB1/saison_id/2017')
('10/11', 'LaLiga', 'xhttps://www.transfermarkt.de/primera-division/startseite/wettbewerb/ES1/saison_id/2010')
('09/10', 'LaLiga', 'xhttps://www.transfermarkt.de/primera-division/startseite/wettbewerb/ES1/saison_id/2009')
('08/09', 'LaLiga', 'xhttps://www.transfermarkt.de/primera-division/startseite/wettbewerb/ES1/saison_id/2008')
('15/16', '1.Bundesliga', 'xhttps://www.transfermarkt.de/1-bundesliga/startseite/wettbewerb/L1/saison_id/2015')
('14/15', '1.Bundesliga', 'xhttps://www.transfermarkt.de/1-bundesliga/startseite/wettbewerb/L1/saison_id/2014')
('13/14', '1.Bundesliga', 'xhttps://www.transfermarkt.de/1-bundesliga/startseite/wettbewerb/L1/saison_id/2013')
('18/19', 'EFL Cup', 'xhttps://www.transfermarkt.de/league-cup/startseite/pokalwettbewerb/CGB/saison_id/2018')
('17/18', 'EFL Cup', 'xhttps://www.transfermarkt.de/league-cup/startseite/pokalwettbewerb/CGB/saison_id/2017')
('13/14', 'UEFA Super Cup', 'xhttps://www.transfermarkt.de/uefa-supercup/startseite/pokalwettbewerb/USC/saison_id/2013')
('11/12', 'UEFA Super Cup', 'xhttps://www.transfermarkt.de/uefa-supercup/startseite/pokalwettbewerb/USC/saison_id/2011')
('09/10', 'UEFA Super Cup', 'xhttps://www.transfermarkt.de/uefa-supercup/startseite/pokalwettbewerb/USC/saison_id/2009')
('13/14', 'FIFA Klub-WM', 'xhttps://www.transfermarkt.de/fifa-klub-wm/startseite/pokalwettbewerb/KLUB/saison_id/2013')
('11/12', 'FIFA Klub-WM', 'xhttps://www.transfermarkt.de/fifa-klub-wm/startseite/pokalwettbewerb/KLUB/saison_id/2011')
('09/10', 'FIFA Klub-WM', 'xhttps://www.transfermarkt.de/fifa-klub-wm/startseite/pokalwettbewerb/KLUB/saison_id/2009')
('10/11', 'MISSING', 'MISSING')
('15/16', 'DFB-Pokal', 'xhttps://www.transfermarkt.de/dfb-pokal/startseite/pokalwettbewerb/DFB/saison_id/2015')
('13/14', 'DFB-Pokal', 'xhttps://www.transfermarkt.de/dfb-pokal/startseite/pokalwettbewerb/DFB/saison_id/2013')
('11/12', 'Copa del Rey', 'xhttps://www.transfermarkt.de/copa-del-rey/startseite/pokalwettbewerb/CDR/saison_id/2011')
('08/09', 'Copa del Rey', 'xhttps://www.transfermarkt.de/copa-del-rey/startseite/pokalwettbewerb/CDR/saison_id/2008')
('11/12', 'Supercopa', 'xhttps://www.transfermarkt.de/supercopa/startseite/pokalwettbewerb/SUC/saison_id/2011')
('10/11', 'Supercopa', 'xhttps://www.transfermarkt.de/supercopa/startseite/pokalwettbewerb/SUC/saison_id/2010')
('09/10', 'Supercopa', 'xhttps://www.transfermarkt.de/supercopa/startseite/pokalwettbewerb/SUC/saison_id/2009')
('18/19', 'Community Shield', 'xhttps://www.transfermarkt.de/community-shield/startseite/pokalwettbewerb/GBCS/saison_id/2018')
Related
This was working, and suddenly fails on extracting the "event.body" JSON object passed into this AWS Lambda nodeJS function:
exports.handler = function (event, context, callback) {
console.log('Event: ' + JSON.stringify(event));
console.log('Event.Body: ' + event.body);
//console.log('Parsed Event: ' + JSON.parse(event));
let body = event.body;
console.log('Body: ' + body);
const tgQueryName = body.queryName;
const tgQueryParams = body.queryParams;
console.log('tgQueryName: ' + tgQueryName);
console.log('tgQueryParams: ' + tgQueryParams);
...
Both tgQueryName and tgQueryParams are 'undefined' - see CloudWatch log:
INFO Event: {"version":"2.0","routeKey":"POST /tg-query","rawPath":"/dev/tg-query","rawQueryString":"","headers":{"accept":"application/json, text/plain, */*","accept-encoding":"gzip, deflate","accept-language":"he-IL,he;q=0.9,en-US;q=0.8,en;q=0.7","cache-control":"no-cache","content-length":"51","content-type":"application/json; charset=UTF-8","host":"p6ilp2ts0g.execute-api.us-east-1.amazonaws.com","origin":"http://localhost","referer":"http://localhost/","sec-fetch-dest":"empty","sec-fetch-mode":"cors","sec-fetch-site":"cross-site","user-agent":"Mozilla/5.0 (Linux; Android 11; Redmi Note 8 Build/RKQ1.201004.002; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/101.0.4951.61 Mobile Safari/537.36","x-amzn-trace-id":"Root=1-629b960c-072e8fa475ad26f56893c6f9","x-forwarded-for":"89.139.32.60","x-forwarded-port":"443","x-forwarded-proto":"https","x-requested-with":"com.skillblaster.simplify.dev"},"requestContext":{"accountId":"140360121027","apiId":"p6ilp2ts0g","domainName":"p6ilp2ts0g.execute-api.us-east-1.amazonaws.com","domainPrefix":"p6ilp2ts0g","http":{"method":"POST","path":"/dev/tg-query","protocol":"HTTP/1.1","sourceIp":"89.139.32.60","userAgent":"Mozilla/5.0 (Linux; Android 11; Redmi Note 8 Build/RKQ1.201004.002; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/101.0.4951.61 Mobile Safari/537.36"},"requestId":"TNRh_gq-oAMESEw=","routeKey":"POST /tg-query","stage":"dev","time":"04/Jun/2022:17:27:40 +0000","timeEpoch":1654363660597},"body":"{\"queryName\":\"getActiveCountries\",\"queryParams\":{}}","isBase64Encoded":false}
INFO Event.Body: {"queryName":"getActiveCountries","queryParams":{}}
INFO Body: {"queryName":"getActiveCountries","queryParams":{}}
INFO tgQueryName: undefined
INFO tgQueryParams: undefined
I also tried: body["queryName"] - same result.
What am I missing?
Your body content is a string and you need to JSON.parse it:
let body = JSON.parse(event.body);
It was only clear when I stuck your initial event JSON into a JSON beautifier and it was a little clearer.
I try to find emails into html using regex but I have problems with some websites.
The main problem is that regex function paralyzes the process and leaves the cpu overloaded.
import re
from urllib.request import urlopen, Request
email_regex = re.compile('([A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4})', re.IGNORECASE)
request = Request('http://www.serviciositvyecla.com')
request.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36')
html = str(urlopen(request, timeout=5).read().decode("utf-8", "strict"))
email_regex.findall(html) ## here is where regex takes a long time
I have not problems if the website is another one.
request = Request('https://www.velezmalaga.es/')
If someone know how to solve this problem or know how to timeout the regex function, I will appreciate it.
I use Windows.
I initially tried fiddling with your approach, but then I ditched it and resorted to BeautifulSoup. It worked.
Try this:
import re
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36",
}
pages = ['http://www.serviciositvyecla.com', 'https://www.velezmalaga.es/']
emails_found = set()
for page in pages:
html = requests.get(page, headers=headers).content
soup = BeautifulSoup(html, "html.parser").select('a[href^=mailto]')
for item in soup:
try:
emails_found.add(item['href'].split(":")[-1].strip())
except ValueError:
print("No email :(")
print('\n'.join(email for email in emails_found))
Output:
info#serviciositvyecla.com
oac#velezmalaga.es
EDIT:
One reason your approach doesn't work is, well, the regex itself. The other one is the size (I suspect) of the HTML returned.
See this:
import re
import requests
headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36",
}
html = requests.get('https://www.velezmalaga.es/', headers=headers).text
op_regx = '([A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4})'
simplified_regex = '[\w\.-]+#[\w\.-]+\.\w+'
print(f"OP's regex results: {re.findall(op_regx, html)}")
print(f"Simplified regex results: {re.findall(simplified_regex, html)}")
This prints:
OP's regex results: []
Simplified regex results: ['oac#velezmalaga.es', 'oac#velezmalaga.es']
Finally, I found a solution for no consume all RAM with a regex search. In my problem, obtaining a white result even though there is email on the web is an acceptable solution, as long as not to block the process due to lack of memory.
The html of the scraped page contained 5.5 million characters. 5.1 millions did not contain priority information, since it is a hidden div with unintelligible characters.
I have added an exception similar than:
if len(html) < 1000000: do whathever
I'm an absolute beginner in get/post requests and micropython.
I'm programming my ESP8266 Wemos D1 mini as a HTTP server with micropython. My project consists of using a website to control the RGB values of a neopixel matrix hooked up to the D1 (all the code is on my GitHub here: https://github.com/julien123123/NeoLamp-Micro).
Basically, the website contains three sliders: one for Red, one for Green and one for Blue. A javascript code reads the value of each slider and sends it to the micropython code with using the POST method as follows :
getColors = function() {
var rgb = new Array(slider1.value, slider2.value, slider3.value);
return rgb;
};
postColors = function(rgb) {
var xmlhttp = new XMLHttpRequest();
var npxJSON = '{"R":' + rgb[0] + ', "G":' + rgb[1] + ', "B":' + rgb[2] + '}';
xmlhttp.open('POST', 'http://' + window.location.hostname + '/npx', true);
xmlhttp.setRequestHeader('Content-type', 'application/json');
xmlhttp.send(npxJSON);
};
To recieve the resquest in micropython here's my code:
conn, addr = s.accept()
request = conn.recv(1024)
request = str(request)
print(request)
The response prints as follows:
b'POST /npx HTTP/1.1\r\nHost: 192.xxx.xxx.xxx\r\nConnection: keep-alive\r\nContent-Length: 27\r\nOrigin: http://192.168.0.110\r\nUser-Agent: Mozilla/5.0 (X11; CrOS x86_64 10323.46.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.107 Safari/537.36\r\nContent-type: application/json\r\nAccept: */*\r\nReferer: http://192.xxx.xxx.xxx/\r\nAccept-Encoding: gzip, deflate\r\nAccept-Language: fr,en;q=0.9,fr-CA;q=0.8\r\n\r\n{"R":114, "G":120, "B":236}'
The only important bit for me is at the end : {"R":114, "G":120, "B":236}. I want to use those values to change the color values of my neopixel object.
My question to you is how to I process the response so that I keep only the dictionary containing the RGB variables at the end of the response??
Thanks in advance (I'm almost there!)
This is more related to generic python data type. The data type of request is in bytes as indicated by prefix b in b'POST /npx HTTP/1.1...\r\n{"R":114, "G":120, "B":236}'. You will have to use decode() to convert it to string
import json
request = b'POST /npx HTTP/1.1\r\nHost: 192.xxx.xxx.xxx\r\nConnection: keep-alive\r\nContent-Length: 27\r\nOrigin: http://192.168.0.110\r\nUser-Agent: Mozilla/5.0 (X11; CrOS x86_64 10323.46.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.107 Safari/537.36\r\nContent-type: application/json\r\nAccept: */*\r\nReferer: http://192.xxx.xxx.xxx/\r\nAccept-Encoding: gzip, deflate\r\nAccept-Language: fr,en;q=0.9,fr-CA;q=0.8\r\n\r\n{"R":114, "G":120, "B":236}'
data = request.decode() # convert to str
rgb = data.split('\r\n')[-1:] #split the str and discard the http header
for color in rgb:
print(color, type(color))
d = json.loads(color)
print(d, type(d))
The result of color is a str representation of an json object, the d will give you a python dict object to be used for further manipulation:
{"R":114, "G":120, "B":236} <class 'str'>
{'R': 114, 'G': 120, 'B': 236} <class 'dict'>
I am working on my first website scraper and am trying to get the number 41,110 that is saved in a column on the webpage https://mcassessor.maricopa.gov/mcs.php?q=14014003N. Below is my code.
How can I get to this number and print it?
from bs4 import BeautifulSoup
import requests
web_page = 'https://mcassessor.maricopa.gov/mcs.php?q=14014003N'
web_header = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
response = requests.get(web_page,headers=web_header)
soup = BeautifulSoup(response.content,'html.parser')
for row in soup.findAll('table')[0].thread.tr.findAll('tr'):
first_column = row.findAll('th')[0].contents
print(first_column)
A straightforward approach would involve getting the "improvements" table, getting the first non-header row and then the last cell in this row:
table = soup.find("table", id="improvements-table")
first_row = table.find_all("tr")[1] # skipping a header
last_cell = first_row.find_all("td")[-1]
print(last_cell.get_text()) # prints 41,110
A more generic approach would involve making a list of dictionaries out of this table where keys are header names:
table = soup.find("table", id="improvements-table")
headers = [th.get_text() for th in table('th')]
data = [dict(zip(headers, [td.get_text() for td in row('td')])) for row in table("tr")[1:]]
print(data)
print(data[0]['Sq Ft.'])
Prints:
[
{u'Imp #': u'000101', u'Description': u'Mini-Warehouse', u'Age': u'1', u'Rank': u'2', u'Sq Ft.': u'41,110', u'CCI': u'C', u'Model': u'386'},
{u'Imp #': u'000201', u'Description': u'Site Improvements', u'Age': u'1', u'Rank': u'2', u'Sq Ft.': u'1', u'CCI': u'D', u'Model': u'163'}
]
41,110
I have mongodb collection "mongocollection", each document in the collection consists of two columns, first a string "cid" which is the id of the collection and second is a json array.
Ex:
{ "_id" : "domain.com", "
className" : "UserAgents",
"userAgents" : [
"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-CA; rv:1.9.0.5) Gecko/2008120122 Firefox/3.0.5",
"Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.6) Gecko/20100628 Ubuntu/10.04 (lucid) Firefox/3.6.6 (.NET CLR 3.5.30729)",
"Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.2) Gecko/20090803 Firefox/3.5.2 Slackware",
"Mozilla/5.0 (X11; U; Linux x86_64) Gecko/2008072820 Firefox/3.0.1"
]
}
From mongo cmd line, i can get the contents of a given document within a collection as
db.CorruptUserAgents.find({"_id":"domain.com"}).pretty();
How do i get the number of elements in a given array of a given document. Ex
SOMETHING.count();
100
Is it possible to do this via cmd line ?
I know i can get the document, iterate over the array and count the elements, but i want to do this from cmd line.
You can use .findOne() instead since you are looking up by _id, it will return a single document.
So you can simply do something like this:
var document = db.CorruptUserAgents.findOne({"_id":"domain.com"});
var count = document.userAgents.length;
You can directly get it by using below :
db.CorruptUserAgents.findOne({"_id":"domain.com"}).userAgents.length;