Timeout while using regex Python3 - html

I try to find emails into html using regex but I have problems with some websites.
The main problem is that regex function paralyzes the process and leaves the cpu overloaded.
import re
from urllib.request import urlopen, Request
email_regex = re.compile('([A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4})', re.IGNORECASE)
request = Request('http://www.serviciositvyecla.com')
request.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36')
html = str(urlopen(request, timeout=5).read().decode("utf-8", "strict"))
email_regex.findall(html) ## here is where regex takes a long time
I have not problems if the website is another one.
request = Request('https://www.velezmalaga.es/')
If someone know how to solve this problem or know how to timeout the regex function, I will appreciate it.
I use Windows.

I initially tried fiddling with your approach, but then I ditched it and resorted to BeautifulSoup. It worked.
Try this:
import re
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36",
}
pages = ['http://www.serviciositvyecla.com', 'https://www.velezmalaga.es/']
emails_found = set()
for page in pages:
html = requests.get(page, headers=headers).content
soup = BeautifulSoup(html, "html.parser").select('a[href^=mailto]')
for item in soup:
try:
emails_found.add(item['href'].split(":")[-1].strip())
except ValueError:
print("No email :(")
print('\n'.join(email for email in emails_found))
Output:
info#serviciositvyecla.com
oac#velezmalaga.es
EDIT:
One reason your approach doesn't work is, well, the regex itself. The other one is the size (I suspect) of the HTML returned.
See this:
import re
import requests
headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36",
}
html = requests.get('https://www.velezmalaga.es/', headers=headers).text
op_regx = '([A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4})'
simplified_regex = '[\w\.-]+#[\w\.-]+\.\w+'
print(f"OP's regex results: {re.findall(op_regx, html)}")
print(f"Simplified regex results: {re.findall(simplified_regex, html)}")
This prints:
OP's regex results: []
Simplified regex results: ['oac#velezmalaga.es', 'oac#velezmalaga.es']

Finally, I found a solution for no consume all RAM with a regex search. In my problem, obtaining a white result even though there is email on the web is an acceptable solution, as long as not to block the process due to lack of memory.
The html of the scraped page contained 5.5 million characters. 5.1 millions did not contain priority information, since it is a hidden div with unintelligible characters.
I have added an exception similar than:
if len(html) < 1000000: do whathever

Related

Webscrape using BeautifulSoup to Dataframe

This is the html code:
<div class="wp-block-atomic-blocks-ab-accordion ab-block-accordion ab-font-size-18"><details><summary class="ab-accordion-title"><strong>American Samoa</strong></summary><div class="ab-accordion-text">
<ul><li><strong>American Samoa Department of Health Travel Advisory</strong></li><li>March 2, 2020—Governor Moliga <a rel="noreferrer noopener" href="https://www.rnz.co.nz/international/pacific-news/410783/american-samoa-establishes-govt-taskforce-to-plan-for-coronavirus" target="_blank">appointed</a> a government taskforce to provide a plan for preparation and response to the covid-19 coronavirus. </li></ul>
<ul><li>March 25, 2020 – The Governor issued an Executive Order 001 recognizing the Declared Public Health Emergency and State of Emergency, and imminent threat to public health. The order requires the immediate and comprehensive enforcement by the Commissioner of Public Safety, Director of Health, Attorney General, and other agency leaders.
<ul>
<li>Business are also required to provide necessary supplies to the public and are prohibited from price gouging.</li>
</ul>
</li></ul>
</div></details></div>
I want to extract State, date and text and add to a dataframe with these three columns
State: American Samoa
Date: 2020-03-25
Text: The Governor Executive Order 001 recognizing the Declared Public Health Emergency and State of Emergency, and imminent threat to public health
My code so far:
soup = bs4.BeautifulSoup(data)
for tag in soup.find_all("summary"):
print("{0}: {1}".format(tag.name, tag.text))
for tag1 in soup.find_all("li"):
#print(type(tag1))
ln = tag1.text
dt = (ln.split(' – ')[0])
dt = (dt.split('—')[0])
#txt = ln.split(' – ')[1]
print(dt)
Need Help:
How to get the text till a "." only, I dont need the entire test
How to add to the dataframe as new row as I loop through (I have only attached a part if the source code of webpage)
Appreciate your help!
As a start I have added the code below. Unfortunately the web page is not uniform in it's use of HTML lists some ul elements contain nested uls others don't. This code is not perfect but a starting point, for example American Samoa has an absolute mess of nested ul elements so only appears once in the df.
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
HEADERS = {
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0',
}
# You need to specify User Agent headers or else you get a 403
data = requests.get("https://www.nga.org/coronavirus-state-actions-all/", headers=HEADERS).text
soup = BeautifulSoup(data, 'lxml')
rows_list = []
for detail in soup.find_all("details"):
state = detail.find('summary')
ul = detail.find('ul')
for li in ul.find_all('li', recursive=False):
# Three types of hyphen are used on this webpage
split = re.split('(?:-|–|—)', li.text, 1)
if len(split) == 2:
rows_list.append([state.text, split[0], split[1]])
else:
print("Error", li.text)
df = pd.DataFrame(rows_list)
with pd.option_context('display.max_rows', None, 'display.max_columns', None, 'display.max_colwidth', -1):
print(df)
It creates and prints a data frame with 547 rows and prints some error messages for text it can not split. You will have to work out exactly which data you need and how to tweak the code to suit your purpose.
You can use 'html.parser' if you don't have 'lxml' installed.
UPDATED
Another approach is to use regex to match any string beginning with a date:
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
HEADERS = {
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0',
}
# You need to specify User Agent headers or else you get a 403
data = requests.get("https://www.nga.org/coronavirus-state-actions-all/", headers=HEADERS).text
soup = BeautifulSoup(data, 'html.parser')
rows_list = []
for detail in soup.find_all("details"):
state = detail.find('summary')
for li in detail.find_all('li'):
p = re.compile(r'(\s*(Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)\s*(\d{1,2}),*\s*(\d{4}))', re.IGNORECASE)
m = re.match(p, li.text)
if m:
rows_list.append([state.text, m.group(0), m.string.replace(m.group(0), '')])
else:
print("Error", li.text)
df = pd.DataFrame(rows_list)
df.to_csv('out.csv')
this gives far more records 4,785. Again it is a starting point some data gets missed but far less. It writes the data to a csv file, out.csv.

BeautifulSoup exception handling of structure

I want to implement an exception handling in my code. I'm scraping data from Transfermarkt. If you look at the attached picture, you will see that in season 10/11 there's a missing entry. BS doesn't find any matchings there and just skips it. I've implemented some code that checks the full length at the end and just appends the list with a 'MISSING'. Unfortunately, I can just append this at the end of any page. Thus, if there is a missing entry in the middle of the table, I have to move it manually. The problem is that my year/season don't fit after such missing entries.
Can this be done with selenium?
Relevant part of my code:
import requests
from bs4 import BeautifulSoup
url = 'https://www.transfermarkt.de/pep-guardiola/erfolge/trainer/5672'
headers = {'Host': 'www.transfermarkt.de',
'Referer': 'https://www.transfermarkt.de/manuel-neuer/erfolge/spieler/17259',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
pageTree = requests.get(url, headers=headers)
soup = BeautifulSoup(pageTree.content, 'lxml')
for title in soup.select('.box td.hauptlink > a'):
list5.append(str('xhttps://www.transfermarkt.de') + str(title['href']))
for titlelink in title.find_all('img'):
list4.append(str(titlelink['alt']))
missingentries = len(list3) - len(list4)
for x in range(0, missingentries):
list4.append(str('MISSING'))
missinglinks = len(list4) - len(list5)
for x in range(0, missinglinks):
list5.append(str('MISSING'))
My output:
Output I want:
Any help is appreciated!
The problem seems to be that you are isolating each of these related elements while parsing them. Once they have been stored into individual lists then you canot find the missing element index.
What you can do is to first get these elements together and then insert 'MISSING' into the list at that point itself. In that case you can use exception to catch the missing element. I would have preferred to store these in a list of list instead of a separate list for each element.
import requests
from bs4 import BeautifulSoup
url = 'https://www.transfermarkt.de/pep-guardiola/erfolge/trainer/5672'
headers = {'Host': 'www.transfermarkt.de',
'Referer': 'https://www.transfermarkt.de/manuel-neuer/erfolge/spieler/17259',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
pageTree = requests.get(url, headers=headers)
soup = BeautifulSoup(pageTree.content, 'html5lib')
list3=[]
list4=[]
list5=[]
for td in soup.find_all('td',class_='hauptlink'):
date=td.find_previous_sibling("td")
list3.append(date.text)
try:
list4.append(str(td.a.find('img')['alt']))
list5.append(str('xhttps://www.transfermarkt.de') + str(td.a['href']))
except AttributeError:
list5.append('MISSING')
list4.append('MISSING')
#just for viewing output
for item in zip(list3,list4,list5):
print(item)
Output
('10/11', 'LaLiga', 'xhttps://www.transfermarkt.de/primera-division/startseite/wettbewerb/ES1/saison_id/2010')
('08/09', 'LaLiga', 'xhttps://www.transfermarkt.de/primera-division/startseite/wettbewerb/ES1/saison_id/2008')
('10/11', 'UEFA Champions League', 'xhttps://www.transfermarkt.de/uefa-champions-league/startseite/pokalwettbewerb/CL/saison_id/2010')
('08/09', 'UEFA Champions League', 'xhttps://www.transfermarkt.de/uefa-champions-league/startseite/pokalwettbewerb/CL/saison_id/2008')
('17/18', 'Premier League', 'xhttps://www.transfermarkt.de/premier-league/startseite/wettbewerb/GB1/saison_id/2017')
('10/11', 'LaLiga', 'xhttps://www.transfermarkt.de/primera-division/startseite/wettbewerb/ES1/saison_id/2010')
('09/10', 'LaLiga', 'xhttps://www.transfermarkt.de/primera-division/startseite/wettbewerb/ES1/saison_id/2009')
('08/09', 'LaLiga', 'xhttps://www.transfermarkt.de/primera-division/startseite/wettbewerb/ES1/saison_id/2008')
('15/16', '1.Bundesliga', 'xhttps://www.transfermarkt.de/1-bundesliga/startseite/wettbewerb/L1/saison_id/2015')
('14/15', '1.Bundesliga', 'xhttps://www.transfermarkt.de/1-bundesliga/startseite/wettbewerb/L1/saison_id/2014')
('13/14', '1.Bundesliga', 'xhttps://www.transfermarkt.de/1-bundesliga/startseite/wettbewerb/L1/saison_id/2013')
('18/19', 'EFL Cup', 'xhttps://www.transfermarkt.de/league-cup/startseite/pokalwettbewerb/CGB/saison_id/2018')
('17/18', 'EFL Cup', 'xhttps://www.transfermarkt.de/league-cup/startseite/pokalwettbewerb/CGB/saison_id/2017')
('13/14', 'UEFA Super Cup', 'xhttps://www.transfermarkt.de/uefa-supercup/startseite/pokalwettbewerb/USC/saison_id/2013')
('11/12', 'UEFA Super Cup', 'xhttps://www.transfermarkt.de/uefa-supercup/startseite/pokalwettbewerb/USC/saison_id/2011')
('09/10', 'UEFA Super Cup', 'xhttps://www.transfermarkt.de/uefa-supercup/startseite/pokalwettbewerb/USC/saison_id/2009')
('13/14', 'FIFA Klub-WM', 'xhttps://www.transfermarkt.de/fifa-klub-wm/startseite/pokalwettbewerb/KLUB/saison_id/2013')
('11/12', 'FIFA Klub-WM', 'xhttps://www.transfermarkt.de/fifa-klub-wm/startseite/pokalwettbewerb/KLUB/saison_id/2011')
('09/10', 'FIFA Klub-WM', 'xhttps://www.transfermarkt.de/fifa-klub-wm/startseite/pokalwettbewerb/KLUB/saison_id/2009')
('10/11', 'MISSING', 'MISSING')
('15/16', 'DFB-Pokal', 'xhttps://www.transfermarkt.de/dfb-pokal/startseite/pokalwettbewerb/DFB/saison_id/2015')
('13/14', 'DFB-Pokal', 'xhttps://www.transfermarkt.de/dfb-pokal/startseite/pokalwettbewerb/DFB/saison_id/2013')
('11/12', 'Copa del Rey', 'xhttps://www.transfermarkt.de/copa-del-rey/startseite/pokalwettbewerb/CDR/saison_id/2011')
('08/09', 'Copa del Rey', 'xhttps://www.transfermarkt.de/copa-del-rey/startseite/pokalwettbewerb/CDR/saison_id/2008')
('11/12', 'Supercopa', 'xhttps://www.transfermarkt.de/supercopa/startseite/pokalwettbewerb/SUC/saison_id/2011')
('10/11', 'Supercopa', 'xhttps://www.transfermarkt.de/supercopa/startseite/pokalwettbewerb/SUC/saison_id/2010')
('09/10', 'Supercopa', 'xhttps://www.transfermarkt.de/supercopa/startseite/pokalwettbewerb/SUC/saison_id/2009')
('18/19', 'Community Shield', 'xhttps://www.transfermarkt.de/community-shield/startseite/pokalwettbewerb/GBCS/saison_id/2018')

How to handle http JSON POST response in MicroPyton? (ESP8266)

I'm an absolute beginner in get/post requests and micropython.
I'm programming my ESP8266 Wemos D1 mini as a HTTP server with micropython. My project consists of using a website to control the RGB values of a neopixel matrix hooked up to the D1 (all the code is on my GitHub here: https://github.com/julien123123/NeoLamp-Micro).
Basically, the website contains three sliders: one for Red, one for Green and one for Blue. A javascript code reads the value of each slider and sends it to the micropython code with using the POST method as follows :
getColors = function() {
var rgb = new Array(slider1.value, slider2.value, slider3.value);
return rgb;
};
postColors = function(rgb) {
var xmlhttp = new XMLHttpRequest();
var npxJSON = '{"R":' + rgb[0] + ', "G":' + rgb[1] + ', "B":' + rgb[2] + '}';
xmlhttp.open('POST', 'http://' + window.location.hostname + '/npx', true);
xmlhttp.setRequestHeader('Content-type', 'application/json');
xmlhttp.send(npxJSON);
};
To recieve the resquest in micropython here's my code:
conn, addr = s.accept()
request = conn.recv(1024)
request = str(request)
print(request)
The response prints as follows:
b'POST /npx HTTP/1.1\r\nHost: 192.xxx.xxx.xxx\r\nConnection: keep-alive\r\nContent-Length: 27\r\nOrigin: http://192.168.0.110\r\nUser-Agent: Mozilla/5.0 (X11; CrOS x86_64 10323.46.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.107 Safari/537.36\r\nContent-type: application/json\r\nAccept: */*\r\nReferer: http://192.xxx.xxx.xxx/\r\nAccept-Encoding: gzip, deflate\r\nAccept-Language: fr,en;q=0.9,fr-CA;q=0.8\r\n\r\n{"R":114, "G":120, "B":236}'
The only important bit for me is at the end : {"R":114, "G":120, "B":236}. I want to use those values to change the color values of my neopixel object.
My question to you is how to I process the response so that I keep only the dictionary containing the RGB variables at the end of the response??
Thanks in advance (I'm almost there!)
This is more related to generic python data type. The data type of request is in bytes as indicated by prefix b in b'POST /npx HTTP/1.1...\r\n{"R":114, "G":120, "B":236}'. You will have to use decode() to convert it to string
import json
request = b'POST /npx HTTP/1.1\r\nHost: 192.xxx.xxx.xxx\r\nConnection: keep-alive\r\nContent-Length: 27\r\nOrigin: http://192.168.0.110\r\nUser-Agent: Mozilla/5.0 (X11; CrOS x86_64 10323.46.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.107 Safari/537.36\r\nContent-type: application/json\r\nAccept: */*\r\nReferer: http://192.xxx.xxx.xxx/\r\nAccept-Encoding: gzip, deflate\r\nAccept-Language: fr,en;q=0.9,fr-CA;q=0.8\r\n\r\n{"R":114, "G":120, "B":236}'
data = request.decode() # convert to str
rgb = data.split('\r\n')[-1:] #split the str and discard the http header
for color in rgb:
print(color, type(color))
d = json.loads(color)
print(d, type(d))
The result of color is a str representation of an json object, the d will give you a python dict object to be used for further manipulation:
{"R":114, "G":120, "B":236} <class 'str'>
{'R': 114, 'G': 120, 'B': 236} <class 'dict'>

Python Scraper - Find Data in Column

I am working on my first website scraper and am trying to get the number 41,110 that is saved in a column on the webpage https://mcassessor.maricopa.gov/mcs.php?q=14014003N. Below is my code.
How can I get to this number and print it?
from bs4 import BeautifulSoup
import requests
web_page = 'https://mcassessor.maricopa.gov/mcs.php?q=14014003N'
web_header = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
response = requests.get(web_page,headers=web_header)
soup = BeautifulSoup(response.content,'html.parser')
for row in soup.findAll('table')[0].thread.tr.findAll('tr'):
first_column = row.findAll('th')[0].contents
print(first_column)
A straightforward approach would involve getting the "improvements" table, getting the first non-header row and then the last cell in this row:
table = soup.find("table", id="improvements-table")
first_row = table.find_all("tr")[1] # skipping a header
last_cell = first_row.find_all("td")[-1]
print(last_cell.get_text()) # prints 41,110
A more generic approach would involve making a list of dictionaries out of this table where keys are header names:
table = soup.find("table", id="improvements-table")
headers = [th.get_text() for th in table('th')]
data = [dict(zip(headers, [td.get_text() for td in row('td')])) for row in table("tr")[1:]]
print(data)
print(data[0]['Sq Ft.'])
Prints:
[
{u'Imp #': u'000101', u'Description': u'Mini-Warehouse', u'Age': u'1', u'Rank': u'2', u'Sq Ft.': u'41,110', u'CCI': u'C', u'Model': u'386'},
{u'Imp #': u'000201', u'Description': u'Site Improvements', u'Age': u'1', u'Rank': u'2', u'Sq Ft.': u'1', u'CCI': u'D', u'Model': u'163'}
]
41,110

How to count number of elements in json array stored in MongoDB collection

I have mongodb collection "mongocollection", each document in the collection consists of two columns, first a string "cid" which is the id of the collection and second is a json array.
Ex:
{ "_id" : "domain.com", "
className" : "UserAgents",
"userAgents" : [
"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-CA; rv:1.9.0.5) Gecko/2008120122 Firefox/3.0.5",
"Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.6) Gecko/20100628 Ubuntu/10.04 (lucid) Firefox/3.6.6 (.NET CLR 3.5.30729)",
"Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.2) Gecko/20090803 Firefox/3.5.2 Slackware",
"Mozilla/5.0 (X11; U; Linux x86_64) Gecko/2008072820 Firefox/3.0.1"
]
}
From mongo cmd line, i can get the contents of a given document within a collection as
db.CorruptUserAgents.find({"_id":"domain.com"}).pretty();
How do i get the number of elements in a given array of a given document. Ex
SOMETHING.count();
100
Is it possible to do this via cmd line ?
I know i can get the document, iterate over the array and count the elements, but i want to do this from cmd line.
You can use .findOne() instead since you are looking up by _id, it will return a single document.
So you can simply do something like this:
var document = db.CorruptUserAgents.findOne({"_id":"domain.com"});
var count = document.userAgents.length;
You can directly get it by using below :
db.CorruptUserAgents.findOne({"_id":"domain.com"}).userAgents.length;