I'm trying to send an email which written in Hebrew.
I'm getting the text in some encoding instead.
I tried to google that error but still cant solve the problem.
import smtplib
import time
from email.mime.text import MIMEText
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
service = Service("C:\Development\chromedriver.exe")
driver = webdriver.Chrome(service=service)
SUPERFARM_URL = "https://shop.super-pharm.co.il/"
my_email = "ZZZ"
password = "ZZZ"
driver.get(SUPERFARM_URL)
time.sleep(2)
search_bar = driver.find_element(By.ID, 'search-input')
search_bar.send_keys("סימילאק גולד")
search_bar.send_keys(Keys.ENTER)
time.sleep(2)
The issue is in these lines:
similac_gold_price = int(driver.find_element(By.XPATH, '//*[#id="results-boxes"]/a[3]/div/div[2]').get_attribute(
"textContent").strip().split(" ")[-1].strip())
similac_gold_price_dec = int(driver.find_element(By.XPATH, '//*[#id="results-boxes"]/a[3]/div/div[2]').get_attribute(
"textContent").strip().split(" ")[0].strip())
similac_grams = driver.find_element(By.XPATH, '//*[#id="results-boxes"]/a[4]/div/div[3]/div/div/span[1]').get_attribute("textContent")
similac_grams_encode = u' '.join(similac_grams).encode('utf-8').strip()
if similac_gold_price < 70:
# The "text_encode" and "similac_grams_encode" are in Hebrew.
text = driver.find_element(By.XPATH, '//*[#id="results-boxes"]/a[3]/div/div[3]/div/h4').get_attribute('textContent')
text_encode = u' '.join(text).encode('utf-8').strip()
print(text)
print(text_encode)
with smtplib.SMTP("smtp.gmail.com") as connection:
connection.starttls()
connection.login(user=my_email, password=password)
connection.sendmail(
from_addr=my_email,
to_addrs="CCC",
msg=f"Subject:Similac Offer!\n\n{text_encode} with a price of {similac_gold_price}.{similac_gold_price_dec} ({similac_grams_encode})"
)
I tried to get plain Hebrew, but got text in some encoding instead.
The email I'm getting:
b'\xd7\xa1 \xd7\x99 \xd7\x9e \xd7\x99 \xd7\x9c \xd7\x90 \xd7\xa7 \xc2\xa0 \xd7\xa1 \xd7\x99 \xd7\x9e \xd7\x99 \xd7\x9c \xd7\x90 \xd7\xa7 \xd7\x92 \xd7\x95 \xd7\x9c \xd7\x93 \xd7\xa2 \xd7\x9d H M O \xd7\xa9 \xd7\x9c \xd7\x91 2' with a price of 69.90 (b'7 0 0 \xd7\x92 \xd7\xa8 \xd7\x9d')
The third argument to sendmail needs to be a properly formatted MIME message, not free-form text.
Something like this;
from email.message import EmailMessage()
...
message = EmailMessage()
message["from"] = my_email
message["to"] = "example1#gmail.com"
message["subject"] = "Similac Offer!"
message.set_content(f"{text_encode} with a price of {similac_gold_price}.{similac_gold_price_dec} ({similac_grams_encode})")
with smtplib.SMTP("smtp.gmail.com") as connection:
connection.starttls()
connection.login(user=my_email, password=password)
connection.send_message(message)
This should convert your Hebrew text into a suitable character set (probably UTF-8 in practice) and create the necessary MIME structure around it to communicate the correct character set to the recipient's email client.
See also https://docs.python.org/3/library/email.examples.html (and maybe notice that you can find a lot of older code on the net which uses MimeMultipart and other obsolete libraries; you want the new API, always).
Related
I have an html document that I wish to edit such that any word(s) within it can be highlighted/made bold.
I have the html in memory and have passed to BeautifulSoup. I iteratate through all tags and take their string elements. If any string contains a matching word, I edit the string and replace it into the html with markup wrapping around the desired word.
from flask import Flask, Markup
from bs4 import BeautifulSoup
def match( documentText: str, searchQuery: str) -> Markup:
words = documentText.split( ' ')
if len( words) >= 3:
words[2] = f'<strong>{ words[2]}</strong>'
logger.info( f'{ words=}')
return Markup( ' '.join( words))
for link in html.find_all( True):
if ( link.string):
link.string = match( link.string, searchQuery)
app = Flask( __name__)
#app.route( '/')
def home():
logger.info( 'trying markup and testing logging')
return str( html), 200
app.run( debug=True)
Now, instead of rendering a page with bold words where I would like them, I visually see the html tags, this is because if I view source, the tags are actually represented by >. This would appear to be coming from the line "link.string = match( link.string, searchQuery)" - which I guess could well make sense, in that BeautifulSoup is doing type checking and ensuring the only thing that goes in the tag.string field is indeed a string. The ideal end state I guess then would be to make a branch off the tag to include a child tag.
Is this a problem anybody else has previously solved? My solution to this whole thing seems chunky and inelegant so I wouldn't mind a better route if anybody has one.
For a quick fix, just replace back those html special characters with str.replace()
from flask import Flask, Markup
from bs4 import BeautifulSoup
# ...
#app.route( '/')
def home():
logger.info( 'trying markup and testing logging')
return str(html).replace(">",">").replace("<","<"), 200
app.run( debug=True)
Be careful since html special characters is not just < and >
Html special characters reference: https://www.html.am/reference/html-special-characters.cfm
Better Approach:
This approach will change all html special chars back to it's un-escapced form
import html.parser
htmlparser = html.parser.HTMLParser()
html_decoded_string = parser.unescape(str(html))
return html_decoded_string , 200
do note that on Python 2, the import statement (module name) might be a little bit different
I tried to web scrape the table data from a binary signals website. The data updates after some time and I wanted to get the data as it updates. The problem is, when I scrape the code it returns empty values. The table has a table tag.
I'm not sure if it uses something else other than html because it updates without reloading. I had to use a browser user agent to get passed the security.
When I run it returns correct data but I have noticed signal id increments by 1
<table class="ui stripe hover dt-center table" id="isosignal-table" style="width:100%"><thead><tr><th></th><th class="no-sort">Current Price</th><th class="no-sort">Direction</th><th class="no-sort">Asset</th><th class="no-sort">Strike Price</th><th class="no-sort">Expiry Time</th></tr></thead><tbody><tr :class="[ signal.direction.toLowerCase() == 'call' ? 'call' : 'put' ]" :id="'signal-' + signal.id" :key="signal.id" ref="signals" v-for="signal in signals"><td style="display: none;" v-text="signal.id"></td><td v-text="signal.current_price"></td><td v-html="showDirection(signal.direction)"></td><td v-text="signal.asset"></td><td v-text="signal.strike_price"></td><td v-text="parseTime(signal.expiry)"></td></tr></tbody></table>
table = soup.table
print(table)
But when I run the whole code it returns this:
[]
['', '', '', '', '', '']
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
url = "https://signals.investingstockonline.com/free-binary-signal-page"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
page = urlopen(req)
data = page.read()
soup = BeautifulSoup(data, 'html.parser')
table = soup.table
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
row = [i.text for i in td]
if len(row) < 1:
pass
print(row)
I thought it would display the whole table but it just displayed empty strings. What could be the problem?
In the HTML you've provided, there is no text content in the elements, so you're getting that correctly. When you look at the live website, text content that appears in the table was inserted dynamically by JS fetching information from a server via ajax. In other words, if you perform a request, you'll get the skeleton (HTML) but no meat (live data).
You can use something like Selenium to extract this information as follows:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get("https://signals.investingstockonline.com/free-binary-signal-page")
for tr in driver.find_elements_by_tag_name("tr"):
for td in tr.find_elements_by_tag_name("td"):
print(td.get_attribute("innerText"))
Output (truncated):
EURJPY
126.044
22:00:00
1.50318
EURCAD
1.50332
22:00:00
1.12595
EURUSD
1.12604
22:00:00
0.86732
EURGBP
0.86743
22:00:00
1.29825
GBPUSD
1.29841
22:00:00
145.320
So I'm aiming to scrape 2 tables (in different formats) from a website - https://info.fsc.org/details.php?id=a0240000005sQjGAAU&type=certificate after using the search bar to iterate this over a list of license codes. I haven't included the loop fully yet but I added it at the top for completeness.
My issue is that because the two tables I want, Product Data and Certificate Data are in 2 different formats, so I have to scrape them separately. As the Product data is in the normal "tr" format on the webpage, this bit is easy and I've managed to extract a CSV file of this. The harder bit is extracting Certificate Data, as it is in "div" form.
I've managed to print the Certificate Data as a list of text, using the class function, however I need to have it in a tabular form saved in a CSV file. As you can see, I've tried several unsuccessful ways of converting it to a CSV but If you have any suggestions, it would be much appreciated, thank you!! Also any other general tips to improve my code would be great too, as I am new to web-scraping.
#namelist = open('example.csv', newline='', delimiter = 'example')
#for name in namelist:
#include all of the below
driver = webdriver.Chrome(executable_path="/Users/jamesozden/Downloads/chromedriver")
url = "https://info.fsc.org/certificate.php"
driver.get(url)
search_bar = driver.find_element_by_xpath('//*[#id="code"]')
search_bar.send_keys("FSC-C001777")
search_bar.send_keys(Keys.RETURN)
new_url = driver.current_url
r = requests.get(new_url)
soup = BeautifulSoup(r.content,'lxml')
table = soup.find_all('table')[0]
df, = pd.read_html(str(table))
certificate = soup.find(class_= 'certificatecl').text
##certificate1 = pd.read_html(str(certificate))
driver.quit()
df.to_csv("Product_Data.csv", index=False)
##certificate1.to_csv("Certificate_Data.csv", index=False)
#print(df[0].to_json(orient='records'))
print certificate
Output:
Status
Valid
First Issue Date
2009-04-01
Last Issue Date
2018-02-16
Expiry Date
2019-04-01
Standard
FSC-STD-40-004 V3-0
What I want but over hundreds/thousands of license codes (I just manually created this one sample in Excel):
Desired output
EDIT
So whilst this is now working for Certificate Data, I also want to scrape the Product Data and output that into another .csv file. However currently it is only printing 5 copies of the product data for the final license code which is not what I want.
New Code:
df = pd.read_csv("MS_License_Codes.csv")
codes = df["License Code"]
def get_data_by_code(code):
data = [
('code', code),
('submit', 'Search'),
]
response = requests.post('https://info.fsc.org/certificate.php', data=data)
soup = BeautifulSoup(response.content, 'lxml')
status = soup.find_all("label", string="Status")[0].find_next_sibling('div').text
first_issue_date = soup.find_all("label", string="First Issue Date")[0].find_next_sibling('div').text
last_issue_date = soup.find_all("label", string="Last Issue Date")[0].find_next_sibling('div').text
expiry_date = soup.find_all("label", string="Expiry Date")[0].find_next_sibling('div').text
standard = soup.find_all("label", string="Standard")[0].find_next_sibling('div').text
return [code, status, first_issue_date, last_issue_date, expiry_date, standard]
# Just insert here output filename and codes to parse...
OUTPUT_FILE_NAME = 'Certificate_Data.csv'
#codes = ['C001777', 'C001777', 'C001777', 'C001777']
df3=pd.DataFrame()
with open(OUTPUT_FILE_NAME, 'w') as f:
writer = csv.writer(f)
for code in codes:
print('Getting code# {}'.format(code))
writer.writerow((get_data_by_code(code)))
table = soup.find_all('table')[0]
df1, = pd.read_html(str(table))
df3 = df3.append(df1)
df3.to_csv('Product_Data.csv', index = False, encoding='utf-8')
Here's all you need.
No chromedriver. No pandas. Forget about it in context of scraping.
import requests
import csv
from bs4 import BeautifulSoup
# This is all what you need for your task. Really.
# No chromedriver. Don't use it for scraping. EVER.
# No pandas. Don't use it for writing csv. It's not what pandas was made for.
#Function to parse single data page based on single input code.
def get_data_by_code(code):
# Parameters to build POST-request.
# "type" and "submit" params are static. "code" is your desired code to scrape.
data = [
('type', 'certificate'),
('code', code),
('submit', 'Search'),
]
# POST-request to gain page data.
response = requests.post('https://info.fsc.org/certificate.php', data=data)
# "soup" object to parse html data.
soup = BeautifulSoup(response.content, 'lxml')
# "status" variable. Contains first's found [LABEL tag, with text="Status"] following sibling DIV text. Which is status.
status = soup.find_all("label", string="Status")[0].find_next_sibling('div').text
# Same for issue dates... etc.
first_issue_date = soup.find_all("label", string="First Issue Date")[0].find_next_sibling('div').text
last_issue_date = soup.find_all("label", string="Last Issue Date")[0].find_next_sibling('div').text
expiry_date = soup.find_all("label", string="Expiry Date")[0].find_next_sibling('div').text
standard = soup.find_all("label", string="Standard")[0].find_next_sibling('div').text
# Returning found data as list of values.
return [response.url, status, first_issue_date, last_issue_date, expiry_date, standard]
# Just insert here output filename and codes to parse...
OUTPUT_FILE_NAME = 'output.csv'
codes = ['C001777', 'C001777', 'C001777', 'C001777']
with open(OUTPUT_FILE_NAME, 'w') as f:
writer = csv.writer(f)
for code in codes:
print('Getting code# {}'.format(code))
#Writing list of values to file as single row.
writer.writerow((get_data_by_code(code)))
Everything is really straightforward here. I'd suggest you spend some time in Chrome dev tools "network" tab to have a better understanding of request forging, which is a must for scraping tasks.
In general, you don't need to run chrome to click the "search" button, you need to forge request generated by this click. Same for any form and ajax.
well... you should sharpen your skills (:
df3=pd.DataFrame()
with open(OUTPUT_FILE_NAME, 'w') as f:
writer = csv.writer(f)
for code in codes:
print('Getting code# {}'.format(code))
writer.writerow((get_data_by_code(code)))
### HERE'S THE PROBLEM:
# "soup" variable is declared inside of "get_data_by_code" function.
# So you can't use it in outer context.
table = soup.find_all('table')[0] # <--- you should move this line to
#definition of "get_data_by_code" function and return it's value somehow...
df1, = pd.read_html(str(table))
df3 = df3.append(df1)
df3.to_csv('Product_Data.csv', index = False, encoding='utf-8')
As per example you can return dictionary of values from "get_data_by_code" function:
def get_data_by_code(code):
...
table = soup.find_all('table')[0]
return dict(row=row, table=table)
`import json
import urllib2
response = urllib2.urlopen('http://www.energyhive.com/mobile_proxy/getCurrentVa$
content = response.read()
for x in json.loads(content):
if x["cid"] == "PWER":
print(x["data"])
`
Hi all, I have some code that I require part of the code sent to a txt file, example [{u'1438923522000': 98}], after running code, I just need the txt after : save as txt, or better sql.
If x["data"] == [{u'1438923522000': 98}] then x["data"][0] == {u'1438923522000': 98}. If you can guarantee that the dict will only have one key (as your example does) then the expression you are looking for is something like
next(x["data"][0].values())
Since you appear to be using Python 3, dict.values() is a generator, so calling next() on it gives you the first value without needing to know what the associated key is.
#!/usr/bin/env python
import urllib2
import json
api_key = 'VtxgIC2UnhfUmXe_pBksov7-lguAQMZD'
url = 'http://www.energyhive.com/mobile_proxy/getCurrentValuesSummary?token='+api_key
response = urllib2.urlopen(url)
content = response.read()
for x in json.loads(content):
if x["cid"] == "PWER":
print (x["data"])
for y in json.loads(content):
if y["cid"] == "PWER_GAC":
print(y["data"])
when i load this code i get
[{u'1439087809000': 36}]
[{u'1439087809000': 0}]
i would like to delete everything apart from the results
36
0
updated api to run code
I am using Python3's csv module and am wondering why I cannot control quoting correctly. I am using the option quoting = csv.QUOTE_NONNUMERIC but am still seeing all entries quoted. Any idea as to why that is?
Here's my code. Essentially, I am reading in a csv file and want to remove all duplicate lines that have the same text string:
import sys
import csv
class Row:
def __init__(self, row):
self.text, self.a, self.b = row
self.elements = row
with open(sys.argv[2], 'w', newline='') as output:
writer = csv.writer(output, delimiter=';', quotechar='"',
quoting=csv.QUOTE_NONNUMERIC)
with open(sys.argv[1]) as input:
reader = csv.reader(input, delimiter=';')
header = next(reader)
Row.labels = header
assert Row.labels[1] == 'Label1'
writer.writerow(header)
texts = set()
for row in reader:
row_object = Row(row)
if row_object.text not in texts:
writer.writerow(row_object.elements)
texts.add(row_object.text)
When I look at the generated file, the content looks like this:
"Label1";"Label2";"Label3"
"AAA";"123";"456"
...
But I want this:
"Label1";"Label2";"Label3"
"AAA";123;456
...
OK ... I figured it out myself. The answer, I am afraid, was rather simple - and obvious in retrospect. Since the content of each line is obtained from a csv.reader()its elements are strings by default. As a result, the get quoted by the subsequently employed csv.writer().
To be treated as an int, they first need to be cast to an int:
row_object.elements[1]= int(row_object.a)
This explanation can be proven by inserting a type check before and after this cast:
print('Type: {}'.format(type(row_object.elements[1])))