I am trying to make a program that uses web crawling to retrive the stock info but somehow the program is not able to decode the webpage. I want this code to be strictly for python 2.
import urllib2
import re
stock=str(raw_input("Give the stock name"))
url = "https://www.google.com/finance?q="
req = urllib2.Request(url)
response = urllib2.urlopen(req)
data = str(response.read())
data1 = data.decode('utf-8')
print(data)
m = re.search('meta itemprop="price"',data1)
start = m.start()
end = start+50
newString = data1[start:end]
m=re.search('content="',newString)
start = m.end()
newString1 = newString[start:]
m = re.search("/",newString1)
start=0
end=m.end()-3
final= newString1[0:end]
print(final)
This is not a direct answer to your question but a suggestion. Try using the beautifulsoup python library. Its has many available functions for web scraping and crawling plus other functionalities and handles most of what you are trying to achieve in your question plus is compatible with all python versions.
Go to https://pypi.python.org/pypi/beautifulsoup4 for the documentation.
A sample example is;
import BeautifulSoup, urllib
url = 'http://www.py4inf.com/code/romeo.txt'
html = urllib.urlopen(url).read()
soup = BeautifulSoup.BeautifulSoup(html)
x = soup('a')
print x
Related
I want to scrape data at the county level from https://apidocs.covidactnow.org
However I could only get a dataframe with one line for each county, and data for each date is stored within a dictionary in each row/county. I would like to access this data and store it in long format (= have one row per county-date).
import requests
import pandas as pd
import os
if __name__ == '__main__':
os.chdir('/home/username/Desktop/')
url = 'https://api.covidactnow.org/v2/counties.timeseries.json?apiKey=ENTER_YOUR_KEY'
response = requests.get(url).json()
data = pd.DataFrame(response)
This seems like a trivial question, but I've tried for hours. What would be the best way to achieve that ?
Do you mean something like that?
import requests
url = 'https://api.covidactnow.org/v2/states.timeseries.csv?apiKey=YOURAPIKEY'
response = requests.get(url)
csv_response = (response.text)
# Then you can transform STRING to CSV
Check this fo string to CSV --> python parsing string to csv format
I'm trying to scrape the urls for the individual players from this website.
I've already tried doing this with bs4 and it just returns [] every time i try to find the table. Switched to lxml to give this a try.
import urlopen from urllib.requests
import lxml.html
url = "https://www.espn.com/soccer/team/squad/_/id/359/arsenal"
tree = etree.HTML(urlopen(url).read())
table = tree.xpath('/*
[#id="fittPageContainer"]/div[2]/div[5]/div[1]/div/article/div/section/div[5]/section/table/tbody/tr/td[1]/div/table/tbody/tr[1]/td/span')
print(table)
I expect some sort output that I could use to get the links but the code returns square brackets
I think this is what you want.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox(executable_path=r'C:\files\geckodriver.exe')
driver.set_page_load_timeout(30)
driver.get("https://www.espn.com/soccer/team/squad/_/id/359/arsenal")
continue_link = driver.find_element_by_tag_name('a')
elems = driver.find_elements_by_xpath("//a[#href]")
for elem in elems:
print(elem.get_attribute("href"))
Here i am trying to extract a table from a website as specified in Python code . i am able to get the HTML Table and further i am unable to convert to data frame using Python . Here is the code
# import libraries
import requests
from bs4 import BeautifulSoup
# specify url
url = 'http://my-trade.in/'
# request html
page = requests.get(url)
# Parse html using BeautifulSoup, you can use a different parser like lxml if present
soup = BeautifulSoup(page.content, 'html.parser')
tbl =soup.find("table",{"id":"MainContent_dataGridView1"})
You can just Use pandas read_html function for that, and remember to convert the html you get to string else you will get some parsing error.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'http://my-trade.in/'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
tbl = soup.find("table",{"id":"MainContent_dataGridView1"})
data_frame = pd.read_html(str(tbl))[0]
I am a novice to write a web crawler. I want to use the searching engine of http://www.creditchina.gov.cn/search_all#keyword=&searchtype=0&templateId=&creditType=&areas=&objectType=2&page=1 to check whether my input is valid.
For example, 912101127157655762 is a valid input, and 912101127157655760 is invalid.
After observing the web source code from developer tools I found that, if the input is a invalid number, the tags would be:
While if the input is valid, the tags would be:
So I want to determine whether the input is valid or not by just checking whether there is anything within the 'ul class = "credit-info-results public-results-left item-template"' tag. Here is how I wrote my web crawler:
import urllib
from bs4 import BeautifulSoup
url = 'http://www.creditchina.gov.cn/search_all#keyword=912101127157655762&searchtype=0&
templateId=&creditType=&areas=&objectType=2&page=1'
req = urllib.request.Request(url)
data = urllib.request.urlopen(req)
bs = data.read().decode('utf-8')
soup = BeautifulSoup(bs, 'lxml')
check = soup.find_all("ul", {"class": "credit-info-results public-results-left item-template"})
if check == []:
# TODO
if check != []:
# TODO
However, the value of check is always []. I cannot understand why there is nothing between the tab. Hope somebody may help me solve the problem.
You've got not html but JS object as a response. That's why BS can't parse it.
You can use substring search to check if response contains something or not.
import urllib
from bs4 import BeautifulSoup
url = 'http://www.creditchina.gov.cn/search_all#keyword=912101127157655762&searchtype=0&
templateId=&creditType=&areas=&objectType=2&page=1'
req = urllib.request.Request(url)
data = urllib.request.urlopen(req)
bs = data.read().decode('utf-8')
ul_pos = bs.find('credit-info-results public-results-left item-template')
if ul_pos <> 0:
bs = bs[ul_pos:]
soup = BeautifulSoup(bs, 'lxml')
check = soup.find_all("ul", {"class": "credit-info-results public-results-left item-template"})
if check == []:
# TODO
if check != []:
# TODO
I recieve a POSTed JSON with mod_wsgi on Apache. I have to forward the JSON to some API (using POST), take API's response and respond back to where the initial POST came from.
Here goes the python code
import requests
import urllib.parse
def application(environ, start_response):
url = "http://texchange.nowtaxi.ru/api/secret_api_key/"
query = environ['QUERY_STRING']
if query == "get":
url += "tariff/list"
r = requests.get(url)
response_headers = [('Content-type', 'application/json')]
else:
url += "order/put"
input_len = int(environ.get('CONTENT_LENGTH', '0'))
data = environ['wsgi.input'].read(input_len)
decoded = data.decode('utf-8')
unquoted = urllib.parse.unquote(decoded)
print(decoded) # 'from%5Baddress%5D=%D0%'
print(unquoted) # 'from[address]=\xd0\xa0'
r = requests.post(url,data)
output_len = sum(len(line) for line in r.text)
response_headers = [('Content-type', 'application/json'),
('Content-Length', str(output_len))]
status = "200 OK"
start_response(status, response_headers)
return [r.text.encode('utf-8')]
The actual JSON starts "{"from":{"address":"Россия
I thought those \x's are called escaped symbols, so I tried ast.literal_eval and codecs.getdecoder("unicode_escape"), but it didn't help. I can't properly google the case, because I feel like I misunderstood wtf is happening here. Maybe I have to somehow change the $.post() call in the .js file that sends POST to the wsgi script?
UPD: my bro said that it's totally unclear what I need. I'll clarify. I need to get the string that represents the recieved JSON in it's initial form. With cyrillic letters, "s, {}s, etc. What I DO get after decoding recieved byte-sequence is 'from%5Baddress%5D=%D0%'. If I unquote it, it converts into 'from[address]=\xd0\xa0', but that's still not what I want