`import json
import urllib2
response = urllib2.urlopen('http://www.energyhive.com/mobile_proxy/getCurrentVa$
content = response.read()
for x in json.loads(content):
if x["cid"] == "PWER":
print(x["data"])
`
Hi all, I have some code that I require part of the code sent to a txt file, example [{u'1438923522000': 98}], after running code, I just need the txt after : save as txt, or better sql.
If x["data"] == [{u'1438923522000': 98}] then x["data"][0] == {u'1438923522000': 98}. If you can guarantee that the dict will only have one key (as your example does) then the expression you are looking for is something like
next(x["data"][0].values())
Since you appear to be using Python 3, dict.values() is a generator, so calling next() on it gives you the first value without needing to know what the associated key is.
#!/usr/bin/env python
import urllib2
import json
api_key = 'VtxgIC2UnhfUmXe_pBksov7-lguAQMZD'
url = 'http://www.energyhive.com/mobile_proxy/getCurrentValuesSummary?token='+api_key
response = urllib2.urlopen(url)
content = response.read()
for x in json.loads(content):
if x["cid"] == "PWER":
print (x["data"])
for y in json.loads(content):
if y["cid"] == "PWER_GAC":
print(y["data"])
when i load this code i get
[{u'1439087809000': 36}]
[{u'1439087809000': 0}]
i would like to delete everything apart from the results
36
0
updated api to run code
Related
I am trying to scrape the pokemon API and create a dataset for all pokemon. So I have written a function which looks like this:
import requests
import json
import pandas as pd
def poke_scrape(x, y):
'''
A function that takes in a range of pokemon (based on pokedex ID) and returns
a pandas dataframe with information related to the pokemon using the Poke API
'''
#GATERING THE DATA FROM API
url = 'https://pokeapi.co/api/v2/pokemon/'
ids = range(x, (y+1))
pkmn = []
for id_ in ids:
url = 'https://pokeapi.co/api/v2/pokemon/' + str(id_)
pages = requests.get(url).json()
# content = json.dumps(pages, indent = 4, sort_keys=True)
if 'error' not in pages:
pkmn.append([pages['id'], pages['name'], pages['abilities'], pages['stats'], pages['types']])
#MAKING A DATAFRAME FROM GATHERED API DATA
cols = ['id', 'name', 'abilities', 'stats', 'types']
df = pd.DataFrame(pkmn, columns=cols)
The code works fine for most pokemon. However, when I am trying to run poke_scrape(229, 229) (so trying to load ONLY the 229th pokemon), it gives me the JSONDecodeError. It looks like this:
So far I have tried using json.loads() instead but that has not solved the issue. What is even more perplexing is that specific pokemon has loaded before and the same issue was with another ID - otherwise I could just manually enter the stats for the specific pokemon that is unable to load into my dataframe. Any help is appreciated!
Because of the way the PokeAPI works, some links to the JSON data for each pokemon only load when the links end with a '/' (such as https://pokeapi.co/api/v2/pokemon/229/ vs https://pokeapi.co/api/v2/pokemon/229 - first link will work and the second will return not found). However, others will respond with a response error because of the added '/' so fixed the issue with a few if statements right after the for loop in the beginning of the function
this is my second try to explain a bit more precisely what I'm looking for ;-)
I set a webhook in Mailchimp that fires every time a new subscriber of an audience appears. Mailchimp sends a HTTP POST request to a Jira Sriptrunner REST endpoint.
The content type of this request is application/x-www-form-urlencoded.
Within the Jira endpoint I would like to read the request data. How can I do that?
The payload (raw body) I receive looks like this:
type=unsubscribe&fired_at=2020-05-26+07%3A04%3A42&data%5Baction%5D=unsub&data%5Breason%5D=manual&data%5Bid%5D=34f28a4516&data%5Bemail%5D=examlple%40bla.com&data%5Bemail_type%5D=html&data%5Bip_opt%5D=xx.xxx.xxx.198&data%5Bweb_id%5D=118321378&data%5Bmerges%5D%5BEMAIL%5D=example%40bla.com&data%5Bmerges%5D%5BFNAME%5D=Horst&data%5Bmerges%5D%5BLNAME%5D=Schlemmer&data%5Bmerges%5D%5BCOMPANY%5D=First&data%5Bmerges%5D%5BADDRESS%5D%5Baddr1%5D=XXX
Now I would like to parse the data of the raw body into a JSON or something similiar.
The result might look like this:
{
"web_id": 123,
"email": "example#bla.com",
"company": "First",
...
}
Meanwhile I searched around a little and found something like the node.js "querystring" module. It would be great if there is something similiar within Groovy or any other way to parse the data of application/x-www-form-urlencoded to json format.
Best regards and thanks in advance
Bernhard
def body = "type=unsubscribe&fired_at=2020-05-26+07%3A04%3A42&data%5Baction%5D=unsub&data%5Breason%5D=manual&data%5Bid%5D=34f28a4516&data%5Bemail%5D=examlple%40bla.com&data%5Bemail_type%5D=html&data%5Bip_opt%5D=xx.xxx.xxx.198&data%5Bweb_id%5D=118321378&data%5Bmerges%5D%5BEMAIL%5D=example%40bla.com&data%5Bmerges%5D%5BFNAME%5D=Horst&data%5Bmerges%5D%5BLNAME%5D=Schlemmer&data%5Bmerges%5D%5BCOMPANY%5D=First&data%5Bmerges%5D%5BADDRESS%5D%5Baddr1%5D=XXX"
def map = body.split('&').collectEntries{e->
e.split('=').collect{ URLDecoder.decode(it, "UTF-8") }
}
assert map.'data[merges][EMAIL]'=='example#bla.com'
map.each{println it}
prints:
type=unsubscribe
fired_at=2020-05-26 07:04:42
data[action]=unsub
data[reason]=manual
data[id]=34f28a4516
data[email]=examlple#bla.com
data[email_type]=html
data[ip_opt]=xx.xxx.xxx.198
data[web_id]=118321378
data[merges][EMAIL]=example#bla.com
data[merges][FNAME]=Horst
data[merges][LNAME]=Schlemmer
data[merges][COMPANY]=First
data[merges][ADDRESS][addr1]=XXX
A imple no-brainer groovy:
def a = '''
data[email_type]: html
data[web_id]: 123
fired_at: 2020-05-26 07:28:25
data[email]: example#bla.com
data[merges][COMPANY]: First
data[merges][FNAME]: Horst
data[ip_opt]: xx.xxx.xxx.xxx
data[merges][PHONE]: xxxxx
data[merges][ADDRESS][zip]: 33615
type: subscribe
data[list_id]: xxXXyyXX
data[merges][ADDRESS][addr1]: xxx.xxx'''
def res = [:]
a.eachLine{
def parts = it.split( /\s*:\s*/, 2 )
if( 2 != parts.size() ) return
def ( k, v ) = parts
def complexKey = ( k =~ /\[(\w+)\]/ ).findAll()
if( complexKey ) complexKey = complexKey.last().last()
res[ ( complexKey ?: k ).toLowerCase() ] = v
}
res
gives:
[email_type:html, web_id:123, fired_at:2020-05-26 07:28:25,
email:example#bla.com, company:First, fname:Horst, ip_opt:xx.xxx.xxx.xxx,
phone:xxxxx, zip:33615, type:subscribe, list_id:xxXXyyXX, addr1:xxx.xxx]
I found a solution finally. I hope you understand and maybe it helps others too ;-)
Starting from daggett's answer I did the following:
// Split body and remove unnecessary characters
def map = body.split('&').collectEntries{e->
e.split('=').collect{ URLDecoder.decode(it, "UTF-8") }
}
// Processing the map to readable stuff
def prettyMap = new JsonBuilder(map).toPrettyString()
// Convert the pretty map into a json object
def slurper = new JsonSlurper()
def jsonObject = slurper.parseText(prettyMap)
(The map looks pretty much like in daggett's answer.
prettyMap)
Then I extract the keys:
// Finally extracting customer data
def type = jsonObject['type']
And I get the data I need. For example
Type : subscribe
...
First Name : Heinz
...
Thanks to daggett!
My goal is to get specific data on many profiles on khanacademy by using their API.
My problem is: in their API, json files have different list orders. It can vary from one to another.
Here is my code:
from urllib.request import urlopen
import json
# here is a list with two json file links:
profiles=['https://www.khanacademy.org/api/internal/user/kaid_329989584305166460858587/profile/widgets?lang=en&_=190424-1429-bcf153233dc9_1556201931959','https://www.khanacademy.org/api/internal/user/kaid_901866966302088310331512/profile/widgets?lang=en&_=190424-1429-bcf153233dc9_1556201931959']
# for each json file, take some specific data out
for profile in profiles:
print(profile)
with urlopen(profile) as response:
source = response.read()
data = json.loads(source)
votes = data[1]['renderData']['discussionData']['statistics']['votes']
print(votes)
I expected something like this:
https://www.khanacademy.org/api/internal/user/kaid_329989584305166460858587/profile/widgets?lang=en&_=190424-1429-bcf153233dc9_1556201931959
100
https://www.khanacademy.org/api/internal/user/kaid_901866966302088310331512/profile/widgets?lang=en&_=190424-1429-bcf153233dc9_1556201931959
41
Instead I got an error:
https://www.khanacademy.org/api/internal/user/kaid_329989584305166460858587/profile/widgets?lang=en&_=190424-1429-bcf153233dc9_1556201931959
100
https://www.khanacademy.org/api/internal/user/kaid_901866966302088310331512/profile/widgets?lang=en&_=190424-1429-bcf153233dc9_1556201931959
Traceback (most recent call last):
File "bitch.py", line 12, in <module>
votes = data[1]['renderData']['discussionData']['statistics']['votes']
KeyError: 'discussionData'
As we can see:
This link A is working fine: https://www.khanacademy.org/api/internal/user/kaid_329989584305166460858587/profile/widgets?lang=en&_=190424-1429-bcf153233dc9_1556201931959
But this link B is not working: https://www.khanacademy.org/api/internal/user/kaid_901866966302088310331512/profile/widgets?lang=en&_=190424-1429-bcf153233dc9_1556201931959 And that's because in this json file. The list is not in the same order as it is in the A link.
My question is: Why? And how can I write my script to get into account these variation of orders?
There is probably something to do with .sort(). But I am missing something.
Maybe I should also precise that I am using python 3.7.2.
Link A: desired data (yellow) is in the second item of the list (blue):
Link B: desired data (yellow) is in the third item of the list (blue):
You could use an if to test if votes in current index dictionary
import requests
urls = ['https://www.khanacademy.org/api/internal/user/kaid_329989584305166460858587/profile/widgets?lang=en&_=190424-1429-bcf153233dc9_1556201931959',
'https://www.khanacademy.org/api/internal/user/kaid_901866966302088310331512/profile/widgets?lang=en&_=190424-1429-bcf153233dc9_1556201931959']
for url in urls:
r = requests.get(url).json()
result = [item['renderData']['discussionData']['statistics']['votes'] for item in r if 'votes' in str(item)]
print(result)
Catching exceptions in python doesn't take much overhead unlike other languages so I would recommend the "better ask forgiveness then permission" solution. This will be slightly faster than searching through a str for the word votes as it will fail instantly if the key is invalid.
import requests
urls = ['https://www.khanacademy.org/api/internal/user/kaid_329989584305166460858587/profile/widgets?lang=en&_=190424-1429-bcf153233dc9_1556201931959',
'https://www.khanacademy.org/api/internal/user/kaid_901866966302088310331512/profile/widgets?lang=en&_=190424-1429-bcf153233dc9_1556201931959']
for url in urls:
response = requests.get(url).json()
result = []
for item in response:
try:
result.append(item['renderData']['discussionData']['statistics']['votes'])
except KeyError:
pass # Could not find votes
print(result)
So I'm aiming to scrape 2 tables (in different formats) from a website - https://info.fsc.org/details.php?id=a0240000005sQjGAAU&type=certificate after using the search bar to iterate this over a list of license codes. I haven't included the loop fully yet but I added it at the top for completeness.
My issue is that because the two tables I want, Product Data and Certificate Data are in 2 different formats, so I have to scrape them separately. As the Product data is in the normal "tr" format on the webpage, this bit is easy and I've managed to extract a CSV file of this. The harder bit is extracting Certificate Data, as it is in "div" form.
I've managed to print the Certificate Data as a list of text, using the class function, however I need to have it in a tabular form saved in a CSV file. As you can see, I've tried several unsuccessful ways of converting it to a CSV but If you have any suggestions, it would be much appreciated, thank you!! Also any other general tips to improve my code would be great too, as I am new to web-scraping.
#namelist = open('example.csv', newline='', delimiter = 'example')
#for name in namelist:
#include all of the below
driver = webdriver.Chrome(executable_path="/Users/jamesozden/Downloads/chromedriver")
url = "https://info.fsc.org/certificate.php"
driver.get(url)
search_bar = driver.find_element_by_xpath('//*[#id="code"]')
search_bar.send_keys("FSC-C001777")
search_bar.send_keys(Keys.RETURN)
new_url = driver.current_url
r = requests.get(new_url)
soup = BeautifulSoup(r.content,'lxml')
table = soup.find_all('table')[0]
df, = pd.read_html(str(table))
certificate = soup.find(class_= 'certificatecl').text
##certificate1 = pd.read_html(str(certificate))
driver.quit()
df.to_csv("Product_Data.csv", index=False)
##certificate1.to_csv("Certificate_Data.csv", index=False)
#print(df[0].to_json(orient='records'))
print certificate
Output:
Status
Valid
First Issue Date
2009-04-01
Last Issue Date
2018-02-16
Expiry Date
2019-04-01
Standard
FSC-STD-40-004 V3-0
What I want but over hundreds/thousands of license codes (I just manually created this one sample in Excel):
Desired output
EDIT
So whilst this is now working for Certificate Data, I also want to scrape the Product Data and output that into another .csv file. However currently it is only printing 5 copies of the product data for the final license code which is not what I want.
New Code:
df = pd.read_csv("MS_License_Codes.csv")
codes = df["License Code"]
def get_data_by_code(code):
data = [
('code', code),
('submit', 'Search'),
]
response = requests.post('https://info.fsc.org/certificate.php', data=data)
soup = BeautifulSoup(response.content, 'lxml')
status = soup.find_all("label", string="Status")[0].find_next_sibling('div').text
first_issue_date = soup.find_all("label", string="First Issue Date")[0].find_next_sibling('div').text
last_issue_date = soup.find_all("label", string="Last Issue Date")[0].find_next_sibling('div').text
expiry_date = soup.find_all("label", string="Expiry Date")[0].find_next_sibling('div').text
standard = soup.find_all("label", string="Standard")[0].find_next_sibling('div').text
return [code, status, first_issue_date, last_issue_date, expiry_date, standard]
# Just insert here output filename and codes to parse...
OUTPUT_FILE_NAME = 'Certificate_Data.csv'
#codes = ['C001777', 'C001777', 'C001777', 'C001777']
df3=pd.DataFrame()
with open(OUTPUT_FILE_NAME, 'w') as f:
writer = csv.writer(f)
for code in codes:
print('Getting code# {}'.format(code))
writer.writerow((get_data_by_code(code)))
table = soup.find_all('table')[0]
df1, = pd.read_html(str(table))
df3 = df3.append(df1)
df3.to_csv('Product_Data.csv', index = False, encoding='utf-8')
Here's all you need.
No chromedriver. No pandas. Forget about it in context of scraping.
import requests
import csv
from bs4 import BeautifulSoup
# This is all what you need for your task. Really.
# No chromedriver. Don't use it for scraping. EVER.
# No pandas. Don't use it for writing csv. It's not what pandas was made for.
#Function to parse single data page based on single input code.
def get_data_by_code(code):
# Parameters to build POST-request.
# "type" and "submit" params are static. "code" is your desired code to scrape.
data = [
('type', 'certificate'),
('code', code),
('submit', 'Search'),
]
# POST-request to gain page data.
response = requests.post('https://info.fsc.org/certificate.php', data=data)
# "soup" object to parse html data.
soup = BeautifulSoup(response.content, 'lxml')
# "status" variable. Contains first's found [LABEL tag, with text="Status"] following sibling DIV text. Which is status.
status = soup.find_all("label", string="Status")[0].find_next_sibling('div').text
# Same for issue dates... etc.
first_issue_date = soup.find_all("label", string="First Issue Date")[0].find_next_sibling('div').text
last_issue_date = soup.find_all("label", string="Last Issue Date")[0].find_next_sibling('div').text
expiry_date = soup.find_all("label", string="Expiry Date")[0].find_next_sibling('div').text
standard = soup.find_all("label", string="Standard")[0].find_next_sibling('div').text
# Returning found data as list of values.
return [response.url, status, first_issue_date, last_issue_date, expiry_date, standard]
# Just insert here output filename and codes to parse...
OUTPUT_FILE_NAME = 'output.csv'
codes = ['C001777', 'C001777', 'C001777', 'C001777']
with open(OUTPUT_FILE_NAME, 'w') as f:
writer = csv.writer(f)
for code in codes:
print('Getting code# {}'.format(code))
#Writing list of values to file as single row.
writer.writerow((get_data_by_code(code)))
Everything is really straightforward here. I'd suggest you spend some time in Chrome dev tools "network" tab to have a better understanding of request forging, which is a must for scraping tasks.
In general, you don't need to run chrome to click the "search" button, you need to forge request generated by this click. Same for any form and ajax.
well... you should sharpen your skills (:
df3=pd.DataFrame()
with open(OUTPUT_FILE_NAME, 'w') as f:
writer = csv.writer(f)
for code in codes:
print('Getting code# {}'.format(code))
writer.writerow((get_data_by_code(code)))
### HERE'S THE PROBLEM:
# "soup" variable is declared inside of "get_data_by_code" function.
# So you can't use it in outer context.
table = soup.find_all('table')[0] # <--- you should move this line to
#definition of "get_data_by_code" function and return it's value somehow...
df1, = pd.read_html(str(table))
df3 = df3.append(df1)
df3.to_csv('Product_Data.csv', index = False, encoding='utf-8')
As per example you can return dictionary of values from "get_data_by_code" function:
def get_data_by_code(code):
...
table = soup.find_all('table')[0]
return dict(row=row, table=table)
I need to take this following URL, get the JSON data, and some how print a certain part of it... For example the "BTC-PCN" part, specifically the "High". Any help? Thanks:
import urllib2
import json
response = urllib2.urlopen('https://bittrex.com/api/v1/public/getmarketsummaries')
data = json.load(response)
data_dump = json.dumps(data)
parsed = json.loads(data_dump)
print parsed['result'][80]
For some reason I get a "5e-08".... Thank you very much in advance for your help. :)
What's the confusion?
parsed['result'][80] is:
{
u'Volume': 10864.00514914,
u'Last': 3e-08,
u'TimeStamp': u'2014-04-22T01:38:24.103',
u'High': 5e-08,
u'MarketName': u'BTC-PCN',
u'Low': 2e-08,
u'BaseVolume': 0.00028593
}
u'BTC-PCN' is the value of the u'MarketName':
parsed['result'][80]['MarketName'] # u'BTC-PCN'
5e-08 is the value of the u'High':
parsed['result'][80]['High'] # 5e-08