My goal is to get specific data on many profiles on khanacademy by using their API.
My problem is: in their API, json files have different list orders. It can vary from one to another.
Here is my code:
from urllib.request import urlopen
import json
# here is a list with two json file links:
profiles=['https://www.khanacademy.org/api/internal/user/kaid_329989584305166460858587/profile/widgets?lang=en&_=190424-1429-bcf153233dc9_1556201931959','https://www.khanacademy.org/api/internal/user/kaid_901866966302088310331512/profile/widgets?lang=en&_=190424-1429-bcf153233dc9_1556201931959']
# for each json file, take some specific data out
for profile in profiles:
print(profile)
with urlopen(profile) as response:
source = response.read()
data = json.loads(source)
votes = data[1]['renderData']['discussionData']['statistics']['votes']
print(votes)
I expected something like this:
https://www.khanacademy.org/api/internal/user/kaid_329989584305166460858587/profile/widgets?lang=en&_=190424-1429-bcf153233dc9_1556201931959
100
https://www.khanacademy.org/api/internal/user/kaid_901866966302088310331512/profile/widgets?lang=en&_=190424-1429-bcf153233dc9_1556201931959
41
Instead I got an error:
https://www.khanacademy.org/api/internal/user/kaid_329989584305166460858587/profile/widgets?lang=en&_=190424-1429-bcf153233dc9_1556201931959
100
https://www.khanacademy.org/api/internal/user/kaid_901866966302088310331512/profile/widgets?lang=en&_=190424-1429-bcf153233dc9_1556201931959
Traceback (most recent call last):
File "bitch.py", line 12, in <module>
votes = data[1]['renderData']['discussionData']['statistics']['votes']
KeyError: 'discussionData'
As we can see:
This link A is working fine: https://www.khanacademy.org/api/internal/user/kaid_329989584305166460858587/profile/widgets?lang=en&_=190424-1429-bcf153233dc9_1556201931959
But this link B is not working: https://www.khanacademy.org/api/internal/user/kaid_901866966302088310331512/profile/widgets?lang=en&_=190424-1429-bcf153233dc9_1556201931959 And that's because in this json file. The list is not in the same order as it is in the A link.
My question is: Why? And how can I write my script to get into account these variation of orders?
There is probably something to do with .sort(). But I am missing something.
Maybe I should also precise that I am using python 3.7.2.
Link A: desired data (yellow) is in the second item of the list (blue):
Link B: desired data (yellow) is in the third item of the list (blue):
You could use an if to test if votes in current index dictionary
import requests
urls = ['https://www.khanacademy.org/api/internal/user/kaid_329989584305166460858587/profile/widgets?lang=en&_=190424-1429-bcf153233dc9_1556201931959',
'https://www.khanacademy.org/api/internal/user/kaid_901866966302088310331512/profile/widgets?lang=en&_=190424-1429-bcf153233dc9_1556201931959']
for url in urls:
r = requests.get(url).json()
result = [item['renderData']['discussionData']['statistics']['votes'] for item in r if 'votes' in str(item)]
print(result)
Catching exceptions in python doesn't take much overhead unlike other languages so I would recommend the "better ask forgiveness then permission" solution. This will be slightly faster than searching through a str for the word votes as it will fail instantly if the key is invalid.
import requests
urls = ['https://www.khanacademy.org/api/internal/user/kaid_329989584305166460858587/profile/widgets?lang=en&_=190424-1429-bcf153233dc9_1556201931959',
'https://www.khanacademy.org/api/internal/user/kaid_901866966302088310331512/profile/widgets?lang=en&_=190424-1429-bcf153233dc9_1556201931959']
for url in urls:
response = requests.get(url).json()
result = []
for item in response:
try:
result.append(item['renderData']['discussionData']['statistics']['votes'])
except KeyError:
pass # Could not find votes
print(result)
Related
I am trying to request geocoding data from the census bureau. The code runs and got more than 1450 records so far (my total is about 60K records), but then it breaks and returns this error :
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
My data looks like this:
enter image description here
Here is my function:
def get_fips(df):
num=len(ven_lst)
for i,e in df.itertuples(index=False):
if e not in repostory_lst:
try:
num+=1
address=i
vendor=e
link="https://geocoding.geo.census.gov/geocoder/geographies/onelineaddress?address={0}&benchmark=Public_AR_Census2020&vintage=Census2020_Census2020&layers=10&format=json".format(address)
reponse = requests.get(link).text
reponse_1=json.loads(response)
x=reponse_1['result']['addressMatches'][0]['geographies']['Census Blocks'][0]['GEOID'][:11]
fields=[num,e,x]
with open(r'fibs&ven.csv', 'a',newline='') as f: #because my data is big a save all the #data into this csv incase the code breaks
writer = csv.writer(f)
writer.writerow(fields)
except (RuntimeError, TypeError, NameError,IndexError):
pass
elif e in repostory_lst:
pass
#df_result=pd.DataFrame(columns=['Vendor Code','fips'],index=range(len(fips_lst)))
#df_result['Vendor Code']=ven_lst
#df_result['fips']=fips_lst
#x.to_csv('fibs&ven.csv', mode='a', header=False)
return None
Normally if the API on the server side is well written and you have specified format=JSON, it should strictly return JSON data only.
But as a fail safe you should add a header to your get a json response.
Also instead of text you can use json() and you need not to do json.loads
Request like
headers = {'Accept': 'application/json'}
reponse_1 = requests.get(link, headers=headers).json()
# reponse_1=json.loads(response)
it turns out the issue was that some of the addresses were PO BOXes, once I had them removed the code worked as expected,
I got this error when I had an DecimalField.
My error was 500
I changed it to IntegerField, and I got 200 OK
From
price = models.DecimalField(max_digits=12, decimal_places=2)
To
price = models.IntegerField()
I am trying to scrape the pokemon API and create a dataset for all pokemon. So I have written a function which looks like this:
import requests
import json
import pandas as pd
def poke_scrape(x, y):
'''
A function that takes in a range of pokemon (based on pokedex ID) and returns
a pandas dataframe with information related to the pokemon using the Poke API
'''
#GATERING THE DATA FROM API
url = 'https://pokeapi.co/api/v2/pokemon/'
ids = range(x, (y+1))
pkmn = []
for id_ in ids:
url = 'https://pokeapi.co/api/v2/pokemon/' + str(id_)
pages = requests.get(url).json()
# content = json.dumps(pages, indent = 4, sort_keys=True)
if 'error' not in pages:
pkmn.append([pages['id'], pages['name'], pages['abilities'], pages['stats'], pages['types']])
#MAKING A DATAFRAME FROM GATHERED API DATA
cols = ['id', 'name', 'abilities', 'stats', 'types']
df = pd.DataFrame(pkmn, columns=cols)
The code works fine for most pokemon. However, when I am trying to run poke_scrape(229, 229) (so trying to load ONLY the 229th pokemon), it gives me the JSONDecodeError. It looks like this:
So far I have tried using json.loads() instead but that has not solved the issue. What is even more perplexing is that specific pokemon has loaded before and the same issue was with another ID - otherwise I could just manually enter the stats for the specific pokemon that is unable to load into my dataframe. Any help is appreciated!
Because of the way the PokeAPI works, some links to the JSON data for each pokemon only load when the links end with a '/' (such as https://pokeapi.co/api/v2/pokemon/229/ vs https://pokeapi.co/api/v2/pokemon/229 - first link will work and the second will return not found). However, others will respond with a response error because of the added '/' so fixed the issue with a few if statements right after the for loop in the beginning of the function
My goal is to (1) import Twitter JSON, (2) extract data of interest, (3) create pandas data frame for the variables of interest. Here is my code:
import json
import pandas as pd
tweets = []
for line in open('00.json'):
try:
tweet = json.loads(line)
tweets.append(tweet)
except:
continue
# Tweets often have missing data, therefore use -if- when extracting "keys"
tweet = tweets[0]
ids = [tweet['id_str'] for tweet in tweets if 'id_str' in tweet]
text = [tweet['text'] for tweet in tweets if 'text' in tweet]
lang = [tweet['lang'] for tweet in tweets if 'lang' in tweet]
geo = [tweet['geo'] for tweet in tweets if 'geo' in tweet]
place = [tweet['place'] for tweet in tweets if 'place' in tweet]
# Create a data frame (using pd.Index may be "incorrect", but I am a noob)
df=pd.DataFrame({'Ids':pd.Index(ids),
'Text':pd.Index(text),
'Lang':pd.Index(lang),
'Geo':pd.Index(geo),
'Place':pd.Index(place)})
# Create a data frame satisfying conditions:
df2 = df[(df['Lang']==('en')) & (df['Geo'].dropna())]
So far, everything seems to be working fine.
Now, the extracted values for Geo result in the following example:
df2.loc[1921,'Geo']
{'coordinates': [39.11890951, -84.48903638], 'type': 'Point'}
To get rid of everything except the coordinates inside the squared brackets I tried using:
df2.Geo.str.replace("[({':]", "") ### results in NaN
# and also this:
df2['Geo'] = df2['Geo'].map(lambda x: x.lstrip('{'coordinates': [').rstrip('], 'type': 'Point'')) ### results in syntax error
Please advise on the correct way to obtain coordinates values only.
The following line from your question indicates that this is an issue with understanding the underlying data type of the returned object.
df2.loc[1921,'Geo']
{'coordinates': [39.11890951, -84.48903638], 'type': 'Point'}
You are returning a Python dictionary here -- not a string! If you want to return just the values of the coordinates, you should just use the 'coordinates' key to return those values, e.g.
df2.loc[1921,'Geo']['coordinates']
[39.11890951, -84.48903638]
The returned object in this case will be a Python list object containing the two coordinate values. If you want just one of the values, you can slice the list, e.g.
df2.loc[1921,'Geo']['coordinates'][0]
39.11890951
This workflow is much easier to deal with than casting the dictionary to a string, parsing the string, and recapturing the coordinate values as you are trying to do.
So let's say you want to create a new column called "geo_coord0" which contains all of the coordinates in the first position (as shown above). You could use a something like the following:
df2["geo_coord0"] = [x['coordinates'][0] for x in df2['Geo']]
This uses a Python list comprehension to iterate over all entries in the df2['Geo'] column and for each entry it uses the same syntax we used above to return the first coordinate value. It then assigns these values to a new column in df2.
See the Python documentation on data structures for more details on the data structures discussed above.
Is it possible to obtain similar pubmed articles given a pmid. Example this link shows similar articles on the rights hand side.
You can do it with BioPython using the NCBI API. The command you are looking for is neighbor_score. Alternatively you can get the data directly via the URL.
from Bio import Entrez
Entrez.email = "Your.Name.Here#example.org"
handle = Entrez.elink(db="pubmed", id="26998445", cmd="neighbor_score", rettype="xml")
records = Entrez.read(handle)
scores = sorted(records[0]['LinkSetDb'][0]['Link'], key=lambda k: int(k['Score']))
#show the top 5 results
for i in range(1, 6):
handle = Entrez.efetch(db="pubmed", id=scores[-i]['Id'], rettype="xml")
record = Entrez.read(handle)
print(record)
BACKGROUND:
I am having issues trying to search through some CSV files.
I've gone through the python documentation: http://docs.python.org/2/library/csv.html
about the csv.DictReader(csvfile, fieldnames=None, restkey=None, restval=None, dialect='excel', *args, **kwds) object of the csv module.
My understanding is that the csv.DictReader assumes the first line/row of the file are the fieldnames, however, my csv dictionary file simply starts with "key","value" and goes on for atleast 500,000 lines.
My program will ask the user for the title (thus the key) they are looking for, and present the value (which is the 2nd column) to the screen using the print function. My problem is how to use the csv.dictreader to search for a specific key, and print its value.
Sample Data:
Below is an example of the csv file and its contents...
"Mamer","285713:13"
"Champhol","461034:2"
"Station Palais","972811:0"
So if i want to find "Station Palais" (input), my output will be 972811:0. I am able to manipulate the string and create the overall program, I just need help with the csv.dictreader.I appreciate any assistance.
EDITED PART:
import csv
def main():
with open('anchor_summary2.csv', 'rb') as file_data:
list_of_stuff = []
reader = csv.DictReader(file_data, ("title", "value"))
for i in reader:
list_of_stuff.append(i)
print list_of_stuff
main()
The documentation you linked to provides half the answer:
class csv.DictReader(csvfile, fieldnames=None, restkey=None, restval=None, dialect='excel', *args, **kwds)
[...] maps the information read into a dict whose keys are given by the optional fieldnames parameter. If the fieldnames parameter is omitted, the values in the first row of the csvfile will be used as the fieldnames.
It would seem that if the fieldnames parameter is passed, the given file will not have its first record interpreted as headers (the parameter will be used instead).
# file_data is the text of the file, not the filename
reader = csv.DictReader(file_data, ("title", "value"))
for i in reader:
list_of_stuff.append(i)
which will (apparently; I've been having trouble with it) produce the following data structure:
[{"title": "Mamer", "value": "285713:13"},
{"title": "Champhol", "value": "461034:2"},
{"title": "Station Palais", "value": "972811:0"}]
which may need to be further massaged into a title-to-value mapping by something like this:
data = {}
for i in list_of_stuff:
data[i["title"]] = i["value"]
Now just use the keys and values of data to complete your task.
And here it is as a dictionary comprehension:
data = {row["title"]: row["value"] for row in csv.DictReader(file_data, ("title", "value"))}
The currently accepted answer is fine, but there's a slightly more direct way of getting at the data. The dict() constructor in Python can take any iterable.
In addition, your code might have issues on Python 3, because Python 3's csv module expects the file to be opened in text mode, not binary mode. You can make your code compatible with 2 and 3 by using io.open instead of open.
import csv
import io
with io.open('anchor_summary2.csv', 'r', newline='', encoding='utf-8') as f:
data = dict(csv.reader(f))
print(data['Champhol'])
As a warning, if your csv file has two rows with the same value in the first column, the later value will overwrite the earlier value. (This is also true of the other posted solution.)
If your program really is only supposed to print the result, there's really no reason to build a keyed dictionary.
import csv
import io
# Python 2/3 compat
try:
input = raw_input
except NameError:
pass
def main():
# Case-insensitive & leading/trailing whitespace insensitive
user_city = input('Enter a city: ').strip().lower()
with io.open('anchor_summary2.csv', 'r', newline='', encoding='utf-8') as f:
for city, value in csv.reader(f):
if user_city == city.lower():
print(value)
break
else:
print("City not found.")
if __name __ == '__main__':
main()
The advantage of this technique is that the csv isn't loaded into memory and the data is only iterated over once. I also added a little code the calls lower on both the keys to make the match case-insensitive. Another advantage is if the city the user requests is near the top of the file, it returns almost immediately and stops looking through the file.
With all that said, if searching performance is your primary consideration, you should consider storing the data in a database.