I am writing a piece of code to retrieve certain information from the League of Legends api.
I have everything working fine and printing to my console, I have even managed to access the data and print off only the information that I need, the only issue is there are 299 values which I would like printed off and I can only manage to print one at a time. This would obviously be the worst way to sort through it as it would take forever to write the program. I have spent over 3 days researching and watching videos with no success so far.
Below is the code I currently have (minus imports).
url =('https://na1.api.riotgames.com/lol/league/v4/challengerleagues/by-
queue/RANKED_SOLO_5x5?api_key=RGAPI-b5187110-2f16-48b4-8b0c-938ae5bddccb')
r = requests.get(url)
response_dict = r.json()
print(response_dict['entries'][0]['summonerName'])
print(response_dict['entries'][1]['summonerName'])
When I attempt to index entries like '[0:299]' I get the following error: list indices must be integers or slices, not str.
I would simply convert the list of dictionaries within entries into a dataframe. You have all the info nicely organised and can access specific items easily including your column for summonerName .
import requests
from bs4 import BeautifulSoup as bs
import json
import pandas as pd
#url = yourURL
res = requests.get(url, headers = {'user-agent' : 'Mozilla/5.0'})
soup = bs(res.content, 'lxml')
data = json.loads(soup.select_one('p').text)
df = pd.DataFrame(data['entries'])
print(df)
You can loop over the index, that'll print them all out
for i in range(300):
print(response_dict['entries'][i]['summonerName'])
When you use response_dict['entries'][M:N]
You create a new list of dictionaries that have to be extracted before you can reference ['summonerName'] directly
If you print(response_dict['entries'][0:3])
You'll see what I mean
Related
I have this code:
import pandas as pd
import json
file = "/Users/mickelborg/Desktop/Dataset/2018/Carbon_Minoxide_(CO)_2018.json"
with open(file, 'r') as j:
contents = json.loads(j.read())
oxide = pd.DataFrame.from_dict(contents, orient='index')
oxide
I'm trying to get a readout of the JSON dataset by the features/columns, but they don't seem to load properly.
Currently this is the output that I have:
LINK
As can be seen from the image, the data loads incorrectly. "county_code" should each have their own row in the dataset, along with all the other following features.
What am I doing wrong in this regard?
Thanks a lot for your help!
Any concise was to save for loop in json format? Thank you for your help.
import requests
import json
results = []
for i in range(1,143):
res = requests.get("https://www.bhhs.com/bin/bhhs/officeSearchServlet?PageSize=10&Sort=1&Page={}&office_country=US".format(i))
results.append(res.json())
# What goes next? Thank you!
Your job is much easier now. The website uses javascript to get this information.
The below scrapes all the 141 pages.
import requests, json
results = []
for i in range(1,142):
res = requests.get("https://www.bhhs.com/bin/bhhs/officeSearchServlet?PageSize=10&Sort=1&Page={}&office_country=US".format(i))
results.append(res.json())
with open("result.json", "w") as f:
json.dump(results, f)
Trying all the requests at once can make some requests failed. Hence, I recommend crawling the pages in batches and save the data like pages from 1-10 save the data, next 10-20 save the data etc... Next you can consolidate all the scraped results
About two months ago I asked a question about pulling data from the CME in the json format. I was successfully able to pull the appropriate data with your help.
Want to remind everyone that I am still pretty new to Python, so please bear with me if my question is relatively straightforward.
I am trying to pull data again again in json format but from a different website and things do not appear to be cooperating. In particular I am trying to pull the following data:
https://api.tmxmoney.com/marketactivity/candeal?ts=1567086212742
This is what I have tried.
import pandas as pd
import json
import requests
cadGovt = 'https://api.tmxmoney.com/marketactivity/candeal?ts=1567086212742'
sample_data = requests.get(cadGovt)
sample_data.encoding = 'utf-8'
test = sample_data.json()
print(test)
I would like to get a json of the information (which is literally just a table that has term, description, bid yield, ask yield, change, bid price, ask price, change).
Instead I am getting 'JSONDecodeError: Expecting value: line 1 column 1 (char 0)'.
If anyone has any guidance or advice that would be greatly appreciated.
It's cause the page you're getting is not returning JSON but an HTML page. So when you try to use
test = sample_data.json()
You're trying to parse HTML as JSON which won't work. You can scrape the data off of the page though, here's an example in bs4 you can try, it's a bit rusty on the edges but it should work.
import requests as r
from bs4 import beautifulsoup
url = 'https://api.tmxmoney.com/marketactivity/candeal?ts=1567086212742'
response = r.get(url)
soup = BeautifulSoup(response.text, 'lxml')
for tr in soup.find_all('tr'):
print(tr.text+"\n")
you can get the TD such as this
for tr in soup.find_all('tr'):
tds = tr.find_all('td')
I am trying to work out how to take results from python regarding the sentiment polarity of tweets (original input from json file) and turn them into a csv i can export for use in R - im using Python 2.7
I have tried a couple of different ways from similar stackflow queries, but no success so far.
For example, using pandas package
tweet_polarity = []
for tweet in tweet_text:
polarity = analyser.polarity_scores(tweet[1])
tweet_polarity.append([tweet[0], tweet[1], polarity['compound'],
polarity['neg'], polarity['neu'], polarity['pos']])
import pandas
df = pandas.DataFrame(data={"tweet_polarity": tweet_polarity, "tweet_text": tweet_text,
"tweets}": tweets})
df.to_csv("polarityRES.csv")
creates a csv file, but seems to just repeat the same tweet over and over again rather than creating a nice dataframe with the polarity scores
I thought about using cvs.writer, but haven't been able to find a relevant example to what I'm trying to do. Any suggestions gang?
(Sorry for my terrible explanation, I'm still getting to grips with the basics while trying to do this - and typing one handed with tendonitis!)
I have a large dataset stored in a S3 bucket, but instead of being a single large file, it's composed of many (113K to be exact) individual JSON files, each of which contains 100-1000 observations. These observations aren't on the highest level, but require some navigation within each JSON to access.
i.e.
json["interactions"] is a list of dictionaries.
I'm trying to utilize Spark/PySpark (version 1.1.1) to parse through and reduce this data, but I can't figure out the right way to load it into an RDD, because it's neither all records > one file (in which case I'd use sc.textFile, though added complication here of JSON) nor each record > one file (in which case I'd use sc.wholeTextFiles).
Is my best option to use sc.wholeTextFiles and then use a map (or in this case flatMap?) to pull the multiple observations from being stored under a single filename key to their own key? Or is there an easier way to do this that I'm missing?
I've seen answers here that suggest just using json.loads() on all files loaded via sc.textFile, but it doesn't seem like that would work for me because the JSONs aren't simple highest-level lists.
The previous answers are not going to read the files in a distributed fashion (see reference). To do so, you would need to parallelize the s3 keys and then read in the files during a flatMap step like below.
import boto3
import json
from pyspark.sql import Row
def distributedJsonRead(s3Key):
s3obj = boto3.resource('s3').Object(bucket_name='bucketName', key=s3Key)
contents = json.loads(s3obj.get()['Body'].read().decode('utf-8'))
for dicts in content['interactions']
yield Row(**dicts)
pkeys = sc.parallelize(keyList) #keyList is a list of s3 keys
dataRdd = pkeys.flatMap(distributedJsonRead)
Boto3 Reference
What about using DataFrames?
does
testFrame = sqlContext.read.json('s3n://<bucket>/<key>')
give you what you want from one file?
Does every observation have the same "columns" (# of keys)?
If so you could use boto to list each object you want to add, read them in and union them with each other.
from pyspark.sql import SQLContext
import boto3
from pyspark.sql.types import *
sqlContext = SQLContext(sc)
s3 = boto3.resource('s3')
bucket = s3.Bucket('<bucket>')
aws_secret_access_key = '<secret>'
aws_access_key_id = '<key>'
#Configure spark with your S3 access keys
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", aws_access_key_id)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", aws_secret_access_key)
object_list = [k for k in bucket.objects.all() ]
key_list = [k.key for k in bucket.objects.all()]
paths = ['s3n://'+o.bucket_name+'/'+ o.key for o in object_list ]
dataframes = [sqlContext.read.json(path) for path in paths]
df = dataframes[0]
for idx, frame in enumerate(dataframes):
df = df.unionAll(frame)
I'm new to spark myself so I'm wondering if there's a better way to use dataframes with a lot of s3 files, but so far this is working for me.
The name is misleading (because it's singular), but sparkContext.textFile() (at least in the Scala case) also accepts a directory name or a wildcard path, so you just be able to say textFile("/my/dir/*.json").