Saving json for loop - json

Any concise was to save for loop in json format? Thank you for your help.
import requests
import json
results = []
for i in range(1,143):
res = requests.get("https://www.bhhs.com/bin/bhhs/officeSearchServlet?PageSize=10&Sort=1&Page={}&office_country=US".format(i))
results.append(res.json())
# What goes next? Thank you!

Your job is much easier now. The website uses javascript to get this information.
The below scrapes all the 141 pages.
import requests, json
results = []
for i in range(1,142):
res = requests.get("https://www.bhhs.com/bin/bhhs/officeSearchServlet?PageSize=10&Sort=1&Page={}&office_country=US".format(i))
results.append(res.json())
with open("result.json", "w") as f:
json.dump(results, f)
Trying all the requests at once can make some requests failed. Hence, I recommend crawling the pages in batches and save the data like pages from 1-10 save the data, next 10-20 save the data etc... Next you can consolidate all the scraped results

Related

How to extract all values from one key type of a json file?

I'm trying to learn how to do this (I can barely code), I'm not trying to get you (the wonderful and generous reader of my post) to do my job for me. Obviously full solutions are welcome but my goal is to figure out the HOW so I can do this myself.
Project - Summary
Extract just the attachment file urls from a massive json file (I believe the proper term is "parse json strings").
Project - Wordy Explanation
I'm trying to get all the attachments from a .json file that is an export of the entire Trello board I have. It has a specific key field for these attachments at the end of a json tree like below:
TrelloBoard.json
> cards
>> 0
>>> attachments
>>>> 0
>>>>> url "https://trello-attachments.s3.amazonaws.com/###/####/#####/AttachedFile.pdf"
(The first 0 goes up to 300+, representing each Trello card, the second 0 has never gone above 0, as it represents the number of attachments per card)
I've looked up tutorials online of how to parse strings from json files, but I haven't been able to get anything to print out (write) from those attempts. Seeing as I have over 100 attachments per month to download, a code would clearly be the best way to do it -- but I'm completely stumped on how and am asking you, dear reader, to help point me in the right direction.
Code Attempt
Any programming language is fine (I'm new enough to not be attached to any), but I've tried the following in python (among other codes) to no avail in Command Prompt.
import json
with open('G:\~WORK~\~Codes~\trello.json') as f:
data = json.load(f)
# Output: {'cards': '0', 'attachments': '0', 'url': ['https://trello-attachments.s3.amazonaws.com']}
print(data)
Use python dict to get the needed value:
import json
with open('G:\~WORK~\~Codes~\trello.json') as f:
data = json.load(f)
url = data['url']

Pull Data from TMX Using Python 3.6.8

About two months ago I asked a question about pulling data from the CME in the json format. I was successfully able to pull the appropriate data with your help.
Want to remind everyone that I am still pretty new to Python, so please bear with me if my question is relatively straightforward.
I am trying to pull data again again in json format but from a different website and things do not appear to be cooperating. In particular I am trying to pull the following data:
https://api.tmxmoney.com/marketactivity/candeal?ts=1567086212742
This is what I have tried.
import pandas as pd
import json
import requests
cadGovt = 'https://api.tmxmoney.com/marketactivity/candeal?ts=1567086212742'
sample_data = requests.get(cadGovt)
sample_data.encoding = 'utf-8'
test = sample_data.json()
print(test)
I would like to get a json of the information (which is literally just a table that has term, description, bid yield, ask yield, change, bid price, ask price, change).
Instead I am getting 'JSONDecodeError: Expecting value: line 1 column 1 (char 0)'.
If anyone has any guidance or advice that would be greatly appreciated.
It's cause the page you're getting is not returning JSON but an HTML page. So when you try to use
test = sample_data.json()
You're trying to parse HTML as JSON which won't work. You can scrape the data off of the page though, here's an example in bs4 you can try, it's a bit rusty on the edges but it should work.
import requests as r
from bs4 import beautifulsoup
url = 'https://api.tmxmoney.com/marketactivity/candeal?ts=1567086212742'
response = r.get(url)
soup = BeautifulSoup(response.text, 'lxml')
for tr in soup.find_all('tr'):
print(tr.text+"\n")
you can get the TD such as this
for tr in soup.find_all('tr'):
tds = tr.find_all('td')

Json Parsing from API With Dicts

I am writing a piece of code to retrieve certain information from the League of Legends api.
I have everything working fine and printing to my console, I have even managed to access the data and print off only the information that I need, the only issue is there are 299 values which I would like printed off and I can only manage to print one at a time. This would obviously be the worst way to sort through it as it would take forever to write the program. I have spent over 3 days researching and watching videos with no success so far.
Below is the code I currently have (minus imports).
url =('https://na1.api.riotgames.com/lol/league/v4/challengerleagues/by-
queue/RANKED_SOLO_5x5?api_key=RGAPI-b5187110-2f16-48b4-8b0c-938ae5bddccb')
r = requests.get(url)
response_dict = r.json()
print(response_dict['entries'][0]['summonerName'])
print(response_dict['entries'][1]['summonerName'])
When I attempt to index entries like '[0:299]' I get the following error: list indices must be integers or slices, not str.
I would simply convert the list of dictionaries within entries into a dataframe. You have all the info nicely organised and can access specific items easily including your column for summonerName .
import requests
from bs4 import BeautifulSoup as bs
import json
import pandas as pd
#url = yourURL
res = requests.get(url, headers = {'user-agent' : 'Mozilla/5.0'})
soup = bs(res.content, 'lxml')
data = json.loads(soup.select_one('p').text)
df = pd.DataFrame(data['entries'])
print(df)
You can loop over the index, that'll print them all out
for i in range(300):
print(response_dict['entries'][i]['summonerName'])
When you use response_dict['entries'][M:N]
You create a new list of dictionaries that have to be extracted before you can reference ['summonerName'] directly
If you print(response_dict['entries'][0:3])
You'll see what I mean

PySpark loading from URL

I wanted to load csv files from a URL in PySpark, is it even possible to do so?
I keep the files on GitHub.
Thanks!
There is no naive way in pyspark (see here).
However, if you have a function that takes as input a URL and outputs the csv:
def read_from_URL(UR):
# your logic here
return data
You can use spark to parallelize this operation:
URL_list = ['http://github.com/file/location/file1.csv, ...]
data = sc.parallelize(URL_list).map(read_from_URL)

Json-Opening Yelp Data Challenge's data set

I am interested in data mining and I am writing my thesis about it. For my thesis I want to use yelp's data challenge's data set, however i can not open it since it is in json format and almost 2 gb. In its website its been said that the dataset can be opened in phyton using mrjob, but I am also not very good with programming. I searched online and looked some of the codes yelp provided in github however I couldn't seem to find an article or something which explains how to open the dataset, clearly.
Can you please tell me step by step how to open this file and maybe how to convert it to csv?
https://www.yelp.com.tr/dataset_challenge
https://github.com/Yelp/dataset-examples
data is in .tar format when u extract it again it has another file,rename it to .tar and then extract it.you will get all the json files
yes you can use pandas. Take a look:
import pandas as pd
# read the entire file into a python array
with open('yelp_academic_dataset_review.json', 'rb') as f:
data = f.readlines()
# remove the trailing "\n" from each line
data = map(lambda x: x.rstrip(), data)
data_json_str = "[" + ','.join(data) + "]"
# now, load it into pandas
data_df = pd.read_json(data_json_str)
Now 'data_df' contains the yelp data ;)
Case, you want convert it directly to csv, you can use this script
https://github.com/Yelp/dataset-examples/blob/master/json_to_csv_converter.py
I hope it can help you
To process huge json files, use a streaming parser.
Many of these files aren't a single json, but a stream of jsons (known as "jsons format"). Then a regular json parser will consider everything but the first entry to be junk.
With a streaming parser, you can start reading the file, process parts, and wrote them to the desired output; then continue writing.
There is no single json-to-csv conversion.
Thus, you will not find a general conversion utility, you have to customize the conversion for your needs.
The reason is that a JSON is a tree but a CSV is not. There exists no ultimative and efficient conversion from trees to table rows. I'd stick with JSON unless you are always extracting only the same x attributes from the tree.
Start coding, to become a better programmer. To succeed with such amounts of data, you need to become a better programmer.