Recieve JSON data from link in whattomine without scraping HTML - json

Explanation
This link is where you are sent to after entering in your hardware stats (hashrate, power, power cost, etc.). On the top bar (below the blue Twitter follow button) is a link to a JSON file created after the page loads with the hardware stats information entered; clicking on that JSON link redirects you to another URL (https://whattomine.com/asic.json).
Goal
My goal is to access that JSON file directly after manipulating the values in the URL string via the terminal. For example, if I would like to change hashrate from 100 to 150 in this portion of the URL:
[sha256_hr]=100& ---> [sha256_hr]=150&
After the URL manipulations (like above, but not limited to), I would like to receive the JSON output so that I can pick-out the desired data.
My Code
Advisory: I started Python programming ~June 2017, please forgive.
import json
import pandas as pd
import urllib2
import requests
hashrate_ghs = float(raw_input('Hash Rate (TH/s): '))
power_W = float(raw_input('Power of Miner (W): '))
electric_cost = float(raw_input('Cost of Power ($/kWh): '))
hashrate_ths = hashrate_ghs * 1000
initial_request = ('https://whattomine.com/asic?utf8=%E2%9C%93&sha256f=true&factor[sha256_hr]={0}&factor[sha256_p]={1}&factor[cost]={2}&sort=Profitability24&volume=0&revenue=24h&factor[exchanges][]=&factor[exchanges][]=bittrex&dataset=Main&commit=Calculate'.format(hashrate_ths, power_W, electric_cost))
data_stream_mine = urllib2.Request(initial_request)
json_data = requests.get('https://whattomine.com/asic.json')
print json_data
Error from My Code
I am getting an HTTPS handshake error. This is where my Python freshness is second most blatantly visible:
Traceback (most recent call last):
File "calc_1.py", line 16, in <module>
s.get('https://whattomine.com/asic.json')
File "/Library/Python/2.7/site-packages/requests/sessions.py", line 521, in get
return self.request('GET', url, **kwargs)
File "/Library/Python/2.7/site-packages/requests/sessions.py", line 508, in request
resp = self.send(prep, **send_kwargs)
File "/Library/Python/2.7/site-packages/requests/sessions.py", line 618, in send
r = adapter.send(request, **kwargs)
File "/Library/Python/2.7/site-packages/requests/adapters.py", line 506, in send
raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='whattomine.com', port=443): Max retries exceeded with url: /asic.json (Caused by SSLError(SSLError(1, u'[SSL: SSLV3_ALERT_HANDSHAKE_FAILURE] sslv3 alert handshake failure (_ssl.c:590)'),))
Thank you for your help and time!
Please advise me of any changes or for more information concerning this question.

This is just a comment. The following approach would suffice (Python 3).
import requests
initial_request = 'http://whattomine.com/asic.json?utf8=1&dataset=Main&commit=Calculate'
json_data = requests.get(initial_request)
print(json_data.json())
The key point here this part - put .json in your initial_request and it will be enough.
You may add all you parameters as you did in the query part after ? sign

It looks like a few others faced similar problems.
While for some it seemed to be like a pyOpenSSL version issue, uninstalling and reinstalling which has fixed the problem. Another older answer in SO asks to do the following.

Related

dask.delayed KeyError with distributed scheduler

I have a function interpolate_to_particles written in c and wrapped with ctypes. I want to use dask.delayed to make a series of calls to this function.
The code runs successfully without dask
# Interpolate w/o dask
result = interpolate_to_particles(arg1, arg2, arg3)
and with the distributed schedular in single-threaded mode
# Interpolate w/ dask
from dask.distributed import Client
client = Client()
result = dask.delayed(interpolate_to_particles)(arg1, arg2, arg3)
result_c = result.compute(scheduler='single-threaded')
but if I instead call
result_c = result.compute()
I get the following KeyError:
> Traceback (most recent call last): File
> "/path/to/lib/python3.6/site-packages/distributed/worker.py",
> line 3287, in dumps_function
> result = cache_dumps[func] File "/path/to/lib/python3.6/site-packages/distributed/utils.py",
> line 1518, in __getitem__
> value = super().__getitem__(key) File "/path/to/lib/python3.6/collections/__init__.py",
> line 991, in __getitem__
> raise KeyError(key) KeyError: <function interpolate_to_particles at 0x1228ce510>
The worker logs accessed from the dask dashboard do not provide any information. Actually, I do not see any information that the workers have done anything besides starting up.
Any ideas on what could be occurring, or suggested tools that I can use to further debug? Thanks!
Given your comments it sounds like your function does not serialize well. To test this, you might try pickling the function in one process, and try unpickling it in another.
>>> import pickle
>>> print(pickle.dumps(interpolate_to_particles))
b'some bytes printed out here'
And then in another process
>>> import pickle
>>> interpolate_to_particles = pickle.loads(b'the same bytes you had before')
If this doesn't work then you'll know that that's your problem. I would encourage you to look up "how to make sure that ctypes functions are serializable" or something similar, or ask another question with that smaller scope here on Stack Overflow.

I am trying to parse 350 files on my local disk and store the data into database as json objects

I am parsing 350 txt files having json data using python. I am able to retrieve 62 of those object and store them on mysql database, but after that I am getting an error saying JSONDecodeError: ExtraData
Python:
import os
import ast
import json
import mysql.connector as mariadb
from mysql.connector.constants import ClientFlag
mariadb_connection = mariadb.connect(user='root', password='137800000', database='shaproject',client_flags=[ClientFlag.LOCAL_FILES])
cursor = mariadb_connection.cursor()
sql3 = """INSERT INTO shaproject.alttwo (alttwo_id,responses) VALUES """
os.chdir('F:/Code Blocks/SEM 2/DM/Project/350/For Merge Disqus')
current_list_dir=os.listdir()
print(current_list_dir)
cur_cwd=os.getcwd()
cur_cwd=cur_cwd.replace('\\','/')
twoid=1
for every_file in current_list_dir:
file=open(cur_cwd + "/" + every_file)
utffile=file.read()
data=json.loads(utffile)
for i in range(0,len(data['response'])):
data123 = json.dumps(data['response'][i])
tup=(twoid,data123)
print(sql3+str(tup))
twoid+=1
cursor.execute(sql3+str(tup)+";")
print(tup)
mariadb_connection.commit()
I have searched online and found that multiple dump statements are resulting in this error. But I am unable to resolve it.
You want to use glob.
Rather than os.listdir(), which is too permissive,
use glob to focus on just the *.json files.
Print out the name of the file before asking .loads() to parse it.
Rename any badly formatted files to .txt rather than .json, in order to skip them.
Note that you can pass the open file directly to .load(), if you wish.
Closing open files would be a good thing.
Rather than a direct assignment (with no close()!)
you would be better off with with:
with open(cur_cwd + "/" + every_file) as file:
data = json.load(file)
Talking about current current working directory seems
both repetitive and redundant.
It would suffice to call it cwd.

UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 3131: invalid start byte

I am trying to read twitter data from json file using python 2.7.12.
Code I used is such:
import json
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
def get_tweets_from_file(file_name):
tweets = []
with open(file_name, 'rw') as twitter_file:
for line in twitter_file:
if line != '\r\n':
line = line.encode('ascii', 'ignore')
tweet = json.loads(line)
if u'info' not in tweet.keys():
tweets.append(tweet)
return tweets
Result I got:
Traceback (most recent call last):
File "twitter_project.py", line 100, in <module>
main()
File "twitter_project.py", line 95, in main
tweets = get_tweets_from_dir(src_dir, dest_dir)
File "twitter_project.py", line 59, in get_tweets_from_dir
new_tweets = get_tweets_from_file(file_name)
File "twitter_project.py", line 71, in get_tweets_from_file
line = line.encode('ascii', 'ignore')
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 3131: invalid start byte
I went through all the answers from similar issues and came up with this code and it worked last time. I have no clue why it isn't working now.
In my case(mac os), there was .DS_store file in my data folder which was a hidden and auto generated file and it caused the issue. I was able to fix the problem after removing it.
It doesn't help that you have sys.setdefaultencoding('utf-8'), which is confusing things further - It's a nasty hack and you need to remove it from your code.
See https://stackoverflow.com/a/34378962/1554386 for more information
The error is happening because line is a string and you're calling encode(). encode() only makes sense if the string is a Unicode, so Python tries to convert it Unicode first using the default encoding, which in your case is UTF-8, but should be ASCII. Either way, 0x80 is not valid ASCII or UTF-8 so fails.
0x80 is valid in some characters sets. In windows-1252/cp1252 it's €.
The trick here is to understand the encoding of your data all the way through your code. At the moment, you're leaving too much up to chance. Unicode String types are a handy Python feature that allows you to decode encoded Strings and forget about the encoding until you need to write or transmit the data.
Use the io module to open the file in text mode and decode the file as it goes - no more .decode()! You need to make sure the encoding of your incoming data is consistent. You can either re-encode it externally or change the encoding in your script. Here's I've set the encoding to windows-1252.
with io.open(file_name, 'r', encoding='windows-1252') as twitter_file:
for line in twitter_file:
# line is now a <type 'unicode'>
tweet = json.loads(line)
The io module also provide Universal Newlines. This means \r\n are detected as newlines, so you don't have to watch for them.
For others who come across this question due to the error message, I ran into this error trying to open a pickle file when I opened the file in text mode instead of binary mode.
This was the original code:
import pickle as pkl
with open(pkl_path, 'r') as f:
obj = pkl.load(f)
And this fixed the error:
import pickle as pkl
with open(pkl_path, 'rb') as f:
obj = pkl.load(f)
I got a similar error by accidentally trying to read a parquet file as a csv
pd.read_csv(file.parquet)
pd.read_parquet(file.parquet)
The error occurs when you are trying to read a tweet containing sentence like
"#Mike http:\www.google.com \A8&^)((&() how are&^%()( you ". Which cannot be read as a String instead you are suppose to read it as raw String .
but Converting to raw String Still gives error so i better i suggest you to
read a json file something like this:
import codecs
import json
with codecs.open('tweetfile','rU','utf-8') as f:
for line in f:
data=json.loads(line)
print data["tweet"]
keys.append(data["id"])
fulldata.append(data["tweet"])
which will get you the data load from json file .
You can also write it to a csv using Pandas.
import pandas as pd
output = pd.DataFrame( data={ "tweet":fulldata,"id":keys} )
output.to_csv( "tweets.csv", index=False, quoting=1 )
Then read from csv to avoid the encoding and decoding problem
hope this will help you solving you problem.
Midhun

jsonstat.from_file() return error "can't multiply sequence by non-int of type 'list'"

I'm trying to parse a json-stat file using jsonstat.py (v 0.1.7) but am getting an error.
The code below is copied from the examples on github (https://github.com/26fe/jsonstat.py/tree/master/examples-notebooks):
from __future__ import print_function
import os
import jsonstat
os.chdir(r'D:\Desktop\JSON_Stat')
url = 'http://www.cso.ie/StatbankServices/StatbankServices.svc/jsonservice/responseinstance/NQQ25'
file_name = "test02.json"
file_path = os.path.abspath(os.path.join("..","JSON_Stat", "CSO", file_name))
I added this line to deal with non ascii characters in the file:
# -*- coding: utf-8 -*-
this succesfully downloads the json file to my desktop:
if os.path.exists(file_path):
print("using already downloaded file {}".format(file_path))
else:
print("download file and storing on disk")
jsonstat.download(url, file_path)
From here, I can load and pprint the data using the json module:
import json
import pprint as pp
with open(r"CSO\test02.json") as data_file:
data = json.load(data_file)
pp.pprint(data)
... but when I try and use the jsonstat module (as specified in the examples) I get the error mentioned in the subject:
collection = jsonstat.from_file(r"D:\Desktop\JSON_Stat\CSO\test02.json")
collection
# abbreviated error message
--> 384 self.__pos2cat = self.__size * [None]
TypeError: can't multiply sequence by non-int of type 'list'
I understand what the error message itself means but, having studied the the dimensions.py module where it occurs, am stuck trying to understand why. I was able to run the sample OECD code without issue so perhaps the data itself is not formatted in the expected way, though the source site (http://www.cso.ie/webserviceclient/) states that the json-stat format is being used.
So, finally, my questions are: has anyone run into this error and resolved it? Has anyone succesfully used the jsonstat module to parse this specific data? Alternatively, any general advice towards troubleshooting this issue is welcome.
Thanks

Encoding Issues with BeautifulSoup Parsing and Scraping

Utilizing Python3, BeautifulSoup and very minimal regex, I'm trying to scrape the text off of this webpage:
http://www.presidency.ucsb.edu/ws/?pid=2921
I have already succesfully extracted its html into a file. In fact I've done this with almost all of the presidential speeches available on this website; I have 247 (out of 258 possible) speeches' html saved locally on my computer.
My code for extracting just the text off of each page looks like this:
import re
from bs4 import BeautifulSoup
with open('scan_here.txt') as reference: #'scan_here.txt' is a file containing all the pages whose html I have downloaded successfully
for line in reference:
line_unclean = reference.readline() #each file's name is just a random string of 5-6 integers
line = str(re.sub(r'\n', '', line_unclean)) #for removing '\n' from each file name
f = open(('local_path_to_folder_containing_all_the_html_files\\') + line)
doc = f.read()
soup = BeautifulSoup(doc, 'html.parser')
for speech in soup.select('span.display-text'):
final_speech = str(speech)
print(final_speech)
Utilizing this code, I get the following error message:
Traceback (most recent call last):
File "extract_individual_speeches.py", line 11, in <module>
doc = f.read()
File "/usr/lib/python3.4/codecs.py", line 319, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 56443: invalid start byte
I understand this is a decode error and have tried to run this code on other html files, not just the first one which appears on the list of file names in 'scan_text.txt'. Same error, so I think it's an encoding issue local to the html files.
I think the problem might lie with this third line of the html, which has the same encoding for all my html files:
<html>
<head>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=windows-1251">
What is 'windows-1251?' I assume it's the problem here. I've looked it up and seen there are some windows-1251 to UTF-8 converters, but I didn't see one which works well with Python.
I found this SO thread which seems to deal with this issue of conversion, but I'm not sure how to integrate it with my existing code.
Any help on this issue is much appreciated, TIA.
'windows-1251' is a standard Windows encoding. What you need is UTF-8. You can define an encoding when you open a file.
Try something like this:
with open(file,'r',encoding='windows-1251') as f:
text = f.read()
or:
text = text.decode('windows-1251')
You can also use codecs:
import codecs
f = codecs.open(file,'r','windows-1251').read()
codecs.open(file,'w','UTF-8').write(f)