Mutilple JSON root element errors - json

The file I am calling is JSON with padding and I have done my simple coding to remove the padding but it appears by stringing together multiple JSON strings the formatting is not correct and I get root element errors.
I am using the output of the python program and running it through an online JSON formatter and validator website to check my output. I am a learner so please bear with my inexperience. All help appreciated.
import json
import re
import requests
payload = {}
headers = {}
for race in range(1, 3):
url = f"https://s3-ap-southeast-2.amazonaws.com/racevic.static/2018-01-01/flemington/sectionaltimes/race-{race}.json?callback=sectionaltimes_callback"
response = requests.request("GET", url, headers=headers, data=payload)
strip = 'sectionaltimes_callback'
string = response.text
repl =''
result = re.sub(strip, repl, string)
print(result)

This is one way of obtaining the data you're looking for:
import requests
import json
import pandas as pd
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:103.0) Gecko/20100101 Firefox/103.0',
'Accept-Language' : 'en-US,en;q=0.5'}
for race in range(1, 3):
url = f"https://s3-ap-southeast-2.amazonaws.com/racevic.static/2018-01-01/flemington/sectionaltimes/race-{race}.json?callback=sectionaltimes_callback"
r = requests.get(url, headers=headers)
json_obj = json.loads(r.text.split('sectionaltimes_callback(')[1].rsplit(')', 1)[0])
df = pd.DataFrame(json_obj['Horses'])
print(df)
This would return (print out in terminal) a dataframe for each race:
Comment FinalPosition FinalPositionAbbreviation FullName SaddleNumber HorseUrl SilkUrl Trainer TrainerUrl Jockey ... DistanceVarToWinner SixHundredMetresTime TwoHundredMetresTime Early Mid Late OverallPeakSpeed PeakSpeedLocation OverallAvgSpeed DistanceFromRail
0 Resumes. Showed pace to lead well off the rail... 1 1st Crossing the Abbey 2 /horses/crossing-the-abbey //s3-ap-southeast-2.amazonaws.com/racevic.silk... T.Hughes /trainers/tim-hughes C.Williams ... 32.84 11.43 57.4 68.2 65.3 68.9 400m 63.3 0.8
1 Same sire as Katy's Daughter out of dual stake... 2 2nd Khulaasa 5 /horses/khulaasa //s3-ap-southeast-2.amazonaws.com/racevic.silk... D. & B.Hayes & T.Dabernig /trainers/david-hayes D.Oliver ... 0 32.61 11.29 56.6 68.4 66.0 69.2 700m 63.4 1.2
2 Trialled nicely before pleasing debut in what ... 3 3rd Graceful Star 4 /horses/graceful-star //s3-ap-southeast-2.amazonaws.com/racevic.silk... D. & B.Hayes & T.Dabernig /trainers/david-hayes A.Mallyon ... 0 33.10 11.56 56.9 67.4 64.8 68.5 400m 62.8 4.4
3 Sat second at debut, hampered at the 700m then... 4 4th Carnina 1 /horses/carnina //s3-ap-southeast-2.amazonaws.com/racevic.silk... T.Busuttin & N.Young /trainers/trent-busuttin B.Mertens ... +1 33.30 11.80 56.9 68.2 63.9 68.9 400m 62.7 3.0
4 $75k yearling by a Magic Millions winner out o... 5 5th Mirette 7 /horses/mirette //s3-ap-southeast-2.amazonaws.com/racevic.silk... A.Alexander /trainers/archie-alexander J.Childs ... 0 33.53 11.89 57.0 67.9 63.5 68.5 700m 62.5 3.8
5 $95k yearling by same sire as Pinot out of a s... 6 6th Dark Confidant 3 /horses/dark-confidant //s3-ap-southeast-2.amazonaws.com/racevic.silk... D. & B.Hayes & T.Dabernig /trainers/david-hayes D.Dunn ... +2 33.74 11.91 56.4 67.1 63.3 68.8 700m 61.9 5.0
6 Same sire as Vega Magic out of imported stakes... 7 7th La Celestina 6 /horses/la-celestina //s3-ap-southeast-2.amazonaws.com/racevic.silk... D.R.Brideoake /trainers/david-brideoake D.M.Lane ... +1 34.46 12.27 57.5 67.3 61.4 68.2 700m 61.7 0.8
7 rows × 29 columns
Comment FinalPosition FinalPositionAbbreviation FullName SaddleNumber HorseUrl SilkUrl Trainer TrainerUrl Jockey ... DistanceVarToWinner SixHundredMetresTime TwoHundredMetresTime Early Mid Late OverallPeakSpeed PeakSpeedLocation OverallAvgSpeed DistanceFromRail
0 Game in defeat both runs this campaign. Better... 1 1st Wise Hero 2 /horses/wise-hero //s3-ap-southeast-2.amazonaws.com/racevic.silk... J.W.Price /trainers/john-price S.M.Thornton ... 33.13 11.43 55.4 62.7 65.5 68.2 300m 61.7 0.7
1 Two runs since racing wide over this trip at C... 2 2nd Just Hifalutin 5 /horses/just-hifalutin //s3-ap-southeast-2.amazonaws.com/racevic.silk... E.Jusufovic /trainers/enver-jusufovic L.Currie ... +3 32.75 11.37 53.1 63.8 65.8 68.5 400m 61.7 3.3
2 Did a bit of early work at Seymour and was not... 3 3rd King Kohei 10 /horses/king-kohei //s3-ap-southeast-2.amazonaws.com/racevic.silk... Michael & Luke Cerchi /trainers/mick-cerchi
[...]

Related

Mediation Analysis Diary Study

I did a diary study.
It was 5 days, 2 times a day (morning and afternoon).
Morning measures I have:
sleeping problems
Afternoon measures I have:
Emotions
Incivility
I want to test a mediation. First I want to teste the model with just the IV. But, when I do this on R:
summary(m1 <- lmer(sleepingproblems ~ incivility + (1 + incivility |code), data = data_within_measures))
I get the error:
**Error in h(simpleError(msg, call)) : **
** error in evaluating the argument 'object' in selecting a method for function 'summary': missing value where TRUE/FALSE needed
**
I get the error. The reason is that because in the morning I only have measures for SP and in the afternoon I only have data for incivility, so they don´t match. But I don´t know how to solve it.
I would like to know if there is a way I can match both (e.g. matching the data from day 1 morning with the afternoon data, and so on). The idea is to see whether who experiences incivility will most likely experience sleeping problems as well.
head(base_within_merged,6)[,c("code","register","sleepingproblems","expf2fwi","emotionalreactivity")]
code register sleepingproblems expf2fwi emotionalreactivity
1 aaja28 7 1.8 NA NA
2 aaja28 8 NA 1 1.000000
3 aaja28 5 2.0 NA NA
4 aaja28 6 NA 1 2.666667
5 aaja28 10 NA NA NA
6 aaja28 3 2.6 NA NA
library (nlme)
library (lme4)
m1 <- lmer(sleepingproblems ~ expf2fwi + (1|code), data = base_within_merged)
Error in lme4::lFormula(formula = sleepingproblems ~ expf2fwi + (1 | code), :0 (non-NA) cases
I appreciate the help!
Thank you!

Pandas local HTML erros

Download FileHi| I am trying to read local html files with pandas and one field is not passing the numeric value but a string that is not shown but it is inside the html code. How can I read the table with the values shown in the html ?
In the picture below you can see that I should be getting the 00:21.44 value but instead I am getting the string
"document.Write(Timefactor("0:19:46","raster"))
Any help ?
I am attaching the file.
Your problem is that you are reading raw HTML, but the browser also renders Javascript that it contains. You need to render HTML the same way the browser does.
For that you will need to install requests_html and html5lib packages. Now load and render your HTML. Then you can proceed as usual.
import pandas as pd
from requests_html import HTML
with open( << your file here >>, 'r', encoding='ISO-8859-1') as fi:
html_orig = fi.read()
html_rendered = HTML(html=html_orig)
html_rendered.render()
df = pd.read_html(html_rendered.html)
I would also suggest to clean the rendered HTML a little before feeding to pandas, for example:
import re
last_table = html_rendered.find('table')[-1].html
last_table_noscript = re.sub(r'<script[^<]*.+?<\/script>','', last_table, flags=re.MULTILINE)
df2 = pd.read_html(last_table_noscript)
df2
[ ASS. Programa T Ferramenta Ø RC ID Cone H Total H RESP. ZMin ap/ae STK(xy/z) Comentário F RPM Tempo WKPL Notas
0 NaN 5414TR20112 2 TR32R1.6 des 32 16 M16L100 37 NaN 12793 0,2/17 0,15/ Desbaste Raster 3500 1800 00:09:46 (3+2) 2POS NaN
1 NaN 5414TR20113 3 TR35R1 35 1 M16L100 34 NaN -957 0,2/16 0/ Desbaste Raster 2000 2500 00:03:50 (3+2) 2POS NaN
2 NaN 5414TR20114 3 TR35R1 35 1 M16L100 34 NaN 12591 0,2/17 0/ Desbaste Raster 2000 2500 00:01:36 (3+2) 2POS NaN
3 NaN 5414TR20115 2 TR32R1.6 des 32 16 M16L100 37 NaN -1865 0,2/ 0/ Z Constante 3500 1800 00:34:55 (3+2) 2POS NaN
4 NaN 5414TR20116 160 EHHB-4120-Ap 12 6 CT12L75 60 36.0 505 /0,3 0/ Raster 3500 6200 00:21:44 (3+2) 2POS NaN]

Beautiful soup - find_all function is returning returning only 20 items from the page. The actual results are around 250

I am using find_all in beautiful soup library to parse the HTML text.
code
headers = ({'User-Agent':
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})
URL = "https://housing.com/in/buy/searches/M1Pmp1mc1ak4wflhbs_735yq6kvim3c7hqz_3g8uxzo18sqqdcuwU2yr9t"
response = get(URL, headers=headers)
html_soup = BeautifulSoup(response.text, 'lxml')
len(html_soup)
This is returning only 20 items even though the page shows 250 results. What am I doing wrong here ?
Try (This takes all (291)):
from selenium import webdriver
import time
driver = webdriver.Firefox(executable_path='c:/program/geckodriver.exe')
URL = "https://housing.com/in/buy/searches/M1Pmp1mc1ak4wflhbs_735yq6kvim3c7hqz_3g8uxzo18sqqdcuwU2yr9t"
driver.get(URL)
driver.maximize_window()
PAUSE_TIME = 2
lh = driver.execute_script("return document.body.scrollHeight")
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(PAUSE_TIME)
nh = driver.execute_script("return document.body.scrollHeight")
if nh == lh:
break
lh = nh
articles = driver.find_elements_by_css_selector('.css-h7k7mr')
for article in articles:
print(article.text)
print('-' * 80)
driver.close()
prints:
₹45.11 L
EMI starts at ₹28.13 K
3 BHK Apartment
Bachupally, Nizampet, Hyderabad
Build Up Area
1556 sq.ft
Avg. Price
₹2.90 K/sq.ft
Special Highlights
24x7 Security
Badminton Court
Cycling & Jogging Track
Gated Community
3 BHK Apartment available for sale in Bachapally,hyderabad,beside Mama Medical College, Nizampet, Hyderabad. Available amenities are: Gym, Swimming pool, Garden, Kids area, Sports facility, Lift. Apartment has 3 bedroom, 2 bathroom.
Read more
M Srikanth
Housing Prime Agent
Contact
--------------------------------------------------------------------------------
₹37.96 L - 62.05 L
EMI starts at ₹23.67 K
Bhuvanteza Evk Aura
Marketed by Sri Avani Infra Projects
Kollur, Hyderabad
Configurations
2, 3 BHK Apartments
Possession Starts
Nov, 2022
Avg. Price
₹3.65 K/sq.ft
Real estate developer Bhuvanteza Infrastructures has launched prime housing project Evk Aura in Kollur, Hyderabad. The project is offering beautiful and comfortable 2 and 3 BHK apartments for sale. Built-up area for 2 BHK apartments is in the range of 1040 to 1185 sq ft. and for 3 BHK apartments it is 1700 sq ft. Amenities which are required for a comfortable living will be available in the complex, they are car parking, club house, swimming pool, children play area, power backup and others. Developer Bhuvanteza Infrastructures can be contacted for owning an apartment in Evk Aura. Kollur is a ...
Read more
SA
Sri Avani Infra Projects
Seller
Contact
--------------------------------------------------------------------------------
and so on....
Note selenium: You need selenium and geckodriver and in this code geckodriver is set to be imported from c:/program/geckodriver.exe
you're not reading right, there are 250 results in total but only 20 are shown, that's why you get 20 in python

How to use HuggingFace nlp library's GLUE for CoLA

I've been trying to use the HuggingFace nlp library's GLUE metric to check whether a given sentence is a grammatical English sentence. But I'm getting an error and is stuck without being able to proceed.
What I've tried so far;
reference and prediction are 2 text sentences
!pip install transformers
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased')
reference="Security has been beefed across the country as a 2 day nation wide curfew came into effect."
prediction="Security has been tightened across the country as a 2-day nationwide curfew came into effect."
import nlp
glue_metric = nlp.load_metric('glue',name="cola")
#Using BertTokenizer
encoded_reference=tokenizer.encode(reference, add_special_tokens=False)
encoded_prediction=tokenizer.encode(prediction, add_special_tokens=False)
glue_score = glue_metric.compute(encoded_prediction, encoded_reference)
Error I'm getting;
ValueError Traceback (most recent call last)
<ipython-input-9-4c3a3ce7b583> in <module>()
----> 1 glue_score = glue_metric.compute(encoded_prediction, encoded_reference)
6 frames
/usr/local/lib/python3.6/dist-packages/nlp/metric.py in compute(self, predictions, references, timeout, **metrics_kwargs)
198 predictions = self.data["predictions"]
199 references = self.data["references"]
--> 200 output = self._compute(predictions=predictions, references=references, **metrics_kwargs)
201 return output
202
/usr/local/lib/python3.6/dist-packages/nlp/metrics/glue/27b1bc63e520833054bd0d7a8d0bc7f6aab84cc9eed1b576e98c806f9466d302/glue.py in _compute(self, predictions, references)
101 return pearson_and_spearman(predictions, references)
102 elif self.config_name in ["mrpc", "qqp"]:
--> 103 return acc_and_f1(predictions, references)
104 elif self.config_name in ["sst2", "mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"]:
105 return {"accuracy": simple_accuracy(predictions, references)}
/usr/local/lib/python3.6/dist-packages/nlp/metrics/glue/27b1bc63e520833054bd0d7a8d0bc7f6aab84cc9eed1b576e98c806f9466d302/glue.py in acc_and_f1(preds, labels)
60 def acc_and_f1(preds, labels):
61 acc = simple_accuracy(preds, labels)
---> 62 f1 = f1_score(y_true=labels, y_pred=preds)
63 return {
64 "accuracy": acc,
/usr/local/lib/python3.6/dist-packages/sklearn/metrics/_classification.py in f1_score(y_true, y_pred, labels, pos_label, average, sample_weight, zero_division)
1097 pos_label=pos_label, average=average,
1098 sample_weight=sample_weight,
-> 1099 zero_division=zero_division)
1100
1101
/usr/local/lib/python3.6/dist-packages/sklearn/metrics/_classification.py in fbeta_score(y_true, y_pred, beta, labels, pos_label, average, sample_weight, zero_division)
1224 warn_for=('f-score',),
1225 sample_weight=sample_weight,
-> 1226 zero_division=zero_division)
1227 return f
1228
/usr/local/lib/python3.6/dist-packages/sklearn/metrics/_classification.py in precision_recall_fscore_support(y_true, y_pred, beta, labels, pos_label, average, warn_for, sample_weight, zero_division)
1482 raise ValueError("beta should be >=0 in the F-beta score")
1483 labels = _check_set_wise_labels(y_true, y_pred, average, labels,
-> 1484 pos_label)
1485
1486 # Calculate tp_sum, pred_sum, true_sum ###
/usr/local/lib/python3.6/dist-packages/sklearn/metrics/_classification.py in _check_set_wise_labels(y_true, y_pred, average, labels, pos_label)
1314 raise ValueError("Target is %s but average='binary'. Please "
1315 "choose another average setting, one of %r."
-> 1316 % (y_type, average_options))
1317 elif pos_label not in (None, 1):
1318 warnings.warn("Note that pos_label (set to %r) is ignored when "
ValueError: Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'].
However, I'm able to get results (pearson and spearmanr) for 'stsb' with the same workaround as given above.
Some help and a workaround for(cola) this is really appreciated. Thank you.
In general, if you are seeing this error with HuggingFace, you are trying to use the f-score as a metric on a text classification problem with more than 2 classes. Pick a different metric, like "accuracy".
For this specific question:
Despite what you entered, it is trying to compute the f-score. From the example notebook, you should set the metric name as:
metric_name = "pearson" if task == "stsb" else "matthews_correlation" if task == "cola" else "accuracy"

How to scrape json data from an interactive chart?

I have a specific section of a website that I want to scrape data from and here's the screenshot of the section -
I inspected the elements of that particular section and noticed that it's within a canvas tag. However, I also checked the source code of the website and I found that the data lies within the source code in a format I'm not familiar with. Here's a sample of that data
JSON.parse('\x5B\x7B\x22id\x22\x3A\x2232522\x22,\x22minute\x22\x3A\x2222\x22,\x22result\x22\x3A\x22MissedShots\x22,
\x22X\x22\x3A\x220.7859999847412109\x22,\x22Y\x22\x3A\x220.52\x22,\x22xG\x22\x3A\x220.03867039829492569\x22,
\x22player\x22\x3A\x22Lionel\x20Messi\x22,
\x22h_a\x22\x3A\x22h\x22,
\x22player_id\x22\x3A\x222097\x22,\x22situation\x22\x3A\x22OpenPlay\x22,
\x22season\x22\x3A\x222014\x22,\x22shotType\x22\x3A\x22LeftFoot\x22,
\x22match_id\x22\x3A...);
How do I parse through this data to give me the x,y co-ordinates of every shot from the map in the screenshot?
Ya the issue is with the encoding/decoding.
You can pull that string and then essentially need to ignore the escape charachters. Once you do that, you can use json.loads() to read that in and then can navigate the json structure.
Now I only looked quickly, but did not see the data in there to show where the plot is on the shot chart. But you can have a look to see if you can find it. The data does however have a shotZones key.
import requests
from bs4 import BeautifulSoup
import json
import codecs
url = 'https://understat.com/player/2097'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
scripts = soup.find_all('script')
for script in scripts:
if 'var groupsData = JSON.parse' in script.text:
encoded_string = script.text
encoded_string = encoded_string .split("var groupsData = JSON.parse('")[-1]
encoded_string = encoded_string.rsplit("'),",1)[0]
jsonStr = codecs.getdecoder('unicode-escape')(encoded_string)[0]
jsonObj = json.loads(jsonStr)
Edit
Actually I found it. Here you go:
import requests
from bs4 import BeautifulSoup
import json
import codecs
from pandas.io.json import json_normalize
url = 'https://understat.com/player/2097'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
scripts = soup.find_all('script')
# I noticed the data was imbedded in the script tag that started with `var shotsData`
for script in scripts:
if 'var shotsData' in script.text:
# I store that text, then trim off the string on the ends so that
# it's in a valid json format
encoded_string = script.text
encoded_string = encoded_string .split("JSON.parse('", 1)[-1]
encoded_string = encoded_string.rsplit("player_info =",1)[0]
encoded_string = encoded_string.rsplit("'),",1)[0]
# Have it ignore the escape characters so it can decode the ascii
# and be able to use json.loads
jsonStr = codecs.getdecoder('unicode-escape')(encoded_string)[0]
jsonObj = json.loads(jsonStr)
df = json_normalize(jsonObj)
Output:
print (df)
X ... xG
0 0.7859999847412109 ... 0.03867039829492569
1 0.8619999694824219 ... 0.06870150566101074
2 0.86 ... 0.15034306049346924
3 0.8180000305175781 ... 0.045503295958042145
4 0.8690000152587891 ... 0.06531666964292526
5 0.7230000305175781 ... 0.054804932326078415
6 0.9119999694824219 ... 0.0971858948469162
7 0.885 ... 0.11467907577753067
8 0.875999984741211 ... 0.10627452284097672
9 0.9540000152587891 ... 0.3100203275680542
10 0.8969999694824219 ... 0.12571729719638824
11 0.8959999847412109 ... 0.04122981056571007
12 0.8730000305175781 ... 0.09942527115345001
13 0.769000015258789 ... 0.025321772322058678
14 0.885 ... 0.7432776093482971
15 0.86 ... 0.4680374562740326
16 0.7619999694824219 ... 0.05699075385928154
17 0.919000015258789 ... 0.10647356510162354
18 0.9530000305175781 ... 0.571601390838623
19 0.8280000305175781 ... 0.07561512291431427
20 0.9030000305175782 ... 0.4600500166416168
21 0.9469999694824218 ... 0.3132372796535492
22 0.92 ... 0.2869703769683838
23 0.7659999847412109 ... 0.07576987147331238
24 0.9640000152587891 ... 0.3824153244495392
25 0.8590000152587891 ... 0.1282796859741211
26 0.9330000305175781 ... 0.42914989590644836
27 0.9230000305175782 ... 0.4968196153640747
28 0.8240000152587891 ... 0.08198583126068115
29 0.965999984741211 ... 0.4309735596179962
.. ... ... ...
843 0.9159999847412109 ... 0.4672183692455292
844 0.7430000305175781 ... 0.04068271815776825
845 0.815 ... 0.07300572842359543
846 0.8980000305175782 ... 0.06551901996135712
847 0.7680000305175781 ... 0.028392281383275986
848 0.885 ... 0.7432776093482971
849 0.875999984741211 ... 0.4060465097427368
850 0.7880000305175782 ... 0.09496577084064484
851 0.7190000152587891 ... 0.05071594566106796
852 0.7680000305175781 ... 0.090679831802845
853 0.7440000152587891 ... 0.06875557452440262
854 0.9069999694824219 ... 0.45824503898620605
855 0.850999984741211 ... 0.06454816460609436
856 0.935 ... 0.5926618576049805
857 0.9219999694824219 ... 0.16091874241828918
858 0.73 ... 0.05882067605853081
859 0.9080000305175782 ... 0.3522365391254425
860 0.8209999847412109 ... 0.1690768003463745
861 0.850999984741211 ... 0.11893663555383682
862 0.88 ... 0.11993970721960068
863 0.8119999694824219 ... 0.15579797327518463
864 0.7019999694824218 ... 0.011425728909671307
865 0.7530000305175781 ... 0.06945621967315674
866 0.850999984741211 ... 0.08273076266050339
867 0.8180000305175781 ... 0.06529481709003448
868 0.86 ... 0.10793478786945343
869 0.8190000152587891 ... 0.061923813074827194
870 0.8130000305175781 ... 0.05294585973024368
871 0.799000015258789 ... 0.06358513236045837
872 0.9019999694824219 ... 0.5841030478477478
[873 rows x 20 columns]