Pandas local HTML erros - html

Download FileHi| I am trying to read local html files with pandas and one field is not passing the numeric value but a string that is not shown but it is inside the html code. How can I read the table with the values shown in the html ?
In the picture below you can see that I should be getting the 00:21.44 value but instead I am getting the string
"document.Write(Timefactor("0:19:46","raster"))
Any help ?
I am attaching the file.

Your problem is that you are reading raw HTML, but the browser also renders Javascript that it contains. You need to render HTML the same way the browser does.
For that you will need to install requests_html and html5lib packages. Now load and render your HTML. Then you can proceed as usual.
import pandas as pd
from requests_html import HTML
with open( << your file here >>, 'r', encoding='ISO-8859-1') as fi:
html_orig = fi.read()
html_rendered = HTML(html=html_orig)
html_rendered.render()
df = pd.read_html(html_rendered.html)
I would also suggest to clean the rendered HTML a little before feeding to pandas, for example:
import re
last_table = html_rendered.find('table')[-1].html
last_table_noscript = re.sub(r'<script[^<]*.+?<\/script>','', last_table, flags=re.MULTILINE)
df2 = pd.read_html(last_table_noscript)
df2
[ ASS. Programa T Ferramenta Ø RC ID Cone H Total H RESP. ZMin ap/ae STK(xy/z) Comentário F RPM Tempo WKPL Notas
0 NaN 5414TR20112 2 TR32R1.6 des 32 16 M16L100 37 NaN 12793 0,2/17 0,15/ Desbaste Raster 3500 1800 00:09:46 (3+2) 2POS NaN
1 NaN 5414TR20113 3 TR35R1 35 1 M16L100 34 NaN -957 0,2/16 0/ Desbaste Raster 2000 2500 00:03:50 (3+2) 2POS NaN
2 NaN 5414TR20114 3 TR35R1 35 1 M16L100 34 NaN 12591 0,2/17 0/ Desbaste Raster 2000 2500 00:01:36 (3+2) 2POS NaN
3 NaN 5414TR20115 2 TR32R1.6 des 32 16 M16L100 37 NaN -1865 0,2/ 0/ Z Constante 3500 1800 00:34:55 (3+2) 2POS NaN
4 NaN 5414TR20116 160 EHHB-4120-Ap 12 6 CT12L75 60 36.0 505 /0,3 0/ Raster 3500 6200 00:21:44 (3+2) 2POS NaN]

Related

Mutilple JSON root element errors

The file I am calling is JSON with padding and I have done my simple coding to remove the padding but it appears by stringing together multiple JSON strings the formatting is not correct and I get root element errors.
I am using the output of the python program and running it through an online JSON formatter and validator website to check my output. I am a learner so please bear with my inexperience. All help appreciated.
import json
import re
import requests
payload = {}
headers = {}
for race in range(1, 3):
url = f"https://s3-ap-southeast-2.amazonaws.com/racevic.static/2018-01-01/flemington/sectionaltimes/race-{race}.json?callback=sectionaltimes_callback"
response = requests.request("GET", url, headers=headers, data=payload)
strip = 'sectionaltimes_callback'
string = response.text
repl =''
result = re.sub(strip, repl, string)
print(result)
This is one way of obtaining the data you're looking for:
import requests
import json
import pandas as pd
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:103.0) Gecko/20100101 Firefox/103.0',
'Accept-Language' : 'en-US,en;q=0.5'}
for race in range(1, 3):
url = f"https://s3-ap-southeast-2.amazonaws.com/racevic.static/2018-01-01/flemington/sectionaltimes/race-{race}.json?callback=sectionaltimes_callback"
r = requests.get(url, headers=headers)
json_obj = json.loads(r.text.split('sectionaltimes_callback(')[1].rsplit(')', 1)[0])
df = pd.DataFrame(json_obj['Horses'])
print(df)
This would return (print out in terminal) a dataframe for each race:
Comment FinalPosition FinalPositionAbbreviation FullName SaddleNumber HorseUrl SilkUrl Trainer TrainerUrl Jockey ... DistanceVarToWinner SixHundredMetresTime TwoHundredMetresTime Early Mid Late OverallPeakSpeed PeakSpeedLocation OverallAvgSpeed DistanceFromRail
0 Resumes. Showed pace to lead well off the rail... 1 1st Crossing the Abbey 2 /horses/crossing-the-abbey //s3-ap-southeast-2.amazonaws.com/racevic.silk... T.Hughes /trainers/tim-hughes C.Williams ... 32.84 11.43 57.4 68.2 65.3 68.9 400m 63.3 0.8
1 Same sire as Katy's Daughter out of dual stake... 2 2nd Khulaasa 5 /horses/khulaasa //s3-ap-southeast-2.amazonaws.com/racevic.silk... D. & B.Hayes & T.Dabernig /trainers/david-hayes D.Oliver ... 0 32.61 11.29 56.6 68.4 66.0 69.2 700m 63.4 1.2
2 Trialled nicely before pleasing debut in what ... 3 3rd Graceful Star 4 /horses/graceful-star //s3-ap-southeast-2.amazonaws.com/racevic.silk... D. & B.Hayes & T.Dabernig /trainers/david-hayes A.Mallyon ... 0 33.10 11.56 56.9 67.4 64.8 68.5 400m 62.8 4.4
3 Sat second at debut, hampered at the 700m then... 4 4th Carnina 1 /horses/carnina //s3-ap-southeast-2.amazonaws.com/racevic.silk... T.Busuttin & N.Young /trainers/trent-busuttin B.Mertens ... +1 33.30 11.80 56.9 68.2 63.9 68.9 400m 62.7 3.0
4 $75k yearling by a Magic Millions winner out o... 5 5th Mirette 7 /horses/mirette //s3-ap-southeast-2.amazonaws.com/racevic.silk... A.Alexander /trainers/archie-alexander J.Childs ... 0 33.53 11.89 57.0 67.9 63.5 68.5 700m 62.5 3.8
5 $95k yearling by same sire as Pinot out of a s... 6 6th Dark Confidant 3 /horses/dark-confidant //s3-ap-southeast-2.amazonaws.com/racevic.silk... D. & B.Hayes & T.Dabernig /trainers/david-hayes D.Dunn ... +2 33.74 11.91 56.4 67.1 63.3 68.8 700m 61.9 5.0
6 Same sire as Vega Magic out of imported stakes... 7 7th La Celestina 6 /horses/la-celestina //s3-ap-southeast-2.amazonaws.com/racevic.silk... D.R.Brideoake /trainers/david-brideoake D.M.Lane ... +1 34.46 12.27 57.5 67.3 61.4 68.2 700m 61.7 0.8
7 rows × 29 columns
Comment FinalPosition FinalPositionAbbreviation FullName SaddleNumber HorseUrl SilkUrl Trainer TrainerUrl Jockey ... DistanceVarToWinner SixHundredMetresTime TwoHundredMetresTime Early Mid Late OverallPeakSpeed PeakSpeedLocation OverallAvgSpeed DistanceFromRail
0 Game in defeat both runs this campaign. Better... 1 1st Wise Hero 2 /horses/wise-hero //s3-ap-southeast-2.amazonaws.com/racevic.silk... J.W.Price /trainers/john-price S.M.Thornton ... 33.13 11.43 55.4 62.7 65.5 68.2 300m 61.7 0.7
1 Two runs since racing wide over this trip at C... 2 2nd Just Hifalutin 5 /horses/just-hifalutin //s3-ap-southeast-2.amazonaws.com/racevic.silk... E.Jusufovic /trainers/enver-jusufovic L.Currie ... +3 32.75 11.37 53.1 63.8 65.8 68.5 400m 61.7 3.3
2 Did a bit of early work at Seymour and was not... 3 3rd King Kohei 10 /horses/king-kohei //s3-ap-southeast-2.amazonaws.com/racevic.silk... Michael & Luke Cerchi /trainers/mick-cerchi
[...]

Python OCR Tesseract, find a certain word in the image and return me the coordinates

I wanted your help, I've been trying for a few months to make a code that finds a word in the image and returns the coordinates where that word is in the image.
I was trying this using OpenCV, OCR tesseract, but I was not successful, could someone here in the community help me?
I'll leave an image here as an example:
Here is something you can start with:
import pytesseract
from PIL import Image
pytesseract.pytesseract.tesseract_cmd = r'C:\<path-to-your-tesseract>\Tesseract-OCR\tesseract.exe'
img = Image.open("img.png")
data = pytesseract.image_to_data(img, output_type='dict')
boxes = len(data['level'])
for i in range(boxes):
if data['text'][i] != '':
print(data['left'][i], data['top'][i], data['width'][i], data['height'][i], data['text'][i])
If you have difficulties with installing pytesseract see: https://stackoverflow.com/a/53672281/18667225
Output:
153 107 277 50 Palavras
151 197 133 37 com
309 186 154 48 R/RR
154 303 126 47 Rato
726 302 158 47 Resto
154 377 144 50 Rodo
720 379 159 47 Arroz
152 457 160 48 Carro
726 457 151 46 Ferro
154 532 142 50 Rede
726 534 159 47 Barro
154 609 202 50 Parede
726 611 186 47 Barata
154 690 124 47 Faro
726 685 288 50 Beterraba
154 767 192 47 Escuro
726 766 151 47 Ferro
I managed to find the solution and I'll post it here for you:
import pytesseract
import cv2
from pytesseract import Output
pytesseract.pytesseract.tesseract_cmd = r'C:\<path-to-your-tesseract>\Tesseract-OCR\tesseract.exe'
filepath = 'image.jpg'
image = cv2.imread(filepath, 1)
# converting image to grayscale image
gray_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
# converting to binary image by Thresholding
# this step is necessary if you have a color image because if you skip this part
# then the tesseract will not be able to detect the text correctly and it will give an incorrect result
threshold_img = cv2.threshold(gray_image, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
# displays the image
cv2.imshow('threshold image', threshold_img)
# Holds the output window until the user presses a key
cv2.waitKey(0)
# Destroying windows present on the screen
cv2.destroyAllWindows()
# setting parameters for tesseract
custom_config = r'--oem 3 --psm 6'
# now feeding image to tesseract
details = pytesseract.image_to_data(threshold_img, output_type=Output.DICT, config=custom_config, lang='eng')
# Color
vermelho = (0, 0, 255)
#Exibe todas as chaves encontradas
print(details.keys())
print(details['text'])
# For in all found texts
for i in range(len(details['text'])):
# If it finds the text "UNIVERIDADE" it will print the coordinates, and draw a rectangle around the word
if details['text'][i] == 'UNIVERSIDADE':
print(details['text'][i])
print(f"left: {details['left'][i]}")
print(f"top: {details['top'][i]}")
print(f"width: {details['width'][i]}")
print(f"height: {details['height'][i]}")
cv2.rectangle(image, (details['left'][i], details['top'][i]), (details['left'][i]+details['width'][i], details['top'][i]+details['height'][i]), vermelho)

BeautifulSoup returns nothing

I'm trying to learn how to scrap components from website, specifically this website https://genshin-impact.fandom.com/wiki/Serenitea_Pot/Load
When I follow guidance from the internet, I collect several important elements such as class
"article-table sortable mw-collapsible jquery-tablesorter mw-made-collapsible"
and html elements like th and tb to get specific content of it using this code
import requests
from bs4 import BeautifulSoup
URL = "https://genshin-impact.fandom.com/wiki/Serenitea_Pot/Load"
page = requests.get(URL)
#print(page.text)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(id="mw-content-text")
teapot_loads = results.find_all("table", class_="article-table sortable mw-collapsible jquery-tablesorter mw-made-collapsible")
for teapot_loads in teapot_loads:
table_head_element = teapot_loads.find("th", class_="headerSort")
print(table_head_element)
print()
I seem to have written the correct element (th) and correct class name "headerSort." But the program doesn't return anything although there's no error in the program as well. What did I do wrong?
You can debug your code to see what went wrong, where. One such debugging effort is below, where we keep only one class for tables, and then print out the full class of the actual elements:
import requests
from bs4 import BeautifulSoup
URL = "https://genshin-impact.fandom.com/wiki/Serenitea_Pot/Load"
page = requests.get(URL)
#print(page.text)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(id="mw-content-text")
# print(results)
teapot_loads = results.find_all("table", class_="article-table")
for teapot_load in teapot_loads:
print(teapot_load.get_attribute_list('class'))
table_head_element = teapot_load.find("th", class_="headerSort")
print(table_head_element)
This will print out (beside the element you want printed out) the table class as well, as seen by requests/BeautifulSoup: ['article-table', 'sortable', 'mw-collapsible']. After the original HTML loads in page (with the original classes, seen by requests/BeautifulSoup), the Javascript in that page kicks in, and adds new classes to the table. As you are searching for elements containing such dynamically added classes, your search fails.
Nonetheless, here is a more elegant way of obtaining that table:
import pandas as pd
url = 'https://genshin-impact.fandom.com/wiki/Serenitea_Pot/Load'
dfs = pd.read_html(url)
print(dfs[1])
This will return a dataframe with that table:
Image
Name
Adeptal Energy
Load
ReducedLoad
Ratio
0
nan
"A Bloatty Floatty's Dream of the Sky"
60
65
47
0.92
1
nan
"A Guide in the Summer Woods"
60
35
24
1.71
2
nan
"A Messenger in the Summer Woods"
60
35
24
1.71
3
nan
"A Portrait of Paimon, the Greatest Companion"
90
35
24
2.57
4
nan
"A Seat in the Wilderness"
20
50
50
0.4
5
nan
"Ballad-Spinning Windwheel"
90
185
185
0.49
6
nan
"Between Nine Steps"
30
550
550
0.05
[...]
Documentation for bs4 (BeautifulSoup) can be found at https://www.crummy.com/software/BeautifulSoup/bs4/doc/#
Also, docs for pandas.read_html: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html

How to scrape json data from an interactive chart?

I have a specific section of a website that I want to scrape data from and here's the screenshot of the section -
I inspected the elements of that particular section and noticed that it's within a canvas tag. However, I also checked the source code of the website and I found that the data lies within the source code in a format I'm not familiar with. Here's a sample of that data
JSON.parse('\x5B\x7B\x22id\x22\x3A\x2232522\x22,\x22minute\x22\x3A\x2222\x22,\x22result\x22\x3A\x22MissedShots\x22,
\x22X\x22\x3A\x220.7859999847412109\x22,\x22Y\x22\x3A\x220.52\x22,\x22xG\x22\x3A\x220.03867039829492569\x22,
\x22player\x22\x3A\x22Lionel\x20Messi\x22,
\x22h_a\x22\x3A\x22h\x22,
\x22player_id\x22\x3A\x222097\x22,\x22situation\x22\x3A\x22OpenPlay\x22,
\x22season\x22\x3A\x222014\x22,\x22shotType\x22\x3A\x22LeftFoot\x22,
\x22match_id\x22\x3A...);
How do I parse through this data to give me the x,y co-ordinates of every shot from the map in the screenshot?
Ya the issue is with the encoding/decoding.
You can pull that string and then essentially need to ignore the escape charachters. Once you do that, you can use json.loads() to read that in and then can navigate the json structure.
Now I only looked quickly, but did not see the data in there to show where the plot is on the shot chart. But you can have a look to see if you can find it. The data does however have a shotZones key.
import requests
from bs4 import BeautifulSoup
import json
import codecs
url = 'https://understat.com/player/2097'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
scripts = soup.find_all('script')
for script in scripts:
if 'var groupsData = JSON.parse' in script.text:
encoded_string = script.text
encoded_string = encoded_string .split("var groupsData = JSON.parse('")[-1]
encoded_string = encoded_string.rsplit("'),",1)[0]
jsonStr = codecs.getdecoder('unicode-escape')(encoded_string)[0]
jsonObj = json.loads(jsonStr)
Edit
Actually I found it. Here you go:
import requests
from bs4 import BeautifulSoup
import json
import codecs
from pandas.io.json import json_normalize
url = 'https://understat.com/player/2097'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
scripts = soup.find_all('script')
# I noticed the data was imbedded in the script tag that started with `var shotsData`
for script in scripts:
if 'var shotsData' in script.text:
# I store that text, then trim off the string on the ends so that
# it's in a valid json format
encoded_string = script.text
encoded_string = encoded_string .split("JSON.parse('", 1)[-1]
encoded_string = encoded_string.rsplit("player_info =",1)[0]
encoded_string = encoded_string.rsplit("'),",1)[0]
# Have it ignore the escape characters so it can decode the ascii
# and be able to use json.loads
jsonStr = codecs.getdecoder('unicode-escape')(encoded_string)[0]
jsonObj = json.loads(jsonStr)
df = json_normalize(jsonObj)
Output:
print (df)
X ... xG
0 0.7859999847412109 ... 0.03867039829492569
1 0.8619999694824219 ... 0.06870150566101074
2 0.86 ... 0.15034306049346924
3 0.8180000305175781 ... 0.045503295958042145
4 0.8690000152587891 ... 0.06531666964292526
5 0.7230000305175781 ... 0.054804932326078415
6 0.9119999694824219 ... 0.0971858948469162
7 0.885 ... 0.11467907577753067
8 0.875999984741211 ... 0.10627452284097672
9 0.9540000152587891 ... 0.3100203275680542
10 0.8969999694824219 ... 0.12571729719638824
11 0.8959999847412109 ... 0.04122981056571007
12 0.8730000305175781 ... 0.09942527115345001
13 0.769000015258789 ... 0.025321772322058678
14 0.885 ... 0.7432776093482971
15 0.86 ... 0.4680374562740326
16 0.7619999694824219 ... 0.05699075385928154
17 0.919000015258789 ... 0.10647356510162354
18 0.9530000305175781 ... 0.571601390838623
19 0.8280000305175781 ... 0.07561512291431427
20 0.9030000305175782 ... 0.4600500166416168
21 0.9469999694824218 ... 0.3132372796535492
22 0.92 ... 0.2869703769683838
23 0.7659999847412109 ... 0.07576987147331238
24 0.9640000152587891 ... 0.3824153244495392
25 0.8590000152587891 ... 0.1282796859741211
26 0.9330000305175781 ... 0.42914989590644836
27 0.9230000305175782 ... 0.4968196153640747
28 0.8240000152587891 ... 0.08198583126068115
29 0.965999984741211 ... 0.4309735596179962
.. ... ... ...
843 0.9159999847412109 ... 0.4672183692455292
844 0.7430000305175781 ... 0.04068271815776825
845 0.815 ... 0.07300572842359543
846 0.8980000305175782 ... 0.06551901996135712
847 0.7680000305175781 ... 0.028392281383275986
848 0.885 ... 0.7432776093482971
849 0.875999984741211 ... 0.4060465097427368
850 0.7880000305175782 ... 0.09496577084064484
851 0.7190000152587891 ... 0.05071594566106796
852 0.7680000305175781 ... 0.090679831802845
853 0.7440000152587891 ... 0.06875557452440262
854 0.9069999694824219 ... 0.45824503898620605
855 0.850999984741211 ... 0.06454816460609436
856 0.935 ... 0.5926618576049805
857 0.9219999694824219 ... 0.16091874241828918
858 0.73 ... 0.05882067605853081
859 0.9080000305175782 ... 0.3522365391254425
860 0.8209999847412109 ... 0.1690768003463745
861 0.850999984741211 ... 0.11893663555383682
862 0.88 ... 0.11993970721960068
863 0.8119999694824219 ... 0.15579797327518463
864 0.7019999694824218 ... 0.011425728909671307
865 0.7530000305175781 ... 0.06945621967315674
866 0.850999984741211 ... 0.08273076266050339
867 0.8180000305175781 ... 0.06529481709003448
868 0.86 ... 0.10793478786945343
869 0.8190000152587891 ... 0.061923813074827194
870 0.8130000305175781 ... 0.05294585973024368
871 0.799000015258789 ... 0.06358513236045837
872 0.9019999694824219 ... 0.5841030478477478
[873 rows x 20 columns]

Create line chart from csv file without excel

Suppose i have the following data in csv format :
Time Total Allocated Deallocated
0.00004 0 16 0
0.000516 16 31 0
0.046274 47 4100 0
0.047036 4147 0 31
0.047602 4116 35 0
0.214296 4151 4100 0
0.215109 8251 0 35
i am looking for some kind of software that will allow me to make a line chart of it ( where time column will be the X axis) , i used excel for now , but i am looking for something else,that will allow me to see in greater details .
Any ideas ?
Use Datawrapper. It's very easy and you can publish it on the web or export it to a PNG file.
You can also use R. Here is an example of code to generate a time series plot :
library("ggplot2")
df <- data.frame(date = seq(as.Date("2012-01-01"),as.Date("2012-12-01"), by = "month"), x = rnorm(12))
ggplot(df, aes(x=date, y = x)) + geom_line() + theme_bw()
This is an old question but still: https://plot.ly is also a good site for that kind of stuff.