Related
I'm trying to scrape https://PickleballBrackets.com using Selenium and BeautifulSoup with this code:
browser = webdriver.Safari()
browser.get('https://pickleballbrackets.com')
soup = BeautifulSoup(browser.page_source, 'lxml')
If I look at browser.page_source after I get the html, I can see 50 instances of
<div class="browse-row-box">
but after I create a soup object, they are lost. I believe that means that I have poorly formed html. I've tried all three parsers ('lxml', 'html5lib', 'html.parser') without any luck.
Suggestions on how to proceed?
Lot easier to get the data from the source.
import pandas as pd
import requests
url = 'https://pickleballbrackets.com/Json.asmx/EventsSearch_PublicUI'
payload = {
'AgeIDs': "",
'Alpha': "All",
'ClubID': "",
'CountryID': "",
'DateFilter': "future",
'EventTypeIDs': "1",
'FormatIDs': "",
'FromDate': "",
'IncludeTestEvents': "0",
'OrderBy': "EventActivityFirstDate",
'OrderDirection': "Asc",
'PageNumber': "1",
'PageSize': 9999,
'PlayerGroupIDs': "",
'PrizeMoney': "All",
'RankIDs': "",
'ReturnType': "json",
'SearchWord': "",
'ShowOnCalendar': "0",
'SportIDs': "dc1894c6-7e85-43bc-bfa2-3993b0dd630f",
'StateIDs': "",
'ToDate': "",
'prt': ""}
jsonData = requests.post(url, json=payload).json()
df = pd.DataFrame(jsonData['d'])
Output:
print(df.head(2).to_string())
RowNumber RecordCount PageCount CurrPage EventID ClubID Title TimeZoneAbbreviation UTCOffset HasDST StartTimesPosted Logo OnlineRegistration_Active Registration_DateOpen Registration_DateClosed IsSanctioned CancelTourney LocationOfEvent_Venue LocationOfEvent_StreetAddress LocationOfEvent_City LocationOfEvent_CountryTitle LocationOfEvent_StateTitle LocationOfEvent_Zip ShowDraws IsFavorite IsPrizeMoney MaxRegistrationsForEntireEvent Sanction_PCO SanctionLevelAppovedStatus_PCO SanctionLevelID_PCO Sanction_SSIPA SanctionLevelAppovedStatus_SSIPA SanctionLevelID_SSIPA Sanction_USAPA SanctionLevelAppovedStatus_USAP SanctionLevelID_USAP Sanction_WPF SanctionLevelAppovedStatus_WPF SanctionLevelID_WPF Sanction_GPA SanctionLevelAppovedStatus_GPA SanctionLevelID_GPA EventActivityFirstDate EventActivityLastDate IsRegClosed Cost_Registration_Current Cost_FeeOnEvents RegistrationCount_InAtLeastOneLiveEvent showResultsButton SantionLevels_PCO_Title SantionLevels_PCO_LevelLogo SantionLevels_SSIPA_Title SantionLevels_SSIPA_LevelLogo SantionLevels_USAP_Title SantionLevels_USAP_LevelLogo SantionLevels_WPF_Title SantionLevels_WPF_LevelLogo SantionLevels_GPA_Title SantionLevels_GPA_LevelLogo mng
0 1 152 1 1 410d04c2-49c5-48a4-847f-0f0ac0aa92f7 91c83e9c-c8e3-460d-b124-52f5c1036336 Cincinnati Pickleball Club 2022 March Mania EST -5 True False 410d04c2-49c5-48a4-847f-0f0ac0aa92f7_Logo.png True 1/24/2022 7:30:00 AM 3/22/2022 5:00:00 PM False False Five Seasons Ohio 11790 Snider Road Cincinnati United States Ohio 45249 -1 0 False 0 False False False False False 3/25/2022 4:00:00 PM 3/27/2022 2:00:00 PM 1 50.0 225.0 238 1 0
1 2 152 1 1 9f0c5976-94e9-4d58-a273-774744bdacec e5cd380b-fe72-4ef4-89e8-5053e94587a3 Flash Fridays Slam Series - March 25th EST -5 True False 9f0c5976-94e9-4d58-a273-774744bdacec_Logo.png True 3/1/2022 5:00:00 PM 3/23/2022 11:45:00 PM False False Holbrook Park 100 Sherwood Dr Huntersville United States North Carolina 28078 1 0 False 6 False False False False False 3/25/2022 4:00:00 PM 3/25/2022 4:00:00 PM 1 25.0 0.0 0 0 0
....
[152 rows x 60 columns]
I need to extract address, telephone no using xPath from my html page. My
address is sometimes within one `<p>`, else within two `<p>`. I have 11
stores.
This is the html tag <p> in my xml. (Just an example)
<div class="info-block-value"> ==$0
<p>36 rue de la Verrerie 75004 PARIS</p>
<p>Tél : 0111 222 222</p>
</div>
<div class="info-block-value"> ==$0
<p>11 rue des archives</p>
<p>75004 PARIS</p>
<p>Tél : 01 11 11 11 11</p>
</div>
1st shop: P1 =address P2= tel
2nd shop P1= address P2 = tel P3 = fax
3rd shop P1=address line 1 P2 = address line 2 P3= tel
4th : P1 = address P2 = tel
5th : P1= add P2 = tel
Shops 6,7,8,9,11 : P1 = add line 1 P2 = add line 2 ( they have no
telephone)
10th shop : P1= add line 1 P2= addline 2, P3= tel, P4= space P5 = email
I tried with,
{
"name": "store Addr",
"key": "Address",
"xPath": "(//div[#class='info-block-value']/p)[1] |
(//div[#class='info-block-value']/p)[2]",
"level": 0,
"enabled": true,
"values": []
},
{
"name": "Tel No",
"key": "TelephoneNumber",
"xPath": "(//div[#class='info-block-value']/p)[2]|
(//div[#class='info-
block-value']/p)[3]",
"regex": "Tél : ((\d+\s*)+)+",
"level": 0,
"enabled": true,
"values": []
}
But I'm not getting the correct results. Can someone help me this?
Results:
id name address phone
1 a 36 rue de la Verrerie 75004 PARIS 0111 222 222
2 b 11 rue des archives 01 11 11 11 11
Expecting results
id name address phone
1 a 36 rue de la Verrerie 75004 PARIS 0111 222 222
2 b 11 rue des archives 75004 PARIS 01 11 11 11 11
You basically need all the <p> but the last one for the address, and the last one for the telephone number.
So, get the address with
//div[#class="info-block-value"]/p[position() < last()]
and similarly, the telephone number is under
//div[#class="info-block-value"]/p[last()]
Say I have the following dataframes:
df1 <- data.frame(Name = c("Harry","George"), color=c("#EA0001", "#EEEEEE"))
Name color
1 Harry #EA0001
2 George #EEEEEE
df.details <- data.frame(Name = c(rep("Harry",each=3), rep("George", each=3)),
age=21:23,
total=c(14,19,24,1,9,4)
)
Name age total
1 Harry 21 14
2 Harry 22 19
3 Harry 23 24
4 George 21 1
5 George 22 9
6 George 23 4
I know how to convert each df to json like this:
library(jsonlite)
toJSON(df.details)
[{"Name":"Harry","age":21,"total":14},{"Name":"Harry","age":22,"total":19},{"Name":"Harry","age":23,"total":24},{"Name":"George","age":21,"total":1},{"Name":"George","age":22,"total":9},{"Name":"George","age":23,"total":4}]
However, I am looking to get the following structure to my JSON data:
{
"myjsondata": [
{
"Name": "Harry",
"color": "#EA0001",
"details": [
{
"age": 21,
"total": 14
},
{
"age": 22,
"total": 19
},
{
"age": 23,
"total": 24
}
]
},
{
"Name": "George",
"color": "#EEEEEE",
"details": [
{
"age": 21,
"total": 1
},
{
"age": 22,
"total": 9
},
{
"age": 23,
"total": 4
}
]
}
]
}
I think the answer may be in how I store the data in a list in R before converting, but not sure.
Try this format:
df1$details <- split(df.details[-1], df.details$Name)[df1$Name]
df1
# Name color details
#1 Harry #EA0001 21, 22, 23, 14, 19, 24
#2 George #EEEEEE 21, 22, 23, 1, 9, 4
toJSON(df1)
#[{
#"Name":"Harry",
#"color":"#EA0001",
#"details":[
# {"age":21,"total":14},
# {"age":22,"total":19},
# {"age":23,"total":24}]},
#{
#"Name":"George",
#"color":"#EEEEEE",
#"details":[
# {"age":21,"total":1},
# {"age":22,"total":9},
# {"age":23,"total":4}]}
#]
My current code as seen below attempts to construct a request payload (body), but isn't giving me the desired result.
library(df2json)
library(rjson)
y = rjson::fromJSON((df2json::df2json(dataframe)))
globalparam = ""
req = list(
Inputs = list(
input1 = y
)
,GlobalParameters = paste("{",globalparam,"}",sep="")#globalparam
)
body = enc2utf8((rjson::toJSON(req)))
body currently turns out to be
{
"Inputs": {
"input1": [
{
"X": 7,
"Y": 5,
"month": "mar",
"day": "fri",
"FFMC": 86.2,
"DMC": 26.2,
"DC": 94.3,
"ISI": 5.1,
"temp": 8.2,
"RH": 51,
"wind": 6.7,
"rain": 0,
"area": 0
}
]
},
"GlobalParameters": "{}"
}
However, I need it to look like this:
{
"Inputs": {
"input1": [
{
"X": 7,
"Y": 5,
"month": "mar",
"day": "fri",
"FFMC": 86.2,
"DMC": 26.2,
"DC": 94.3,
"ISI": 5.1,
"temp": 8.2,
"RH": 51,
"wind": 6.7,
"rain": 0,
"area": 0
}
]
},
"GlobalParameters": {}
}
So basically global parameters have to be {}, but not hardcoded. It seemed like a fairly simple problem, but I couldn't fix it. Please help!
EDIT:
This is the dataframe
X Y month day FFMC DMC DC ISI temp RH wind rain area
1 7 5 mar fri 86.2 26.2 94.3 5.1 8.2 51 6.7 0.0 0
2 7 4 oct tue 90.6 35.4 669.1 6.7 18.0 33 0.9 0.0 0
3 7 4 oct sat 90.6 43.7 686.9 6.7 14.6 33 1.3 0.0 0
4 8 6 mar fri 91.7 33.3 77.5 9.0 8.3 97 4.0 0.2 0
This is an example of another data frame
> a = data.frame("col1" = c(81, 81, 81, 81), "col2" = c(72, 69, 79, 84))
Using this sample data
dd<-read.table(text=" X Y month day FFMC DMC DC ISI temp RH wind rain area
1 7 5 mar fri 86.2 26.2 94.3 5.1 8.2 51 6.7 0.0 0", header=T)
You can do
globalparam = setNames(list(), character(0))
req = list(
Inputs = list(
input1 = dd
)
,GlobalParameters = globalparam
)
body = enc2utf8((rjson::toJSON(req)))
Note that globalparam looks a bit funny because we need to force it to a named list for rjson to treat it properly. We only have to do this when it's empty.
I am trying to write a data.frame from R into a JSON file, but in a hierarchical structure with child nodes within them. I found examples and JSONIO but I wasn't able to apply it to my case.
This is the data.frame in R
> DF
Date_by_Month CCG Year Month refYear name OC_5a OC_5b OC_5c
1 2010-01-01 MyTown 2010 01 2009 2009/2010 0 15 27
2 2010-02-01 MyTown 2010 02 2009 2009/2010 1 14 22
3 2010-03-01 MyTown 2010 03 2009 2009/2010 1 6 10
4 2010-04-01 MyTown 2010 04 2010 2010/2011 0 10 10
5 2010-05-01 MyTown 2010 05 2010 2010/2011 1 16 7
6 2010-06-01 MyTown 2010 06 2010 2010/2011 0 13 25
In addtion to writing the data by month, I would also like to create an aggregate child, the 'yearly' one, which holds the sum (for example) of all the months that fall in this year. This is how I would like the JSON file to look like:
[
{
"ccg":"MyTown",
"data":[
{"period":"yearly",
"scores":[
{"name":"2009/2010","refYear":"2009","OC_5a":2, "OC_5b": 35, "OC_5c": 59},
{"name":"2010/2011","refYear":"2010","OC_5a":1, "OC_5b": 39, "OC_5c": 42},
]
},
{"period":"monthly",
"scores":[
{"name":"2009/2010","refYear":"2009","month":"01","year":"2010","OC_5a":0, "OC_5b": 15, "OC_5c": 27},
{"name":"2009/2010","refYear":"2009","month":"02","year":"2010","OC_5a":1, "OC_5b": 14, "OC_5c": 22},
{"name":"2009/2010","refYear":"2009","month":"03","year":"2010","OC_5a":1, "OC_5b": 6, "OC_5c": 10},
{"name":"2009/2010","refYear":"2009","month":"04","year":"2010","OC_5a":0, "OC_5b": 10, "OC_5c": 10},
{"name":"2009/2010","refYear":"2009","month":"05","year":"2010","OC_5a":1, "OC_5b": 16, "OC_5c": 7},
{"name":"2009/2010","refYear":"2009","month":"01","year":"2010","OC_5a":0, "OC_5b": 13, "OC_5c": 25}
]
}
]
},
]
Thank you so much for your help!
Expanding on my comment:
The jsonlite package has a lot of features, but what you're describing doesn't really map to a data frame anymore so I doubt any canned routine has this functionality. Your best bet is probably to convert the data frame to a more general list (FYI data frames are stored internally as lists of columns) with a structure that matches the structure of the JSON exactly, then just use the converter to translate
This is complicated in general but in your case should be fairly simple. The list will be structured exactly like the JSON data:
list(
list(
ccg = "Town1",
data = list(
list(
period = "yearly",
scores = yearly_data_frame_town1
),
list(
period = "monthly",
scores = monthly_data_frame_town1
)
)
),
list(
ccg = "Town2",
data = list(
list(
period = "yearly",
scores = yearly_data_frame_town2
),
list(
period = "monthly",
scores = monthly_data_frame_town2
)
)
)
)
Constructing this list should be a straightforward case of looping over unique(DF$CCG) and using aggregate at each step, to construct the yearly data.
If you need performance, look to either the data.table or dplyr packages to do the looping and aggregating all at once. The former is flexible and performant but a little esoteric. The latter has relatively easy syntax and is similarly performant, but is designed specifically around building pipelines for data frames so it might take some hacking to get it to produce the right output format.
Looks like ssdecontrol has you covered... but here's my solution. Need to loop over unique CCG and Years to create the entire data set...
df <- read.table(textConnection("Date_by_Month CCG Year Month refYear name OC_5a OC_5b OC_5c
2010-01-01 MyTown 2010 01 2009 2009/2010 0 15 27
2010-02-01 MyTown 2010 02 2009 2009/2010 1 14 22
2010-03-01 MyTown 2010 03 2009 2009/2010 1 6 10
2010-04-01 MyTown 2010 04 2010 2010/2011 0 10 10
2010-05-01 MyTown 2010 05 2010 2010/2011 1 16 7
2010-06-01 MyTown 2010 06 2010 2010/2011 0 13 25"), stringsAsFactors=F, header=T)
library(RJSONIO)
to_list <- function(ccg, year){
df_monthly <- subset(df, CCG==ccg & Year==year)
df_yearly <- aggregate(df[,c("OC_5a", "OC_5b", "OC_5c")] ,df[,c("name", "refYear")], sum)
l <- list("ccg"=ccg,
data=list(list("period" = "yearly",
"scores" = as.list(df_yearly)
),
list("period" = "monthly",
"scores" = as.list(df[,c("name", "refYear", "OC_5a", "OC_5b", "OC_5c")])
)
)
)
return(l)
}
toJSON(to_list("MyTown", "2010"), pretty=T)
Which returns this:
{
"ccg" : "MyTown",
"data" : [
{
"period" : "yearly",
"scores" : {
"name" : [
"2009/2010",
"2010/2011"
],
"refYear" : [
2009,
2010
],
"OC_5a" : [
2,
1
],
"OC_5b" : [
35,
39
],
"OC_5c" : [
59,
42
]
}
},
{
"period" : "monthly",
"scores" : {
"name" : [
"2009/2010",
"2009/2010",
"2009/2010",
"2010/2011",
"2010/2011",
"2010/2011"
],
"refYear" : [
2009,
2009,
2009,
2010,
2010,
2010
],
"OC_5a" : [
0,
1,
1,
0,
1,
0
],
"OC_5b" : [
15,
14,
6,
10,
16,
13
],
"OC_5c" : [
27,
22,
10,
10,
7,
25
]
}
}
]
}