R: Web scraping JSON, extracting information from nest - json

I am trying to use tidyJSON to extract information from JSON, but I am open to any R package that can achieve my ends. I took a look at the documentation and vignittes and found the complex example was helpful. However, the information I want is nested inside of a non-key-value pair and I am not sure how to access it. I am interested in getting appid, name, developer, etc., but this information is within 570 and 730:
{"570":{"appid":570,"name":"Dota 2","developer":"Valve","publisher":"Valve","score_rank":71,"owners":102151578,"owners_variance":259003,"players_forever":102151578,"players_forever_variance":259003,"players_2weeks":9436299,"players_2weeks_variance":89979,"average_forever":11727,"average_2weeks":1229,"median_forever":277,"median_2weeks":662,"ccu":811259,"price":"0","tags":{"Free to Play":22678,"MOBA":7808,"Strategy":7415,"Multiplayer":6757,"Team-Based":4848,"Action":4602,"e-sports":4089,"Online Co-Op":3669,"Competitive":3553,"PvP":2655,"RTS":2267,"Difficult":2129,"RPG":2114,"Fantasy":2044,"Tower Defense":2024,"Co-op":1898,"Character Customization":1514,"Replay Value":1487,"Action RPG":1397,"Simulation":1024}},
"730":{"appid":730,"name":"Counter-Strike: Global Offensive","developer":"Valve","publisher":"Valve","score_rank":78,"owners":29225079,"owners_variance":154335,"players_forever":28552354,"players_forever_variance":152685,"players_2weeks":9102348,"players_2weeks_variance":88410,"average_forever":17648,"average_2weeks":791,"median_forever":5030,"median_2weeks":358,"ccu":543626,"price":"1499","tags":{"FPS":17082,"Multiplayer":13744,"Shooter":12833,"Action":10881,"Team-Based":10369,"Competitive":9664,"Tactical":8529,"First-Person":7329,"e-sports":6716,"PvP":6383,"Online Co-Op":5714,"Military":4621,"Co-op":4435,"Strategy":4424,"War":4361,"Realistic":3196,"Trading":3191,"Difficult":3158,"Fast-Paced":3100,"Moddable":2496}}
There are many thousands of such entries. Is there a way to skip the "top-level" and look within the nest?
The JSON information is from http://steamspy.com/api.php?request=top100in2weeks

This might be what you need:
library(jsonlite)
data = fromJSON("http://steamspy.com/api.php?request=top100in2weeks")
appid = lapply(data, function(x){x$appid})
name = lapply(data, function(x){x$name})
df = data.frame(appid = unlist(appid),
name = unlist(name),
stringsAsFactors = F)
Result:
> head(df)
appid name
570 570 Dota 2
730 730 Counter-Strike: Global Offensive
578080 578080 PLAYERUNKNOWN'S BATTLEGROUNDS
440 440 Team Fortress 2
271590 271590 Grand Theft Auto V
433850 433850 H1Z1: King of the Kill
I'll let you add the rest of the information
Edit: Adding arrays to a dataframe
Adding the tags information for each game in the data frame is possible. And the times tagged as well. For each game you must store an array of tag names in a column and the tag quantities in another.
After the definition of df add the following lines:
for(k in 1:nrow(d)){
d$tags[k] = list(names(data[[k]]$tags))
d$tagsQ[k] = list(unlist(data[[k]]$tags))
}
This will give you:
> d["570",]
appid name
570 570 Dota 2
tags
570 Free to Play, MOBA, Strategy, Multiplayer, Team-Based, Action, e-sports, Online Co-Op, Competitive, PvP, RTS, Difficult, RPG, Fantasy, Tower Defense, Co-op, Character Customization, Replay Value, Action RPG, Simulation
tagsQ
570 22686, 7810, 7420, 6759, 4850, 4603, 4092, 3672, 3555, 2657, 2267, 2130, 2116, 2045, 2024, 1898, 1514, 1487, 1397, 1023
In this situation, columns tags and tagsQ contain lists. To obtain the second tag and quantity for appid 570 do:
> df["570","tags"][[1]][2]
[1] "MOBA"
> d["570","tagsQ"][[1]][2]
MOBA
7810

Related

selenium by_xpath not returning any results

I am using Selenium 4+, and I seem to not get back the any result when requesting for elements in a div.
# Wait for the page to load
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.ID, "search-key")))
# wait for the page to load
driver.implicitly_wait(10)
# search for the suppliers you want to message
search_input = driver.find_element(By.ID,"search-key")
search_input.send_keys("suppliers")
search_input.send_keys(Keys.RETURN)
# find all the supplier stores on the page
supplier_stores_div = driver.find_element(By.CLASS_NAME, "list--gallery--34TropR")
print(supplier_stores_div)
supplier_stores = supplier_stores_div.find_elements(By.XPATH, "./a[#class='v3--container--31q8BOL cards--gallery--2o6yJVt']")
print(supplier_stores)
The logging statements gave me <selenium.webdriver.remote.webelement.WebElement (session="a3be4d8c5620e760177247d5b8158823", element="5ae78693-4bae-4826-baa6-bd940fa9a41b")> and an empty list for the Element objects, []
The html code is here:
<div class="list--gallery--34TropR" data-spm="main" data-spm-max-idx="24">flex
<a class="v3--container--31q8BOL cards--gallery--2o6yJVt" href="(link)" target="_blank" style="text-align: left;" data-spm-anchor-id="a2g0o.productlist.main.1"> (some divs) </a>flex
That is just one <a> class, there's more.
Before scraping the supplier names, you have to scroll down the page slowly to the bottom, then only you can get all the supplier names, try the below code:
driver.get("https://www.aliexpress.com/premium/supplier.html?spm=a2g0o.best.1000002.0&initiative_id=SB_20221218233848&dida=y")
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
driver.execute_script("window.scrollBy(0, 800);")
sleep(1)
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
suppliers = driver.find_elements(By.XPATH, ".//*[#class='list--gallery--34TropR']//span/a")
print("Total no. of suppliers:", len(suppliers))
print("======")
for supplier in suppliers:
print(supplier.text)
Output:
Total no. of suppliers: 60
======
Reading Life Store
ASONSTEEL Store
Paper, ink, pen and inkstone Store
Custom Stationery Store
ZHOUYANG Official Store
WOWSOCOOL Store
IFPD Official Store
The 9 Store
QuanRun Store
...
...
It returns you the Element object.
If you want to get the text you need to write
supplier_stores_div.find_elements(By.XPATH, "./a[#class='v3--container--31q8BOL cards--gallery--2o6yJVt']").getText()
or for getting an attribute (for example the href)
supplier_stores_div.find_elements(By.XPATH, "./a[#class='v3--container--31q8BOL cards--gallery--2o6yJVt']").getAttribute("href")

Columns of Data Frame are Being Swapped: Why is my loop switching the column values when I identify and assign the columns by name?

I need help with the specific code I will paste below. I am using the Ames Housing data set collected by Dean De Cock.
I am using a Python notebook and editing thru Anaconda's Jupyter Lab 2.1.5.
The code below is supposed to replace all np.nan or "None" values. For some reason,
after repeatedly calling a hand-made function inside a for loop, the columns of the resulting data frame get swapped around.
Note: I am aware I could do this with an "imputer." I plan to select numeric and object type features, impute them separately then put them back together. As a side-note, is there any way I can do that while having the details I output manually using text displayed or otherwise verified?
In the cell in question, the flow is:
Get and assign the number of data points in the data frame df_train.
Get and assign a series that lists the count of null values in df_train. The syntax is sr_null_counts = df_train.isnull().sum().
Create an empty list to which names of features that have 5% of their values equal to null are appended. They will be dropped later,
outside the for loop. I thought at first that this was the problem since the command to drop the columns of df_train in-place
used to be within the for-loop.
Repeatedly call a hand-made function to impute columns with null values not exceeding 5% of the row count for df_train.
I used a function that has a for-loop and nested try-except statements to:
Accept a series and, optionally, the series' name when it was a column in a dataframe. It assigns a copy of the passed series
to a local variable.
In the exact order, (a) try to replace all null (NaN or None) values with the mean of the passed series.
(b) If that fails, try to replace all null values with the median of the series.
(c) If even that fails, replace all null values with the mode of the series.
Return the edited copy of the series with all null values replaced. It should also print out strings that tell me what feature
was modified and what summary statistic was used to replace/impute the missing values.
The final line is to drop all the columns marked as having more than 5% missing values.
Here is the full code:
Splitting the main dataframe into a train and test set.
The full data-set was loaded thru df_housing = pd.read_csv(sep = '\t', filepath_or_buffer = "AmesHousing.tsv").
def make_traintest(df, train_fraction = 0.7, random_state_val = 88):
df = df.copy()
df_train = df.sample(frac = train_fraction, random_state = random_state_val)
bmask_istrain = df.index.isin(df_train.index.values)
df_test = df.loc[ ~bmask_istrain ]
return {
"train":df_train,
"test":df_test
}
dict_traintest = make_traintest(df = df_housing)
df_train = dict_traintest["train"]
df_test = dict_traintest["test"]
Get a List of Columns With Null Values
lst_have_nulls = []
for feature in df_housing.columns.values.tolist():
nullcount = df_housing[feature].isnull().sum()
if nullcount > 0:
lst_have_nulls.append(feature)
print(feature, "\n=====\nNull Count:\t", nullcount, '\n', df_housing[feature].value_counts(dropna = False),'\n*****')
Definition of the hand-made function:
def impute_series(sr_values, feature_name = ''):
sr_out = sr_values.copy()
try:
sr_out.fillna(value = sr_values.mean())
print("Feature", feature_name, "imputed with mean:", sr_values.mean())
except Exception as e:
print("Filling NaN values with mean of feature", feature_name, "caused an error:\n", e)
try:
sr_out.fillna(value = sr_values.median())
print("Feature", feature_name, "imputed with median:", sr_values.median())
except Exception as e:
print("Filling NaN values with median for feature", feature_name, "caused an error:\n", e)
sr_out.fillna(value = sr_values.mode())
print("Feature", feature_name, "imputed with mode:", sr_values.mode())
return sr_out
For-Loop
Getting the count of null values, defining the empty list of columns to drop to allow appending, and repeatedly
doing the following: For every column in lst_have_nulls, check if the column has equal, less or more than 5% missing values.
If more, append the column to lst_drop. Else, call the hand-made imputing function. After the for-loop, drop all columns in
lst_drop, in-place.
Where did I go wrong? In case you need the entire notebook, I have uploaded it to Kaggle. Here is a link.
https://www.kaggle.com/joachimrives/ames-housing-public-problem
Update: Problem Still Exists After Testing Anvar's Answer with Changes
When I tried the code of Anvar Kurmukov, my dataframe column values still got swapped. The change I made was adding int and float to the list of dtypes to check. The changes are inside the for-loop:
if dtype in [np.int64, np.float64, int, float].
It may be a problem with another part of my code in the full notebook. I will need to check where it is by calling df_train.info() cell by cell from the top. I tested the code in the notebook I made public. It is in cell 128. For some reason, after running Anvar's code, the df_train.info() method returned this:
1st Flr SF 2nd Flr SF 3Ssn Porch Alley Bedroom AbvGr Bldg Type Bsmt Cond Bsmt Exposure Bsmt Full Bath Bsmt Half Bath ... Roof Style SalePrice Screen Porch Street TotRms AbvGrd Total Bsmt SF Utilities Wood Deck SF Year Built Year Remod/Add
1222 1223 534453140 70 RL 50.0 4882 Pave NaN IR1 Bnk ... 0 0 0 0 0 NaN NaN NaN 0 87000
1642 1643 527256040 20 RL 81.0 13870 Pave NaN IR1 HLS ... 52 0 0 174 0 NaN NaN NaN 0 455000
1408 1409 905427050 50 RL 66.0 21780 Pave NaN Reg Lvl ... 36 0 0 144 0 NaN NaN NaN 0 185000
1729 1730 528218050 60 RL 65.0 10237 Pave NaN Reg Lvl ... 72 0 0 0 0 NaN NaN NaN 0 178900
1069 1070 528180110 120 RL 58.0 10110 Pave NaN IR1 Lvl ... 48 0 0 0 0 NaN NaN NaN 0 336860
tl;dr instead of try: except you should simply use if and check dtype of the column; you do not need to iterate over columns.
drop_columns = df.columns[df.isna().sum() / df.shape[0] > 0.05]
df.drop(drop_columns, axis=1)
num_columns = []
cat_columns = []
for col, dtype in df.dtypes.iteritems():
if dtype in [np.int64, np.float64]:
num_columns.append(col)
else:
cat_columns.append(col)
df[num_columns] = df[num_columns].fillna(df[num_columns].mean())
df[cat_columns] = df[cat_columns].fillna(df[cat_columns].mode())
Short comment on make_traintest function: I would simply return 2 separate DataFrames instead of a dictionary or use sklearn.model_selection.train_test_split.
upd. You can check for number of NaN values in a column, but it is unnecessary if your only goal is to impute NaNs.
Answer
I discovered the answer as to why my columns were being swapped. They were not actually being swapped. The original problem was that I had not set the "Order" column as the index column. To fix the problem on the notebook in my PC, I simply added the following paramater and value to pd.read_csv: index_col = "Order". That fixed the problem on my local notebook. When I tried it on the Kaggle notebook, however, it did not fix the problem
The version of the Ames Housing data set I first used on the notebook - for some reason - was also the cause for the column swapping.
Anvar's Code is fine. You may test the code I wrote, but to be safe, defer to Anvar's code. Mine is still to be tested.
Testing Done
I modified the Kaggle notebook I linked in my question. I used the data set I was actually working in with my PC. When I did that, the code given by Anvar Kurmukov's answer worked perfectly. I tested my own code and it seems fine, but test both versions before trying. I only reviewed the data sets using head() and manually checked the column inputs. If you want to check the notebook, here it is:
https://www.kaggle.com/joachimrives/ames-housing-public-problem/
To test if the data set was at fault, I created to data frames. One was taken directly from my local file uploaded to Kaggle. The other used the current version of the Ames Iowa Housing data set I had used as input. The columns were properly "aligned" with their expected input. To find the expected column values, I used this source:
http://jse.amstat.org/v19n3/decock/DataDocumentation.txt
Here are the screenshots of the different results I got when I swapped data sets:
With an uploaded copy of my local file:
With the original AmesHousing.csv From Notebook Version 1:
The data set I Used that Caused the Column-swap on the Kaggle Notebook
https://www.kaggle.com/marcopale/housing

Count coords dots in polygon

I would like to count the appearance of several dots in a prespecified polygons. I am loading the EU NUTS Region by
nuts = 'https://raw.githubusercontent.com/eurostat/Nuts2json/master/2016/4258/60M/nutsrg_2.json'
geo_json_nuts = json.loads(requests.get(nuts).text)
and I have a list of tuples or a DataFrame, which contains data as follows:
Index lon lat
0 -178.1328187 -14.3087256
1 -176.2036596 -13.3469813
2 -176.1720255 -13.2789922
3 -151.3381037 -22.4532474
4 -151.0331577 -16.7159449
... ... ...
Now I would like to match the lon/lat in the DataFrame to the Feature properties id contained in geo_json_nuts. Meaning if lon/lat is contained in one of the polygons in geo_json_nuts it should get the properties id, e.g. BE31 or AT32, etc.
Does anyone know how to handle this?
Thank you in advance!
Best regards
Alex
you can try to normalize your json file.
nuts = 'https://raw.githubusercontent.com/eurostat/Nuts2json/master/2016/4258/60M/nutsrg_2.json'
geo_json_nuts = json.loads(requests.get(nuts).text)
df = json_normalize(geo_json_nuts, 'features', ['coordinates'],
errors='ignore',
record_prefix='features_')
After this process, you will have a dataframe contains 'features_properties.id' for the id and the features_geometry.coordinates for the points.

How to get data from a specific section in MediaWiki supported wikis

I am parsing a Wikia article and trying to get the data from the right hand side highlighted block, I have already got the left one using the following URL
http://hetalia.wikia.com/api.php?action=parse&prop=revisions&prop=sections&page=America&format=json
But don't know the reference about the right one. What will be the parameter?
The original URL is,
http://hetalia.wikia.com/wiki/America
I believe the only way to get info from the Infoboxes would be to get the page source, which can be done with this query
http://hetalia.wikia.com/api.php?action=query&prop=revisions&rvprop=content&titles=America&format=json
And then parsing the text to get the information, as the source of that box is in this format
{{Character
|name = America
|jname = アメリカ
|image = America0.png
|country = [[wikipedia:United States|The United States of America]]
|human = Alfred F.Jones (アルフレッド・F・ジョーンズ, ''Arufureddo F. Joonzu'')
|age = 19
...
|japanese = [[Katsuyuki Konishi]], Ryoko Shimizu (Young America, drama CD "Prologue"), [[Ai Iwamura]] (Young America, anime), [[Axis Powers Hetalia: The CD|Osamu Ikeda]] (''Flower Of Iris'')
|english = [[Eric Vale]], Stephanie Young (young America)}}
You could use Regex to extract the data from the text, such as using \|age\s*=\s*(\d*) to get the age attribute.

list() contains one element but it has matrix of string inside, how do I convert this element into matrix?

After converting JSON data into a list using jsonlite, i end up with one of the list looking like following
In this case, 10th element contain a list of 9 columns (always fixed) and 2 rows (varies everytime).
mat <- lset$data$comments$data[10]
mat
[[1]]
id can_remove created_time from.id
1 10152663742099258_10152663749369258 TRUE 2014-07-01T11:10:29+0000 10203711779968366
2 10152663742099258_10152663842204258 TRUE 2014-07-01T12:15:57+0000 706804257
3 10152663742099258_10152663929639258 TRUE 2014-07-01T13:25:28+0000 10152738599744416
4 10152663742099258_10152663976344258 TRUE 2014-07-01T13:59:33+0000 706804257
from.name like_count
1 Aileen Yeow 1
2 Tejas Damania 0
3 Sandeep Kulkarni 1
4 Tejas Damania 0
message
1 Lame statement
2 Don't forget, people like you only because they don't know you! <ed><U+00A0><U+00BD><ed><U+00B8><U+00A1>
3 ...for a second I thought it's Accenture Singapore office with some new theme similar to its brand!
4 This is shanghai and nothing to do with firm I work for <ed><U+00A0><U+00BD><ed><U+00B8><U+008E>
user_likes
1 FALSE
2 FALSE
3 TRUE
4 FALSE
Whole mat shows us as a list of [1]
As you can see, it contains list (within a list?). When i print mat then it shows a structure as seen above.
typeof(mat)
[1] "list"
substring(mat,1,100)
[1] "list(id = c(\"10152663742099258_10152663749369258\", \"10152663742099258_10152663842204258\", \"101526637"
I cant access specific elements (say message) from this. Nor I am able to convert this into a matrix of strings so I can access the elements in structured way.
I changed the fromJSON call parameter to by setting simplifyVector = FALSE (which is default set to true)
lset <- fromJSON(jsonobj, simplifyVector = F, flatten=TRUE, unicode = TRUE)
this changes the way mat is formed, code maintain nesting all the way down to each leaf element. I can keep going deeper using $ and find the string value only at leaf element!
lset$data[[x]]$comments$data[[y]]$from$name
That works for now! thanks for all the help