list() contains one element but it has matrix of string inside, how do I convert this element into matrix? - json

After converting JSON data into a list using jsonlite, i end up with one of the list looking like following
In this case, 10th element contain a list of 9 columns (always fixed) and 2 rows (varies everytime).
mat <- lset$data$comments$data[10]
mat
[[1]]
id can_remove created_time from.id
1 10152663742099258_10152663749369258 TRUE 2014-07-01T11:10:29+0000 10203711779968366
2 10152663742099258_10152663842204258 TRUE 2014-07-01T12:15:57+0000 706804257
3 10152663742099258_10152663929639258 TRUE 2014-07-01T13:25:28+0000 10152738599744416
4 10152663742099258_10152663976344258 TRUE 2014-07-01T13:59:33+0000 706804257
from.name like_count
1 Aileen Yeow 1
2 Tejas Damania 0
3 Sandeep Kulkarni 1
4 Tejas Damania 0
message
1 Lame statement
2 Don't forget, people like you only because they don't know you! <ed><U+00A0><U+00BD><ed><U+00B8><U+00A1>
3 ...for a second I thought it's Accenture Singapore office with some new theme similar to its brand!
4 This is shanghai and nothing to do with firm I work for <ed><U+00A0><U+00BD><ed><U+00B8><U+008E>
user_likes
1 FALSE
2 FALSE
3 TRUE
4 FALSE
Whole mat shows us as a list of [1]
As you can see, it contains list (within a list?). When i print mat then it shows a structure as seen above.
typeof(mat)
[1] "list"
substring(mat,1,100)
[1] "list(id = c(\"10152663742099258_10152663749369258\", \"10152663742099258_10152663842204258\", \"101526637"
I cant access specific elements (say message) from this. Nor I am able to convert this into a matrix of strings so I can access the elements in structured way.

I changed the fromJSON call parameter to by setting simplifyVector = FALSE (which is default set to true)
lset <- fromJSON(jsonobj, simplifyVector = F, flatten=TRUE, unicode = TRUE)
this changes the way mat is formed, code maintain nesting all the way down to each leaf element. I can keep going deeper using $ and find the string value only at leaf element!
lset$data[[x]]$comments$data[[y]]$from$name
That works for now! thanks for all the help

Related

Columns of Data Frame are Being Swapped: Why is my loop switching the column values when I identify and assign the columns by name?

I need help with the specific code I will paste below. I am using the Ames Housing data set collected by Dean De Cock.
I am using a Python notebook and editing thru Anaconda's Jupyter Lab 2.1.5.
The code below is supposed to replace all np.nan or "None" values. For some reason,
after repeatedly calling a hand-made function inside a for loop, the columns of the resulting data frame get swapped around.
Note: I am aware I could do this with an "imputer." I plan to select numeric and object type features, impute them separately then put them back together. As a side-note, is there any way I can do that while having the details I output manually using text displayed or otherwise verified?
In the cell in question, the flow is:
Get and assign the number of data points in the data frame df_train.
Get and assign a series that lists the count of null values in df_train. The syntax is sr_null_counts = df_train.isnull().sum().
Create an empty list to which names of features that have 5% of their values equal to null are appended. They will be dropped later,
outside the for loop. I thought at first that this was the problem since the command to drop the columns of df_train in-place
used to be within the for-loop.
Repeatedly call a hand-made function to impute columns with null values not exceeding 5% of the row count for df_train.
I used a function that has a for-loop and nested try-except statements to:
Accept a series and, optionally, the series' name when it was a column in a dataframe. It assigns a copy of the passed series
to a local variable.
In the exact order, (a) try to replace all null (NaN or None) values with the mean of the passed series.
(b) If that fails, try to replace all null values with the median of the series.
(c) If even that fails, replace all null values with the mode of the series.
Return the edited copy of the series with all null values replaced. It should also print out strings that tell me what feature
was modified and what summary statistic was used to replace/impute the missing values.
The final line is to drop all the columns marked as having more than 5% missing values.
Here is the full code:
Splitting the main dataframe into a train and test set.
The full data-set was loaded thru df_housing = pd.read_csv(sep = '\t', filepath_or_buffer = "AmesHousing.tsv").
def make_traintest(df, train_fraction = 0.7, random_state_val = 88):
df = df.copy()
df_train = df.sample(frac = train_fraction, random_state = random_state_val)
bmask_istrain = df.index.isin(df_train.index.values)
df_test = df.loc[ ~bmask_istrain ]
return {
"train":df_train,
"test":df_test
}
dict_traintest = make_traintest(df = df_housing)
df_train = dict_traintest["train"]
df_test = dict_traintest["test"]
Get a List of Columns With Null Values
lst_have_nulls = []
for feature in df_housing.columns.values.tolist():
nullcount = df_housing[feature].isnull().sum()
if nullcount > 0:
lst_have_nulls.append(feature)
print(feature, "\n=====\nNull Count:\t", nullcount, '\n', df_housing[feature].value_counts(dropna = False),'\n*****')
Definition of the hand-made function:
def impute_series(sr_values, feature_name = ''):
sr_out = sr_values.copy()
try:
sr_out.fillna(value = sr_values.mean())
print("Feature", feature_name, "imputed with mean:", sr_values.mean())
except Exception as e:
print("Filling NaN values with mean of feature", feature_name, "caused an error:\n", e)
try:
sr_out.fillna(value = sr_values.median())
print("Feature", feature_name, "imputed with median:", sr_values.median())
except Exception as e:
print("Filling NaN values with median for feature", feature_name, "caused an error:\n", e)
sr_out.fillna(value = sr_values.mode())
print("Feature", feature_name, "imputed with mode:", sr_values.mode())
return sr_out
For-Loop
Getting the count of null values, defining the empty list of columns to drop to allow appending, and repeatedly
doing the following: For every column in lst_have_nulls, check if the column has equal, less or more than 5% missing values.
If more, append the column to lst_drop. Else, call the hand-made imputing function. After the for-loop, drop all columns in
lst_drop, in-place.
Where did I go wrong? In case you need the entire notebook, I have uploaded it to Kaggle. Here is a link.
https://www.kaggle.com/joachimrives/ames-housing-public-problem
Update: Problem Still Exists After Testing Anvar's Answer with Changes
When I tried the code of Anvar Kurmukov, my dataframe column values still got swapped. The change I made was adding int and float to the list of dtypes to check. The changes are inside the for-loop:
if dtype in [np.int64, np.float64, int, float].
It may be a problem with another part of my code in the full notebook. I will need to check where it is by calling df_train.info() cell by cell from the top. I tested the code in the notebook I made public. It is in cell 128. For some reason, after running Anvar's code, the df_train.info() method returned this:
1st Flr SF 2nd Flr SF 3Ssn Porch Alley Bedroom AbvGr Bldg Type Bsmt Cond Bsmt Exposure Bsmt Full Bath Bsmt Half Bath ... Roof Style SalePrice Screen Porch Street TotRms AbvGrd Total Bsmt SF Utilities Wood Deck SF Year Built Year Remod/Add
1222 1223 534453140 70 RL 50.0 4882 Pave NaN IR1 Bnk ... 0 0 0 0 0 NaN NaN NaN 0 87000
1642 1643 527256040 20 RL 81.0 13870 Pave NaN IR1 HLS ... 52 0 0 174 0 NaN NaN NaN 0 455000
1408 1409 905427050 50 RL 66.0 21780 Pave NaN Reg Lvl ... 36 0 0 144 0 NaN NaN NaN 0 185000
1729 1730 528218050 60 RL 65.0 10237 Pave NaN Reg Lvl ... 72 0 0 0 0 NaN NaN NaN 0 178900
1069 1070 528180110 120 RL 58.0 10110 Pave NaN IR1 Lvl ... 48 0 0 0 0 NaN NaN NaN 0 336860
tl;dr instead of try: except you should simply use if and check dtype of the column; you do not need to iterate over columns.
drop_columns = df.columns[df.isna().sum() / df.shape[0] > 0.05]
df.drop(drop_columns, axis=1)
num_columns = []
cat_columns = []
for col, dtype in df.dtypes.iteritems():
if dtype in [np.int64, np.float64]:
num_columns.append(col)
else:
cat_columns.append(col)
df[num_columns] = df[num_columns].fillna(df[num_columns].mean())
df[cat_columns] = df[cat_columns].fillna(df[cat_columns].mode())
Short comment on make_traintest function: I would simply return 2 separate DataFrames instead of a dictionary or use sklearn.model_selection.train_test_split.
upd. You can check for number of NaN values in a column, but it is unnecessary if your only goal is to impute NaNs.
Answer
I discovered the answer as to why my columns were being swapped. They were not actually being swapped. The original problem was that I had not set the "Order" column as the index column. To fix the problem on the notebook in my PC, I simply added the following paramater and value to pd.read_csv: index_col = "Order". That fixed the problem on my local notebook. When I tried it on the Kaggle notebook, however, it did not fix the problem
The version of the Ames Housing data set I first used on the notebook - for some reason - was also the cause for the column swapping.
Anvar's Code is fine. You may test the code I wrote, but to be safe, defer to Anvar's code. Mine is still to be tested.
Testing Done
I modified the Kaggle notebook I linked in my question. I used the data set I was actually working in with my PC. When I did that, the code given by Anvar Kurmukov's answer worked perfectly. I tested my own code and it seems fine, but test both versions before trying. I only reviewed the data sets using head() and manually checked the column inputs. If you want to check the notebook, here it is:
https://www.kaggle.com/joachimrives/ames-housing-public-problem/
To test if the data set was at fault, I created to data frames. One was taken directly from my local file uploaded to Kaggle. The other used the current version of the Ames Iowa Housing data set I had used as input. The columns were properly "aligned" with their expected input. To find the expected column values, I used this source:
http://jse.amstat.org/v19n3/decock/DataDocumentation.txt
Here are the screenshots of the different results I got when I swapped data sets:
With an uploaded copy of my local file:
With the original AmesHousing.csv From Notebook Version 1:
The data set I Used that Caused the Column-swap on the Kaggle Notebook
https://www.kaggle.com/marcopale/housing

scraping - find the last 5 score of each match - in html

i would like your help to get the last 5 score, i can't get it please help me.
from selenium import webdriver
import pandas as pd
from pandas import ExcelWriter
from openpyxl.workbook import Workbook
import time as t
import xlsxwriter
pd.set_option('display.max_rows', 5, 'display.max_columns', None, 'display.width', None)
browser = webdriver.Firefox()
browser.get('https://www.mismarcadores.com/futbol/espana/laliga/resultados/')
print("Current Page Title is : %s" %browser.title)
aux_ids = browser.find_elements_by_css_selector('.event__match.event__match--static.event__match--oneLine')
ids=[]
i = 0
for aux in aux_ids:
if i < 1:
ids.append( aux.get_attribute('id') )
i+=1
data=[]
for idt in ids:
id_clean = idt.split('_')[-1]
browser.execute_script("window.open('');")
browser.switch_to.window(browser.window_handles[1])
browser.get(f'https://www.mismarcadores.com/partido/{id_clean}/#h2h;overall')
t.sleep(5)
p_ids = browser.find_elements_by_css_selector('h2h-wrapper')
#here the code of the last 5 score of each match
I believe you can use your Firefox browser but have not tested with it. I use chrome so if you want to use chromedriver check the version of your browser and download the right one, also add it to your system path. The only thing with this approach is that it open a browser window until the page is loaded (because we are waiting for the javascript to generate the matches data). If you need anything else let me know. Good luck!
https://chromedriver.chromium.org/downloads
Known issues: Sometimes it will throw index out of range when retrieve matches data. This is something I am looking to it because it look like sometimes the xpath on each link change a little .
from selenium import webdriver
from lxml import html
from lxml.html import HtmlElement
def test():
# Here we specified the urls to for testing purpose
urls = ['https://www.mismarcadores.com/partido/noIPZ3Lj/#h2h;overall'
]
# a loop to go over all the urls
for url in urls:
# We will print the string and format it with the url we are currently checking, Also we will print the
# result of the function get_last_5(url) where url is the current url in the for loop.
print("Scores after this match {u}".format(u=url), get_last_5(url))
def get_last_5(url):
print("processing {u}, please wait...".format(u=url))
# here we get a instance of the webdriver
browser = webdriver.Chrome()
# now we pass the url we want to get
browser.get(url)
# in this variable, we will "store" the html data as a string. We get it from here because we need to wait for
# the page to load and execute their javascript code in order to generate the matches data.
innerHTML = browser.execute_script("return document.body.innerHTML")
# Now we will assign this to a variable of type HtmlElement
tree: HtmlElement = html.fromstring(innerHTML)
# the following variables: first_team,second_team,match_date and rows are obtained via xpath method(). To get the
# xpath go to chrome browser,open it and load one of the url to check the DOM. Now if you wish to check the xpath
# of each of this variables (elements in case of html), right click on the element->click inspect->the inspect
# panel will appear->the clicked element wil appear selected on the inspect panel->right click on it->Copy->Copy
# Xpath. first_team,second_team and match_date are obtained from the "title" section. Rows are obtained from the
# table of last matches in the tbody content
# When using xpath it will return a list of HtmElement because it will try to find all the elements that match our
# xpath, so that is why we use [0] (to get the first element of the list). This will give use access to a
# HtmlElement object so now we can access its text attribute.
first_team = tree.xpath('//*[#id="flashscore"]/div[1]/div[1]/div[2]/div/div/a')[0].text
print((type(first_team)))
second_team = tree.xpath('//*[#id="flashscore"]/div[1]/div[3]/div[2]/div/div/a')[0].text
# [0:8] is used to slice the string because in the title it contains also the time of the match ie.(10.08.2020
# 13:00) . To use it for comparing each row we need only (10.08.20), so we get from position 0, 8 characters ([0:8])
match_date = tree.xpath('//*[#id="utime"]')[0].text[0:8]
# when getting the first element with [0], we get a HtmlElement object( which is the "table" that have all matches
# data). so we want to get all the children of it, which are all the "rows(elements)" inside it. getchildren()
# will also return a list of object of type HtmlElement. In this case we are also slicing the list with [:-1]
# because the last element inside the "table" is the button "Mostar mas partidos", so we want to take that out.
rows = tree.xpath('//*[#id="tab-h2h-overall"]/div[1]/table/tbody')[0].getchildren()[:-1]
# we quit the browser since we do not need this anymore, we could do it after assigning innerHtml, but no harm
# doing it here unless you wish to close it before doing all this assignment of variables.
browser.quit()
# this match_position variable will be the position of the match we currently have in the title.
match_position = None
# Now we will iterate over the rows and find the match. range(len(rows)) is just to get the count of rows to know
# until when to stop iterating.
for i in range(len(rows)):
# now we use the is_match function with the following parameter: first_team,second team, match_date and the
# current row which is row[i]. if the function return true we found the match position and we assign (i+1) to
# the match_position variable. i+1 because we iterate from 0.
if is_match(first_team, second_team, match_date, rows[i]):
match_position = i + 1
# now we stop the for no need to go further when we find it.
break
# Since we only want the following 5 matches score, we need to check if we have 5 rows beneath our match. If
# adding 5 from the match position is less than the number of rows then we can do it, if not we will only get the
# rows beneath it(maybe 0,1,2,3 or 4 rows)
if (match_position + 5) < len(rows):
# Again we are slicing the list, in this case 2 times [match_position:] (take out all the rows before the
# match position), then from the new list obtained from that we do [:5] which is start from the 0 position
# and stop on 5 [start:stop]. we use rows=rows beacause when slicing you get a new list so you can not do
# rows[match_position:][:5] you need to assign it to a variable. I am using same variable but you can assign
# it to a new one if you wish.
rows = rows[match_position:][:5]
else:
# since we do not have enough rows, just get the rows beneath our position.
rows = rows[match_position:len(rows)]
# Now to get the list of scores we are using a list comprehension in here but I will explain it as a for loop.
# Before that, you need to know that each row(<tr> element in html) has 6 td elements inside it, the number 5 is
# the score of the match. then inside each "score element" we have a span element and then a strong element,
# something like
# <tr>
# <td></td>
# <td></td>
# <td></td>
# <td></td>
# <td><span><strong>1:2</strong></span></td>.
# <td></td>
# </tr>
# Now, That been said, since each row is a HtmlElement object , we can go in a for loop as following:
scores = []
for row in rows:
data = row.getchildren()[4].getchildren()[0].text_content()
# not the best way but we will get al the text content on the element, in this case the span element,
# if the string has more than 5 characters i.e. "1 : 2" then we will take as if it is i.e. "1 : 2(0 : 1)". So
# in this case we want to slice it from the 2nd character from right to left and get 5 characters from that
# position.
# using a ternary expression here, if the length of the string is equal to 5 then this is our score,
# if not then we have to slice it and get the last part, from -6 which is the white space before then 2 (in
# our example) to -1 (which is the 1 before the last ')' ).
score = data if len(data) == 5 else data[-6:-1]
scores.append(score)
print("finished processing {u}.".format(u=url))
# now we return the scores
return scores
def is_match(t1, t2, match_date, row):
# from each row we want to compare, t1,t2,match_date (this are obtained from the title) with the rows team1,
# team2 and date. Each row has 6 element inside it. Please read all the code on get_last_5 before reading this
# explanation. so the for this row, date is in position 0, team1 in 2, team2 in 3.
# <td><span>10.03.20</span></td>
date = row.getchildren()[0].getchildren()[0].text
# <td><span>TeamName</span></td> (when the team lost) or
# <td><span><strong>TeamName</strong></span></td> (when the team won)
team1element = row.getchildren()[2].getchildren()[0] # this is the span element
# using a ternary expression (condition_if_true if condition else condition_if_false)
# https://book.pythontips.com/en/latest/ternary_operators.html
# if span element have childrens , (getchildren()>0) then the team name is team1element.getchildren()[0].text
# which is the text of the strong element, if not the jsut get the text from the span element.
mt1 = team1element.getchildren()[0].text if len(team1element.getchildren()) > 0 else team1element.text
# repeat the same as team 1
team2element = row.getchildren()[3].getchildren()[0]
mt2 = team2element.getchildren()[0].text if len(team2element.getchildren()) > 0 else team2element.text
# basically we can compare only the date, but jsut to be sure we compare the names also. So, if the dates and the
# names are the same this is our match row.
if match_date == date and t1 == mt1 and t2 == mt2:
# we found it so return true
return True
# if not the same then return false
return False

Code combination in microsoft access (yyxxxx format)

I'm struggeling with a part of code that I want to implement in Microsoft Access.
The required code is used for project asignments.
The code format contains the last 2 numbers of the year + 4 digits which add up until a new year, then the last 2 numbers of the year add up with 1 and the 4 digits start at 1 again.
For example:
2019:
190001 = first task;
190002 = second task;
etc...
2020:
200001 = first task;
200002 = second task;
etc...
Could anybody help me out how to code this in Microsoft Access, preferably by VBA?
This way I can asign the code to a "submit" button to avoid similar numbers.
Thanks!
Formatting your code given an integer could be achieved in several ways, here is one possible method:
Function ProjectCode(ByVal n As Long) As Long
ProjectCode = CLng(Format(Date, "yy") & Format(n, "0000"))
End Function
?ProjectCode(1)
200001
?ProjectCode(2)
200002
?ProjectCode(100)
200100
You probably need to assign the next task id to a project.
So, look up the latest id in use and add 1 to obtain the next task id:
NextTaskId = (Year(Date()) \ 100) * 10000 + Nz(DMax("TaskId", "ProjectTable", "TaskId \ 10000 = Year(Date()) \ 100"), 0) Mod 10000 + 1
Nz ensures that a task id also can be assigned the very first task of a year.

R: Web scraping JSON, extracting information from nest

I am trying to use tidyJSON to extract information from JSON, but I am open to any R package that can achieve my ends. I took a look at the documentation and vignittes and found the complex example was helpful. However, the information I want is nested inside of a non-key-value pair and I am not sure how to access it. I am interested in getting appid, name, developer, etc., but this information is within 570 and 730:
{"570":{"appid":570,"name":"Dota 2","developer":"Valve","publisher":"Valve","score_rank":71,"owners":102151578,"owners_variance":259003,"players_forever":102151578,"players_forever_variance":259003,"players_2weeks":9436299,"players_2weeks_variance":89979,"average_forever":11727,"average_2weeks":1229,"median_forever":277,"median_2weeks":662,"ccu":811259,"price":"0","tags":{"Free to Play":22678,"MOBA":7808,"Strategy":7415,"Multiplayer":6757,"Team-Based":4848,"Action":4602,"e-sports":4089,"Online Co-Op":3669,"Competitive":3553,"PvP":2655,"RTS":2267,"Difficult":2129,"RPG":2114,"Fantasy":2044,"Tower Defense":2024,"Co-op":1898,"Character Customization":1514,"Replay Value":1487,"Action RPG":1397,"Simulation":1024}},
"730":{"appid":730,"name":"Counter-Strike: Global Offensive","developer":"Valve","publisher":"Valve","score_rank":78,"owners":29225079,"owners_variance":154335,"players_forever":28552354,"players_forever_variance":152685,"players_2weeks":9102348,"players_2weeks_variance":88410,"average_forever":17648,"average_2weeks":791,"median_forever":5030,"median_2weeks":358,"ccu":543626,"price":"1499","tags":{"FPS":17082,"Multiplayer":13744,"Shooter":12833,"Action":10881,"Team-Based":10369,"Competitive":9664,"Tactical":8529,"First-Person":7329,"e-sports":6716,"PvP":6383,"Online Co-Op":5714,"Military":4621,"Co-op":4435,"Strategy":4424,"War":4361,"Realistic":3196,"Trading":3191,"Difficult":3158,"Fast-Paced":3100,"Moddable":2496}}
There are many thousands of such entries. Is there a way to skip the "top-level" and look within the nest?
The JSON information is from http://steamspy.com/api.php?request=top100in2weeks
This might be what you need:
library(jsonlite)
data = fromJSON("http://steamspy.com/api.php?request=top100in2weeks")
appid = lapply(data, function(x){x$appid})
name = lapply(data, function(x){x$name})
df = data.frame(appid = unlist(appid),
name = unlist(name),
stringsAsFactors = F)
Result:
> head(df)
appid name
570 570 Dota 2
730 730 Counter-Strike: Global Offensive
578080 578080 PLAYERUNKNOWN'S BATTLEGROUNDS
440 440 Team Fortress 2
271590 271590 Grand Theft Auto V
433850 433850 H1Z1: King of the Kill
I'll let you add the rest of the information
Edit: Adding arrays to a dataframe
Adding the tags information for each game in the data frame is possible. And the times tagged as well. For each game you must store an array of tag names in a column and the tag quantities in another.
After the definition of df add the following lines:
for(k in 1:nrow(d)){
d$tags[k] = list(names(data[[k]]$tags))
d$tagsQ[k] = list(unlist(data[[k]]$tags))
}
This will give you:
> d["570",]
appid name
570 570 Dota 2
tags
570 Free to Play, MOBA, Strategy, Multiplayer, Team-Based, Action, e-sports, Online Co-Op, Competitive, PvP, RTS, Difficult, RPG, Fantasy, Tower Defense, Co-op, Character Customization, Replay Value, Action RPG, Simulation
tagsQ
570 22686, 7810, 7420, 6759, 4850, 4603, 4092, 3672, 3555, 2657, 2267, 2130, 2116, 2045, 2024, 1898, 1514, 1487, 1397, 1023
In this situation, columns tags and tagsQ contain lists. To obtain the second tag and quantity for appid 570 do:
> df["570","tags"][[1]][2]
[1] "MOBA"
> d["570","tagsQ"][[1]][2]
MOBA
7810

GetelementsByTagName seems to not work properly

this question sounds stupid but how come when I use the function GetElementsByTagname("frame") , it only returns 3 as a length and not 5 as I expected ?
Here is the HTML of the webpage where I counted 5 times the apparition of the tagname "frame" but when I ask for the length in VBA I get 3...
My observations :
1) You can see that 3 is the number of main frames (top_navigation, contentframe, dummyframe)
2) If I try to access to one of the mainframes via getelementbyname, it works but if I try to access on the the subframes of contentframe ( leftnavigation or postfachcontent) it doesn't work ( 0 item detected)
Here is my code :
Dim Frame As IHTMLElementCollection
Set Frame = IEDoc.getElementsByName("contentframe") ' this works and returns 1 item
MsgBox Frame.Length
Set Frame = IEDoc.getElementsByName("postfachcontent")
MsgBox Frame.Length ' this returns 0 item
Dim Collection As IHTMLElementCollection
Set Collection = IEDoc.getElementsByTagName("frame")
MsgBox Collection.Length ' this returns 3 and I expected 5...
Only 3 frames are on that page, the rest are inside an embedded html frame which getElementsByTagName cannot access as it is a different DOM tree.