How do I read a HTML table with mismatching columns and headers? - html

A HTML table body has 1 column more than defined within the table header. This leads to skipping the last column and of course, column mismatch. How can I add the additional column to the result data.frame/table in R while reading in the HTML table with package("htmltab") Obviously, post processing does not help.
Here is an example:
code
install.packages("htmltab")
library(htmltab)
bu<- 0
bu <- data.table("Pl.", "Mannschaft", "Kurzname" , "Spiele", "G.", "U.", "V.", "Tore", "Diff.", "Pkt.")
#https://www.bundesliga-prognose.de/1/2009/1/
url <- "https://www.bundesliga-prognose.de/1/2009/1/"
bu <- htmltab(doc = url, column=10,columnnames=c ("Pl." , "Mannschaft", "Kurzname" , "Spiele", "G.", "U.", "V.", "Tore", "Diff.", "Pkt."), which = "//th[text() = 'Pl.']/ancestor::table")
bu <- data.table(bu)
head(bu)
This results in
Pl. Mannschaft Spiele G. U. V. Tore Diff. Pkt.
1: 1. VfL Wolfsburg Wolfsburg 1 1 0 0 2:0 2
2: 2. Eintracht Frankfurt E. Frankfurt 1 1 0 0 3:2 1
3: 3. FC Schalke 04 FC Schalke 04 1 1 0 0 2:1 1
4: 4. Borussia Dortmund B. Dortmund 1 1 0 0 1:0 1
5: NA Hertha BSC Berlin H. BSC Berlin 1 1 0 0 1:0 1
6: 6. Bor. Mönchengladbach M´gladbach 1 0 1 0 3:3 0
As the short-name("Kurzname") is not specified in the header the short-name ("Kurzname") is displayed with the games (Spiele) column an so on. So the last column is skipped. How can I add the additional column short-name ("Kurzname") while reading the header using the htmltab package?
In addition I would like to replace the NA in row 5 with the row-id/number using the htmltab package?

This seems to be indeed a problem for htmltab. The only solution i have found is to directly read the tbody of the table. You would then need to add the header manually.
htmltab(doc = url, which = "//table[2]/tbody")

With that help I found a quite simple solution:
specify to skip the header
List/define all colums thru colNames
url <- "https://www.bundesliga-prognose.de/1/2007/5/"
sp_2007_5<- htmltab(doc = url, which = "//table[1]/tbody", header = 0 , colNames = c("Datum" , "Anpfiff", "Heim" , "Heim_Kurzname","Gast", "Gast_Kurzname","Ergebnis", "Prognose"), rm_nodata_cols = F,encoding = "UTF-8")
head(sp_2007_5)

Related

Phyloseq: relative abundance otu-table and metadata do not match

I am relatively new to phyloseq and I struggle to obtain a relative abundance otu-table acceptable for input to siamcat R code for meta-analysis.
# this works: from qza to phyloseq object
ps<-qza_to_phyloseq(
features="all-table.qza",
tree="rooted-tree.qza",
taxonomy = "all-taxonomy.qza",
metadata = "metafinal.tsv"
)
# import metadata
metadata <- read_tsv("metafinal.tsv")
# 30 overlap of the metadata-sample_id with ps, 115 only in metadata
gplots::venn(list(metadata=metadata$sample_id, features=sample_names(ps))
# works: from phyloseq object to relative abundance otu table
table(tax_table(ps)[, "Phylum"])
ps_rel_abund <- transform_sample_counts(ps, function(x){x / sum(x)})
ps_phylum_rel <- tax_glom(ps_rel_abund, "Phylum")
taxa_names(ps_phylum_rel) <- tax_table(ps_phylum_rel)[, "Phylum"]
rel_table <- as(otu_table(ps_phylum_rel), "matrix")
# column names and sample_id are 100% the same
colnames(rel_table)
metadata$sample_id
# 100% overlap:
gplots::venn(list(metadata=metadata$sample_id, featuretable=colnames(rel_table)))
# check that metadata and feature agree
stopifnot(all(colnames(rel_table) == metadata$sample_id))
and here I get an error message: all(colnames(rel_table) == metadata$sample_id) is not TRUE
and the following siamcat code is not working at all.
my metadata[1:5, 1:5]:
sample_id absolute_filepath study experiment_acce… study_title
1 SRR8547628 $PWD/Chen_2020_da… Chen… SRX5349649 Dissection of c…
2 SRR8547629 $PWD/Chen_2020_da… Chen… SRX5349648 Dissection of c…
3 SRR8547630 $PWD/Chen_2020_da… Chen… SRX5349647 Dissection of c…
4 SRR8547631 $PWD/Chen_2020_da… Chen… SRX5349646 Dissection of c…
5 SRR8547632 $PWD/Chen_2020_da… Chen… SRX5349645 Dissection of c…
my rel-table[1:5, 1:5]:
SRR5092146 SRR5092147 SRR5092148 SRR5092149
Phragmoplastophyta 0 0.0000000 0.00000000 0.000000000
Vertebrata 0 0.0000000 0.00000000 0.000000000
Apicomplexa 0 0.0000000 0.00000000 0.000000000
Ascomycota 0 0.0000000 0.00000000 0.000000000
Campilobacterota 0 0.2465222 0.01166882 0.004337051
SRR5092150
Phragmoplastophyta 0.00000000
Vertebrata 0.00000000
Apicomplexa 0.00000000
Ascomycota 0.00000000
Campilobacterota 0.02106281
nrow(metadata)= 154
ncol(rel_table)= 154
Please, why is it not working? I tried for weeks now and I can't make the code run properly ...
Thank you for your time and help.
If I understand your question correctly, you are wondering, why you have perfect overlap between sample IDs in your metadata and in your feature table, but why
stopifnot(all(colnames(rel_table) == metadata$sample_id))
returns FALSE.
I think it is because your samples seem to be in a different order. The first five samples in your metadata are:
SRR8547628
SRR8547629
SRR8547630
SRR8547631
SRR8547632
and the first five samples in your feature table are:
SRR5092146
SRR5092147
SRR5092148
SRR5092149
SRR5092150
Try
stopifnot(all(colnames(rel_table) %in% metadata$sample_id))

Undefined columns selected using panelvar package

Have anyone used panel var in R?
Currently I'm using the package panelvar of R. And I'm getting this error :
Error in `[.data.frame`(data, , c(colnames(data)[panel_identifier], required_vars)) :
undefined columns selected
And my syntax currently is:
model1<-pvargmm(
dependent_vars = c("Change.."),
lags = 2,
exog_vars = c("Price"),
transformation = "fd",
data = base1,
panel_identifier = c("id", "t"),
steps = c("twostep"),
system_instruments = FALSE,
max_instr_dependent_vars = 99,
min_instr_dependent_vars = 2L,
collapse = FALSE)
I don't know why my panel_identifier is not working, it's pretty similar to the example given by panelvar package, however, it doesn't work, I want to appoint that base1 is on data.frame format. any ideas? Also, my data is structured like this:
head(base1)
id t country DDMMYY month month_text day Date_txt year Price Open
1 1 1296 China 1-4-2020 4 Apr 1 Apr 01 2020 12588.24 12614.82
2 1 1295 China 31-3-2020 3 Mar 31 Mar 31 2020 12614.82 12597.61
High Low Vol. Change..
1 12775.83 12570.32 NA -0.0021
2 12737.28 12583.05 NA 0.0014
thanks in advance !
Check the documentation of the package and the SSRN paper. For me it helped to ensure all entered formats are identical (you can check this with str(base1) command). For example they write:
library(panelvar)
data("Dahlberg")
ex1_dahlberg_data <-
pvargmm(dependent_vars = .......
When I look at it I get
>str(Dahlberg)
'data.frame': 2385 obs. of 5 variables:
$ id : Factor w/ 265 levels "114","115","120",..: 1 1 1 1 1 1 1 1 1 2 ...
$ year : Factor w/ 9 levels "1979","1980",..: 1 2 3 4 5 6 7 8 9 1 ...
$ expenditures: num 0.023 0.0266 0.0273 0.0289 0.0226 ...
$ revenues : num 0.0182 0.0209 0.0211 0.0234 0.018 ...
$ grants : num 0.00544 0.00573 0.00566 0.00589 0.00559 ...
For example the input data must be a data.frame (in my case it had additional type specifications like tibble or data.table). I resolved it by casting as.data.frame() on it.

How to apply countvectorizer to bigrams in a pandas dataframe

I'm trying to apply the countvectorizer to a dataframe containing bigrams to convert it into a frequency matrix showing the number of times each bigram appears in each row but I keep getting error messages.
This is what I tried using
cereal['bigrams'].head()
0 [(best, thing), (thing, I), (I, have),....
1 [(eat, it), (it, every), (every, morning),...
2 [(every, morning), (morning, my), (my, brother),...
3 [(I, have), (five, cartons), (cartons, lying),...
.........
bow = CountVectorizer(max_features=5000, ngram_range=(2,2))
train_bow = bow.fit_transform(cereal['bigrams'])
train_bow
Expected results
(best,thing) (thing, I) (I, have) (eat,it) (every,morning)....
0 1 1 1 0 0
1 0 0 0 1 1
2 0 0 0 0 1
3 0 0 1 0 0
....
I see you are trying to convert a pd.Series into a count representation of each term.
Thats a bit different from what CountVectorizer does;
From the function description:
Convert a collection of text documents to a matrix of token counts
The official example of case use is:
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> corpus = [
... 'This is the first document.',
... 'This document is the second document.',
... 'And this is the third one.',
... 'Is this the first document?',
... ]
>>> vectorizer = CountVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>> print(vectorizer.get_feature_names())
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
>>> print(X.toarray())
[[0 1 1 1 0 0 1 0 1]
[0 2 0 1 0 1 1 0 1]
[1 0 0 1 1 0 1 1 1]
[0 1 1 1 0 0 1 0 1]]
So, as one can see, it takes as input a list where each term is a "document".
Thats problaby the cause of the errors you are getting, you see, you are passing a pd.Series where each term is a list of tuples.
For you to use CountVectorizer you would have to transform your input into the proper format.
If you have the original corpus/text you can easily implement CountVectorizer on top of it (with the ngram parameter) to get the desired result.
Else, best solution wld be to treat it as it is, a series with a list of items, which must be counted/pivoted.
Sample workaround:
(it wld be a lot easier if you just use the text corpus instead)
Hope it helps!

write items from a list to csv file column by column using pandas dataframe.to_csv

I have a list named items
items=['a' , 'b','c']
Code is:
df = pandas.DataFrame(items)
df.to_csv("myfile.csv",headers=None,index=False)
the values written to the file are in different rows but same column.(vertically written)
But
I want the values to be written as : a b c ie. in same row but different column.
Help please
You get each element in different rows because you load the df as that way.
If you want in different column I would suggest to do transpose,
df = df.T
or you can load as one row like below,
items=[['a' , 'b','c']]
df = pd.DataFrame(items)
df
Out[22]:
0 1 2
0 a b c
And then write the output to csv,
eg:
df = pandas.DataFrame(items)
df = df.T
df.to_csv("myfile.csv",headers=None,index=False)
df = pd.DataFrame(items)
df
Out[5]:
0
0 a
1 b
2 c
df.T
Out[11]:
0 1 2
0 a b c

Scraping embeded html table in R

I am fairly new to scraping/parsing HTML in R. I am trying to get data from the Career Receiving Statistics and Career Rushing Statistics' tables from http://totalfootballstats.com/PlayerWR.asp?id=1218565.
I know about the read readHTMLtable function but both these tables are embedded in so much junk and I can't seem to get past the children nodes of the root.
EDIT: the above problem has been solved. However for the website http://www.sports-reference.com/cfb/players/a-index.html I am trying to loop through all players and access their data. I'm running into trouble in accessing their respective url links. I have tried:
fb=htmlParse("http://www.sports-reference.com/cfb/players/a-index.html")
p1=getNodeSet(fb,'//pre')
con = textConnection(xmlValue(p1[[100]]))
players100 = read.table(con)
But this results in the error "Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 3 did not have 5 elements"
The other thing I tried is:
links <- xpathSApply(fb, "//a/#href")
But I feel like there should be a better way to do this?
Well here's the same player from a different website, much much cleaner. The data doesn't match though, so someone got it wrong. My money's on totalfootballstats.com. Choose your resources wisely!
readHTMLTable(
"http://www.sports-reference.com/cfb/players/doyle-aaron-1.html"
)
# $receiving
# Year School Conf Class Pos G Rec Yds Avg TD Att Yds Avg TD Plays Yds Avg TD
# 1 1988 Miami (FL) Ind WR 11 1 12 12.0 0 1 34 34.0 0 2 46 23.0 0
# 2 1989 Miami (FL) Ind WR 11 8 93 11.6 1 8 93 11.6 1
# $kick_ret
# Year School Conf Class Pos G Ret Yds Avg TD Ret Yds Avg TD
# 1 1988 Miami (FL) Ind WR 11 1 8 8.0 0
# 2 1989 Miami (FL) Ind WR 11
For specific requests, it looks like you can a construct a valid URL like this, which will also construct the path for multiple players at once.
## base URI
u <- "http://www.sports-reference.com"
## player first and last names
first <- "bill"
last <- "adams"
## use sprintf() to make all the paths at once
fullPath <- sprintf("%s/cfb/players/%s-%s-1.html", u, first, last)
## read the table - I think you'll need to loop readHTMLTable() though
readHTMLTable(fullPath)
# $receiving
# Year School Conf Class Pos G Rec Yds Avg TD Att Yds Avg TD Plays Yds Avg TD
# 1 1969 Dayton Ind WR 10 1 3 3.0 1 1 3 3.0 1
# 2 1970 Dayton Ind WR 10 4 42 10.5 1 4 42 10.5 1