Here we have a link where there is a table:
http://pitzavod.ru/products/upakovka/
When I read it with pd.read_html I do get a list, but it 1.) is not nested, thus when converted to a dataframe it is not readable, 2.) contains integers 0 to number of rows in the table on the website.
The list I get looks like:
[ 0 1 \
0 Показатели Марка целлюлозы
1 ОСН NaN
2 Механическая прочность при размоле в мельнице ... 10 000 740 520
3 Степень делигнификации, п.е. 28 - 45
4 Сорность - число соринок в условной массе 500г... 6500
5 Влажность, % не более 20
2
0 Методы испытаний
1 NaN
2 ГОСТ13525.1 ГОСТ 13525.3 ГОСТ 13525.8
3 ГОСТ 10070
4 ГОСТ 14363.3
5 ГОСТ 16932 ]
Is there a way to easily clean this pandas outpute, or do I properly need to parse the website? Thank you.
That's because read_html returns always a list (even if the number of tables is 1).
pandas.read_html :Read HTML tables into a list of DataFrame objects.
You need to slice it with [0] :
df = pd.read_html("http://pitzavod.ru/products/upakovka/")[0]
Output (showing the last two columns) :
1 2
0 Марка целлюлозы Методы испытаний
1 ОСН Методы испытаний
2 10 000 740 520 ГОСТ13525.1 ГОСТ 13525.3 ГОСТ 13525.8
3 28 - 45 ГОСТ 10070
4 6500 ГОСТ 14363.3
5 20 ГОСТ 16932
Category/Date jul aug sep oct nov dec ...jan
AA on 2020jun 5 6 3 8 2 7 ... 4
AA on 2020May 7 3 2 6 5 5 ....7
Difference -2 1 1 2 -3 2 ...-3
I am using wcf query to fetch the data.The dataset looks like
category -AA
Date - 2020Jun
Month1 4
Month12 7
Month11 2
Month10 8
i am using a switch statement like below to display each column value
switch(Reportdate.Month(jun 2020) = 1, Fields.Month1.Value,
Reportdate.Month = 12, Fields.Month12.Value)
The problem i am facing is with finding the difference in each column
I have a csv file with 1461 attributes. I want to load it into a pandas data frame. The problem is, many rows do not have values for trailing consecutive columns. So pandas gives parsing error due to irregular length. How to put missing values for the leading columns for once and load the csv file into the data frame?
Edit1
We can see that the data set csv file is as follows
a,b,c,d,e,f,g,h,i"""
1,2,4,5
1,0,9,8,7,6,5,4,7
1,3,5,6,7
6,7,8,8,9,4,5,3,5"""
I want a pandas data frame like below
"""a b c d e f g h i
1 2 4 5 ? ? ? ? ?
1 0 9 8 7 6 5 4 7
1 3 5 6 7 ? ? ? ?
6 7 8 8 9 4 5 3 5"""
NaN in place of ? can be ok
we don't have enough commas unequal length problem.
It seems you can use parameter names in read_csv for column names by range (if attributes are columns):
import pandas as pd
from pandas.compat import StringIO
temp=u"""
a,v
c,v,f,r
b,g
y"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
#in real data change 4 to 1461
names = range(4)
df = pd.read_csv(StringIO(temp), names=names)
print (df)
0 1 2 3
0 a v NaN NaN
1 c v f r
2 b g NaN NaN
3 y NaN NaN NaN
EDIT:
temp=u"""a,b,c,d,e,f,g,h,i
1,2,4,5
1,0,9,8,7,6,5,4,7
1,3,5,6,7
6,7,8,8,9,4,5,3,5"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp))
print (df)
a b c d e f g h i
0 1 2 4 5 NaN NaN NaN NaN NaN
1 1 0 9 8 7.0 6.0 5.0 4.0 7.0
2 1 3 5 6 7.0 NaN NaN NaN NaN
3 6 7 8 8 9.0 4.0 5.0 3.0 5.0
I am trying to read an html table using the package XML, but even though it looks easy, I haven’t managed to do it. I tried everything, but the names of the columns are always fixed by R as V1, V2, V3,…
This is the code:
require(XML)
tbl <- readHTMLTable("http://facedata.ornl.gov/ornl/npp_98-08.html”,
header = c("year","ring","CO2", "stem","root","leaf","fine root", "NPP"),
skip.rows=c(1,2),colClasses=c(rep("factor",3),rep("numeric",5)))
Many thanks for your help
The first row of the table is causing trouble. It maybe easiest to remove it:
library(XML)
appURL <- "http://facedata.ornl.gov/ornl/npp_98-08.html"
doc <- htmlParse(appURL)
removeNodes(doc["//table/tr[1]"]) # remove the first row with the troublesome header
myTable <- readHTMLTable(doc, which = 1)
> head(myTable)
Year Plot CO2 Stem Coarse Root Leaf Fine Root Total NPP
1 1998 1 elev 1540 127 362 168 2197
2 1998 2 elev 1487 139 418 175 2219
3 1998 3 amb 1085 112 333 231 1762
4 1998 4 amb 1204 113 368 185 1870
5 1998 5 amb 1136 109 382 56 1683
6 1999 1 elev 1218 98 475 295 2086
I have a list of data.frames (dfList) from which I'd like to generate directed weighted networks in Igraph in R.
The edge-weight variable is "ValueUSD", while vertices are identified by numbers with the columns "Reporter" and "Partner".
Here it follows a function I prepared called "write.graphs" to be used with lapply on my "dfList".
write.graphs<-function(filename){
d<-graph.data.frame(filename[c("Reporter", "Partner")], directed=TRUE)
d <- set.edge.attribute(d, "weight", value=filename$ValueUSD)
}
graphs<-lapply(dfList, write.graphs)
Every thing work perfectly. If I check for vertex names I get:
graphs[1]$names
NULL
But troubles emerge when I want to use names in place of numbers to identify my vertices, using the corresponding columns "ReporterN" and "PartnerN" in each of my data.frames in dfLIst.
Here you can see how my data.frames look like:
dfList[1]
$`Aug 2014`
Reporter YearPeriod Year Period Commodity Partner NetWeightKg ValueUSD Price PartnerN ReporterN
1 76 201408 2014 8 150910 0 4472917 22028271 4.924811 World Brazil
2 76 201408 2014 8 150910 32 380891 1533948 4.027262 Argentina Brazil
3 76 201408 2014 8 150910 152 239776 1336057 5.572105 Chile Brazil
4 76 201408 2014 8 150910 251 289 2164 7.487889 France Brazil
5 76 201408 2014 8 150910 300 27592 170658 6.185054 Greece Brazil
6
This is the message I get:
> grafi<-lapply(dfList, scrivi.grafi)
There were 50 or more warnings (use warnings() to see the first 50)
> warnings()
Warning messages:
1: In graph.data.frame(filename[c("ReporterN", "PartnerN")], ... :
In `d' `NA' elements were replaced with string "NA"
2: In graph.data.frame(filename[c("ReporterN", "PartnerN")], ... :
In `d' `NA' elements were replaced with string "NA"
3: In graph.data.frame(filename[c("ReporterN", "PartnerN")], ... :
If it helps, I checked:
class(dfList[1]$PartnerN)
[1] "NULL"
Any suggestion? Can anyone explain me what happens?
Thanks a lot, Umberto.