So I am trying to get a table off of https://www.baseball-reference.com/register/team.cgi?id=9995d2a1, specifically the one labeled "Team Pitching", which is hidden in an html comment, preventing me from using pd.read_html() or another simpler method. I have gotten to the point where I have all of the data in a data frame, but my issue is that players with an asterisk in their name because they are left handed dissapear. Meaning their names turn to 'None', but I really need to remove the '*' so that the name reads.
This is what I did to get what I have so far with the 'None' as a name for lefties:
page = BeautifulSoup(requests.get('https://www.baseball-reference.com/register/team.cgi?id=b0a9f9bc').text, features = 'lxml')
tbls = []
for comment in page.find_all(text=lambda text: isinstance(text, Comment)):
if comment.find("<table ") > 0:
comment_soup = BeautifulSoup(comment, 'lxml')
table = comment_soup.find("table")
tbls.append(table)
def parse_row(row):
return [str(x.string) for x in row.find_all('td')]
# pitching table
pitching_tbl = tbls[0]
# html text only used for finding names
html = BeautifulSoup(pitching_tbl.text, features = 'lxml')
rows = pitching_tbl.find_all('tr')
data = pd.DataFrame([parse_row(row) for row in rows])
What I would like to be able to do is loop through the text within the pitching_tbl text, and change it in place if there is an asterisk and use .replace('*', ''), and have the actual html within pitching_tbl be changed.
any help is appriciated!
The desired table data is in html comment.So You can invoke beautifulsoup built-in package which is Comment with lambda function to grab data.
import pandas as pd
import requests
from bs4 import BeautifulSoup
from bs4 import Comment
url='https://www.baseball-reference.com/register/team.cgi?id=9995d2a1'
req=requests.get(url)
soup=BeautifulSoup(req.text,'lxml')
df = pd.read_html([x for x in soup.find_all(string=lambda text: isinstance(text, Comment)) if 'id="div_team_pitching"' in x][0])[0]
print(df)
Output:
Rk Name Age W L W-L% ... H9 HR9 BB9 SO9 SO/W Notes
0 1.0 Logan Bursick-Harrington 21.0 0 2 0.000 ... 4.5 0.0 15.8 15.8 1.00 NaN
1 2.0 Cylis Cox* 19.0 1 0 1.000 ... 23.1 0.0 7.7 11.6 1.50 NaN
2 3.0 Travis Densmore* 21.0 0 1 0.000 ... 7.2 0.0 1.8 14.4 8.00 NaN
3 4.0 Dylan Freeman 22.0 1 0 1.000 ... 13.5 1.1 3.4 14.6 4.33 NaN
4 5.0 Zach Hopman* 22.0 0 1 0.000 ... 12.8 0.0 9.9 11.4 1.14 NaN
5 6.0 Eamon Horwedel 22.0 1 0 1.000 ... 9.0 0.0 6.4 6.4 1.00 NaN
6 7.0 Tyler Johnson 19.0 0 0 NaN ... 5.4 0.0 2.7 10.8 4.00 NaN
7 8.0 Trent Jones 20.0 0 0 NaN ... 14.6 1.1 2.3 12.4 5.50 NaN
8 9.0 Tanner Knapp 21.0 1 1 0.500 ... 11.6 0.0 7.7 4.8 0.63 NaN
9 10.0 Mason Majors 22.0 1 0 1.000 ... 4.9 0.0 7.4 12.3 1.67 NaN
10 11.0 Mason Meeks 21.0 0 1 0.000 ... 6.3 0.9 3.6 5.4 1.50 NaN
11 12.0 Sam Nagelvoort 19.0 0 1 0.000 ... 18.0 2.3 22.5 9.0 0.40 NaN
12 13.0 Tyler Nichol 20.0 0 0 NaN ... 27.0 0.0 27.0 0.0 0.00 NaN
13 14.0 Cole Russo 19.0 0 0 NaN ... 27.0 13.5 0.0 0.0 NaN NaN
14 15.0 Kyle Salley* 22.0 0 1 0.000 ... 9.0 2.3 22.5 9.0 0.40 NaN
15 16.0 Noah Stants 21.0 0 0 NaN ... 4.3 1.4 7.1 11.4 1.60 NaN
16 17.0 Quinn Waterhouse* 21.0 0 0 NaN ... 4.5 0.0 4.5 18.0 4.00 NaN
17 18.0 Nick Weyrich 19.0 0 0 NaN ... 6.4 1.3 7.7 11.6 1.50 NaN
18 19.0 Adam Wheaton 23.0 0 1 0.000 ... 11.7 1.8 4.5 12.6 2.80 NaN
19 NaN 19 Players 20.9 5 9 0.357 ... 9.2 0.8 6.9 10.7 1.55 NaN
[20 rows x 32 columns]
So to get that, you need to change:
def parse_row(row):
return [str(x.string) for x in row.find_all('td')]
to
def parse_row(row):
return [str(x.text) for x in row.find_all('td')]
The reason you get None is because the '*' is not part of the <a> tag, so essentially there are 2 contents within the <td> element. If you use .text it'll join them.
So that handles the first issue of the None. The second issue of removing the *: I'm not sure do you actually want it removed from the html, or just simply removed in the dataframe you create, so I will show you both.
To change the actual html:
Here we just remove the '*' element from the <td> .contents list. This will alter the actual soup object, changing the html. And this will also then result in your dataframe showing without that too.
HTML Before:
HTML After - Notice the '*' no longer in the actual html:
Now if you're not interested in changing the html, but rather just grab the data and use pandas to manipulate (NOTE: like F.Hoque did, I would let pandas' .read_html() parse the table for you as it'll grab the headers as well, but this will still work with your code too. Either way, it would be the last step once you have data or in the other solution df. For his, you'd do df['Name'] = df['Name'].str.replace('*','', regex=True)):
import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd
page = BeautifulSoup(requests.get('https://www.baseball-reference.com/register/team.cgi?id=b0a9f9bc').text, features = 'lxml')
tbls = []
for comment in page.find_all(text=lambda text: isinstance(text, Comment)):
if comment.find("<table ") > 0:
comment_soup = BeautifulSoup(comment, 'lxml')
table = comment_soup.find("table")
tbls.append(table)
def parse_row(row):
return [str(x.text) for x in row.find_all('td')]
# pitching table
pitching_tbl = tbls[0]
# html text only used for finding names
html = BeautifulSoup(pitching_tbl.text, features = 'lxml')
rows = pitching_tbl.find_all('tr')
data = pd.DataFrame([parse_row(row) for row in rows])
data[0] = data[0].str.replace('*','', regex=True)
**Using F.Hoque's solution, which is how I would do it. And then also, that might be nice extra column to add, so if it's there, why not add it?:
import pandas as pd
import requests
from bs4 import BeautifulSoup, Comment
import numpy as np
url='https://www.baseball-reference.com/register/team.cgi?id=9995d2a1'
req=requests.get(url)
soup=BeautifulSoup(req.text,'lxml')
df = pd.read_html([x for x in soup.find_all(string=lambda text: isinstance(text, Comment)) if 'id="div_team_pitching"' in x][0])[0]
df['Handedness'] = np.where(df['Name'].str.contains('\*'), 'L', 'R')
df['Name'] = df['Name'].str.replace('*','', regex=True)
print(df)
Output:
Rk Name Age W ... SO9 SO/W Notes Handedness
0 1.0 Logan Bursick-Harrington 21.0 0 ... 15.8 1.00 NaN R
1 2.0 Cylis Cox 19.0 1 ... 11.6 1.50 NaN L
2 3.0 Travis Densmore 21.0 0 ... 14.4 8.00 NaN L
3 4.0 Dylan Freeman 22.0 1 ... 14.6 4.33 NaN R
4 5.0 Zach Hopman 22.0 0 ... 11.4 1.14 NaN L
5 6.0 Eamon Horwedel 22.0 1 ... 6.4 1.00 NaN R
6 7.0 Tyler Johnson 19.0 0 ... 10.8 4.00 NaN R
7 8.0 Trent Jones 20.0 0 ... 12.4 5.50 NaN R
8 9.0 Tanner Knapp 21.0 1 ... 4.8 0.63 NaN R
9 10.0 Mason Majors 22.0 1 ... 12.3 1.67 NaN R
10 11.0 Mason Meeks 21.0 0 ... 5.4 1.50 NaN R
11 12.0 Sam Nagelvoort 19.0 0 ... 9.0 0.40 NaN R
12 13.0 Tyler Nichol 20.0 0 ... 0.0 0.00 NaN R
13 14.0 Cole Russo 19.0 0 ... 0.0 NaN NaN R
14 15.0 Kyle Salley 22.0 0 ... 9.0 0.40 NaN L
15 16.0 Noah Stants 21.0 0 ... 11.4 1.60 NaN R
16 17.0 Quinn Waterhouse 21.0 0 ... 18.0 4.00 NaN L
17 18.0 Nick Weyrich 19.0 0 ... 11.6 1.50 NaN R
18 19.0 Adam Wheaton 23.0 0 ... 12.6 2.80 NaN R
19 NaN 19 Players 20.9 5 ... 10.7 1.55 NaN R
[20 rows x 33 columns]
Related
I am trying to use pd.read_html to import the table under "Daily Observations" from https://www.wunderground.com/history/monthly/us/mi/ann-arbor/date/2020-1
I tried this but the error of "HTTPError: HTTP Error 403: Forbidden" showed.
Jan = pd.read_html('https://www.wunderground.com/history/monthly/us/mi/ann-arbor/date/2020-1')
Alternatively, I copied the source code of the table and save it as an html file.
The html file looks like this:
https://i.stack.imgur.com/YTkF9.jpg
When I use pd.read_html to import this html file, it seems like the imported dataset is not a dataframe. The rows and columns gone messy like this:
[ Time \
0 Jan 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 ...
1 Jan
2 1
3 2
4 3
.. ...
220 0.02
221 0.00
222 0.00
223 0.00
224 0.00
Temperature (° F) \
0 Max Avg Min 38 30.8 26 47 41.8 35 45 42.8 39 3...
1 NaN
2 NaN
3 NaN
4 NaN
.. ...
220 NaN
221 NaN
222 NaN
223 NaN
224 NaN
How can I solve it?
I'm using an AWS mariaDB to store some data. My idea was to do the full management with the DBI package. However, I have found that DBI only imports the first row of the data when I try to write a table in the db. I have to use DBI::dbCreateTable and dbx::dbxInsert. I can't figure out why DBI is not importing the full data frame.
I have gone through this post but the conclusion is not quite clear. This is the code/output:
con <- DBI::dbConnect(odbc::odbc(), "my_odbc", timeout = 10)
## Example 1 - doesn't work
DBI::dbWriteTable(con, "test1", mtcars)
DBI::dbReadTable(con, "test1")
row_names mpg cyl disp hp drat wt qsec vs am gear carb
1 Mazda RX4 21 6 160 110 3.9 2.62 16.46 0 1 4 4
# Example 2 - doesn't work
DBI::dbCreateTable(con, "test2", mtcars)
DBI::dbAppendTable(con, "test2", mtcars)
[1] 1
DBI::dbReadTable(con, "test2")
mpg cyl disp hp drat wt qsec vs am gear carb
1 21 6 160 110 3.9 2.62 16.46 0 1 4 4
# Example 3 - does work.
DBI::dbCreateTable(con, "test3", mtcars)
dbx::dbxInsert(con, "test3", mtcars)
DBI::dbReadTable(con, "test3")
mpg cyl disp hp drat wt qsec vs am gear carb
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
I had a similar issue and if you aren't careful with how you define and use your primary keys you get this issue. The first row is allowed as its the first with that primary key and then the rows after are blocked and hence dont get inserted.
I'm trying to scrape the data from every table at the hockey-reference awards page. I can scrape the first table for the Hart Memorial Trophy, but when I try the rest of them, I end up with empty vectors. I used Selector Gadget and the rvest package to produce the following code.
library(rvest)
url="https://www.hockey-reference.com/awards/voting-2017.html"
byng<-read_html(url)
byng_node<-html_nodes(byng, "#byng_stats .right , #byng_stats a")
byng_text<-html_text(byng_node)
However, once I run this code, I get no data in the byng variables:
> byng_node
{xml_nodeset (0)}
> byng_text
character(0)
What's happening here? Does selector gadget not work for pages with multiple tables? Does it have nothing to do with that and there's something HTMLy I don't understand? Any help is greatly appreciated!
#neilfws was right: if you look at the source code of the HTML page, you see that all but the first table are commented so rvest thinks they are comments, not part of source code itself. Let's do a dirty hack and remove these characters that are used to comment our precious tables:
library(rvest)
url="https://www.hockey-reference.com/awards/voting-2017.html"
byng<-read_html(url)
# Remove commenting sequences
byng <- gsub("<!--", "", byng)
byng <- gsub("-->", "", byng)
byng<-read_html(byng)
#Get tables as a list of dataframes
tables <- html_table(byng)
# Last table
tables[7]
[[1]]
Scoring Scoring Scoring Scoring Goalie Stats Goalie Stats
1 Place Player Age Tm Pos Votes Vote% 1st 2nd 3rd 4th 5th G A PTS +/- W L
2 1 Connor McDavid 20 EDM C 762 94.07 141 18 3 0 0 30 70 100 27
3 2 Sidney Crosby 29 PIT C 526 64.94 20 142 0 0 0 44 45 89 17
4 3 Nicklas Backstrom 29 WSH C 127 15.68 1 2 116 0 0 23 63 86 17
5 4 Mark Scheifele 23 WPG C 21 2.59 0 0 21 0 0 32 50 82 18
6 5 Auston Matthews 19 TOR C 10 1.23 0 0 10 0 0 40 29 69 2
7 6 Evgeni Malkin 30 PIT C 4 0.49 0 0 4 0 0 33 39 72 18
8 7 John Tavares 26 NYI C 2 0.25 0 0 2 0 0 28 38 66 4
9 8 Jonathan Toews 28 CHI C 1 0.12 0 0 1 0 0 21 37 58 7
10 8 Brad Marchand 28 BOS C 1 0.12 0 0 1 0 0 39 46 85 18
11 8 Ryan Kesler 32 ANA C 1 0.12 0 0 1 0 0 22 36 58 8
12 8 Ryan Getzlaf 31 ANA C 1 0.12 0 0 1 0 0 15 58 73 7
Here is my own implementation of gradient descent algorithm in matlab language
m = height(data_training); % number of samples
cols = {'x1', 'x2', 'x3', 'x4', 'x5', 'x6',...
'x7', 'x8','x9', 'x10', 'x11', 'x12', 'x13', 'x14', 'x15'};
y = data_training{:, {'y'}}';
X = [ones(m,1) data_training{:,cols}]';
theta = zeros(1,width(data_training));
alpha = 1e-2; % learning rate
iter = 400;
dJ = zeros(1,width(data_training));
J_seq = zeros(1, iter);
for n = 1:iter
err = (theta*X - y);
for j = 1:width(data_training)
dJ(j) = 1/m*sum(err*X(j,:)');
end
J = 1/(2*m)*sum((theta*X-y).^2);
theta = theta - alpha.*dJ;
J_seq(n) = J;
if mod(n,100) == 0
plot(1:iter, J_seq);
end
end
EDIT
WORKING ALGORITHM
I have applied this algorithm to the following training dataset. The last column is the output variable. Here we have 15 different features.
For a reason for me unknown, when I plot the cost function J after 50 iterations in order to check if it is going towards the convergence, I see it doesn't convergence. Can you help me to understand? is it the implementation wrong or should I make something?
36 27 71 8.1 3.34 11.4 81.5 3243 8.8 42.6 11.7 21 15 59 59 921.87
35 23 72 11.1 3.14 11 78.8 4281 3.6 50.7 14.4 8 10 39 57 997.88
44 29 74 10.4 3.21 9.8 81.6 4260 0.8 39.4 12.4 6 6 33 54 962.35
47 45 79 6.5 3.41 11.1 77.5 3125 27.1 50.2 20.6 18 8 24 56 982.29
43 35 77 7.6 3.44 9.6 84.6 6441 24.4 43.7 14.3 43 38 206 55 1071.3
53 45 80 7.7 3.45 10.2 66.8 3325 38.5 43.1 25.5 30 32 72 54 1030.4
43 30 74 10.9 3.23 12.1 83.9 4679 3.5 49.2 11.3 21 32 62 56 934.7
45 30 73 9.3 3.29 10.6 86 2140 5.3 40.4 10.5 6 4 4 56 899.53
36 24 70 9 3.31 10.5 83.2 6582 8.1 42.5 12.6 18 12 37 61 1001.9
36 27 72 9.5 3.36 10.7 79.3 4213 6.7 41 13.2 12 7 20 59 912.35
52 42 79 7.7 3.39 9.6 69.2 2302 22.2 41.3 24.2 18 8 27 56 1017.6
33 26 76 8.6 3.2 10.9 83.4 6122 16.3 44.9 10.7 88 63 278 58 1024.9
40 34 77 9.2 3.21 10.2 77 4101 13 45.7 15.1 26 26 146 57 970.47
35 28 71 8.8 3.29 11.1 86.3 3042 14.7 44.6 11.4 31 21 64 60 985.95
37 31 75 8 3.26 11.9 78.4 4259 13.1 49.6 13.9 23 9 15 58 958.84
35 46 85 7.1 3.22 11.8 79.9 1441 14.8 51.2 16.1 1 1 1 54 860.1
36 30 75 7.5 3.35 11.4 81.9 4029 12.4 44 12 6 4 16 58 936.23
15 30 73 8.2 3.15 12.2 84.2 4824 4.7 53.1 12.7 17 8 28 38 871.77
31 27 74 7.2 3.44 10.8 87 4834 15.8 43.5 13.6 52 35 124 59 959.22
30 24 72 6.5 3.53 10.8 79.5 3694 13.1 33.8 12.4 11 4 11 61 941.18
31 45 85 7.3 3.22 11.4 80.7 1844 11.5 48.1 18.5 1 1 1 53 891.71
31 24 72 9 3.37 10.9 82.8 3226 5.1 45.2 12.3 5 3 10 61 871.34
42 40 77 6.1 3.45 10.4 71.8 2269 22.7 41.4 19.5 8 3 5 53 971.12
43 27 72 9 3.25 11.5 87.1 2909 7.2 51.6 9.5 7 3 10 56 887.47
46 55 84 5.6 3.35 11.4 79.7 2647 21 46.9 17.9 6 5 1 59 952.53
39 29 76 8.7 3.23 11.4 78.6 4412 15.6 46.6 13.2 13 7 33 60 968.66
35 31 81 9.2 3.1 12 78.3 3262 12.6 48.6 13.9 7 4 4 55 919.73
43 32 74 10.1 3.38 9.5 79.2 3214 2.9 43.7 12 11 7 32 54 844.05
11 53 68 9.2 2.99 12.1 90.6 4700 7.8 48.9 12.3 648 319 130 47 861.83
30 35 71 8.3 3.37 9.9 77.4 4474 13.1 42.6 17.7 38 37 193 57 989.26
50 42 82 7.3 3.49 10.4 72.5 3497 36.7 43.3 26.4 15 10 34 59 1006.5
60 67 82 10 2.98 11.5 88.6 4657 13.6 47.3 22.4 3 1 1 60 861.44
30 20 69 8.8 3.26 11.1 85.4 2934 5.8 44 9.4 33 23 125 64 929.15
25 12 73 9.2 3.28 12.1 83.1 2095 2 51.9 9.8 20 11 26 50 857.62
45 40 80 8.3 3.32 10.1 70.3 2682 21 46.1 24.1 17 14 78 56 961.01
46 30 72 10.2 3.16 11.3 83.2 3327 8.8 45.3 12.2 4 3 8 58 923.23
Not sure I'm following your logic, but it's quite obvious that 'e' (the error) should not be squared.
Let's see what you should be using.
theta is a column vector of unknowns, y is a column vector of measurements and X is the model matrix where each row is an 'example'. So you need to find theta such that:
y = X*theta
Or equivalently, use an optimization method to find theta minimizing the current squared error (this is what makes this a convex optimization problem):
e[n] = (y - X*theta[n])
e[n]^2 --> minimize
Gradient descent uses the gradient of the error function (with respect to theta) to update the theta vector:
theta[n+1] = theta[n] - alpha*2*X'*e[n]
(Note that e[n] and theta[n] are vectors. This is math notation - not matlab's)
So you see that e[n] is not squared in the update equation.
I have growth data of trees for the month of June across multiple years. Around the beginning of June in 2012, 2013 and 2014, I planted seeds and went back to those seeds near the end of the month to see if the seeds germinated and the tree was alive, or didn't germinate and the tree was dead. For each sample (each seed), the number of growing days were calculated.
Sample_ID Tree_Type Check_Date Growing_Days Status Max_Temp Min_Temp Mean_Temp Total_mm_Rain
1 Spruce 25-06-2012 16 Alive
2 Spruce 28-06-2012 25 Alive
3 Fir 23-06-2012 19 Dead
4 Spruce 29-06-2012 23 Alive
5 Fir 28-06-2012 16 Alive
6 Fir 25-06-2013 18 Alive
7 Fir 26-06-2013 15 Dead
8 Spruce 28-06-2013 17 Alive
9 Fir 30-06-2013 24 Dead
10 Fir 27-06-2013 19 Alive
11 Spruce 21-06-2014 16 Alive
12 Fir 24-06-2014 18 Alive
13 Fir 28-06-2014 14 Dead
14 Spruce 29-06-2014 18 Alive
15 Spruce 30-06-2014 15 Dead
What I would like to is see how weather affected my trees. I have pulled historical weather data as a separate dataframe and would like to add to each sample row the Total_mm_Rain that fell during the growing days, along with Max, Min and Mean Temperatures of that growing period.
Date Max_Temp Min_Temp Mean_Temp Total_mm_Rain
01-05-2012 9 3 6 0
02-05-2012 9 2.5 5.8 0
03-05-2012 9.5 -2.5 3.5 4.6
04-05-2012 11 2.5 6.8 0.6
05-05-2012 10 2 6 1.8
06-05-2012 14 -2 6 0
07-05-2012 18 -2 8 0
08-05-2012 21.5 1 11.3 0
09-05-2012 17.5 4.5 11 2.8
10-05-2012 8 0.5 4.3 0
11-05-2012 14.5 -6 4.3 0
12-05-2012 19.5 -3 8.3 0
13-05-2012 23.5 -1 11.3 0
14-05-2012 25 0.5 12.8 0
15-05-2012 27.5 1.5 14.5 0
16-05-2012 24 2.5 13.3 0
17-05-2012 15.5 4.5 10 10
18-05-2012 12.5 2 7.3 0.4
19-05-2012 15 -2 6.5 0
20-05-2012 17.5 -2 7.8 0.4
21-05-2012 15.5 6.5 11 2.2
22-05-2012 12.5 8 10.3 0.4
23-05-2012 14 5 9.5 9.6
24-05-2012 10 1 5.5 1
25-05-2012 11 3 7 3
26-05-2012 13 2 7.5 0
27-05-2012 11.5 3 7.3 0
28-05-2012 17.5 3 10.3 1.2
29-05-2012 15.5 4 9.8 0.2
30-05-2012 17.6 4 10.8 0
31-05-2012 16 6.5 11.3 0.2
01-05-2013 11.5 -4.9 3.3 0
02-05-2013 17.1 -4.5 6.3 2
03-05-2013 15 5.1 10.1 0
04-05-2013 18.9 -0.2 9.4 0
05-05-2013 24.2 -1.8 11.2 0
06-05-2013 26.6 -0.1 13.3 0
07-05-2013 21.9 1.5 11.7 0
08-05-2013 24.6 4.9 14.8 0
09-05-2013 25.5 0.9 13.2 0
10-05-2013 21.4 2 11.7 0
11-05-2013 26.2 3.9 15.1 0
12-05-2013 25 4.5 14.8 0.2
13-05-2013 19.9 10.2 15.1 11
14-05-2013 13.1 5 9.1 0.2
15-05-2013 17.2 -1.7 7.8 0
16-05-2013 15.3 4.1 9.7 0
17-05-2013 18.6 2.4 10.5 1.6
18-05-2013 15.5 3 9.3 5.6
19-05-2013 12.7 5.6 9.2 1
20-05-2013 22 5 13.5 0
21-05-2013 21.9 1.9 11.9 0
22-05-2013 12 7 9.5 24.8
23-05-2013 7.3 0.1 3.7 4.6
24-05-2013 12.3 1.5 6.9 0.2
25-05-2013 13.7 3.7 8.7 0
26-05-2013 19 -1.5 8.8 0
27-05-2013 20 3.5 11.8 0
28-05-2013 17 5.5 11.3 0
29-05-2013 20.1 7 13.6 0.8
30-05-2013 13.5 7.5 10.5 2.4
31-05-2013 9.9 7 8.5 7.8
01-06-2014 8.8 -1 3.9 3.6
02-06-2014 11.4 0.5 6 0
03-06-2014 11.6 -0.7 5.5 0
04-06-2014 16.9 -3.6 6.7 0
05-06-2014 19.6 -2.3 8.7 0
06-06-2014 16.7 0.9 8.8 0
07-06-2014 9.3 5 7.2 1
08-06-2014 10.1 2.8 6.5 0.4
09-06-2014 13.3 -5.2 4.1 0
10-06-2014 16 -4.3 5.9 0
11-06-2014 17 -1.5 7.8 1.6
12-06-2014 13.9 4.7 9.3 0.3
13-06-2014 16.5 -3.4 6.6 0
14-06-2014 22.9 -2.3 10.3 0
15-06-2014 27 0.6 13.8 0
16-06-2014 29.6 4.1 16.9 0
17-06-2014 29.1 3.3 16.2 0
18-06-2014 28.1 5.6 16.9 0
19-06-2014 25.9 8.1 17 0.2
20-06-2014 15.9 8.7 12.3 3.1
21-06-2014 21.3 8.8 15.1 0.4
22-06-2014 23.7 6.7 15.2 6.9
23-06-2014 18.4 9.3 13.9 0
24-06-2014 18.2 4 11.1 6.4
25-06-2014 16 6.5 11.3 10
26-06-2014 12.2 3.6 7.9 1.9
27-06-2014 11.6 3.5 7.6 2.6
28-06-2014 13.7 4.4 9.1 5.6
29-06-2014 11.7 5.5 8.6 3.4
30-06-2014 17.4 7 12.2 0
I have tried using table functions as well as diving into the idea of converting dates to numbers (as in excel) and summing based on dates as numbers instead of dates, but this is above my knowledge of R.