Append information in the th tags to td rows - html

I am an economist struggling with coding and data scraping.
I am scarping data from the main and unique table on this webpage (https://www.oddsportal.com/basketball/europe/euroleague-2013-2014/results/). I can retrieve all the information of the td HTML tags with python selenium by referring to the class element. The same goes for the th tag where it is stored the information of the date and stage of the competition. In my final dataset, I would like to have the information stored in the th tag in two rows (data and stage of the competition) next to the other rows in the table. Basically, for each match, I would like to have the date and the stage of the competition in rows and not as the head of each group of matches.
The only solution I came up with is to index all the rows (with both th and td tags) and build a while loop to append the information in the th tags to the td rows whose index is lower than the next index for the th tag. Hope I made myself clear (if not I will try to give a more graphical explanation). However, I am not able to code such a logic construct due to my poor coding abilities. I do not know if I need two loops to iterate through different tags (td and th) and in case how to do that. If you have any easier solution, it is more than welcome!
Thanks in advance for the precious help!
code below:
from selenium import webdriver
import time
import pandas as pd
# Season to filter
seasons_filt = ['2013-2014', '2014-2015', '2015-2016','2016-2017', '2017-2018', '2018-2019']
# Define empty data
data_keys = ["Season", "Match_Time", "Home_Team", "Away_Team", "Home_Odd", "Away_Odd", "Home_Score",
"Away_Score", "OT", "N_Bookmakers"]
data = dict()
for key in data_keys:
data[key] = list()
del data_keys
# Define 'driver' variable and launch browser
#path = "C:/Users/ALESSANDRO/Downloads/chromedriver_win32/chromedriver.exe"
#path office pc
path = "C:/Users/aldi/Downloads/chromedriver.exe"
driver = webdriver.Chrome(path)
# Loop through pages based on page_num and season
for season_filt in seasons_filt:
page_num = 0
while True:
page_num += 1
# Get url and navigate it
page_str = (1 - len(str(page_num)))* '0' + str(page_num)
url ="https://www.oddsportal.com/basketball/europe/euroleague-" + str(season_filt) + "/results/#/page/" + page_str + "/"
driver.get(url)
time.sleep(3)
# Check if page has no data
if driver.find_elements_by_id("emptyMsg"):
print("Season {} ended at page {}".format(season_filt, page_num))
break
try:
# Teams
for el in driver.find_elements_by_class_name('name.table-participant'):
el = el.text.strip().split(" - ")
data["Home_Team"].append(el[0])
data["Away_Team"].append(el[1])
data["Season"].append(season_filt)
# Scores
for el in driver.find_elements_by_class_name('center.bold.table-odds.table-score'):
el = el.text.split(":")
if el[1][-3:] == " OT":
data["OT"].append(True)
el[1] = el[1][:-3]
else:
data["OT"].append(False)
data["Home_Score"].append(el[0])
data["Away_Score"].append(el[1])
# Match times
for el in driver.find_elements_by_class_name("table-time"):
data["Match_Time"].append(el.text)
# Odds
i = 0
for el in driver.find_elements_by_class_name("odds-nowrp"):
i += 1
if i%2 == 0:
data["Away_Odd"].append(el.text)
else:
data["Home_Odd"].append(el.text)
# N_Bookmakers
for el in driver.find_elements_by_class_name("center.info-value"):
data["N_Bookmakers"].append(el.text)
# TODO think of inserting the dates list in the dataframe even if it has a different size (19 rows and not 50)
except:
pass
driver.quit()
data = pd.DataFrame(data)
data.to_csv("data_odds.csv", index = False)
I would like to add this information to my dataset as two additional rows:
for el in driver.find_elements_by_class_name("first2.tl")[1:]:
el = el.text.strip().split(" - ")
data["date"].append(el[0])
data["stage"].append(el[1])

Few things I would change here.
Don't overwrite variables. You store elements in your el variable, then you over write the element with your strings. It may work for you here, but you may get yourself into trouble with that practice later on, especially since you are iterating through those elements. It makes it hard to debug too.
I know Selenium has ways to parse the html. But I personally feel BeautifulSoup is a tad easier to parse with and is a little more intuitive if you are simply just trying to pull out data from the html. So I went with BeautifulSoup's .find_previous() to get the tags that precede the games, essentially then able to get your date and stage content.
Lastly, I like to construct a list of dictionaries to make up the data frame. Each item in the list is a dictionary key:value where the key is the column name and value is the data. You sort of do the opposite in creating a dictionary of lists. Now there is nothing wrong with that, but if the lists don't have the same length, you're get an error when trying to create the dataframe. Where as with my way, if for what ever reason there is a value missing, it will still create the dataframe, but will just have a null or nan for the missing data.
There may be more work you need to do with the code to go through the pages, but this gets you the data in the form you need.
Code:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
import time
import pandas as pd
from bs4 import BeautifulSoup
import re
# Season to filter
seasons_filt = ['2013-2014', '2014-2015', '2015-2016','2016-2017', '2017-2018', '2018-2019']
# Define 'driver' variable and launch browser
path = "C:/Users/ALESSANDRO/Downloads/chromedriver_win32/chromedriver.exe"
driver = webdriver.Chrome(path)
rows = []
# Loop through pages based on page_num and season
for season_filt in seasons_filt:
page_num = 0
while True:
page_num += 1
# Get url and navigate it
page_str = (1 - len(str(page_num)))* '0' + str(page_num)
url ="https://www.oddsportal.com/basketball/europe/euroleague-" + str(season_filt) + "/results/#/page/" + page_str + "/"
driver.get(url)
time.sleep(3)
# Check if page has no data
if driver.find_elements_by_id("emptyMsg"):
print("Season {} ended at page {}".format(season_filt, page_num))
break
try:
soup = BeautifulSoup(driver.page_source, 'html.parser')
table = soup.find('table', {'id':'tournamentTable'})
trs = table.find_all('tr', {'class':re.compile('.*deactivate.*')})
for each in trs:
teams = each.find('td', {'class':'name table-participant'}).text.split(' - ')
scores = each.find('td', {'class':re.compile('.*table-score.*')}).text.split(':')
ot = False
for score in scores:
if 'OT' in score:
ot == True
scores = [x.replace('\xa0OT','') for x in scores]
matchTime = each.find('td', {'class':re.compile('.*table-time.*')}).text
# Odds
i = 0
for each_odd in each.find_all('td',{'class':"odds-nowrp"}):
i += 1
if i%2 == 0:
away_odd = each_odd.text
else:
home_odd = each_odd.text
n_bookmakers = soup.find('td',{'class':'center info-value'}).text
date_stage = each.find_previous('th', {'class':'first2 tl'}).text.split(' - ')
date = date_stage[0]
stage = date_stage[1]
row = {'Season':season_filt,
'Home_Team':teams[0],
'Away_Team':teams[1],
'Home_Score':scores[0],
'Away_Score':scores[1],
'OT':ot,
'Match_Time':matchTime,
'Home_Odd':home_odd,
'Away_Odd':away_odd,
'N_Bookmakers':n_bookmakers,
'Date':date,
'Stage':stage}
rows.append(row)
except:
pass
driver.quit()
data = pd.DataFrame(rows)
data.to_csv("data_odds.csv", index = False)
Output:
print(data.head(15).to_string())
Season Home_Team Away_Team Home_Score Away_Score OT Match_Time Home_Odd Away_Odd N_Bookmakers Date Stage
0 2013-2014 Real Madrid Maccabi Tel Aviv 86 98 False 18:00 -667 +493 7 18 May 2014 Final Four
1 2013-2014 Barcelona CSKA Moscow 93 78 False 15:00 -135 +112 7 18 May 2014 Final Four
2 2013-2014 Barcelona Real Madrid 62 100 False 19:00 +134 -161 7 16 May 2014 Final Four
3 2013-2014 CSKA Moscow Maccabi Tel Aviv 67 68 False 16:00 -278 +224 7 16 May 2014 Final Four
4 2013-2014 Real Madrid Olympiacos 83 69 False 18:45 -500 +374 7 25 Apr 2014 Play Offs
5 2013-2014 CSKA Moscow Panathinaikos 74 44 False 16:00 -370 +295 7 25 Apr 2014 Play Offs
6 2013-2014 Olympiacos Real Madrid 71 62 False 18:45 +127 -152 7 23 Apr 2014 Play Offs
7 2013-2014 Maccabi Tel Aviv Olimpia Milano 86 66 False 17:45 -217 +179 7 23 Apr 2014 Play Offs
8 2013-2014 Panathinaikos CSKA Moscow 73 72 False 16:30 -106 -112 7 23 Apr 2014 Play Offs
9 2013-2014 Panathinaikos CSKA Moscow 65 59 False 18:45 -125 +104 7 21 Apr 2014 Play Offs
10 2013-2014 Maccabi Tel Aviv Olimpia Milano 75 63 False 18:15 -189 +156 7 21 Apr 2014 Play Offs
11 2013-2014 Olympiacos Real Madrid 78 76 False 17:00 +104 -125 7 21 Apr 2014 Play Offs
12 2013-2014 Galatasaray Barcelona 75 78 False 17:00 +264 -333 7 20 Apr 2014 Play Offs
13 2013-2014 Olimpia Milano Maccabi Tel Aviv 91 77 False 18:45 -286 +227 7 18 Apr 2014 Play Offs
14 2013-2014 CSKA Moscow Panathinaikos 77 51 False 16:15 -303 +247 7 18 Apr 2014 Play Offs

Related

Undefined columns selected using panelvar package

Have anyone used panel var in R?
Currently I'm using the package panelvar of R. And I'm getting this error :
Error in `[.data.frame`(data, , c(colnames(data)[panel_identifier], required_vars)) :
undefined columns selected
And my syntax currently is:
model1<-pvargmm(
dependent_vars = c("Change.."),
lags = 2,
exog_vars = c("Price"),
transformation = "fd",
data = base1,
panel_identifier = c("id", "t"),
steps = c("twostep"),
system_instruments = FALSE,
max_instr_dependent_vars = 99,
min_instr_dependent_vars = 2L,
collapse = FALSE)
I don't know why my panel_identifier is not working, it's pretty similar to the example given by panelvar package, however, it doesn't work, I want to appoint that base1 is on data.frame format. any ideas? Also, my data is structured like this:
head(base1)
id t country DDMMYY month month_text day Date_txt year Price Open
1 1 1296 China 1-4-2020 4 Apr 1 Apr 01 2020 12588.24 12614.82
2 1 1295 China 31-3-2020 3 Mar 31 Mar 31 2020 12614.82 12597.61
High Low Vol. Change..
1 12775.83 12570.32 NA -0.0021
2 12737.28 12583.05 NA 0.0014
thanks in advance !
Check the documentation of the package and the SSRN paper. For me it helped to ensure all entered formats are identical (you can check this with str(base1) command). For example they write:
library(panelvar)
data("Dahlberg")
ex1_dahlberg_data <-
pvargmm(dependent_vars = .......
When I look at it I get
>str(Dahlberg)
'data.frame': 2385 obs. of 5 variables:
$ id : Factor w/ 265 levels "114","115","120",..: 1 1 1 1 1 1 1 1 1 2 ...
$ year : Factor w/ 9 levels "1979","1980",..: 1 2 3 4 5 6 7 8 9 1 ...
$ expenditures: num 0.023 0.0266 0.0273 0.0289 0.0226 ...
$ revenues : num 0.0182 0.0209 0.0211 0.0234 0.018 ...
$ grants : num 0.00544 0.00573 0.00566 0.00589 0.00559 ...
For example the input data must be a data.frame (in my case it had additional type specifications like tibble or data.table). I resolved it by casting as.data.frame() on it.

Regression by year and companyID to save coefficients

I am trying to run regressions by companyID and year, and save the coefficients for each firm-year model as new variables in a new column right besides the other columns. There is an additional wrinkle‹ I have panel data for 1990-2010 and want to run each regression using t to t-4 only (I.e., for 2001, use only 1998-2001 years of data and i.e. for 1990 then only the data of 1990 and so on). I am new to using foreach loops and I found some prior coding on the web. I have tried to adapt it to my situation but two issues: anything.....
the output is staying blank
I have not figured out how to use the rolling four year data periods.
Here is the code I tried. Any suggestions would be much appreciated.
use paneldata.dta // the dataset I am working in
generate coeff . //empty variable for coefficient
foreach x of local levels {
forval z = 1990/2010
{
capture reg excess_returns excess_market
replace coeff = _b[fyear] & _b[CompanyID] if e(sample) }
}
So below is a short snapshot of what the data looks like;
CompanyID Re_Rf Rm-Rf Year
10 2 2 1990 
10 3 2 1991 
15 3 2 1991 
15 4 2 1992
15 5 2 1993 
21 4 2 1990 
21 4 2 1991 
34 3 1 1990 
34 3 1 1991
34 4 1 1992
34 2 1 1993  
34 3 1 1994
34 4 1 1995
34 2 1 1996   
 
Re_Rf = excess_returns 
Rm_Rf = excess_market 
I want to run the following regression: ​​​​​​​
reg excess_returns excess_market
There is a good discussion on Statalist, but I think this answer may be helpful for your learning about loops and how Stata syntax work.
the code I would use is as follows:
generate coeff = . //empty variable for coefficient
// put the values of gvkey into a local macro called levels
qui levelsof CompanyID, local(levels)
foreach co of local levels {
forval yr = 1994/2010 {
// run the regression with the condition that year is between yr
// and yr-3 (which is what you write in your example)
// and the CompanyID is the same as in the regression
qui reg Re_Rf Rm_Rf if fyear <= `yr' & fyear >= `yr'-3 & CompanyID== `co'
// now replace coeff equal to the coefficient on Rm_Rf with the same
// condiditions as above, but only for year yr
replace coeff = _b[Rm_Rf] if fyear == `yr' & CompanyID == `co'
}
}
This is a potentially dangerous thing to do if you do not have a balanced panel. If you are worried about this, there may be a way to deal with it using capture or changing the fyear loop to include something like:
levelsof fyear if CompanyID == `co', local(yr_level)
foreach yr of `yr_level' { ...

Select records based on the specific index string value and then remove subsequent fields by python

I have a .csv file named file01.csv that contains many records. Some records are required and some are not. I find that the required records has a string variable “Mi”, but it is not exist into the unnecessary records. So, I want to select the required records based on string value “Mi” in the field for every records.
Finally I want to delete the subsequent fields of each record from the field that contains value “Mi”. Any suggestion and advice is appreciated.
Optional:
In addition, I want to delete the first column.
Split column BB into two column named as a_id, and c_id. Separate the value by _ (underscore) and left side will go to a_id, and right side will go to c_id.
My fileO.csv is as follows:
AA BB CC DD EE FF GG
1 1_1.csv (=0 =10" 27" =57 "Mi"
0.97 0.9 0.8 NaN 0.9 od 0.2
2 1_3.csv (=0 =10" 27" "Mi" 0.5
0.97 0.5 0.8 NaN 0.9 od 0.4
3 1_6.csv (=0 =10" "Mi" =53 cnt
0.97 0.9 0.8 NaN 0.9 od 0.6
4 2_6.csv No Bi 000 000 000 000
5 2_8.csv No Bi 000 000 000 000
6 6_9.csv less 000 000 000 000
7 7_9.csv s(=0 =26" =46" "Mi" 121
My Expected results files (outFile.csv):
a_id b_id CC DD EE FF GG
1 1 0 10 27 57
1 3 0 10 27
1 6 0 10
7 9 0 26 46
The following approach should work fine using Python csv module:
import csv
import re
import string
output_header = ['a_id', 'b_id', 'CC', 'DD', 'EE', 'FF', 'GG']
sanitise_table = string.maketrans("","")
nodigits_table = sanitise_table.translate(sanitise_table, string.digits)
def find_mi(row):
for index, col in enumerate(row):
if col.find('Mi') != -1:
return index
return -1
def sanitise_cell(cell):
return cell.translate(sanitise_table, nodigits_table) # Keep digits
f_input = open('fileO.csv', 'rb')
f_output = open('outFile.csv', 'wb')
csv_input = csv.reader(f_input)
csv_output = csv.writer(f_output)
input_header = next(f_input)
csv_output.writerow(output_header)
for row in csv_input:
#print '%2d %s' % (len(row), row)
if len(row) >= 2:
bb = re.match(r'(\d+)__(\d+).0\.csv', row[1])
mi = find_mi(row)
if bb and mi != -1:
row[:] = row[:mi] + [''] * (len(row) - mi)
row[:] = [sanitise_cell(col) for col in row]
row[0] = bb.group(1)
row[1] = bb.group(2)
csv_output.writerow(row)
f_input.close()
f_output.close()
outFile.csv will contain the following:
a_id,b_id,CC,DD,EE,FF,GG
1,1,0,10,27,57,
1,3,0,10,27,,
1,6,0,10,,,
7,9,0,26,46,,
Tested using Python 2.6.6

Scraping embeded html table in R

I am fairly new to scraping/parsing HTML in R. I am trying to get data from the Career Receiving Statistics and Career Rushing Statistics' tables from http://totalfootballstats.com/PlayerWR.asp?id=1218565.
I know about the read readHTMLtable function but both these tables are embedded in so much junk and I can't seem to get past the children nodes of the root.
EDIT: the above problem has been solved. However for the website http://www.sports-reference.com/cfb/players/a-index.html I am trying to loop through all players and access their data. I'm running into trouble in accessing their respective url links. I have tried:
fb=htmlParse("http://www.sports-reference.com/cfb/players/a-index.html")
p1=getNodeSet(fb,'//pre')
con = textConnection(xmlValue(p1[[100]]))
players100 = read.table(con)
But this results in the error "Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 3 did not have 5 elements"
The other thing I tried is:
links <- xpathSApply(fb, "//a/#href")
But I feel like there should be a better way to do this?
Well here's the same player from a different website, much much cleaner. The data doesn't match though, so someone got it wrong. My money's on totalfootballstats.com. Choose your resources wisely!
readHTMLTable(
"http://www.sports-reference.com/cfb/players/doyle-aaron-1.html"
)
# $receiving
# Year School Conf Class Pos G Rec Yds Avg TD Att Yds Avg TD Plays Yds Avg TD
# 1 1988 Miami (FL) Ind WR 11 1 12 12.0 0 1 34 34.0 0 2 46 23.0 0
# 2 1989 Miami (FL) Ind WR 11 8 93 11.6 1 8 93 11.6 1
# $kick_ret
# Year School Conf Class Pos G Ret Yds Avg TD Ret Yds Avg TD
# 1 1988 Miami (FL) Ind WR 11 1 8 8.0 0
# 2 1989 Miami (FL) Ind WR 11
For specific requests, it looks like you can a construct a valid URL like this, which will also construct the path for multiple players at once.
## base URI
u <- "http://www.sports-reference.com"
## player first and last names
first <- "bill"
last <- "adams"
## use sprintf() to make all the paths at once
fullPath <- sprintf("%s/cfb/players/%s-%s-1.html", u, first, last)
## read the table - I think you'll need to loop readHTMLTable() though
readHTMLTable(fullPath)
# $receiving
# Year School Conf Class Pos G Rec Yds Avg TD Att Yds Avg TD Plays Yds Avg TD
# 1 1969 Dayton Ind WR 10 1 3 3.0 1 1 3 3.0 1
# 2 1970 Dayton Ind WR 10 4 42 10.5 1 4 42 10.5 1

standard unambiguos format [R] MySQL imported data

OK, to set the scene, I have written a function to import multiple tables from MySQL (using RODBC) and run randomForest() on them.
This function is run on multiple databases (as separate instances).
In one particular database, and one particular table, the "error in as.POSIXlt.character(x, tz,.....): character string not in a standard unambiguous format" error is thrown. The function runs on around 150 tables across two databases without any issues except this one table.
Here is a head() print from the table:
MQLTime bar5 bar4 bar3 bar2 bar1 pat1 baXRC
1 2014-11-05 23:35:00 184 24 8 24 67 147 Flat
2 2014-11-05 23:57:00 203 184 204 67 51 147 Flat
3 2014-11-06 00:40:00 179 309 49 189 75 19 Flat
4 2014-11-06 00:46:00 28 192 60 49 152 147 Flat
5 2014-11-06 01:20:00 309 48 9 11 24 19 Flat
6 2014-11-06 01:31:00 24 177 64 152 188 19 Flat
And here is the function:
GenerateRF <- function(db, countstable, RFcutoff) {
'load required libraries'
library(RODBC)
library(randomForest)
library(caret)
library(ff)
library(stringi)
'connection and data preparation'
connection <- odbcConnect ('TTODBC', uid='root', pwd='password', case="nochange")
'import count table and check if RF is allowed to be built'
query.str <- paste0 ('select * from ', db, '.', countstable, ' order by RowCount asc')
row.counts <- sqlQuery (connection, query.str)
'Operate only on tables that have >= RFcutoff'
for (i in 1:nrow (row.counts)) {
table.name <- as.character (row.counts[i,1])
col.count <- as.numeric (row.counts[i,2])
row.count <- as.numeric (row.counts[i,3])
if (row.count >= 20) {
'Delete old RFs and DFs for input pattern'
if (file.exists (paste0 (table.name, '_RF.Rdata'))) {
file.remove (paste0 (table.name, '_RF.Rdata'))
}
if (file.exists (paste0 (table.name, '_DF.Rdata'))) {
file.remove (paste0 (table.name, '_DF.Rdata'))
}
'import and clean data'
query.str2 <- paste0 ('select * from ', db, '.', table.name, ' order by mqltime asc')
raw.data <- sqlQuery(connection, query.str2)
'partition data into training/test sets'
set.seed(489)
index <- createDataPartition(raw.data$baXRC, p=0.66, list=FALSE, times=1)
data.train <- raw.data [index,]
data.test <- raw.data [-index,]
'find optimal trees to grow (without outcome and dates)
data.mtry <- as.data.frame (tuneRF (data.train [, c(-1,-col.count)], data.train$baXRC, ntreetry=100,
stepFactor=.5, improve=0.01, trace=TRUE, plot=TRUE, dobest=FALSE))
best.mtry <- data.mtry [which (data.mtry[,2] == min (data.mtry[,2])), 1]
'compress df'
data.ff <- as.ffdf (data.train)
'run RF. Originally set to 1000 trees but M1 dataset is to large for laptop. Maybe train at the lab?'
data.rf <- randomForest (baXRC~., data=data.ff[,-1], mtry=best.mtry, ntree=500, keep.forest=TRUE,
importance=TRUE, proximity=FALSE)
'generate and print variable importance plot'
varImpPlot (data.rf, main = table.name)
'predict on test data'
data.test.pred <- as.data.frame( predict (data.rf, data.test, type="prob"))
'get dates and name date column'
data.test.dates <- data.frame (data.test[,1])
colnames (data.test.dates) <- 'MQLTime'
'attach dates to prediction df'
data.test.res <- cbind (data.test.dates, data.test.pred)
'force date coercion to attempt negating unambiguous format error '
data.test.res$MQLTime <- format(data.test.res$MQLTime, format = "%Y-%m-%d %H:%M:%S")
'delete row names, coerce to dataframe, generate row table name and export outcomes to MySQL'
rownames (data.test.res)<-NULL
data.test.res <- as.data.frame (data.test.res)
root.table <- stri_sub(table.name, 0, -5)
sqlUpdate (connection, data.test.res, tablename = paste0(db, '.', root.table, '_outcome'), index = "MQLTime")
'save RF and test df/s for future use; save latest version of row_counts to MQL4 folder'
save (data.rf, file = paste0 ("C:/Users/user/Documents/RF_test2/", table.name, '_RF.Rdata'))
save (data.test, file = paste0 ("C:/Users/user/Documents/RF_test2/", table.name, '_DF.Rdata'))
write.table (row.counts, paste0("C:/Users/user/AppData/Roaming/MetaQuotes/Terminal/71FA4710ABEFC21F77A62A104A956F23/MQL4/Files/", db, "_m1_rowcounts.csv"), sep = ",", col.names = F,
row.names = F, quote = F)
'end of conditional block'
}
'end of for loop'
}
'close all connection to MySQL'
odbcCloseAll()
'clear workspace'
rm(list=ls())
'end of function'
}
At this line:
data.test.res$MQLTime <- format(data.test.res$MQLTime, format = "%Y-%m-%d %H:%M:%S")
I have tried coercing MQLTime using various functions including: as.character(), as.POSIXct(), as.POSIXlt(), as.Date(), format(), as.character(as.Date())
and have also tried:
"%y" vs "%Y" and "%OS" vs "%S"
All variants seem to have no effect on the error and the function is still able to run on all other tables. I have checked the table manually (which contains almost 1500 rows) and also in MySQL looking for NULL dates or dates like "0000-00-00 00:00:00".
Also, if I run the function line by line in R terminal, this offending table is processed without any problems which just confuses the hell out me.
I've exhausted all the functions/solutions I can think of (and also all those I could find through Dr. Google) so I am pleading for help here.
I should probably mention that the MQLTime column is stored as varchar() in MySQL. This was done to try and get around issues with type conversions between R and MySQL
SHOW VARIABLES LIKE "%version%";
innodb_version, 5.6.19
protocol_version, 10
slave_type_conversions,
version, 5.6.19
version_comment, MySQL Community Server (GPL)
version_compile_machine, x86
version_compile_os, Win32
> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: i386-w64-mingw32/i386 (32-bit)
Edit: Str() output on the data as imported from MySQl showing MQLTime is already in POSIXct format:
> str(raw.data)
'data.frame': 1472 obs. of 8 variables:
$ MQLTime: POSIXct, format: "2014-11-05 23:35:00" "2014-11-05 23:57:00" "2014-11-06 00:40:00" "2014-11-06 00:46:00" ...
$ bar5 : int 184 203 179 28 309 24 156 48 309 437 ...
$ bar4 : int 24 184 309 192 48 177 48 68 60 71 ...
$ bar3 : int 8 204 49 60 9 64 68 27 192 147 ...
$ bar2 : int 24 67 189 49 11 152 27 56 437 67 ...
$ bar1 : int 67 51 75 152 24 188 56 147 71 0 ...
$ pat1 : int 147 147 19 147 19 19 147 19 147 19 ...
$ baXRC : Factor w/ 3 levels "Down","Flat",..: 2 2 2 2 2 2 2 2 2 3 ...
So I have tried declaring stringsAsfactors = FALSE in the dataframe operations and this had no effect.
Interestingly, if the offending table is removed from processing through an additional conditional statement in the first 'if' block, the function stops on the table immediately preceeding the blocked table.
If both the original and the new offending tables are removed from processing, then the function stops on the table immediately prior to them. I have never seen this sort of behavior before and it really has me stumped.
I watched system resources during the function and they never seem to max out.
Could this be a problem with the 'for' loop and not necessarily date formats?
There appears to be some egg on my face. The table following the table where the function was stopping had a row with value '0000-00-00 00:00:00'. I added another statement in my MySQL function to remove these rows when pre-processing the tables. Thanks to those that had a look at this.