For an assignment, I need to "use SQL to extract all tweets in twitter message-table under those 3 user ids in the previous step." I am currently confused with grabbing the tweet info from MySQL using the vector,x, in R.
I keep getting this error message, "Error in .local(conn, statement, ...) :
unused argument (c(18949452, 34713362, 477583514))."
#use SQL to get a list of unique user id in twitter message table as a
#vector in R.
res <- dbSendQuery(con, statement = "select user_id from
twitter_message")
user_id <- dbFetch(res)
user_id
nrow(user_id)
#randomly selects : use R to randomly generate 3 user id
x <- user_id[sample(nrow(user_id), 3, replace = FALSE, prob = NULL),]
x
res2 = dbSendQuery(con, statement = 'SELECT twitter_message WHERE
user_id =',x)
tweets <- dbFetch(res2)
tweets
x is a vector, so you maybe you should use the dbSendQuery function in a loop. For each element in x, pass its value in your dbSendQuery statement. Does that make sense?
Related
I tried to use the glue function to pass a list of event Id's into the CALL for the Stored Procedure, but that did not work. I get the following Error: unexpected string constant in "df <- dbGetQuery(mydb, glue::gluesql('CALL reportProfitAndLossDetails(3,' ',{eventids_list},TRUE);'"
Any suggestions on passing the list into the call?
# query eventIds for MLB (433) and MLS (446) events and convert it into a list
eventids <- dbGetQuery(mydb, 'SELECT id
FROM Event
WHERE date >= CURDATE()
AND
(EventTypeId = 433 OR EventTypeId = 446)
ORDER BY date ASC;')
eventids_list <- paste0(eventids$id, collapse=',')
# execute reportProfitAndLossDetails for above eventids
df <- dbGetQuery(mydb, glue::gluesql('CALL reportProfitAndLossDetails(3,' ',{eventids_list},TRUE);'))
I have a data frame in pyspark like below
df = spark.createDataFrame(
[
('2021-10-01','A',25),
('2021-10-02','B',24),
('2021-10-03','C',20),
('2021-10-04','D',21),
('2021-10-05','E',20),
('2021-10-06','F',22),
('2021-10-07','G',23),
('2021-10-08','H',24)],("RUN_DATE", "NAME", "VALUE"))
Now using this data frame I want to update a table in MySql
# query to run should be similar to this
update_query = "UPDATE DB.TABLE SET DATE = '2021-10-01', VALUE = 25 WHERE NAME = 'A'"
# mysql_conn is a function which I use to connect to `MySql` from `pyspark` and run queries
# Invoking the function
mysql_conn(host, user_name, password, update_query)
Now when I invoke the mysql_conn function by passing parameters the query runs successfully and the record gets updated in the MySql table.
Now I want to run the update statement for all the records in the data frame.
For each NAME it has to pick the RUN_DATE and VALUE and replace in update_query and trigger the mysql_conn.
I think we need to a for loop but not sure how to proceed.
Instead of iterating through the dataframe with a for loop, it would be better to distribute the workload across each partitions using foreachPartition. Moreover, since you are writing a custom query instead of executing one query for each query, it would be more efficient to execute a batch operation to reduce the round trips, latency and concurrent connections. Eg
def update_db(rows):
temp_table_query=""
for row in rows:
if len(temp_table_query) > 0:
temp_table_query = temp_table_query + " UNION ALL "
temp_table_query = temp_table_query + " SELECT '%s' as RUNDATE, '%s' as NAME, %d as VALUE " % (row.RUN_DATE,row.NAME,row.VALUE)
update_query="""
UPDATE DBTABLE
INNER JOIN (
%s
) new_records ON DBTABLE.NAME = new_records.NAME
SET
DBTABLE.DATE = new_records.RUNDATE,
DBTABLE.VALUE = new_records.VALUE
""" % (temp_table_query)
mysql_conn(host, user_name, password, update_query)
df.foreachPartition(update_db)
View Demo on how the UPDATE query works
Let me know if this works for you.
I am new to R and have a database, DatabaseX which contains multiple tables a,b,c,d,etc. I want to use R to count number of rows for a common attribute message_id among all these tables and store it separately.
I can count message_id for all tables using below code:
list<-dbListTables(con)
# print list of tables for testing
print(list)
for (i in 1:length(list)){
query <- paste("SELECT COUNT(message_id) FROM ",list[i], sep = "")
t <- dbGetQuery(con,query)
}
print(t)
This prints :
### COUNT(message_id)
## 1 21519
but I want to keep record of count(message_id) for each individual table. So for example table a = 200, b = 300, c = 500, and etc.
any suggestions how to do this?
As #kaiten65 suggested, one option would be to create a helper function which executes the COUNT query. Outside of your loop I have defined a numeric vector counts which will store the number of records for each table in your database. You can then perform descriptive stats on the tables along with this vector of record counts.
doCountQuery <- function(con, table) {
query <- paste("SELECT COUNT(message_id) FROM ", table, sep = "")
t <- dbGetQuery(con, query)
return(t)
}
list <- dbListTables(con)
counts <- numeric(0) # this will store the counts for all tables
for (i in 1:length(list)) {
count <- doCountQuery(con, list[i])
counts[i] <- count[[1]]
}
You can use functions to achieve what you need. For instance
readDB <- function(db, i){
query <- paste("SELECT COUNT(message_id) FROM ",db, sep = "")
t <- dbGetQuery(con,query)
return(print(paste("Table ", i, " count:", t)
}
list<-dbListTables(con)
for (i in 1:length(list)){
readDB(list[i]);
}
This should print your list recursively but the actual code is in a nice editable function. Your output will be
"Table 1 count: 2519"
"Table 2 count: ---- "
More information on R functions here: http://www.statmethods.net/management/userfunctions.html
I have a MySQL table I am attempting to access with R using RMySQL.
There are 1690004 rows that should be returned from
dbGetQuery(con, "SELECT * FROM tablename WHERE export_date ='2015-01-29'")
Unfortunately, I receive the following warning messages:
In is(object, Cl) : error while fetching row
In dbGetQuery(con, "SELECT * FROM tablename WHERE export_date ='2015-01-29'", : pending rows
And only receive ~400K rows.
If I break the query into several "fetches" using dbSendQuery, the warning messages start appearing after ~400K rows are received.
Any help would be appreciated.
So, it looks like it was due to a 60 second timeout imposed by my hosting provider (damn Arvixe!). I got around this by "paging/chunking" the output. Because my data has an auto-incrementing primary key, every row returned is in order, allowing me to take the next X rows after each iteration.
To get 1.6M rows I did the following:
library(RMySQL)
con <- MySQLConnect() # mysql connection function
day <- '2015-01-29' # date of interest
numofids <- 50000 # number of rows to include in each 'chunk'
count <- dbGetQuery(con, paste0("SELECT COUNT(*) as count FROM tablename WHERE export_date = '",day,"'"))$count # get the number of rows returned from the table.
dbDisconnect(con)
ns <- seq(1, count, numofids) # get sequence of rows to work over
tosave <- data.frame() # data frame to bind results to
# iterate through table to get data in 50k row chunks
for(nextseries in ns){ # for each row
print(nextseries) # print the row it's on
con <- MySQLConnect()
d1 <- dbGetQuery(con, paste0("SELECT * FROM tablename WHERE export_date = '",day,"' LIMIT ", nextseries,",",numofids)) # extract data in chunks of 50k rows
dbDisconnect(con)
# bind data to tosave dataframe. (the ifelse is avoid an error when it tries to rbind d1 to an empty dataframe on the first pass).
if(nrow(tosave)>0){
tosave <- rbind(tosave, d1)
}else{
tosave <- d1
}
}
I need to join two tables where the common column-id that I want to use has a different name in each table. The two tables have a "false" common column name that does not work when dplyr takes the default and joins on columns "id".
Here's some of the code involved in this problem
library(dplyr)
library(RMySQL)
SDB <- src_mysql(host = "localhost", user = "foo", dbname = "bar", password = getPassword())
# Then reference a tbl within that src
administrators <- tbl(SDB, "administrators")
members <- tbl(SDB, "members")
Here are 3 attempts -- that all fail -- to pass along the information that the common column on the members side is "id" and on the adminisrators side it's "idmember":
sqlq <- semi_join(members,administrators, by=c("id","idmember"))
sqlq <- inner_join(members,administrators, by= "id.x = idmember.y")
sqlq <- semi_join(members,administrators, by.x = id, by.y = idmember)
Here's an example of the kinds of error messages I'm getting:
Error in mysqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not run statement: Unknown column '_LEFT.idmember' in 'where clause')
The examples I see out there pertain to data tables and data frames on the R side. My question is about how dplyr sends "by" statements to a SQL engine.
In the next version of dplyr, you'll be able to do:
inner_join(members, administrators, by = c("id" = "idmember"))
Looks like this is an unresolved issue:
https://github.com/hadley/dplyr/issues/177
However you can use merge:
❥ admin <- as.tbl(data.frame(id = c("1","2","3"),false = c(TRUE,FALSE,FALSE)))
❥ members <- as.tbl(data.frame(idmember = c("1","2","4"),false = c(TRUE,TRUE,FALSE)))
❥ merge(admin,members, by.x = "id", by.y = "idmember")
id false.x false.y
1 1 TRUE TRUE
2 2 FALSE TRUE
If you need to do left or outer joins, you can always use the ALL.x, or ALL arguments to merge. A thought though... You've got a sql db, why not use it?
❥ con2 <- dbConnect(MySQL(), host = "localhost", user = "foo", dbname = "bar", password = getPassword())
❥ dbGetQuery(con, "select * from admin join members on id = idmember")