I want to be able to add a column into an existing table with its corresponding type.
This is how I tried it:
library("RMySQL")
# Connect to DB
v_db <- dbConnect(MySQL(),
user="USERNAME", password="PASSWORD",
dbname="DBNAME", host="localhost")
on.exit(dbDisconnect(v_db))
#Read in my new data (into R)
newcolumn <- read.csv("test.csv")
newcolumn
id datafornewcolumn
1 4
2 5
3 8
dbq <- dbSendQuery(v_db, "SELECT * FROM `EXISTINGTABLE`")
dbq <- fetch(dbq, n = -1)
dbq
id existing columns
1 ...
2 ...
3 ...
dbWriteTable(v_db, "EXISTINGTABLE", merge(dbq, newcolumn, by="id", all.x=TRUE), row.name=FALSE, overwrite=T)
But with that last statement I overwrite the existing table with the new one thereby losing all the corresponding variable types.
Then I tried a workaround. Write the new data into a new table in SQL and after that merge that into the EXISTINGTABLE. However, it seems I'm not able to do that corretly:
dbSendQuery(v_db, "create table workaround (id int not null primary key,
newcolumn DECIMAL(3,1))")
#write data into that new empty table called workaround --> works fine
dbWriteTable(v_db, "workaround", neu, row.name=FALSE, append=TRUE)
#something works...
dbSendQuery(v_db, "SELECT * FROM EXISTINGTABLE
LEFT JOIN workaround ON EXISTINGTABLE.id = workaround.id
UNION
SELECT * FROM EXISTINGTABLE
RIGHT JOIN workaround ON EXISTINGTABLE.id = workaround.id")
<MySQLResult:(29344,26,2)>
The result should look like this:
EXISTINGTABLE
id existingcolumns datafornewcolumn
1 ... 4
2 ... 5
3 ... 8
Have you tried to modify your table using SQL with a ALTER TABLE statement?
rs <- dbSendStatement(v_db, "ALTER TABLE table_name ADD COLUMN [...]")
dbHasCompleted(rs)
dbGetRowsAffected(rs)
dbClearResult(rs)
Then you can simply send and UPDATE statement to add new values.
Try first with dbExecute() method... and try this on R Console:
?dbExecute or ?dbSendStatement
Check this out too: RMySQL dbWriteTable with field.types
Adding a column to an existing table can be done as :
library(dbConnect) # *loads the packages Loading required package: RMySQL
Loading required package: DBI
Loading required package: methods
Loading required package: gWidgets*
mydb =dbConnect(MySQL(),user='root',password='newpass',dbname='database_name', host='localhost')
my_table <- dbReadTable(conn=mydb,name='table_name')
my_table$column_name<- NA # *creates new column named column_name and populates its value NA*
head(my_table) # *shows the data*
Related
I have a data frame in pyspark like below
df = spark.createDataFrame(
[
('2021-10-01','A',25),
('2021-10-02','B',24),
('2021-10-03','C',20),
('2021-10-04','D',21),
('2021-10-05','E',20),
('2021-10-06','F',22),
('2021-10-07','G',23),
('2021-10-08','H',24)],("RUN_DATE", "NAME", "VALUE"))
Now using this data frame I want to update a table in MySql
# query to run should be similar to this
update_query = "UPDATE DB.TABLE SET DATE = '2021-10-01', VALUE = 25 WHERE NAME = 'A'"
# mysql_conn is a function which I use to connect to `MySql` from `pyspark` and run queries
# Invoking the function
mysql_conn(host, user_name, password, update_query)
Now when I invoke the mysql_conn function by passing parameters the query runs successfully and the record gets updated in the MySql table.
Now I want to run the update statement for all the records in the data frame.
For each NAME it has to pick the RUN_DATE and VALUE and replace in update_query and trigger the mysql_conn.
I think we need to a for loop but not sure how to proceed.
Instead of iterating through the dataframe with a for loop, it would be better to distribute the workload across each partitions using foreachPartition. Moreover, since you are writing a custom query instead of executing one query for each query, it would be more efficient to execute a batch operation to reduce the round trips, latency and concurrent connections. Eg
def update_db(rows):
temp_table_query=""
for row in rows:
if len(temp_table_query) > 0:
temp_table_query = temp_table_query + " UNION ALL "
temp_table_query = temp_table_query + " SELECT '%s' as RUNDATE, '%s' as NAME, %d as VALUE " % (row.RUN_DATE,row.NAME,row.VALUE)
update_query="""
UPDATE DBTABLE
INNER JOIN (
%s
) new_records ON DBTABLE.NAME = new_records.NAME
SET
DBTABLE.DATE = new_records.RUNDATE,
DBTABLE.VALUE = new_records.VALUE
""" % (temp_table_query)
mysql_conn(host, user_name, password, update_query)
df.foreachPartition(update_db)
View Demo on how the UPDATE query works
Let me know if this works for you.
I want to append new table to original(existing) table without overwriting original table. What is the query code for appending new table?
I have tried the following codes.
## choose one document from vector of strings
file <- x[j]
# read csv for file
doc <- read.csv(file, sep=";")
# indicate number of rows for each doc
n <- nrow(doc)
# create dataframe for doc
df <- data.frame(doc_id = numeric(n), doc_name = character(n), doc_number = character(n), stringsAsFactors = FALSE)
# loop to create df
for(k in 1:nrow(doc)){
df$doc_id[k] <- paste0(df$doc_id[k])
df$doc_name[k] <- paste0(doc$titles[k])
df$doc_number[k] <- paste0(doc$no[k])
}
# query for inserting table to mysql
query1 = sprintf('INSERT IGNORE INTO my_sql_table VALUES ("%s","%s","%s");', df[i,1],df[i,2],df[i,3])
# query for appending table to my_sql_table
query2 = sqlAppendTable(con, "doc", df)
# execute the queries
dbExecute(con, query1)
dbExecute(con, query2)
print(j)
} ## end of for loop
I also tried the following queries for appending, unfortunately, it didn't work.
INSERT IGNORE INTO my_sql_table VALUES ("%s","%s");
INSERT IGNORE INTO tableBackup(SELECT * FROM my_sql_table);
I expect to have appended new_table to original_table without deleting or overwriting the original table.
Situation
Working with Python 3.7.2
I have read previlege of a MariaDB table with 5M rows on a server.
I have a local text file with 7K integers, one per line.
The integers represent IDXs of the table.
The IDX column of the table is the primary key. (so I suppose it is automatically indexed?)
Problem
I need to select all the rows whose IDX is in the text file.
My effort
Version 1
Make 7K queries, one for each line in the text file. This makes approximately 130 queries per second, costing about 1 minute to complete.
import pymysql
connection = pymysql.connect(....)
with connection.cursor() as cursor:
query = (
"SELECT *"
" FROM TABLE1"
" WHERE IDX = %(idx)s;"
)
all_selected = {}
with open("idx_list.txt", "r") as f:
for idx in f:
idx = idx.strip()
if idx:
idx = int(idx)
parameters = {"idx": idx}
cursor.execute(query, parameters)
result = cursor.fetchall()[0]
all_selected[idx] = result
Version 2
Select the whole table, iterate over the cursor and cherry-pick rows. The for-loop over .fetchall_unbuffered() covers 30-40K rows per second, and the whole script costs about 3 minutes to complete.
import pymysql
connection = pymysql.connect(....)
with connection.cursor() as cursor:
query = "SELECT * FROM TABLE1"
set_of_idx = set()
with open("idx_list.txt", "r") as f:
for line in f:
if line.strip():
line = int(line.strip())
set_of_idx.add(line)
all_selected = {}
cursor.execute(query)
for row in cursor.fetchall_unbuffered():
if row[0] in set_of_idx:
all_selected[row[0]] = row[1:]
Expected behavior
I need to select faster, because the number of IDXs in the text file will grow as big as 10K-100K in the future.
I consulted other answers including this, but I can't make use of it since I only have read previlege, thus impossible to create another table to join with.
So how can I make the selection faster?
A temporary table implementation would look like:
connection = pymysql.connect(....,local_infile=True)
with connection.cursor() as cursor:
cursor.execute("CREATE TEMPORARY TABLE R (IDX INT PRIMARY KEY)")
cursor.execute("LOAD DATA LOCAL INFILE 'idx_list.txt' INTO R")
cursor.execute("SELECT TABLE1.* FROM TABLE1 JOIN R USING ( IDX )")
..
cursor.execute("DROP TEMPORARY TABLE R")
Thanks to the hint (or more than a hint) from #danblack, I was able to achieve the desired result with the following query.
query = (
"SELECT *"
" FROM TABLE1"
" INNER JOIN R"
" ON R.IDX = TABLE1.IDX;"
)
cursor.execute(query)
danblack's SELECT statement didn't work for me, raising an error:
pymysql.err.ProgrammingError: (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'IDX' at line 1")
This is probably because of MariaDB's join syntax, so I consulted the MariaDB documentation on joining tables.
And now it selects 7K rows in 0.9 seconds.
Leaving here as an answer just for the sake of completeness, and for future readers.
I have a MySQL table I am attempting to access with R using RMySQL.
There are 1690004 rows that should be returned from
dbGetQuery(con, "SELECT * FROM tablename WHERE export_date ='2015-01-29'")
Unfortunately, I receive the following warning messages:
In is(object, Cl) : error while fetching row
In dbGetQuery(con, "SELECT * FROM tablename WHERE export_date ='2015-01-29'", : pending rows
And only receive ~400K rows.
If I break the query into several "fetches" using dbSendQuery, the warning messages start appearing after ~400K rows are received.
Any help would be appreciated.
So, it looks like it was due to a 60 second timeout imposed by my hosting provider (damn Arvixe!). I got around this by "paging/chunking" the output. Because my data has an auto-incrementing primary key, every row returned is in order, allowing me to take the next X rows after each iteration.
To get 1.6M rows I did the following:
library(RMySQL)
con <- MySQLConnect() # mysql connection function
day <- '2015-01-29' # date of interest
numofids <- 50000 # number of rows to include in each 'chunk'
count <- dbGetQuery(con, paste0("SELECT COUNT(*) as count FROM tablename WHERE export_date = '",day,"'"))$count # get the number of rows returned from the table.
dbDisconnect(con)
ns <- seq(1, count, numofids) # get sequence of rows to work over
tosave <- data.frame() # data frame to bind results to
# iterate through table to get data in 50k row chunks
for(nextseries in ns){ # for each row
print(nextseries) # print the row it's on
con <- MySQLConnect()
d1 <- dbGetQuery(con, paste0("SELECT * FROM tablename WHERE export_date = '",day,"' LIMIT ", nextseries,",",numofids)) # extract data in chunks of 50k rows
dbDisconnect(con)
# bind data to tosave dataframe. (the ifelse is avoid an error when it tries to rbind d1 to an empty dataframe on the first pass).
if(nrow(tosave)>0){
tosave <- rbind(tosave, d1)
}else{
tosave <- d1
}
}
I'd like to calculate the fitted values for a model fit in R using glm on data in a MySQL table and pipe the result back into that table, how could I do that?
# preparation
library("RMySQL")
con <- dbConnect(MySQL(), user="####", pass="###", host="127.0.0.1", db="###")
on.exit(dbDisconnect(con))
# fetching the data and calculating fit
tab <- dbGetQuery(con,"SELECT ID, dep, indep1, indep2 FROM table WHERE !(ISNULL(ID) OR ISNULL(dep) OR ISNULL(indep1) OR ISNULL(indep2))")
regression = glm(tab$dep ~ tab$indep1 + tab$indep2, gaussian)
# preparing data for insertion
insert <- data.frame(tab$ID, fitted.values(regression)
colnames(insert) <- c('ID', 'dep')
# table hassle (inserting data, combining with previous table, deleting old and fitresult renaming combined back to original name
if (dbExistsTable(con, '_result') {
dbRemoveTable(con, '_result');
}
dbWriteTable(con, '_result', insert)
dbSendQuery(con, 'CREATE TABLE temporary SELECT table.*, _result.dep FROM table LEFT JOIN result USING (ID)')
dbRemoveTable(con, 'table')
dbRemoveTable(con, '_result')
dbSendQuery(con, 'RENAME TABLE temporary TO table')