I have a large table in MySQL (about 3million rows with 15 columns) and I'm trying to use some of the data from the table in an R shiny app.
I've been able to get the connection and write a query in R:
library(DBI)
library(dplyr)
cn <- dbConnect(drv = RMySQL::MySQL(),
username = "user",
password = "my_password",
host = "host",
port = 3306
)
query = "SELECT * FROM dbo.locations"
However, when I run dbGetQuery(cn, query) it takes really long (I ended up closing my RStudio program after it turned unresponsive.
I also tried
res <- DBI::dbSendQuery (cn, query)
repeat {
df <- DBI::dbFetch (res, n = 1000)
if (nrow (df) == 0) { break }
}
dbClearResult(dbListResults(cn)[[1]])
since this is similar to reading the data in by chunks, but my resulting df has 0 rows for some reason.
Any suggestions on how to get my table in R? Should I even try to read that table into R? From what I understand, R doesn't handle large data very well.
Related
I use DBI and RMySQL package to import the whole table from the database. The code works as expected. I would like to know is there a faster way to import the same table multiple times? For example, I import the table, do some calculations, close the R session, and then import the same table again tomorrow. Is there a way to somehow cache that table and import the same table in a faster way?
The code example (working as expected):
library(RMySQL)
library(DBI)
# coonect to database
connection <- function() {
con <- DBI::dbConnect(RMySQL::MySQL(),
host = "91.234.xx.xxx",
port = 3306L,
dbname = "xxxx",
username = "xxxx",
password = "xxxx",
Trusted_Connection = "True")
}
# imoprt
db <- connection()
vix <- DBI::dbGetQuery(db, 'SELECT * FROM VIX')
invisible(dbDisconnect(db))
I successfully inserted many JSON files (only chosen keys) to a local MongoDB. However, when a collection has a little bit more than 100 million rows that need to be inserted my code seems so slow. I hope multiprocessing will help speeds up the process but I can't come up with the correct ways of doing it without any conflict. Here is my code without multiprocessing:
import json
import os
from pymongo import MongoClient
client = MongoClient('localhost', 27017)
db = client[db_name]
# get file list
def log_list(log_folder):
log_file = list()
for entry in os.listdir(log_folder):
if os.path.isfile(os.path.join(log_folder, entry)):
log_path = os.path.join(log_folder, entry)
log_file.append(log_path)
return log_file
def func():
collection = db[collection_name]
print('loading folder_name')
root = folder_path
nfile = 0
nrow = 0
# insert data
files = log_list(root)
files.sort()
for file in files:
with open(file, 'r') as f:
nfile += 1
table = [json.loads(line) for line in f]
for row in table:
nrow += 1
entry = {'timestamp': row['#timestamp'], 'user_id': row['user']['id'], 'action': row['#type']}
collection.insert_one(entry).inserted_id
client.close()
print(nfile, 'file(s) processed.', nrow, 'row(s) loaded.')
Split your file into several files. Run a single copy of your program for each chunk of the file. When writing to the database use insert_many rather than insert_one to write more efficiently to the database.
You can use Python multiprocessing to fork multiple parallel jobs.
We do this in our project, users upload lot of files for some task, we handle it using distributed task queues using Celery.
Since this is a similar, asynchronous task, 'Celery' can do great here, it is designed to pick up tasks and then execute in separate process.
Create a task
Set up a broker (like redis)
Run celery in another terminal or in the background
send the task (see task_name.apply_async() or task_name.delay() )
https://docs.celeryproject.org/en/latest/index.html
I am trying to get a PyMySQL query in Lambda (Python 3.6) to return whether a user exists or not. I pass my slack user ID into the query. This is what I want to check in MySQL. I can run the same query through MySQL and it returns a 0, but for some reason, every time I call this query through lambda, it tells me the user exists (My database is empty). My query is function is this:
def userExists(user):
statement = f"SELECT EXISTS(SELECT 1 FROM slackDB.Assets WHERE userID LIKE '%{user}%')Assets"
tempBool = cursor.execute(statement, args=None)
conn.commit()
return tempBool
Here is the full code I am working with:
################################
# Slack Lambda handler.
################################
import sys
import logging
import os
import pymysql
import urllib
# Grab data from the environment.
BOT_TOKEN = os.environ["BOT_TOKEN"]
ASSET_TABLE = os.environ["ASSET_TABLE"]
REGION_NAME = os.getenv('REGION_NAME', 'us-east-2')
DB_NAME = "admin"
DB_PASSWORD = "somepassword"
DB_DATABASE = "someDB"
RDS_HOST = "myslackdb.somepseudourl.us-east-2.rds.amazonaws.com"
port = 3306
logger = logging.getLogger()
logger.setLevel(logging.INFO)
try:
conn = pymysql.connect(RDS_HOST, user=DB_NAME, passwd=DB_PASSWORD, db=DB_DATABASE, connect_timeout=5)
cursor = conn.cursor()
except:
logger.error("ERROR: Unexpected error: Could not connect to MySql instance.")
sys.exit()
# Define the URL of the targeted Slack API resource.
SLACK_URL = "https://slack.com/api/chat.postMessage"
def userExists(user):
statement = f"SELECT EXISTS(SELECT 1 FROM slackDB.Assets WHERE userID LIKE '%{user}%')Assets"
tempBool = cursor.execute(statement, args=None)
conn.commit()
return tempBool
def addUser(user):
statement = f"INSERT INTO `slackDB`.`Assets` (`userID`, `money`) VALUES ('{user}', '1000')"
tempBool = cursor.execute(statement, args=None)
conn.commit()
return tempBool
def lambda_handler(data, context):
# Slack challenge answer.
if "challenge" in data:
return data["challenge"]
# Grab the Slack channel data.
slack_event = data['event']
slack_userID = slack_event["user"]
slack_text = slack_event["text"]
channel_id = slack_event["channel"]
slack_reply = ""
# Ignore bot messages.
if "bot_id" in slack_event:
slack_reply = ""
else:
# Start data sift.
if slack_text.startswith("!networth"):
slack_reply = "Your networth is: "
elif slack_text.startswith("!price"):
command,asset = text.split()
slack_reply = f"The price of a(n) {asset} is: "
elif slack_text.startswith("!addme"):
if userExists(slack_userID):
slack_reply = f"User {slack_userID} already exists"
else:
slack_reply = f"Adding user {slack_userID}"
addUser(slack_userID)
# We need to send back three pieces of information:
data = urllib.parse.urlencode(
(
("token", BOT_TOKEN),
("channel", channel_id),
("text", slack_reply)
)
)
data = data.encode("ascii")
# Construct the HTTP request that will be sent to the Slack API.
request = urllib.request.Request(
SLACK_URL,
data=data,
method="POST"
)
# Add a header mentioning that the text is URL-encoded.
request.add_header(
"Content-Type",
"application/x-www-form-urlencoded"
)
# Fire off the request!
urllib.request.urlopen(request).read()
# Everything went fine.
return "200 OK"
I am typing '!addme' in slack and it always tells me the user exists. I have printed out my query statement and it is inputting my slack ID correctly. I have checked my table, and it is completely empty. I have run the query in MySQL and it returns a 0.
Does anyone have any ideas? Am I just derping this up on something easy? Any helps or hints is much appreciated.
Thanks,
I don't see a fetch from the cursor. Just the execute.
And the return from execute is the number of rows affected. For DML operations (INSERT/UPDATE/DELETE) that makes sense. But I wouldn't rely on the rows affected count for a SELECT.
In this case, the SELECT EXISTS query is going to either return a row, or throw an error. But the fact that the query returns a row doesn't tell us anything about the value of the Assets column.
From the query, it looks like we want to fetch a row, and then determine if the Assets column contains a 0 or 1 (or NULL).
After the query execution, try cur.fetchone to retrieve the row.
We could also execute a simpler query, and then use a fetch to determine if a row is returned or not.
Can't find any solution how to load huge JSON. I try with well known Yelp dataset. It's 3.2 GB and I want to analyse 9 out of 10 columns. I need to skip import $text column, which will give me much lighter file to load. Probably about -70%. I don't want to manipulate the file.
I tried many libraries and stuck. I've found a solution for data.frame to apply pipe function:
df <- read.table(pipe("cut -f1,5,28 myFile.txt"))
from this thread: Ways to read only select columns from a file into R? (A happy medium between `read.table` and `scan`?)
How to do it for JSON? I'd like to do:
json <- read.table(pipe("cut -text yelp_academic_dataset_review.json"))
but this of course throws an error due to wrong format. Is there any possibilities without parsing whole file with regex?
EDIT
Structure of one row: (even can't count them all)
{"review_id":"vT2PALXWX794iUOoSnNXSA","user_id":"uKGWRd4fONB1cXXpU73urg","business_id":"D7FK-xpG4LFIxpMauvUStQ","stars":1,"date":"2016-10-31","text":"some long text here","useful":0,"funny":0,"cool":0,"type":"review"}
SECOND EDIT
Finally, I've created a loop to convert all data into csv file, omitted unwanted column. It's slow but I've got 150 mb (zipped) from 3.2 gb.
# files to process
filepath <- jfile1
fcsv <- "c:/Users/cp/Documents/R DATA/Yelp/tests/reviews.csv"
write.table(x = paste(colns, collapse=","), file = fcsv, quote = F, row.names = F, col.names = F)
con = file(filepath, "r")
while ( TRUE ) {
line = readLines(con, n = 1)
if ( length(line) == 0 ) {
break
}
# regex process
d <- NULL
for (i in rcols) {
pst <- paste(".*\"",colns[i],"\":*(.*?) *,\"",colns[i+1],"\":.*", sep="")
w <- sub(pst, "\\1", line)
d <- cbind(d, noquote(w))
}
# save on the fly
write.table(x = paste(d, collapse = ","), file = fcsv, append = T, quote = F, row.names = F, col.names = F)
}
close(con)
It can be save to json also. I wonder if it's the most efficient way, but other scripts I tested was slow and often had some encoding issues.
Try this:
library(jsonlite)
df <- as.data.frame(fromJSON('yelp_academic_dataset_review.json', flatten=TRUE))
Then once it is a dataframe delete the column(s) you don't need.
If you don't want to manipulate the file in advance of importing it, I'm not sure what options you have in R. Alternatively, you could make a copy of the file, then delete the text column with this script, then import the copy to R, then delete the copy.
I'm using R to insert a data.frame into a MySQL database. I have this code below that inserts 1000 rows at a time successfully. However, it's not practical if I have a data.frame with tens of thousands of rows. How would you do a bulk insert using R? is it even possible?
## R and MySQL
library(RMySQL)
### create sql connection object
mydb = dbConnect(MySQL(), dbname="db", user='xxx', password='yyy', host='localhost', unix.sock="/Applications/MAMP/mysql/mysql.sock")
# get data ready for mysql
df = data.format
# chunks
df1 <- df[1:1000,]
df2 <- df[1001:2000,]
df3 <- df[2001:nrow(df),]
## SQL insert for data.frame, limit 1000 rows
dbWriteTable(mydb, "table_name", df1, append=TRUE, row.names=FALSE)
dbWriteTable(mydb, "table_name", df2, append=TRUE, row.names=FALSE)
dbWriteTable(mydb, "table_name", df3, append=TRUE, row.names=FALSE)
For completeness, as the link suggests, write the df to a temp table and insert into the destination table as follows:
dbWriteTable(mydb, name = 'temp_table', value = df, row.names = F, append = F)
dbGetQuery(mydb, "insert into table select * from temp_table")
Fast bulk insert is now supported by the DBI-based ODBC package, see
this example posted by Jim Hester (https://github.com/r-dbi/odbc/issues/34):
library(DBI);
con <- dbConnect(odbc::odbc(), "MySQL")
dbWriteTable(con, "iris", head(iris), append = TRUE, row.names=FALSE)
dbDisconnect(con)
Since RMySQL is also DBI-based you just have to "switch" the DB connection
to use the odbc package (thanks to the standardized DBI interface of R).
Since the RMySQL package
... is being phased out in favor of the new RMariaDB package.
according to their web site (https://github.com/r-dbi/RMySQL) you could try switching the driver package to RMariaDB (perhaps they have already implemented a bulk insert feature).
For details see: https://github.com/r-dbi/RMariaDB
If all else fails, you could put it in a loop:
for(i in 0:floor(nrow(df)/1000)) {
insert_set = df[(i*1000 + 1):((i+1)*1000),]
dbWriteTable(mydb, "table_name", insert_set, append=T, row.names=F)
}