How to parsimoniously refer to a data frame in RMySQL - mysql

I have a MySQL table that I am reading with the RMySQL package of R. I would like to be able to directly refer to the data frame stored in the table so I can seamlessly interact with it rather than having to execute RMySQL statement every time I want to do something. Is there a way to accomplish this? I tried:
data <- dbReadTable(conn = con, name = 'tablename')
For example, if I now want to check how many rows I have in this table I would run:
nrow(data)
Does this go through the database connection, or am I now storing the object "data" locally, defeating the whole purpose of using an external database?

data <- dbReadTable(conn = con, name = 'tablename')
This command downloads all the data into a local R dataframe (assuming you have enough RAM). Any operations with data from that point forward do not require the SQL connection.

Related

How to manipulate/clean data located in a MySQL database using base R commands?

I've connected to a MySQL database using the RMariaDB package, and, thanks to the dbplyr package, am able to adjust the data using dplyr commands directly from R studio. However, there are some basic things I want to do that require base R functions (there are no equivalents in dplyr to my knowledge). Is there a way to clean this data using base R commands? Thanks in advance.
The answer to this arises from how the dbplyr package works. dbplyr translates certain dplyr commands into SQL. For example:
library(dplyr)
library(dbplyr)
data(mtcars)
# setup simulated database connection
df_postgre = tbl_lazy(mtcars, con = simulate_postgres())
# fetch first 5 records
first_five = df_postgre %>% head(5)
# view SQL transaltion
first_five %>% show_query()
# resulting SQL translation
<SQL>
SELECT *
FROM `df`
LIMIT 5
The major constrain for this approach is that dbplyr can only translate certain commands into SQL. So something like the following will fail:
# setup simulated database connection
df_postgre = tbl_lazy(mtcars, con = simulate_postgres())
# fetch first 5 records
first_five = df_postgre[1:5,]
While head(df, 5) and df[1:5,] produce identical output for data.frames in local R memory, dbplyr can not translate developer intention, only specific dplyr commands. Hence these two commands are very different when working with database tables.
The other element to consider here is that databases are primarily read-only. In R we can do
df = df %>%
mutate(new_var = 2*old_var)
and this changes the data held in memory. However in databases, the original data is stored in the database and it is transformed based on your instructions when it is requested. There are ways to write completely new database tables from existing database tables - there are already several Q&A on this under the dbplyr tag.

Why does R upload data much faster than KNIME or Workbench?

What I want to know is, what the heck happens, under the hoods, when I upload data through R and it turns to be way much faster than MySQL Workbench or KNIME?
I work with data and, everyday, I upload data into a MySQL server. I used to upload data using KNIME since it was much faster than uploading with MySQL Workbench (select the table -> "import data").
Some infos: The CSV has 4000 rows and 15 columns. The library I used in R is RMySQL. The node I used in KNIME is database writer.
library('RMySQL')
df=read.csv('C:/Users/my_user/Documents/file.csv', encoding = 'UTF-8', sep=';')
connection <- dbConnect(
RMySQL::MySQL(),
dbname = "db_name",
host = "yyy.xxxxxxx.com",
user = "vitor",
password = "****"
)
dbWriteTable(connection, "table_name", df, append=TRUE, row.names=FALSE)
So, to test, I did the exact same process, using the same file. It took 2 minutes in KNIME and only seconds in R.
Everything happens under the hood! Data upload to DB depends on parameters such as interface between DB and tool, network connectivity, batch size set, memory available for tool and tool data processing speed itself and probably some more. In your case RMySQL package uses batch size of 500 by default and KNIME only 1 so probably that is where the difference comes from. Try setting it to 500 in KNIME and then compare. Have no clue how MySQL Workbench works...

RMySQL - dbWriteTable() - The used command is not allowed with this MySQL version

I am trying to read a few excel files into a dataframe and then write to a MySQL database. The following program is able to read the files and create the dataframe but when it tries to write to the db using dbWriteTable command, I get an error message -
Error in .local(conn, statement, ...) :
could not run statement: The used command is not allowed with this MySQL version
library(readxl)
library(RMySQL)
library(DBI)
mydb = dbConnect(RMySQL::MySQL(), host='<ip>', user='username', password='password', dbname="db",port=3306)
setwd("<directory path>")
file.list <- list.files(pattern='*.xlsx')
print(file.list)
dat = lapply(file.list, function(i){
print(i);
x = read_xlsx(i,sheet=NULL, range=cell_cols("A:D"), col_names=TRUE, skip=1, trim_ws=TRUE, guess_max=1000)
x$file=i
x
})
df = do.call("rbind.data.frame", dat)
dbWriteTable(mydb, name="table_name", value=df, append=TRUE )
dbDisconnect(mydb)
I checked the definition of the dbWriteTable function and looks like it is using load data local inpath to store the data in the database. As per some other answered questions on Stackoverflow, I understand that the word local could be the cause for concern but since it is already in the function definition, I don't know what I can do. Also, this statement is using "," as separator. But my data has "," in some of the values and that is why I was interested in using the dataframes hoping that it would preserve the source structure. But now I am not so sure.
Is there any other way/function do write the dataframe to the MySQL tables?
I solved this on my system by adding the following line to the my.cnf file on the server (you may need to use root and vi to edit!). In my this is just below the '[mysqld]' line
local-infile=1
Then restart the sever.
Good luck!
You may need to change
dbWriteTable(mydb, name="table_name", value=df, append=TRUE )
to
dbWriteTable(mydb, name="table_name", value=df,field.types = c(artist="varchar(50)", song.title="varchar(50)"), row.names=FALSE, append=TRUE)
That way, you specify the field types in R and append data to your MySQL table.
Source:Unknown column in field list error Rmysql

Shapefile to MDB with custom field structure [duplicate]

This question already has an answer here:
How to create table in mdb from dbf query
(1 answer)
Closed 6 years ago.
I have a Shapefile with 80.000 polygons that they are grouped by a specific field called "OTA".
I wanted to convert each Shapefile (it's attribute table) to mdb database (not Personal Geodatabase) with one table in it with the same name as the Shapefile and with a given field structure.
In the code I used I had to load on Python 2 new modules:
pypyodbc and adodbapi
The first module was used to create the mdb file for each shapefile and the second to create the table in the mdb and fill the table with the data from the attribute table of the shapefile.
The code I came up with is the following:
import pypyodbc
import adodbapi
Folder = ur'C:\TestPO' # Folder to save the mdbs
FD = Folder+ur'\27ALLPO.shp' # Shapefile
Map = u'PO' # Map type
N = u'27' # Prefecture
OTAList = sorted(set([row[0] for row in arcpy.da.SearchCursor(FD,('OTA'))]))
cnt = 0
for OTAvalue in OTAList:
cnt += 1
dbname = N+OTAvalue+Map
pypyodbc.win_create_mdb(Folder+'\\'+dbname+'.mdb')
conn_str = (r"Provider=Microsoft.Jet.OLEDB.4.0;Data Source="+Folder+"\\"+dbname+ur".mdb;")
conn = adodbapi.connect(conn_str)
crsr = conn.cursor()
SQL = "CREATE TABLE ["+dbname+"] ([FID] INT,[AREA] FLOAT,[PERIMETER] FLOAT,[KA_PO] VARCHAR(10),[NOMOS] VARCHAR(2),[OTA] VARCHAR(3),[KATHGORPO] VARCHAR(2),[KATHGORAL1] VARCHAR(2),[KATHGORAL2] VARCHAR(2),[LABEL_PO] VARCHAR(8),[PHOTO_45] VARCHAR(14),[PHOTO_60] VARCHAR(10),[PHOTO_PO] VARCHAR(8),[POLY_X_CO] DECIMAL(10,3),[POLY_Y_CO] DECIMAL(10,3),[PINAKOKXE] VARCHAR(11),[LANDTYPE] DECIMAL(2,0));"
crsr.execute(SQL)
conn.commit()
with arcpy.da.SearchCursor(FD,['FID','AREA','PERIMETER','KA_PO','NOMOS','OTA','KATHGORPO','KATHGORAL1','KATHGORAL2','LABEL_PO','PHOTO_45','PHOTO_60','PHOTO_PO','POLY_X_CO','POLY_Y_CO','PINAKOKXE','LANDTYPE'],'"OTA" = \'{}\''.format(OTAvalue)) as cur:
for row in cur:
crsr.execute("INSERT INTO "+dbname+" VALUES ("+str(row[0])+","+str(row[1])+","+str(row[2])+",'"+row[3]+"','"+row[4]+"','"+row[5]+"','"+row[6]+"','"+row[7]+"','"+row[8]+"','"+row[9]+"','"+row[10]+"','"+row[11]+"','"+row[12]+"',"+str(row[13])+","+str(row[14])+",'"+row[15]+"',"+str(row[16])+");")
conn.commit()
crsr.close()
conn.close()
print (u'«'+OTAvalue+u'» ('+str(cnt)+u'/'+str(len(OTAList))+u')')
Executing this code took about 5 minutes to complete the task for about 140 mdbs.
As you can see, I execute an "INSERT INTO" statement for each record of the shapefile.
Is this the correct way (and probably the fastest) or should I collect all the statements for each "OTA" and execute them all together?
I don't think anyone's going to write your code for you, but if you try some VBA yourself, and tell us what happened and what worked and what you're stuck on, you'll get a great response.
Saying that - to start with I don't see any reason to use VB6 when you can use VBA right inside your mdb file.
Use DIR command and possibly FileSystemObject to loop through all DBFs in a given folder, or use FileDialog object to select multiple files at one go
Then process each file with
DoCmd.TransferDatabase command
TransferType:=acImport, _
DatabaseType:="dBASE III", _
DatabaseName:="your-dbf-filepath", _
ObjectType:=acTable, _
Source:="Source", _
Destination:="your-newtbldbf"
Finally process each dbf import with a make table query
Look at results and see what might have to be changed based on field types before and after.
Then .... edit your post and let us know how it went
In theory you could do something like this by searching the directory the DBF files reside in, writing those filenames to a table, then loop through the table and, for each filename, scan the DBF for tables and their fieldnames/datatypes and create those tables in your MDB. You could also bring in all the data from the tables, all within a series of loops.
In theory, you could.
In practice, you can't. And you can't, because DBF and MDB support different data types that aren't compatible.
I suppose you could create a "crosswalk" table such that for each datatype in DBF there is a corresponding, hand-picked datatype in MDB and use that when you're creating the table, but it's probably going to either fail to import some of the data or import corrupted data. And that's assuming you can open a DBF for reading the same way you can open an MDB for reading. Can you run OpenDatabase on a DBF from inside Access? I don't even have the answer to that.
I wouldn't recommend that you do this. The reason that you're doing it is because you want to keep the structure as similar as possible when migrating from dBase/FoxBase to Access. However, the file structure is different between them.
As you are aware, each .DBF ("Database file") file is a table, and the folder or directory in which the .DBF files reside constitutes the "database". With Access, all the tables in one database are in one .MDB ("Microsoft Database") file.
If you try to put each .DBF file in a separate .MDB file, you will have no end of trouble getting the .MDB files to interact. Access treats different .MDB files as different databases, not different tables in the same database, and you will have to do strange things like link all the separate databases just to have basic relational functionality. (I tried this about 25 years ago with Paradox files, which are also a one-file-per-table structure. It didn't take me long to decide it was easier to get used to the one-file-per-database concept.) Do yourself a favor, and migrate all of the .DBF files in one folder into a single .MDB file.
As for what you ought to do with your code, I'd first suggest that you use ADO rather than DAO. But if you want to stick with DAO because you've been using it, then you need to have one connection to the dBase file and another to the Access database. As far as I can tell, you don't have the dBase connection. I've never tried what you're doing before, but I doubt you can use a SQL statement to select directly from a .dbf file in the way you're doing. (I could be wrong, though; Microsoft has come up with stranger things over the years.)
It would

Most effective way to push data from a SQL Server database into a Greenplum database?

Greenplum Database version:
PostgreSQL 8.2.15 (Greenplum Database 4.2.3.0 build 1)
SQL Server Database version:
Microsoft SQL Server 2008 R2 (SP1)
Our current approach:
1) Export each table to a flat file from SQL Server
2) Load the data into Greenplum with pgAdmin III using PSQL Console's psql.exe utility
Benifits...
Speed: OK, but is there anything faster? We load millions of rows of data in minutes
Automation: OK, we call this utility from an SSIS package using a Shell script in VB
Pitfalls...
Reliability: ETL is dependent on the file server to hold the flat files
Security: Lots of potentially sensitive data on the file server
Error handling: It's a problem. psql.exe never raises an error that we can catch even if it does error out and loads no data or a partial file
What else we have tried...
.Net Providers\Odbc Data Provider: We have configured a System DSN using DataDirect 6.0 Greenplum Wire Protocol. Good performance for a DELETE. Dog awful slow for an INSERT.
For reference, this is the aforementioned VB script in SSIS...
Public Sub Main()
Dim v_shell
Dim v_psql As String
v_psql = "C:\Program Files\pgAdmin III\1.10\psql.exe -d "MyGPDatabase" -h "MyGPHost" -p "5432" -U "MyServiceAccount" -f \\MyFileLocation\SSIS_load\sql_files\load_MyTable.sql"
v_shell = Shell(v_psql, AppWinStyle.NormalFocus, True)
End Sub
This is the contents of the "load_MyTable.sql" file...
\copy MyTable from '\\MyFileLocation\SSIS_load\txt_files\MyTable.txt' with delimiter as ';' csv header quote as '"'
If you're getting your data load done in minutes, then the current method is probably good enough. However, if you find yourself having to load larger volumes of data (terabyte scale for instance), the usual preferred method for bulk-loading into Greenplum is via gpfdist and corresponding EXTERNAL TABLE definitions. gpload is a decent wrapper that provides abstraction over much of this process and is driven by YAML control files. The general idea is that gpfdist instance(s) are spun up at the location(s) where your data is staged, preferrably as CSV text files, and then the EXTERNAL TABLE definition within Greenplum is made aware of the URIs for the gpfdist instances. From the admin guide, a sample definition of such an external table could look like this:
CREATE READABLE EXTERNAL TABLE students (
name varchar(20), address varchar(30), age int)
LOCATION ('gpfdist://<host>:<portNum>/file/path/')
FORMAT 'CUSTOM' (formatter=fixedwidth_in,
name=20, address=30, age=4,
preserve_blanks='on',null='NULL');
The above example expects to read text files whose fields from left to right are a 20-character (at most) string, a 30-character string, and an integer. To actually load this data into a staging table inside GP:
CREATE TABLE staging_table AS SELECT * FROM students;
For large volumes of data, this should be the most efficient method since all segment hosts are engaged in the parallel load. Do keep in mind that the simplistic approach above will probably result in a randomly distributed table, which may not be desirable. You'd have to customize your table definitions to specify a distribution key.