Most effective way to push data from a SQL Server database into a Greenplum database? - sql-server-2008

Greenplum Database version:
PostgreSQL 8.2.15 (Greenplum Database 4.2.3.0 build 1)
SQL Server Database version:
Microsoft SQL Server 2008 R2 (SP1)
Our current approach:
1) Export each table to a flat file from SQL Server
2) Load the data into Greenplum with pgAdmin III using PSQL Console's psql.exe utility
Benifits...
Speed: OK, but is there anything faster? We load millions of rows of data in minutes
Automation: OK, we call this utility from an SSIS package using a Shell script in VB
Pitfalls...
Reliability: ETL is dependent on the file server to hold the flat files
Security: Lots of potentially sensitive data on the file server
Error handling: It's a problem. psql.exe never raises an error that we can catch even if it does error out and loads no data or a partial file
What else we have tried...
.Net Providers\Odbc Data Provider: We have configured a System DSN using DataDirect 6.0 Greenplum Wire Protocol. Good performance for a DELETE. Dog awful slow for an INSERT.
For reference, this is the aforementioned VB script in SSIS...
Public Sub Main()
Dim v_shell
Dim v_psql As String
v_psql = "C:\Program Files\pgAdmin III\1.10\psql.exe -d "MyGPDatabase" -h "MyGPHost" -p "5432" -U "MyServiceAccount" -f \\MyFileLocation\SSIS_load\sql_files\load_MyTable.sql"
v_shell = Shell(v_psql, AppWinStyle.NormalFocus, True)
End Sub
This is the contents of the "load_MyTable.sql" file...
\copy MyTable from '\\MyFileLocation\SSIS_load\txt_files\MyTable.txt' with delimiter as ';' csv header quote as '"'

If you're getting your data load done in minutes, then the current method is probably good enough. However, if you find yourself having to load larger volumes of data (terabyte scale for instance), the usual preferred method for bulk-loading into Greenplum is via gpfdist and corresponding EXTERNAL TABLE definitions. gpload is a decent wrapper that provides abstraction over much of this process and is driven by YAML control files. The general idea is that gpfdist instance(s) are spun up at the location(s) where your data is staged, preferrably as CSV text files, and then the EXTERNAL TABLE definition within Greenplum is made aware of the URIs for the gpfdist instances. From the admin guide, a sample definition of such an external table could look like this:
CREATE READABLE EXTERNAL TABLE students (
name varchar(20), address varchar(30), age int)
LOCATION ('gpfdist://<host>:<portNum>/file/path/')
FORMAT 'CUSTOM' (formatter=fixedwidth_in,
name=20, address=30, age=4,
preserve_blanks='on',null='NULL');
The above example expects to read text files whose fields from left to right are a 20-character (at most) string, a 30-character string, and an integer. To actually load this data into a staging table inside GP:
CREATE TABLE staging_table AS SELECT * FROM students;
For large volumes of data, this should be the most efficient method since all segment hosts are engaged in the parallel load. Do keep in mind that the simplistic approach above will probably result in a randomly distributed table, which may not be desirable. You'd have to customize your table definitions to specify a distribution key.

Related

Shapefile to MDB with custom field structure [duplicate]

This question already has an answer here:
How to create table in mdb from dbf query
(1 answer)
Closed 6 years ago.
I have a Shapefile with 80.000 polygons that they are grouped by a specific field called "OTA".
I wanted to convert each Shapefile (it's attribute table) to mdb database (not Personal Geodatabase) with one table in it with the same name as the Shapefile and with a given field structure.
In the code I used I had to load on Python 2 new modules:
pypyodbc and adodbapi
The first module was used to create the mdb file for each shapefile and the second to create the table in the mdb and fill the table with the data from the attribute table of the shapefile.
The code I came up with is the following:
import pypyodbc
import adodbapi
Folder = ur'C:\TestPO' # Folder to save the mdbs
FD = Folder+ur'\27ALLPO.shp' # Shapefile
Map = u'PO' # Map type
N = u'27' # Prefecture
OTAList = sorted(set([row[0] for row in arcpy.da.SearchCursor(FD,('OTA'))]))
cnt = 0
for OTAvalue in OTAList:
cnt += 1
dbname = N+OTAvalue+Map
pypyodbc.win_create_mdb(Folder+'\\'+dbname+'.mdb')
conn_str = (r"Provider=Microsoft.Jet.OLEDB.4.0;Data Source="+Folder+"\\"+dbname+ur".mdb;")
conn = adodbapi.connect(conn_str)
crsr = conn.cursor()
SQL = "CREATE TABLE ["+dbname+"] ([FID] INT,[AREA] FLOAT,[PERIMETER] FLOAT,[KA_PO] VARCHAR(10),[NOMOS] VARCHAR(2),[OTA] VARCHAR(3),[KATHGORPO] VARCHAR(2),[KATHGORAL1] VARCHAR(2),[KATHGORAL2] VARCHAR(2),[LABEL_PO] VARCHAR(8),[PHOTO_45] VARCHAR(14),[PHOTO_60] VARCHAR(10),[PHOTO_PO] VARCHAR(8),[POLY_X_CO] DECIMAL(10,3),[POLY_Y_CO] DECIMAL(10,3),[PINAKOKXE] VARCHAR(11),[LANDTYPE] DECIMAL(2,0));"
crsr.execute(SQL)
conn.commit()
with arcpy.da.SearchCursor(FD,['FID','AREA','PERIMETER','KA_PO','NOMOS','OTA','KATHGORPO','KATHGORAL1','KATHGORAL2','LABEL_PO','PHOTO_45','PHOTO_60','PHOTO_PO','POLY_X_CO','POLY_Y_CO','PINAKOKXE','LANDTYPE'],'"OTA" = \'{}\''.format(OTAvalue)) as cur:
for row in cur:
crsr.execute("INSERT INTO "+dbname+" VALUES ("+str(row[0])+","+str(row[1])+","+str(row[2])+",'"+row[3]+"','"+row[4]+"','"+row[5]+"','"+row[6]+"','"+row[7]+"','"+row[8]+"','"+row[9]+"','"+row[10]+"','"+row[11]+"','"+row[12]+"',"+str(row[13])+","+str(row[14])+",'"+row[15]+"',"+str(row[16])+");")
conn.commit()
crsr.close()
conn.close()
print (u'«'+OTAvalue+u'» ('+str(cnt)+u'/'+str(len(OTAList))+u')')
Executing this code took about 5 minutes to complete the task for about 140 mdbs.
As you can see, I execute an "INSERT INTO" statement for each record of the shapefile.
Is this the correct way (and probably the fastest) or should I collect all the statements for each "OTA" and execute them all together?
I don't think anyone's going to write your code for you, but if you try some VBA yourself, and tell us what happened and what worked and what you're stuck on, you'll get a great response.
Saying that - to start with I don't see any reason to use VB6 when you can use VBA right inside your mdb file.
Use DIR command and possibly FileSystemObject to loop through all DBFs in a given folder, or use FileDialog object to select multiple files at one go
Then process each file with
DoCmd.TransferDatabase command
TransferType:=acImport, _
DatabaseType:="dBASE III", _
DatabaseName:="your-dbf-filepath", _
ObjectType:=acTable, _
Source:="Source", _
Destination:="your-newtbldbf"
Finally process each dbf import with a make table query
Look at results and see what might have to be changed based on field types before and after.
Then .... edit your post and let us know how it went
In theory you could do something like this by searching the directory the DBF files reside in, writing those filenames to a table, then loop through the table and, for each filename, scan the DBF for tables and their fieldnames/datatypes and create those tables in your MDB. You could also bring in all the data from the tables, all within a series of loops.
In theory, you could.
In practice, you can't. And you can't, because DBF and MDB support different data types that aren't compatible.
I suppose you could create a "crosswalk" table such that for each datatype in DBF there is a corresponding, hand-picked datatype in MDB and use that when you're creating the table, but it's probably going to either fail to import some of the data or import corrupted data. And that's assuming you can open a DBF for reading the same way you can open an MDB for reading. Can you run OpenDatabase on a DBF from inside Access? I don't even have the answer to that.
I wouldn't recommend that you do this. The reason that you're doing it is because you want to keep the structure as similar as possible when migrating from dBase/FoxBase to Access. However, the file structure is different between them.
As you are aware, each .DBF ("Database file") file is a table, and the folder or directory in which the .DBF files reside constitutes the "database". With Access, all the tables in one database are in one .MDB ("Microsoft Database") file.
If you try to put each .DBF file in a separate .MDB file, you will have no end of trouble getting the .MDB files to interact. Access treats different .MDB files as different databases, not different tables in the same database, and you will have to do strange things like link all the separate databases just to have basic relational functionality. (I tried this about 25 years ago with Paradox files, which are also a one-file-per-table structure. It didn't take me long to decide it was easier to get used to the one-file-per-database concept.) Do yourself a favor, and migrate all of the .DBF files in one folder into a single .MDB file.
As for what you ought to do with your code, I'd first suggest that you use ADO rather than DAO. But if you want to stick with DAO because you've been using it, then you need to have one connection to the dBase file and another to the Access database. As far as I can tell, you don't have the dBase connection. I've never tried what you're doing before, but I doubt you can use a SQL statement to select directly from a .dbf file in the way you're doing. (I could be wrong, though; Microsoft has come up with stranger things over the years.)
It would

How to parsimoniously refer to a data frame in RMySQL

I have a MySQL table that I am reading with the RMySQL package of R. I would like to be able to directly refer to the data frame stored in the table so I can seamlessly interact with it rather than having to execute RMySQL statement every time I want to do something. Is there a way to accomplish this? I tried:
data <- dbReadTable(conn = con, name = 'tablename')
For example, if I now want to check how many rows I have in this table I would run:
nrow(data)
Does this go through the database connection, or am I now storing the object "data" locally, defeating the whole purpose of using an external database?
data <- dbReadTable(conn = con, name = 'tablename')
This command downloads all the data into a local R dataframe (assuming you have enough RAM). Any operations with data from that point forward do not require the SQL connection.

SSIS - Export multiple SQL Server tables to multiple text files

I have to move data between two SQL Server DBs. My task is to export the data as text (.dat) files, move the files and import into the destination. I have to migrate over 200 tables.
This is what I tried
1) I used a Execute SQL task to fetch my tables.
2) Used a For each loop to loop through the table names from the collection.
3) Used a script task inside the for each loop to build the text file destination path.
4) Called a DFT with the table name in a variable for the source ole db and the path name in a variable for the destination flat file.
First table extracts fine but the second table bombs with a synchronization error. I see this is numerous posts but could not find one that matches my scenario. Hence posting here.
Even if I get the package to work with multiple DFTs, the second table from the second DFT does not export columns because the flat file connection manager still remembers the first table columns. Is there a way to get it to forget the columns?
Any thoughts on how I can export multiple tables to multiple text files using one DFT using dynamic source and destination variable?
Thanks and appreciate your help.
Unfortunately Bulk Import Task only enable us to use format files effectively to map the columns between source and destinations. Bulk Import Task uses BULK INSERT TSQL command to import the data, to execute user should have the BULKADMIN server privilege.
Most of the companies would not allow BULKADMIN server privilege to enable due to security reasons.
Hence using the script task to construct BCP statements is a good and simple option to Export.
You does not require to construct .bat file as script itself can execute dos commands which runs under .NET security account.
I figured out a way to do this. I thought I will share if anybody is stuck in the same situation.
So, in summary, I needed to export and import data via files. I also wanted to use a format file if at all possible for various reasons.
What I did was
1) Construct a DFT which gets me a list of table names from the DB that I need to export. I used 'oledb' as a source and 'recordset destination' as target and stored the table names inside a object variable.
A DFT is not really necessary. You can do it any other way. Also, in our application, we store the table names in a table.
2) Add a 'For each loop container' with a 'For Each ADO Enumerator' which takes my object variable from the previous step into the collection.
3) Parse the variable one by one and construct BCP statements like below inside a Script task. Create variables as necessary. The BCP statement will be stored in a variable.
I loop through the tables and construct multiple BCP statements like this.
BCP "DBNAME.DBO.TABLENAME1" out "PATH\FILENAME2.dat" -S SERVERNAME -T -t"|" -r$\n -f "PATH\filename.fmt"
BCP "DBNAME.DBO.TABLENAME1" out "PATH\FILENAME2.dat" -S SERVERNAME -T -t"|" -r$\n -f "PATH\filename.fmt"
The statements are put inside a .bat file. This is also done inside the script task.
4) A execute process task will next execute the .BAT file. I had to do this because, I do not have the option to use the 'master..xp_cmdShell' command or the 'BULK INSERT' command in my company. If I had the option to execute cmdshell, I could have directly run the command from the package.
5) Again add a 'For each loop container' with a 'For Each ADO Enumerator' which takes my object variable from the previous step into the collection.
6) Parse the variable one by one and construct BCP statements like this inside a Script task. Create variables as necessary. The BCP statement will be stored in a variable.
I loop through the tables and construct multiple BCP statements like this.
BCP "DBNAME.DBO.TABLENAME1" in "PATH\FILENAME2.dat" -S SERVERNAME -T -t"|" -r$\n -b10000 -f "PATH\filename.fmt"
BCP "DBNAME.DBO.TABLENAME1" in "PATH\FILENAME2.dat" -S SERVERNAME -T -t"|" -r$\n -b10000 -f "PATH\filename.fmt"
The statements are put inside a .bat file. This is also done inside the script task.
The -b10000 was put so I can import in batches. Without this many of my large tables could not be copied due to less space in the tempdb.
7) Run the .bat file to import the file again.
I am not sure if this is the best solution. I still thought I will share what satisfied my requirement. If my answer is not clear, I would be happy to explain if you have any questions. We can also optimize this solution. The same can be done purely via VB Scripts but you have to write some code to do that.
I also created a package configuration file where I can change the DB name, server name, the data and format file locations dynamically.
Thanks.

import database dump to mysql using visual foxpro

I used leaves stru2mysql.prg and vfp2mysql_upload.prg to create a .sql dump file from DBF's. I connect to mysql database from vfp using ODBC.I KNOW how upload the sql dump file but i need to automate the whole process i.e after creating the dump file,my visual foxpro program can upload the dump file without a third party(automatically). I thought of using the source command but that needs to be run in mysql prompt.The assumption here is that my end users dont know how to import(which most of them dont).Please advice on how i can automate importation of sql file to mysql database.thank you
I think what you are looking for are the various SQL* functions in Foxpro. See the VFP help or MSDN on SQLCONNECT (or SQLSTRINGCONNECT), SQLEXEC, and SQLDISCONNECT functions to get you started. Microsoft provided good examples on each in the documentation.
You may also want to use FILETOSTR to get the output from Leafe's programs into a string for the SQLEXEC function.
Here's the steps I use to take data from a Visual FoxPro Database and upload to a MySql Database. These are all put into a custom method on a form, which is fired by a command button. For example the method would be 'uploadnewdata' and I pass parameters for whichever data tables I need
1) Connect to the Server - I use MySql ODBC
2) Validate the user (this uses a SQLEXEC to pull the correct matching record for a users tables
IF M.WorkingDatabase<>-1
nRetVal=SQLEXEC(m.WorkingDatabase,"SELECT * FROM users", "csrUsersOnServer")
SELECT csrUsersOnServer
SELECT userid,FROM csrUsersOnServer;
WHERE ALLTRIM(UPPER(userid))=ALLTRIM(UPPER(lcRanchUser));
AND ALLTRIM(UPPER(lcPassWord))=ALLTRIM(UPPER(lchPassWord));
INTO CURSOR ValidUsers
IF _TALLY>=1
ELSE
=MESSAGEBOX("Your Premise ID Does Not Match Any Records On The Server","System Message")
RETURN 0
ENDIF
ELSE
=MESSAGEBOX("Unable To Connect To Your Database", "System Message")
RETURN 0
ENDIF
3) Once that is successful I create my base cursor (this is the one I'm sending from)
4) I then loop through that cursor creating variable for the values in the fields
5) then using the SQLEXEC, and INSERT INTO, I update each record
6) once the program is finished processing the cursor, it generates a messagebox with the 'finished' message and control returns to the form.
All the user has to do, is select the starting table and enter their login information

Share 1 table between 2 different types of databases

The problem that I have is that I want to synchronize one table between two different databases.
Database 1 is on a XP server with MySQL
Database 2 is on a Novell server with Clarion.
Is it possible to share one table users between the two databases?
So when data is put in database 1, the database automatically synchornize with database 2. When this is done the table: user is in both databases the same?
Thanks in advance!
Diederik,
Your question isn't very clear in that we don't know if you have access to the source code or can only operate on a database level.
You didn't mention clearly if you're using Clarion to drive those databases. I'm assuming you are, since you tagged your post with it.
Also, you didn't mention which file format you're using at the Novell server. I'm assuming you are using the TopSpeed file format - here a bit of information about the TopSpeed file format: most programmers think it is the "native" file format for Clarion for Windows. It is not. Clarion for Windows doesn't have such thing as a native file format, but employs a totally driver driven approach. Clarion Professional Developer (a DOS IDE) had a native file format, which was the Clarion .DAT format. Clarion for Windows can use whatever file format that offers a driver or ODBC driver, including the old .DAT.
If you have access to the source code, then it is a pretty straight situation. In Clarion you can easily have different buffers pointing to different tables.
PROGRAM
MAP
END
szConnMySQL CSTRING( 256 )
users_mysql FILE, DRIVER( 'ODBC' ), OWNER( szConnMySQL ), NAME( 'users' )
RECORD RERCORD
id LONG
name STRING( 20 )
END
END
users_tps FILE, DRIVER( 'TopSpeed' ), NAME( 'users' )
RECORD RECORD
name STRING( 20 )
id LONG
END
END
CODE
szConnMySQL = 'Driver={{MySQL ODBC 3.51 Driver};' & |
'Server=myServerAddress;Database=myDataBase;User=myUsername;' & |
'Password=myPassword;Option=3;'
OPEN( users_mysql, 42h )
OPEN( users_tps, 42h )
users_mysql.id = 1
users_mysql.name = 'GUSTAVO PINSARD'
ADD( users_mysql )
IF NOT ERRORCODE()
users_tps.RECORD :=: users_mysql.RECORD
ADD( users_tps )
ELSE
! Do your thing
END
CLOSE( users_mysql )
CLOSE( users_tps )
If you don't have access to source code, then you'll have to write a MySQL stored procedure to update the remote file. The problem is that the remote file, being a TopSpeed file, would bot be directly accessible from the MySQL server, since it, MySQL, doesn't know anything about it.
One solution to overcome this is by using the TopSpeed ODBC driver at the MySQL server, and having the MySQL SP access the ODBC driver. I consider the TopSpeed ODBC driver a must have, because it allows for a strategy to escape such situations, and promote a better integration.
Details on the MySQL SP are out of the scope of this post, also because I don't know MySQL SPs to that level.
Regards