AWS Glue write and compress with the files in output bucket - partitioning

I have an ETL job that runs daily, uses bookmarks and writes the increment to some output s3 bucket. The output bucket is partitioned by one key.
Now, I want to have just one file by each partition. I can achieve that on the first run of the job as following:
datasource = datasource.repartition(1)
glueContext.write_dynamic_frame.from_options(
connection_type = "s3",
frame = datasource,
connection_options = {"path":output_path, "partitionKeys": ["a_key"]},
format = "glueparquet",format_options={"compression":"gzip"},
transformation_ctx = "write_dynamic_frame")
What I can't figure out is how to write and compress my increment with the files that are already in my output bucket/partition.
One option would be to read the table from the previous day and merge it with the increment, but it seems like an overkill.
Any smarter ideas?

I was running into the same issue, and discovered that the compression setting goes in the connection_options:
connection_options = {"path": file_path, "compression": "gzip", "partitionKeys": ["a_key"]}

Related

Read S3 CSV file and insert into RDS mysql using AWS Glue

I have a CSV file in S3 bucket which gets updated/refreshed with new data generated from ML model every week. I have created an ETL pipeline in AWS glue to read data(CSV file) from S3 bucket and load it into RDS(mysql server). I have connected my RDS via SSMS. I was able to load data successfully into RDS and validate currect row counts with 50000. When I run the job again, the whole table; ie same file contents in CSV file gets appended. Here is the sample code:
datasink5 = glueContext.write_dynamic_frame.from_catalog(frame = resolvechoice4, database = "<dbname>", table_name = "<table schema name>", transformation_ctx = "datasink5")
Next week when I run my model there will be 1000 new rows in that CSV file. So when I run my ETL job in Glue, it should append 1000 new row values with previously loaded 5000 rows. Total row counts should reflect as 6000.
Can anyone tell me how to achieve this? Is there anyway we can truncate or drop table before inserting all new data? In that way we could avoid duplication.
Note: I will have to run "Crawler" to read data from S3 bucket every week to get new data with existing row values.
sample code generate using AWS glue.
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
Any help would be appreciated.

Transfering data from csv file into database table using Apache Flink

I wonder what is the best way to read data from csv file (located on S3) and then insert into database table.
I have deployed apache flink on my k8s cluster.
I have tried with DataSet api in the following way:
Source(Read csv) -> Map(Transform POJO to Row) -> Sink(JdbcOutputFormat)
It seems that Sink (writing into DB) is the bottleneck. Source and Map tasks are idle for ~80% while at the same time Sink is idle for 0ms/1s with input rate rate 1.6MB/s.
I can only speed up the whole operation of inserting csv content into my database by spliting the whole operation on new replicas of task managers.
Is there any room for improving performance of my jdbc sink?
[edit]
DataSource<Order> orders = env.readCsvFile("path/to/file") //
.pojoType(Order.class, pojoFields)
.setParallelism(6) //
.name("Read csv"); //
JDBCOutputFormat jdbcOutput = JDBCOutputFormat.buildJDBCOutputFormat()
.setQuery("INSERT INTO orders(...) values (...)") //
.setBatchInterval(10000) //
.finish();
orders.map(order -> {
Row r = new Row(29);
//assign values from Order pojo to Row
return r;
}).output(jdbcOutput).name("Postgre SQL Output");
I have experimented with batch interval in range 100-50000 but it didn't affect speed of processing significantly, it's still 1.4-1.6MB/s
If instead of writing to external database I print all entries from csv file to stdout (print()) I get rate 6-7MB/s so this is why I assumed the problem is with jdbc sink.
With this post just wanted to make sure my code doesn't have any performance issues and I reach max performance from a single Task Manager.

Why does R upload data much faster than KNIME or Workbench?

What I want to know is, what the heck happens, under the hoods, when I upload data through R and it turns to be way much faster than MySQL Workbench or KNIME?
I work with data and, everyday, I upload data into a MySQL server. I used to upload data using KNIME since it was much faster than uploading with MySQL Workbench (select the table -> "import data").
Some infos: The CSV has 4000 rows and 15 columns. The library I used in R is RMySQL. The node I used in KNIME is database writer.
library('RMySQL')
df=read.csv('C:/Users/my_user/Documents/file.csv', encoding = 'UTF-8', sep=';')
connection <- dbConnect(
RMySQL::MySQL(),
dbname = "db_name",
host = "yyy.xxxxxxx.com",
user = "vitor",
password = "****"
)
dbWriteTable(connection, "table_name", df, append=TRUE, row.names=FALSE)
So, to test, I did the exact same process, using the same file. It took 2 minutes in KNIME and only seconds in R.
Everything happens under the hood! Data upload to DB depends on parameters such as interface between DB and tool, network connectivity, batch size set, memory available for tool and tool data processing speed itself and probably some more. In your case RMySQL package uses batch size of 500 by default and KNIME only 1 so probably that is where the difference comes from. Try setting it to 500 in KNIME and then compare. Have no clue how MySQL Workbench works...

How to parsimoniously refer to a data frame in RMySQL

I have a MySQL table that I am reading with the RMySQL package of R. I would like to be able to directly refer to the data frame stored in the table so I can seamlessly interact with it rather than having to execute RMySQL statement every time I want to do something. Is there a way to accomplish this? I tried:
data <- dbReadTable(conn = con, name = 'tablename')
For example, if I now want to check how many rows I have in this table I would run:
nrow(data)
Does this go through the database connection, or am I now storing the object "data" locally, defeating the whole purpose of using an external database?
data <- dbReadTable(conn = con, name = 'tablename')
This command downloads all the data into a local R dataframe (assuming you have enough RAM). Any operations with data from that point forward do not require the SQL connection.

How do I customise the CREATE DATABASE statement in VSTS DB Edition Deploy?

I'm using VSTS Database Edition GDR Version 9.1.31024.02
I've got a project where we will be creating multiple databases with identical schema, on the fly, as customers are added to the system. It's one DB per customer. I thought I should be able to use the deploy script to do this. Unfortunately I always get the full filenames specified on the CREATE DATABASE statement. For example:
CREATE DATABASE [$(DatabaseName)]
ON
PRIMARY(NAME = [targetDBName], FILENAME = N'$(DefaultDataPath)targetDBName.mdf')
LOG ON (NAME = [targetDBName_log], FILENAME = N'$(DefaultDataPath)targetDBName_log.ldf')
GO
I'd expected something more like this
CREATE DATABASE [$(DatabaseName)]
ON
PRIMARY(NAME = [targetDBName], FILENAME = N'$(DefaultDataPath)$(DatabaseName).mdf')
LOG ON (NAME = [targetDBName_log], FILENAME = N'$(DefaultDataPath)$(DatabaseName)_log.ldf')
GO
Or even
CREATE DATABASE [$(DatabaseName)]
I'm not going to be running this on an on-going basis so I'd like to make it as simple as possible, for the next guy. There are a bunch of options for deployment in the project properties, but I can't get this to work the way I'd like.
Any one know how to set this up?
Better late than never, I know how to get the $(DefaultDataPath)$(DatabaseName) file names from your second example.
The SQL you're showing in your first code snippet suggests that you don't have scripts for creating the database files in your VSTS:DB project, perhaps by deliberately excluded them from any schema comparisons you've done. I found it a little counter-intuitive, but the solution is to let VSTS:DB script the MDF and LDF in you development environment, then edit those scripts to use the SQLCMD variables.
In your database project, go to the folder Schema Objects > Database Level Objects > Storage > Files. In there, add these two files:
Database.sqlfile.sql
ALTER DATABASE [$(DatabaseName)]
ADD FILE (NAME = [$(DatabaseName)],
FILENAME = '$(DefaultDataPath)$(DatabaseName).mdf',
SIZE = 2304 KB, MAXSIZE = UNLIMITED, FILEGROWTH = 1024 KB)
TO FILEGROUP [PRIMARY];
Database_log.sqlfile.sql
ALTER DATABASE [$(DatabaseName)]
ADD LOG FILE (NAME = [$(DatabaseName)_log],
FILENAME = '$(DefaultDataPath)$(DatabaseName)_log.ldf',
SIZE = 1024 KB, MAXSIZE = 2097152 MB, FILEGROWTH = 10 %);
The full database creation script that VSTS:DB, or for that matter VSDBCMD.exe, generates will now use the SQLCMD variables for naming the MDF and LDF files, allowing you to specify them on the command line, or in MSBuild.
We do this using a template database, that we back up, copy, and restore as new customers are brought online. We don't do any of the schema creation with scripts but with a live, empty DB.
Hmm, well it seems that the best answer so far (given the over whelming response) is to edit the file after the fact... Still looking