Couchbase to local files export - couchbase

I need to migrate the couchbase data into HDFS but the db and Hadoop clusters are not accessible to each other. So I cannot use sqoop in the recommended way. Is there a way to import couchbase data into local files (instead of HDFS) using sqoop. If it is possible I can do that and then transfer the local files using ftp and then use sqoop again to transfer them to HDFS.
If that's a bad solution, then is there any other way I can transfer all the cb data in local files. Creating views on this cb cluster is a difficult task and I would like to avoid using it.

Alternative solution (perhaps not as elegant, but it works):
Use Couchbase backup utility: cbbackup and save locally all data.
Transfer backup files to HDFS reachable network host.
Install Couchbase in the network segment where HDFS is reachable and use Couchbase restore from backup procedure to populate that instance.
Use Scoop (in recommended way) against that Couchbase instance that has access to HDFS.

You can use the cbbackup utility that comes with the Couchbase installation to export all data to backup files. By default the backups are actually stored in SQLite format, so you can move them to your Hadoop cluster and then use any JDBC SQLite driver to import the data from each *.cbb file individually with Sqoop. I actually wrote a blog about this a while ago, you can check it out.
To get you started, here's one of the many JDBC SQLite drivers out there.

You can use couchbase kafka adapter to stream data from couchbase to kafka and from kafka you can store in any file system you like. CouchbaseKafka adapter uses TAP protocol to push data to kafka.
https://github.com/paypal/couchbasekafka

Related

How can I transfer data from AWS RDS PostgreSQL db instance to mysql db which is on a different server?

Till now I have been transferring the data manually by just exporting the data and then importing it to MySQL DB but now I have to automate the whole process.
I want to generate CSV and FTP these CSV files to MySQL server.
pgAdmin let me download the file through windows but when export via the "COPY" command; however, I get the error that I need to be a superuser and that I should use "\copy" instead.
And I cannot access the Operating system for the Postgre server.

AWS RDS: Load XML From S3?

Since AWS Aurora does not support the RDS free tier (it does not have Micro instance support), I am using a MySQL server instead.
I have a script that generates data (currently in XML) that can be imported into MySQL, then writes it to an S3 bucket. I was intending to use the LOAD XML FROM S3 command like in this answer to import it from the bucket, but I get a syntax error when I try.
I've looked at AWS Data Pipelines, but it seems hard to maintain since, from what I can tell, it only supports CSV, and I would have to edit the SQL query to import the lines manually whenever the structure of the database changes. This is an advantage of XML; LOAD XML gets the column names from the file, not the query used.
Does the AWS MySQL (not Aurora) support importing from S3? Or do I have to generate the XML, write it locally and to the bucket, then use LOAD XML LOCAL INFILE on the local file?
There are multiple limitations while importing data in RDS from S3 as mentioned in the official documentation. Check if any of the below applies to you.
Limitations and Recommendations for Importing Backup Files from Amazon S3 to Amazon RDS The following are some limitations and
recommendations for importing backup files from Amazon S3:
You can only import your data to a new DB instance, not an existing
DB instance.
You must use Percona XtraBackup to create the backup of your on-premises database.
You can't migrate from a source database that has tables defined
outside of the default MySQL data directory.
You can't import a MySQL 5.5 or 8.0 database.
You can't import an on-premises MySQL 5.6 database to an Amazon RDS
MySQL 5.7 or 8.0 database. You can upgrade your DB instance after
you complete the import.
You can't restore databases larger than the maximum database size
supported by Amazon RDS for MySQL. For more information about
storage limits, see General Purpose SSD Storage and Provisioned IOPS
SSD Storage.
You can't restore from an encrypted source database, but you can
restore to an encrypted Amazon RDS DB instance.
You can't restore from an encrypted backup in the Amazon S3 bucket.
You can't restore from an Amazon S3 bucket in a different AWS Region
than your Amazon RDS DB instance.
Importing from Amazon S3 is not supported on the db.t2.micro DB
instance class. However, you can restore to a different DB instance
class, and then change the instance class later. For more
information about instance classes, see Hardware Specifications for
All Available DB Instance Classes.
Amazon S3 limits the size of a file uploaded to an Amazon S3 bucket
to 5 TB. If a backup file exceeds 5 TB, then you must split the
backup file into smaller files.
Amazon RDS limits the number of files uploaded to an Amazon S3
bucket to 1 million. If the backup data for your database, including
all full and incremental backups, exceeds 1 million files, use a
tarball (.tar.gz) file to store full and incremental backup files in
the Amazon S3 bucket.
User accounts are not imported automatically. Save your user
accounts from your source database and add them to your new DB
instance later.
Functions are not imported automatically. Save your functions from
your source database and add them to your new DB instance later.
Stored procedures are not imported automatically. Save your stored
procedures from your source database and add them to your new DB
instance later.
Time zone information is not imported automatically. Record the
time
zone information for your source database, and set the time zone of
your new DB instance later. For more information, see Local Time
Zone for MySQL DB Instances.
Backward migration is not supported for both major versions and
minor versions. For example, you can't migrate from version 5.7 to
version 5.6, and you can't migrate from version 5.6.39 to version
5.6.37.

Is any way to store Prometheus Data to External Database like mysql or postgreSQL

Currently I am working with the Prometheus and getting a good result, I difficulty I am facing is that if the service restart my whole old data will lose. Is there any way to permanently store the Prometheus data in databases like mysql or PostgreSQL?
You can't write Prometheus data directly to a relational db (or any db for that matter). You have two choices:
mount an external disk on your machine and configure Prometheus to write the data to whatever that mount location was
Write a tiny web script which translates the Prometheus export format to whatever storage format you want. Then configure Prometheus to send data to the web script.
Information can be found on the Prometheus docs.
Traditiomal databases like MySQL and PostgreSQL aren't optimized for time series data which is collected by Prometheus. There are better solutions exist, which require less storage space and work faster with both inserts and selects.
Prometheus supports remote storage. When enabled, it stores all the new data in both local storage and remote strorage. There are multiple choices exist for the remote storage db with various tradeoffs. I'd recommend trying VictoriaMetrics. It natively supports Prometheus' query language - PromQL, so may be easily used as Prometheus datasource in Grafana.
Now PostgreSQL support for Prometheus here
https://blog.timescale.com/prometheus-ha-postgresql-8de68d19b6f5
InfluxDB would be another option:
https://www.influxdata.com/blog/influxdb-now-supports-prometheus-remote-read-write-natively/
Just configure the remote_write and remote_read in your Prometheus config and you are ready to go:
remote_write:
- url: 'http://{YOUR_INFLUX-DB}:{YOUR_INFLUX-DB_PORT}/api/v1/prom/write?db=metrics'
remote_read:
- url: 'http://{YOUR_INFLUX-DB}:{YOUR_INFLUX-DB_PORT}/api/v1/prom/read?db=metrics'

Import data from CSV file to Amazon Web Services RDS MySQL database

I have created a Relational Database (MySQL) hosted on Amazon Web Services. What I would like to do next is, import the data in my local CSV files into this database. I would really appreciate if someone provides me an outline on how to go about it.Thanks!
This is easiest and most hands-off by using MySQL command line. For large loads, consider spinning up a new EC2 instance, installing MySQL CL tools, and transferring your file to that machine. Then, after connecting to your database via CL, you'd do something like:
mysql> LOAD DATA LOCAL INFILE 'C:/upload.csv' INTO TABLE myTable;
Also options to match your file's details and ignore header (plenty more in the docs)
mysql> LOAD DATA LOCAL INFILE 'C:/upload.csv' INTO TABLE myTable FIELDS TERMINATED BY ','
ENCLOSED BY '"' IGNORE 1 LINES;
If you're hesitant to use CL, download MySQL Workbench. It connects no prob to AWS RDS.
Closing thoughts:
MySQL LOAD DATA Docs
AWS' Aurora RDS is MySQL-compatible so command works there too
"LOCAL" flag actually transfers the file from your client machine (where you're running the command) to the DB server. Without LOCAL, the file must be on the DB server (not possible to transfer it there in advance with RDS)
Works great on huge files too! Just sent a 8.2GB file via this method (260 million rows). Took just over 10 hours from a t2-medium EC2 to db.t2.small Aurora
Not a solution if you need to watch out for unique keys or read the CSV row-by-row and change the data before inserting/updating
I did some digging and found this official AWS documentation on how to import data from any source to MySQL hosted on RDS.
It is a very detailed step by step guide and icludes an explanation on how to import CSV files.
http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/MySQL.Procedural.Importing.AnySource.html
Basically, each table must have its own file. Data for multiple tables cannot be combined in the same file. Give each file the same name as the table it corresponds to. The file extension can be anything you like. For example, if the table name is "sales", the file name could be "sales.csv" or "sales.txt", but not "sales_01.csv".
Whenever possible, order the data by the primary key of the table being loaded. This drastically improves load times and minimizes disk storage requirements.
There is another option to import data to MySQL database, you can use an external tool Alooma that can do the data import for you in real time.
Depending on how large is your file, but if it is under 1GB I found that DataGrip imports smaller files without any issues: https://www.jetbrains.com/datagrip/
You get nice mapping tool and graphical IDE to play around. DataGrip is available as a trial for 30 days free.
I am experiencing myself RDS connection dropouts with bigger files like > 2GB. Not sure if it is about the DataGrip or AWS side.
I think your best bet would be to develop a script in your language of choice to connect to the database and import it.
If your database is internet accessible then you can run that script locally. If it is in a private subnet then you can either run that script on an EC2 instance with access to the private subnet or on lambda connected to your VPC. You should really only use lambda if you expect runtime to be less than 5 minutes or so.
Edit: Note that lambda only supports a handful of languages
AWS Lambda supports code written in Node.js (JavaScript), Python, Java
(Java 8 compatible), and C# (.NET Core).

Spark integration in existing application using cassandra

We have working application with one application server and 3 node Cassandra cluster. Recently we got new requirement to import large CSV files to our existing database. Rows in CSV need to be transformed before saving in Cassandra. Our infrastructure is deployed in Amazon AWS.
Have couple questions:
It looks to us that Spark is right tool for the job since it has Spark Cassandra Connector and Spark CSV plugin. Are we correct?
Maybe a newbie Spark question, but in our deployment scenario where should importer app be deployed? Our idea is to have Spark Master on one of DB nodes, Spark workers spread on 3 database nodes and importer application on same node where is master. It would be perfect to have some command line interface to import CSV which can later evolve to API/web interface.
Can we put importer application on application server and what will be network penalty?
Can we use Spark in this scenario for Cassandra JOINS as well and how can we integrate to existing application which already uses regular Datastax java driver along with application joins if needed
First of all, keep in mind that Spark Cassandra Connector will only be useful for data locality if you're loading your data from Cassandra, not from an external source. So, for loading a CSV file, you'll have to transport it to your Spark workers, using a shared storage or HDFS, etc. Which means that wherever you place your importer application, it will stream the data to your spark Workers.
Now to address your points:
You're correct about Spark, but incorrect about Spark Cassandra Connector, as it's only useful if you're loading data from Cassandra (which might be the case for #4 when you need to perform Joins between external data and Cassandra data), otherwise it won't give you any significant help.
Your importer application will be deployed to your cluster. In the scenario you described, this is a stand-alone Spark Cluster. So you'll need to package your application, then use the spark-submit command on your master node to deploy your application. Using a command line parameter for your CSV file location, you can deploy and run your application as a normal command line tool.
As described in #2, your importer application will be deployed from your master node to all your workers. What matters here is where your CSV file is. A simple way to deploy it is by splitting the file across your worker nodes (using the same local file path), and load it as a local file. But be aware that you'd lose your local CSV part if the node dies. For more reliable distribution you can place your CSV file on an HDFS cluster then read from there.
Using Spark Cassandra Connector, you can load your data from Cassandra into RDDs on the corresponding local nodes, then using the RDDs you created by loading your CSV data, you can perform Joins and of course write the result back to Cassandra if you need to. You can use the Spark Cassandra Connector as a higher level tool to perform both the reading and writing, you wouldn't need to use the Java Driver directly (as the connector is built on top of it anyway).