Importing large datasets into Couchbase - couchbase

I am having difficulty importing large datasets into Couchbase. I have experience doing this very fast with Redis via the command line but I have not seen anything yet for Couchbase.
I have tried using the PHP SDK and it imports about 500 documents / second. I have also tried the cbcdocload script in the Couchbase bin folder but it seems to want each document in its on JSON file. It is a bit of work to create all these files and then load them. Is there some other importation process I am missing? If cbcdocload is the only way load data fast then is it possible to put multiple documents into 1 json file.

Take the file that has all the JSON documents in it and zip up the file:
zip somefile.zip somefile.json
Place the zip file(s) into a directory. I used ~/json_files/ in my home directory.
Then load the file or files by the following command:
cbdocloader -u Administrator -p s3kre7Pa55 -b MyBucketToLoad -n 127.0.0.1:8091 -s 1000 \
~/json_files/somefile.zip
Note: '-s 1000' is the memory size. You'll need to adjust this value for your bucket.
If successful you'll see output stating how many documents were loaded, success, etc.
Here is a brief script to load up a lot of .zip files in a given directory:
#!/bin/bash
JSON_Dir=~/json_files/
for ZipFile in $JSON_Dir/*.zip ;
do /Applications/Couchbase\ Server.app/Contents/Resources/couchbase-core/bin/cbdocloader \
-u Administrator -p s3kre7Pa55 -b MyBucketToLoad \
-n 127.0.0.1:8091 -s 1000 $ZipFile
done
UPDATED: Keep in mind this script will only work if your data is formatted correctly or if the files are less than the max single document size of 20MB. (not the zipfile, but any document extracted from the zip)
I have created a blog post describing bulk loading from a single file as well and it is listed here:
Bulk Loading Documents Into Couchbase

Related

Is it possible to specify a file name inside .zip archive compressing standard input to standard output using cli tools?

I'm trying to find a way to specify a file name in a compressed .zip archive while compressing it from standard input to standard output. I want to achieve it without creating a temporal file during a process.
Currently I have an example script which creates mysqldump, passes the result as input to a zip command and outputs the stream to aws s3 command to save a result to S3.
Here is the example:
mysqldump ... | zip | aws s3 cp - s3://[bucket_name]/output.sql.zip
The problem is zip by default saves a file inside zip archive with name "-".
Maybe there is a way how to pass specific file name inside an archive using zip command or any other zip library?
Recent Linux distributions have a streamzip script which I wrote to handle this usecase. In the example below the -member commandline option sets the member within the streamed zipfile to data.sql
mysqldump ... | streamzip -member data.sql | aws s3 cp - s3://[bucket_name]/output.sql.zip
If the mysqldump is larger than 4G, include the -zip64 option with streamzip.
If you don't have streamzip available download from https://github.com/pmqs/IO-Compress/blob/master/bin/streamzip

to resume the download by using gsutil

I have been downloading the file by using gsutil, and the process has crushed.
The documentation on gsutil is located at :
https://cloud.google.com/storage/docs/gsutil_install#redhat
The file location is described on : https://genebass.org/downloads
How can I resume the file download instead of starting from scratch ?
I have been looking for answers to a similar question, although those have been provided to different questions. For example :
GSutil resume download using tracker files
As mentioned in GCP docs, using the gsutil cp command:
gsutil automatically performs a resumable upload whenever you use the cp command to upload an object that is larger than 8 MiB. You do not need to specify any special command line options to make this happen. [. . .] Similarly, gsutil automatically performs resumable downloads (using standard HTTP Range GET operations) whenever you use the cp command, unless the destination is a stream. In this case, a partially downloaded temporary file will be visible in the destination directory. Upon completion, the original file is deleted and overwritten with the downloaded contents.
If you're also using gsutil in large production tasks, you may find useful information on Scripting Production Transfers.
Alternatively, you can achieve resumable download from Google Cloud Storage using the Range header (just take note of the HTTP specification threshold).
I'm not sure which command you're using (cp or rsync), but either way gsutil will fortunately take care of resuming downloads for you.
From the docs for gsutil cp:
gsutil automatically resumes interrupted downloads and interrupted resumable uploads, except when performing streaming transfers.
So, if you're using gsutil cp, it will automatically resume the partially downloaded files without starting them over. However, resuming with cp will also re-download the files that were already completed. To avoid this, use the -n flag so the files you've already downloaded are skipped, something like:
gsutil -m cp -n -r gs://ukbb-exome-public/300k/results/variant_results.mt .
If instead you're using gsutil rsync, then it will simply resume downloading.

Wget -i gives no output or results

I'm learning data analysis in Zeppelin, I'm a mechanical engineer so this is outside my expertise.
I am trying to download two csv files using a file that contains the urls, test2.txt. When I run it I get no output, but no error message either. I've included a link to a screenshot showing my code and the results.
When I go into Ambari Sandbox I cannot find any files created. I'm assuming the directory the file is in is where the csv files will be downloaded too. I've tried using -P as well with no luck. I've checked in man wget but it did not help.
So I have several questions:
How do I show the output from running wget?
Where is the default directory that wget stores files?
Do I need additional data in the file other than just the URLs?
Screenshot: Code and Output for %sh
Thanks for any and all help.
%sh
wget -i /tmp/test2.txt
%sh
# list the current working directory
pwd # output: home/zeppelin
# make a new folder, created in "tmp" because it is temporary
mkdir -p /home/zeppelin/tmp/Folder_Name
# change directory to new folder
cd /home/zeppelin/tmp/Folder_Name
# transfer the file from the sandbox to the current working directory
hadoop fs -get /tmp/test2.txt /home/zeppelin/tmp/Folder_Name/
# download the URL
wget -i test2.txt

Linux shell script command - gzip

I am having one shell script in Linux in which the output will be generated in .csv format.
At the end of the script i am making this .csv to .gz format to reduce the space on my machine.
The file which is generated comes in this format Output_04-07-2015.csv
The command which i have written to make it zip is:-gzip Output_*.csv
But i am facing an issue that if the file already exists, then it should make the new file with that reported time stamp.
Can anyone help me with it.?
If all you want is to just overwrite the file if it already exists, gzip has a -f flag for it.
gzip -f Output_*.csv
What the -f flag does is forcefully create the gzip file, and overwrite whatever existing zip file there might already be.
Have a look at the man pages by typing man gzip or even this link for many other options.
If instead you want to do it more elegantly, you could check out and see if shell commands for your script work for you or not. But that would differ depending on what shell you have, bash, cshell, etc.

Import Multiple .sql dump files into mysql database from shell

I have a directory with a bunch of .sql files that mysql dumps of each database on my server.
e.g.
database1-2011-01-15.sql
database2-2011-01-15.sql
...
There are quite a lot of them actually.
I need to create a shell script or single line probably that will import each database.
I'm running on a Linux Debian machine.
I thinking there is some way to pipe in the results of a ls into some find command or something..
any help and education is much appreciated.
EDIT
So ultimately I want to automatically import one file at a time into the database.
E.g. if I did it manually on one it would be:
mysql -u root -ppassword < database1-2011-01-15.sql
cat *.sql | mysql? Do you need them in any specific order?
If you have too many to handle this way, then try something like:
find . -name '*.sql' | awk '{ print "source",$0 }' | mysql --batch
This also gets around some problems with passing script input through a pipeline though you shouldn't have any problems with pipeline processing under Linux. The nice thing about this approach is that the mysql utility reads in each file instead of having it read from stdin.
One-liner to read in all .sql files and imports them:
for SQL in *.sql; do DB=${SQL/\.sql/}; echo importing $DB; mysql $DB < $SQL; done
The only trick is the bash substring replacement to strip out the .sql to get the database name.
There is superb little script at http://kedar.nitty-witty.com/blog/mydumpsplitter-extract-tables-from-mysql-dump-shell-script which will take a huge mysqldump file and split it into a single file for each table. Then you can run this very simple script to load the database from those files:
for i in *.sql
do
echo "file=$i"
mysql -u admin_privileged_user --password=whatever your_database_here < $i
done
mydumpsplitter even works on .gz files, but it is much, much slower than gunzipping first, then running it on the uncompressed file.
I say huge, but I guess everything is relative. It took about 6-8 minutes to split a 2000-table, 200MB dump file for me.
I don't remember the syntax of mysqldump but it will be something like this
find . -name '*.sql'|xargs mysql ...
I created a script some time ago to do precisely this, which I called (completely uncreatively) "myload". It loads SQL files into MySQL.
Here it is on GitHub
It's simple and straight-forward; allows you to specify mysql connection parameters, and will decompress gzip'ed sql files on-the-fly. It assumes you have a file per database, and the base of the filename is the desired database name.
So:
myload foo.sql bar.sql.gz
Will create (if not exist) databases called "foo" and "bar", and import the sql file into each.
For the other side of the process, I wrote this script (mydumpall) which creates the corresponding sql (or sql.gz) files for each database (or some subset specified either by name or regex).